Language
Module: Language — natural language processing primitives.
Word and Sentence Extraction
| Verb | Signature | Description |
creates | words(text String) List<String> | Extract individual words from text |
creates | sentences(text String) List<String> | Split text into sentences |
Stemming
| Verb | Signature | Description |
derives | stem(word String) String | Apply the Porter stemming algorithm |
derives | root(word String) String | Strip common suffixes to find the root form |
String Similarity
| Verb | Signature | Description |
creates | distance(first String, second String) Integer | Levenshtein edit distance between two strings |
creates | similarity(first String, second String) Float | Normalized similarity (0.0 to 1.0) |
Phonetic Codes
| Verb | Signature | Description |
derives | soundex(word String) String | Soundex phonetic code |
derives | metaphone(word String) String | Double Metaphone phonetic code |
N-Grams
| Verb | Signature | Description |
creates | ngrams(text String, size Integer) List<String> | Word-level n-grams |
creates | bigrams(text String) List<String> | Word-level bigrams |
Text Normalization
| Verb | Signature | Description |
derives | normalize(text String) String | Lowercase and fold accented characters to ASCII |
derives | transliterate(text String) String | Transliterate accented characters to ASCII preserving case |
Stopwords and Frequency
| Verb | Signature | Description |
derives | stopwords() List<String> | Common English stopwords |
creates | without_stopwords(text String) List<String> | Remove stopwords, return remaining words |
creates | frequency(text String) Table<Value> | Word frequency counts (keys are words, values are counts) |
creates | keywords(text String, count Integer) List<String> | Top N most frequent words |
Token Accessors
Access properties of Token values produced by Parse.tokens(). Extract token text via Types.string(token).
| Verb | Signature | Description |
creates | start(token Token) Integer | Start position in source |
creates | end(token Token) Integer | End position in source |
creates | kind(token Token) Integer | Kind tag (from the matched rule) |
Language creates words distance similarity keywords start end kind derives stem normalize
Parse types Token
derives find_similar(query String, candidates List<String>) List<String>
from
filter(candidates, |c| Language.similarity(query, c) > 0.8f)