Skip to content

Language

Module: Language — natural language processing primitives.

Word and Sentence Extraction

Verb Signature Description
creates words(text String) List<String> Extract individual words from text
creates sentences(text String) List<String> Split text into sentences

Stemming

Verb Signature Description
derives stem(word String) String Apply the Porter stemming algorithm
derives root(word String) String Strip common suffixes to find the root form

String Similarity

Verb Signature Description
creates distance(first String, second String) Integer Levenshtein edit distance between two strings
creates similarity(first String, second String) Float Normalized similarity (0.0 to 1.0)

Phonetic Codes

Verb Signature Description
derives soundex(word String) String Soundex phonetic code
derives metaphone(word String) String Double Metaphone phonetic code

N-Grams

Verb Signature Description
creates ngrams(text String, size Integer) List<String> Word-level n-grams
creates bigrams(text String) List<String> Word-level bigrams

Text Normalization

Verb Signature Description
derives normalize(text String) String Lowercase and fold accented characters to ASCII
derives transliterate(text String) String Transliterate accented characters to ASCII preserving case

Stopwords and Frequency

Verb Signature Description
derives stopwords() List<String> Common English stopwords
creates without_stopwords(text String) List<String> Remove stopwords, return remaining words
creates frequency(text String) Table<Value> Word frequency counts (keys are words, values are counts)
creates keywords(text String, count Integer) List<String> Top N most frequent words

Token Accessors

Access properties of Token values produced by Parse.tokens(). Extract token text via Types.string(token).

Verb Signature Description
creates start(token Token) Integer Start position in source
creates end(token Token) Integer End position in source
creates kind(token Token) Integer Kind tag (from the matched rule)
  Language creates words distance similarity keywords start end kind derives stem normalize
  Parse types Token

derives find_similar(query String, candidates List<String>) List<String>
from
    filter(candidates, |c| Language.similarity(query, c) > 0.8f)