Skip to content

Language

Module: Language — natural language processing primitives.

Defines a binary type: Token (a text span with position and kind).

Tokenization

Verb Signature Description
reads words(text String) List<String> Extract individual words from text
reads sentences(text String) List<String> Split text into sentences
reads tokens(text String) List<Token> Tokenize text with position and kind metadata

Stemming

Verb Signature Description
transforms stem(word String) String Apply the Porter stemming algorithm
transforms root(word String) String Strip common suffixes to find the root form

String Similarity

Verb Signature Description
reads distance(first String, second String) Integer Levenshtein edit distance between two strings
reads similarity(first String, second String) Float Normalized similarity (0.0 to 1.0)

Phonetic Codes

Verb Signature Description
reads soundex(word String) String Soundex phonetic code
reads metaphone(word String) String Double Metaphone phonetic code

N-Grams

Verb Signature Description
reads ngrams(text String, size Integer) List<String> Word-level n-grams
reads bigrams(text String) List<String> Word-level bigrams

Text Normalization

Verb Signature Description
transforms normalize(text String) String Lowercase and fold accented characters to ASCII
transforms transliterate(text String) String Transliterate accented characters to ASCII preserving case

Stopwords and Frequency

Verb Signature Description
reads stopwords() List<String> Common English stopwords
transforms without_stopwords(text String) List<String> Remove stopwords, return remaining words
reads frequency(text String) Table<String, Integer> Word frequency counts
reads keywords(text String, count Integer) List<String> Top N most frequent words

Token Accessors

Verb Signature Description
reads text(token Token) String Matched text of a token
reads start(token Token) Integer Start position
reads end(token Token) Integer End position
reads kind(token Token) Integer Token kind as integer
  Language reads words distance similarity keywords transforms stem normalize
  Language types Token

reads find_similar(query String, candidates List<String>) List<String>
from
    filter(candidates, |c| Language.similarity(query, c) > 0.8f)