Language¶

Module: Language — natural language processing primitives.

Defines a binary type: Token (a text span with position and kind).

Tokenization¶

Verb	Signature	Description
`reads`	`words(text String) List<String>`	Extract individual words from text
`reads`	`sentences(text String) List<String>`	Split text into sentences
`reads`	`tokens(text String) List<Token>`	Tokenize text with position and kind metadata

Stemming¶

Verb	Signature	Description
`transforms`	`stem(word String) String`	Apply the Porter stemming algorithm
`transforms`	`root(word String) String`	Strip common suffixes to find the root form

String Similarity¶

Verb	Signature	Description
`reads`	`distance(first String, second String) Integer`	Levenshtein edit distance between two strings
`reads`	`similarity(first String, second String) Float`	Normalized similarity (0.0 to 1.0)

Phonetic Codes¶

Verb	Signature	Description
`reads`	`soundex(word String) String`	Soundex phonetic code
`reads`	`metaphone(word String) String`	Double Metaphone phonetic code

N-Grams¶

Verb	Signature	Description
`reads`	`ngrams(text String, size Integer) List<String>`	Word-level n-grams
`reads`	`bigrams(text String) List<String>`	Word-level bigrams

Text Normalization¶

Verb	Signature	Description
`transforms`	`normalize(text String) String`	Lowercase and fold accented characters to ASCII
`transforms`	`transliterate(text String) String`	Transliterate accented characters to ASCII preserving case

Stopwords and Frequency¶

Verb	Signature	Description
`reads`	`stopwords() List<String>`	Common English stopwords
`transforms`	`without_stopwords(text String) List<String>`	Remove stopwords, return remaining words
`reads`	`frequency(text String) Table<String, Integer>`	Word frequency counts
`reads`	`keywords(text String, count Integer) List<String>`	Top N most frequent words

Token Accessors¶

Verb	Signature	Description
`reads`	`text(token Token) String`	Matched text of a token
`reads`	`start(token Token) Integer`	Start position
`reads`	`end(token Token) Integer`	End position
`reads`	`kind(token Token) Integer`	Token kind as integer

  Language reads words distance similarity keywords transforms stem normalize
  Language types Token

reads find_similar(query String, candidates List<String>) List<String>
from
    filter(candidates, |c| Language.similarity(query, c) > 0.8f)