This is a multilingual NLP model created by Andre Popovitch as part of the Lexide project. See Github for training data, training code, and a rust client library.
It supports: Tokenization, Part of Speech tagging, Lemmatization, and Dependency relations.
I created it because I noticed that there was no high-quality model for this. I previously used Spacy models, but the quality was too low for non-English languages.
One day, these models may be used in my language learning app Yap.Town. But that may be a while in the future as that app does not require perfect quality.
It is a LoRA for a Gemma 3 model. In this case, gemma-3-270m-it.
Supported Languages
- English
- German
- French
- Spanish
- Korean
- Portugese
Input Format
format!(
"Language: {}\nSentence: {}\nTask: Analyze tokens (idx,token,ws,POS,lemma,dep,head)\n\$
language, sentence
)
Output Format
The model outputs tab-separated values, with the prefix Here's the token analysis:\n\n and the suffix \n</analysis> Here is an example output for the input "Cats love me":
Here's the token analysis:
1 Cats none NOUN cat nsubj 2
2 love none VERB love root 0
3 me none PRON I obj 2
</analysis>
The format is index, token, whitespace, pos, lemma, dep, head.
Note that the model will join whitespace with the next word (e.g. love). The model is trained to do this because it greatly simplifies its task.
The reason is that a sentence like "Cats love me" is input to the model as 3 tokens: Cats, _love, and _me. Having the model also output these same tokens avoids forcing it to learn a complicated mapping. In this case, the whitespace column will be none, because the whitespace is in the word field.