This is a multilingual NLP model created by Andre Popovitch as part of the Lexide project. See Github for training data, training code, and a rust client library.

It supports: Tokenization, Part of Speech tagging, Lemmatization, and Dependency relations.

I created it because I noticed that there was no high-quality model for this. I previously used Spacy models, but the quality was too low for non-English languages.

One day, these models may be used in my language learning app Yap.Town. But that may be a while in the future as that app does not require perfect quality.

It is a LoRA for a Gemma 3 model. In this case, gemma-3-270m-it.

Supported Languages

  1. English
  2. German
  3. French
  4. Spanish
  5. Korean
  6. Portugese

Input Format

format!(
    "Language: {}\nSentence: {}\nTask: Analyze tokens (idx,token,ws,POS,lemma,dep,head)\n\$
    language, sentence
)

Output Format

The model outputs tab-separated values, with the prefix Here's the token analysis:\n\n and the suffix \n</analysis> Here is an example output for the input "Cats love me":

Here's the token analysis:

1       Cats    none    NOUN    cat     nsubj   2
2        love   none    VERB    love    root    0
3        me     none    PRON    I       obj     2

</analysis>

The format is index, token, whitespace, pos, lemma, dep, head.

Note that the model will join whitespace with the next word (e.g. love). The model is trained to do this because it greatly simplifies its task. The reason is that a sentence like "Cats love me" is input to the model as 3 tokens: Cats, _love, and _me. Having the model also output these same tokens avoids forcing it to learn a complicated mapping. In this case, the whitespace column will be none, because the whitespace is in the word field.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including anchpop/lexide-gemma-3-4b-it