Emo — on-device emoji suggestions from text

Takes a short text string and returns the best-matching emoji. Tuned for to-dos, calendar entries, notes, and message drafts across 23 languages (including CJK, Arabic, Thai, Hindi, and more). The whole thing — model and tokenizer — is about 5 MB and runs in well under 2 ms on device.

"Dentist appointment" → 🦷 · "réserver un vol pour Tokyo" → ✈️ · "犬の散歩" → 🐕 · "จองโรงแรม" → 🏨

Try it

  • Live demo: desert-ant-labs/emo-demo — type a phrase, get emojis, fully in your browser.
  • iOS / macOS: emo-swift — the Swift SDK with a built-in demo app.
  • Android / Kotlin / JVM: emo-kotlin — the Kotlin SDK (via JitPack), with an Android demo app.
  • JavaScript / TypeScript: emo-js — the npm package (Node + browser).

Files

File Format Size Contents
Emo.mlmodelc Compiled Core ML ~4.2 MB 4-bit-palettized model, ready to load on Apple platforms
emo_tokenizer.bin Pruned unigram tokenizer ~0.75 MB 48k SentencePiece pieces + scores; token ids = semantic-table rows
emo_meta.json JSON tiny emoji labels + n-gram hashing config the runtime needs
emo.pt PyTorch checkpoint ~40 MB Full-precision weights + semantic table + tokenizer (for retraining / other runtimes)

Architecture

A compact two-stream classifier — no transformer, no large encoder:

  • Lexical stream — script-aware character/word n-grams (Latin, Han·Kana, Hangul jamo, Devanagari clusters, SE-Asian, …) hashed into a fixed multi-hash signed embedding table. Its size is independent of the number of languages.
  • Semantic stream — a frozen multilingual static embedding (Model2Vec potion-multilingual-128M, distilled from BAAI bge-m3), PCA-reduced to 128 dims and vocab-pruned to the 48k tokens that matter for the 22 target languages. Gives cross-lingual generalization and handles out-of-vocabulary words. The matching ~0.75 MB unigram tokenizer ships alongside (emo_tokenizer.bin).
  • Head — a small MLP fusing the two streams into a softmax over a data-driven vocabulary of ~300 emojis (the emojis that actually come up most across the training phrases). Trained with n-gram dropout so the head relies on the semantic stream, which is what makes it generalize across languages.

Inputs and outputs

  • Input: a plain text string. Best on short, intent-oriented text.
  • Output: a probability distribution over the ~300-emoji vocabulary; take the top-1 (or top-k). Optimized for top-1 relevance.

Languages

English, Spanish, Portuguese, French, German, Italian, Dutch, Russian, Polish, Turkish, Arabic, Chinese (Simplified & Traditional), Japanese, Korean, Hindi, Indonesian, Thai, Vietnamese, Ukrainian, Swedish, Danish, Czech.

Limitations

  • Tuned for short, intent-oriented text; long-form text produces noisier suggestions.
  • Emoji semantics are imprecise; near-ties at the top of the ranking are expected.
  • Per-language quality varies; lower-resource languages in the set are somewhat weaker.

Built on

  • minishlab/potion-multilingual-128M — MIT — semantic embedding stream (PCA-reduced, vocab-pruned derivative) + tokenizer lineage.
  • BAAI/bge-m3 — MIT — teacher the static embedding was distilled from.
  • Model2Vec — MIT — static-embedding distillation method.
  • Unicode CLDR emoji annotations — multilingual keyword grounding in the training data.

See THIRD_PARTY_NOTICES.md.

License

Released under the Desert Ant Labs Source-Available License v1.0 (see LICENSE.md).

  • Free for commercial use up to 100,000 Monthly Active Users (MAU).
  • Above 100,000 MAU a commercial license is required. Contact licensing@desertant.ai.

Citation

@software{emo_2026,
  title  = {Emo: on-device emoji suggestions from text},
  author = {Desert Ant Labs},
  year   = {2026},
  url    = {https://huggingface.co/desert-ant-labs/emo},
}

© 2026 Desert Ant Labs · https://desertant.ai

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using desert-ant-labs/emo 1