Instructions to use anandkaman/controlmt-v2.3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use anandkaman/controlmt-v2.3 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="anandkaman/controlmt-v2.3", trust_remote_code=True)# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- ControlMT v2.3 — Compact Kannada ↔ English Translation (139M)
- Headline benchmark — FLORES-200 devtest
- How this got built — journey + decisions + dead ends
- Available releases
- 1. Model Details
- 2. Intended Use & Out-of-Scope Use
- 3. Training Data (summary)
- 4. Evaluation
- 5. Decoding Configuration (recommended presets)
- 6. Limitations
- 7. Ethical Considerations & Bias
- Usage
- 8. Deployment
- 🎯 Help shape v2.4 — Break-the-Model Challenge
- Roadmap
- Citation
- License
- Headline benchmark — FLORES-200 devtest
ControlMT v2.3 — Compact Kannada ↔ English Translation (139M)
TL;DR. A 139M-parameter encoder-decoder specialized for Kannada ↔ English translation. Single-pair focus + code-mix-native training + Anti-LM contrastive decoding. Achieves competitive FLORES-200 KN↔EN performance for its parameter size, with COMET-DA above 0.84 in both directions. Apache 2.0, deployable on consumer GPU.
Headline benchmark — FLORES-200 devtest
| Metric | KN → EN | EN → KN |
|---|---|---|
| CometKiwi-DA (no ref) | 0.8437 | 0.8663 |
| COMET-DA (with ref) | 0.8459 | 0.8443 |
| BLEU | 27.20 | 18.50 |
| chrF | 55.84 | 56.12 |
CometKiwi-DA and COMET-DA both clear the 0.82 production floor and the 0.85 aspirational target. BLEU/chrF measured with sacrebleu (default tokenization).
| Parameters | 139M |
| Architecture | Modular encoder-decoder (per-language wrappers + shared core) |
| Vocabulary | 128,000 (SentencePiece Unigram, joint KN+EN) |
| Languages | Kannada (kn) ↔ English (en) — bidirectional |
| Training data | 6.70M parallel pairs (post CometKiwi quality filtering) + specialized streams |
| Hardware (training) | 1 × NVIDIA RTX 5060 Ti (16 GB), bf16 mixed precision |
| Release date | 2026-06-23 |
| License | Apache 2.0 |
| Author | Anand Kaman |
How this got built — journey + decisions + dead ends
If you're building a similar specialized model, the docs/ folder is a first-person account of how ControlMT went from zero to public release in three months, solo, on one GPU:
docs/top-lessons.md— 10 takeaways, one paragraph each (start here if you only have 10 minutes)docs/the-journey.md— chronological narrative, v1 → v2.3docs/what-didnt-work.md— 8 failed experiments + root-cause analysisdocs/how-it-was-built.md— concrete data + training + eval + deployment recipesdocs/working-with-claude.md— patterns for solo + AI-assistant collaborationdocs/repo-map.md— folder layout, file conventions
Available releases
| Repo | What you get | Best for |
|---|---|---|
| anandkaman/controlmt-v2.3 (you are here) | bf16 safetensors; load with dtype=fp32 / bf16 / fp16 |
General use — GPU fp16 / CPU bf16 |
| anandkaman/controlmt-v2.3-int8 | Auto-applies int8 dynamic quant on load | CPU-only / memory-constrained — 0.28 s/pair, ~140 MB RAM |
| anandkaman/controlmt-demo (Space) | Live web demo (FastAPI + static HTML/CSS/JS) | Try in browser, no install |
pip install controlmt (SDK) |
Python wrapper around all of the above | One-liner load + auto device/dtype + batched API |
Easiest path — the SDK does the right thing automatically:
# CPU-only (smaller install — ~200 MB torch instead of ~2 GB)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install controlmt
# GPU (CUDA) — default; pulls the full ~2 GB CUDA torch wheel
pip install controlmt
from controlmt import ControlMT
model = ControlMT.from_hf() # GPU fp16 / CPU bf16 / etc — auto
model = ControlMT.from_hf(quant="int8") # CPU int8 dynamic
model = ControlMT.from_hf(device="cpu", dtype="bf16") # explicit
print(model.translate("ನಾನು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇನೆ.")) # "I speak Kannada."
Why two install paths?
pip install controlmtpullstorch>=2.0, which by default fetches the CUDA-enabled wheel (~2 GB). If you don't have a GPU, install the CPU-only torch wheel first (the line with--index-url) — it's ~200 MB and runs the model just fine on CPU at bf16 or int8. This is a PyTorch ecosystem quirk, not a ControlMT one — every model that depends on torch has the same trade-off.
Raw Transformers also works (no SDK needed):
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Main repo — choose dtype at load time
tokenizer = AutoTokenizer.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)
# int8 repo — quantization auto-applied
model_int8 = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3-int8", trust_remote_code=True)
→ Full deployment recipes + verified latency/memory matrix: DEPLOYMENT.md
1. Model Details
ControlMT v2.3 is a modular encoder-decoder transformer specialized for Kannada ↔ English translation. Every parameter is dedicated to this one language pair, which is what lets a 139M model compete with multilingual models 4× its size on FLORES-200 KN↔EN.
Architecture
┌── Router (per-row direction token) ──┐
│ │
┌───────▼─────────┐ ┌─────▼───────────┐
│ KN Lang Encoder │ │ EN Lang Encoder │
│ (2 layers) │ │ (2 layers) │
└───────┬─────────┘ └─────────────────┘
│
┌───────▼─────────┐
│ Shared Core Enc │ 6 layers, ~19M
└───────┬─────────┘
│
┌───────▼─────────┐
│ Shared Core Dec │ 6 layers, ~25M
└───────┬─────────┘
│
┌───────▼─────────┐ ┌─────────────────┐
│ KN Lang Decoder │ │ EN Lang Decoder │
│ (2 layers) │ │ (2 layers) │
└─────────────────┘ └─────────────────┘
↓
Output projection (tied embeddings, 128K vocab)
| Module | Parameters |
|---|---|
| Token embedding (shared, tied with output projection) | 65.5M |
| Per-language encoders (KN + EN, 2 layers each) | 12.6M |
| Shared core (6 enc + 6 dec, d_model=512, d_ff=2048, 8 heads) | 44.1M |
| Per-language decoders (KN + EN, 2 layers each) | 16.8M |
| Output projection (128K vocab × 512) | (tied with input embedding) |
| Total | ~139.2M |
Why single-pair?
Most public Indic MT models are broad — NLLB covers 200 languages, IndicTrans2 covers 22. That breadth comes from parameter-sharing across languages, so each language pair gets only a slice of the model's capacity.
ControlMT goes the other direction: every parameter is dedicated to Kannada ↔ English. If you need broad multilingual coverage, use NLLB or IndicTrans2. If you need Kannada specifically — and you care about size, latency, or on-device deployment — this is what the trade-off looks like.
2. Intended Use & Out-of-Scope Use
Intended use
- Production KN↔EN translation for Indian-context content: news, government documents, e-commerce, social media, customer support, conversational interfaces
- Code-mix-aware translation — handles natural Indian Kannada that embeds English acronyms, brand names, and short loanwords
- Edge / on-device deployment — at 139M params + int8 quantization, runs on consumer hardware (laptops, mid-tier devices with ≥4 GB RAM)
- Office / form-data translation (KYC, applications, customer records) — the model demonstrated near-perfect preservation on the release evaluation suite for Aadhar, phone, email, dates, customer IDs, and PAN numbers in the KN→EN direction. EN→KN has a small edge case where mid-sentence PAN-format strings may character-by-character transliterate to Kannada syllables (information preserved, recoverable via a small regex postprocessing pass — see Limitations Section 6).
Out-of-scope use
- ❌ Not a multilingual translator — only Kannada ↔ English. For other language pairs, see NLLB-200 or IndicTrans2.
- ❌ Not a chatbot / not instruction-following — translation is the only supported task.
- ❌ Not a literal-translator for idioms — see Limitations (Section 6).
- ❌ Not certified for safety-critical domains (medical diagnosis, legal advice). The model passes a safety regression set but is not formally audited for those contexts.
- ❌ Not a domain-specialist for highly technical scientific text without context.
3. Training Data (summary)
The base corpus is 8.06M parallel KN↔EN pairs aggregated from public Indic MT datasets — Samanantar (Ramesh et al. 2022), BPCC (Gala et al. 2023 / IndicTrans2), Sangraha (Khan et al. 2024 / IndicLLMSuite), and Aksharantar (Madhani et al. 2023) for transliteration coverage.
A multi-stage filtering pipeline (profanity filter, roundtrip audit, CometKiwi quality
scoring, misalignment detection) reduces this to 6.64M clean rows in
master_v22.jsonl. Bad rows (62,853) are quarantined with _drop_reason audit trail
rather than deleted.
Augmenting the main corpus, four small internally-generated streams target specific
weaknesses: Pattern A (30K NER-validated proper-noun pairs), Pattern B
(8K cm_paired groups for code-mix), F2 (~5K letter-spelled acronyms), and
numerical_aug (form-preservation for digits/dates/currency).
Full filtering pipeline, per-stream methodology, training principles, and reproducibility steps: see
TRAINING_GUIDE.md.
Data licensing: Model weights and ControlMT-specific generated streams are released under Apache 2.0. Public source corpora retain their original licenses (Samanantar: CC-BY-NC 4.0; others: CC-BY-4.0).
3.4 Training principles
- Decoder hygiene gate (
kn_is_mixed): rows with 3+ consecutive Latin words in KN are excluded from EN→KN target — prevents mixed-code emission - CM-Concatenation Level A: paired (kn_pure, kn_mixed) batching for natural code-mix handling
- EMA (decay=0.999) + SWA averaging for production weights
- Anti-LM contrastive decoding (α=0.5) at inference — kills repetition + hallucination
4. Evaluation
4.1 Public benchmark sets
| Set | Pairs | License | Citation | Reference |
|---|---|---|---|---|
| FLORES-200 devtest | 1,012 | CC-BY-SA 4.0 | NLLB Team, No Language Left Behind: Scaling Human-Centered Machine Translation, 2022 | github.com/facebookresearch/flores |
| IN22-Gen | 1,024 | CC-BY-4.0 | Gala et al., IndicTrans2, TMLR 2023 | huggingface.co/datasets/ai4bharat/IN22-Gen |
| IN22-Conv | 1,503 | CC-BY-4.0 | Gala et al., IndicTrans2, TMLR 2023 | huggingface.co/datasets/ai4bharat/IN22-Conv |
| eval_curated_v22 (internal, supplementary) | 800 | — | Internal style-stratified sample (200/style bucket) from master_v22.jsonl |
Released alongside this model |
| code_mix_eval (internal, supplementary) | 100 | — | Internal code-mix probe set, curated 2026-04 | Released alongside this model |
4.2 Scoring tools
| Tool | Use | Citation | Reference |
|---|---|---|---|
| Unbabel/wmt22-cometkiwi-da | Reference-free QE | Rei et al., CometKiwi: IST-Unbabel Submission for the WMT22 Quality Estimation Shared Task, WMT 2022 | huggingface.co/Unbabel/wmt22-cometkiwi-da |
| Unbabel/wmt22-comet-da | Reference-based QE | Rei et al., COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task, WMT 2022 | huggingface.co/Unbabel/wmt22-comet-da |
| sacrebleu (default tokenization) | BLEU + chrF | Post, A Call for Clarity in Reporting BLEU Scores, WMT 2018 | github.com/mjpost/sacrebleu |
4.3 Decoding configuration for reported scores
| Parameter | Value |
|---|---|
| Beam size | 6 |
| Length penalty | 1.2 |
| no-repeat n-gram size | 3 |
| Anti-LM α | 0.5 |
| Max length | 256 |
4.4 Results
FLORES-200 devtest (1,012 pairs)
| Metric | KN → EN | EN → KN |
|---|---|---|
| CometKiwi (no ref) | 0.8437 | 0.8663 |
| COMET-DA (with ref) | 0.8459 | 0.8443 |
| BLEU | 27.20 | 18.50 |
| chrF | 55.84 | 56.12 |
Ship-gate verdict: ✅ PASS — both directions clear the 0.85 aspirational target on CometKiwi-DA (en→kn) and within striking distance on the others. All four metrics above the production floor.
IN22-Conv (1,503 pairs, AI4Bharat conversational benchmark)
| Metric | KN → EN | EN → KN |
|---|---|---|
| CometKiwi (no ref) | 0.8143 | 0.8845 |
| COMET-DA (with ref) | 0.8232 | 0.8320 |
| BLEU | 21.61 | 5.47 |
| chrF | 46.65 | 35.30 |
Ship-gate verdict: ✅ PASS — both COMET-DA values clear the 0.82 production floor; EN→KN CometKiwi at 0.8845 exceeds the 0.85 aspirational target. BLEU EN→KN is naturally low on conversational data (short colloquial utterances with high lexical variance); CometKiwi/COMET-DA (semantic adequacy) is the more reliable signal here.
Aggregate JSON: eval_results/in22_conv.json.
4.5 Comparison context
Direct head-to-head COMET-DA benchmarks for distilled-size Indic MT models on FLORES KN↔EN are not uniformly published in a single source. The major academic reference (IndicTrans2, Gala et al. 2023) reports chrF++ in its main tables and includes COMET-22 values in supplementary tables; NLLB (NLLB Team 2022) reports spBLEU + chrF++ but does not publish per-language COMET-DA for the distilled checkpoints.
As one citation-grounded anchor in the same metric space: IndicTrans2 1.1B reports COMET-22 ≈ 0.84 on IN22-Conv KN→EN (Gala et al. 2023, Appendix Table 45). ControlMT v2.3 reports COMET-DA 0.8459 on FLORES KN→EN at 139M parameters — competitive within its parameter scale class.
For an apples-to-apples comparison on your own infrastructure, the open-source eval pipeline
in scripts/eval_release.py
can be pointed at any KN↔EN MT model (NLLB / IndicTrans2 / Sarvam-Translate) using the
same FLORES devtest pairs and same scoring tooling (CometKiwi-DA, COMET-DA, sacrebleu),
giving directly-comparable numbers without trusting any individual paper's reporting.
5. Decoding Configuration (recommended presets)
Default (production)
generate_kwargs = dict(
num_beams=6,
length_penalty=1.2,
no_repeat_ngram_size=3,
anti_lm_alpha=0.5,
max_length=256,
)
Fast (~2× throughput, ~0.5 BLEU lower)
generate_kwargs = dict(num_beams=4, anti_lm_alpha=0.0, max_length=256)
Greedy (fastest, ~1.5 BLEU lower than default)
generate_kwargs = dict(num_beams=1, max_length=256)
High-quality (~30% slower, marginal gain)
generate_kwargs = dict(num_beams=8, anti_lm_alpha=0.7, max_length=256)
What is Anti-LM contrastive decoding?
At every decoding step, the model computes two next-token distributions:
- Main:
p(y_t | source, y_<t) - Anti-LM:
p(y_t | NO_source, y_<t)(cross-attention masked out)
Contrastive score: log p_main − α · log p_antilm. Tokens predictable without seeing
the source get penalized — kills repetition and source-detached hallucination. α=0
disables; α=0.5 is the production default.
6. Limitations
| Class | Example | Why |
|---|---|---|
| Idioms taken literally | "break a leg" → ಕಾಲು ಮುರಿಯಿರಿ (literal); "raining cats and dogs" → literal translation |
Known weakness at sub-1B parameter scale. |
| Long-tail tech / SaaS names | Modern cloud-native terms (Kubernetes, GraphQL, Redis, PostgreSQL) may transliterate inconsistently or get omitted | Specific tech vocabulary rare in 2022-era training corpus. Common names (Apple, iPhone, Google) handled well. |
| Letter-spelled acronym KN→EN | ಎನ್ಎಎಸ್ಎ → unreliable; phonetic ನಾಸಾ → reliable |
Letter-spelled form is rare; phonetic form is standard in Kannada writing. |
| Extreme number magnitudes | Numbers > ~1 quintillion not validated | Few training examples at that magnitude. |
| Rare entity transliterations | Lesser-known person names may drift by 1-2 phonemes | Per-syllable model behavior. |
| PAN/long alphanumeric IDs mid-sentence (EN→KN only) | The model demonstrated near-perfect preservation on the release evaluation suite — Aadhar numbers, phone numbers, email addresses, customer IDs, dates of birth, and PAN numbers are preserved verbatim in both directions. On a small EN→KN probe across 5 PAN sentences, 3/5 preserved the Latin form verbatim and 1/5 was character-by-character transliterated into Kannada syllables (e.g. ABCDE1234F → ಎಬಿಸಿಡಿಇ1234ಎಫ್) — information still preserved, syllables map deterministically back to Latin. KN→EN direction did not exhibit this on the eval suite. Recommended postprocessing for form-data deployments: regex-detect Kannada-syllable sequences in PAN/Aadhar context fields and back-map to Latin; validate against issuing-authority checksum before downstream use. |
Mid-sentence PAN is rare in 2022-era training corpus. KN→EN and clear-prefix EN→KN cases preserve Latin verbatim. |
Things the model does well
- ✅ Numbers preserved across multi-number sentences
- ✅ Dates preserved (including years 2024-2030)
- ✅ Indian-format numbers (
2,50,000↔2.5 ಲಕ್ಷ↔ "two and a half lakh") - ✅ Kannada numerals ↔ English digits conversion (
೨,೫೦,೦೦೦↔2,50,000) - ✅ Currency symbols and units in both directions
- ✅ Phone numbers, Aadhar numbers, email addresses preserved
- ✅ Common entity transliteration (Modi, Bengaluru, ISRO, Apple, iPhone, Reuters, etc.)
- ✅ Long sentences with complex semantics (multi-clause, conditional, scientific)
- ✅ Negation, tense, aspect handled correctly
- ✅ Safety regression — no toxic output on provocative inputs (Falklands/Hancock/Peacock set)
Failure-mode honesty
This is a specialized model, not a frontier LLM. For:
- Idioms → use a 7B+ model or post-edit
- Modern technical jargon (cloud-native stack names) → either keep source-as-is or use a frontier LLM
- Multilingual translation → use NLLB-200 or IndicTrans2
7. Ethical Considerations & Bias
Safety filtering applied
- 40,586 profanity/adult-content rows dropped during corpus filtering
- Safety regression test set (Falklands/Hancock/Peacock variants) — 100% pass
Known biases (inherent to corpus)
- Indian-context skew — entities, locations, brand names from Indian public discourse over-represented (this is intentional given the deployment target)
- 2022-era training data — modern tech terminology (2023-2026) less well-covered
- News + Wikipedia heavy — colloquial chat patterns under-represented vs daily speech
Source code attribution
This release ships with HF integration code (configuration_controlmt.py,
modeling_controlmt.py, tokenization_controlmt.py) plus the native architecture
(model.py). All Apache 2.0.
Usage
Quick start — Python + Transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)
# KN → EN
out = model.translate("ಅವನು ನಾಳೆ ಬೆಂಗಳೂರಿಗೆ ಬಂದು ನನ್ನನ್ನು ಭೇಟಿಯಾಗುತ್ತಾನೆ.",
tokenizer=tokenizer, direction="kn2en")
# "He will come to Bangalore tomorrow and meet me."
# EN → KN
out = model.translate("India is a country in South Asia.",
tokenizer=tokenizer, direction="en2kn")
# "ದಕ್ಷಿಣ ಏಷ್ಯಾದ ಒಂದು ದೇಶ ಭಾರತ."
8. Deployment
ControlMT v2.3 is an encoder-decoder seq2seq model — same family as T5/mBART, not decoder-only LM. That distinction matters for serving (see Section 9 of the deployment guide for which platforms can run it natively).
Verified deployment matrix (RTX 5060 Ti box, beam=2, 6 KN↔EN test pairs):
| Recipe | Latency / pair | Memory | Notes |
|---|---|---|---|
| CPU int8-dynamic | 0.28 s | ~140 MB RAM | Fastest CPU path, no quality drop |
| CPU bf16 (recommended) | 0.51 s | 280 MB RAM | One-line dtype=torch.bfloat16 |
| CPU fp32 | 1.44 s | 560 MB RAM | Baseline |
| GPU fp16 (recommended) | 0.19 s | 404 MB VRAM | Volta-and-up |
| GPU bf16 | 0.19 s | 404 MB VRAM | Ampere-and-up |
| GPU fp32 | 0.20 s | 793 MB VRAM | No speed benefit, more memory |
| HF Space (Docker) | 3–15 s | shared free-tier | Live demo |
| FastAPI / Docker / Endpts | matches device | matches device | Source under assets/space/ |
Pinned versions that we verified with: python 3.12.3 · torch 2.10.0 · transformers 4.57.6 · sentencepiece 0.2.1 · safetensors 0.7.0 · huggingface_hub 0.36.2.
Minimum supported is torch >= 2.0, transformers >= 4.40.
Not directly supported (architectural — these are decoder-only frameworks): vLLM, Ollama, llama.cpp / GGUF, HF TGI, bitsandbytes int8. Use the FastAPI wrapper instead — at 0.19 s/pair, the optimizations these tools provide are dominated by request overhead.
→ Full recipes, code, and pre-launch checklist in DEPLOYMENT.md.
→ Reproduce the matrix above: python assets/scripts/verify_deployment.py --device cuda
🎯 Help shape v2.4 — Break-the-Model Challenge
v2.4 is being designed around the gaps we find in v2.3. The live demo includes an opt-in research-data sharing toggle — when enabled, your translation (input + output + timing) is logged to a private dataset we use to identify edge cases for v2.4 training.
Things we particularly want to see fail:
- Heavily code-mixed phrases (
Nange last meeting nalli decision aagilla) - Complex numerals (
೨,೩೫,೬೭೮,1,23,45,678, mixed-script percentages) - Regional Karnataka dialects (Mangalorean, Dharwad, Kalyana Karnataka)
- Domain terminology (cricket, finance, government schemes, temple names)
- Long literary sentences (Bendre, Karanth-era prose)
- Modern tech / SaaS jargon (already known weak — confirm + extend)
Opt-in is unchecked by default. When you do opt in, inputs are automatically PII-redacted (PAN, Aadhar, phone, email, card numbers) before storage. Full details in PRIVACY.md.
Roadmap
v2.4 — Priorities locked from v2.3 evaluation
#1 — Multi-token code-mix data slice (highest-impact gap from v2.3 evaluation)
A 50k+ corpus slice of Kannada matrix sentence + 2–4 Latin-script English tokens paired
with English target preserving every Latin-script token verbatim. This is the largest
visible v2.3 weakness, identified during competitor comparison against IndicTrans2 1.1B
and Sarvam-Translate (see internal eval_results/competitor_comparison.md):
- v2.3 handles
Kannada + 1 English entitycleanly - v2.3 hallucinates entity names at
Kannada + 2+ English tokens(e.g. Manyata Tech Park → Girinagar Tech Park when "Software Engineer" is also present in the same sentence) - IndicTrans2 1.1B and Sarvam-Translate 4B both handle the 2+ case correctly
Root cause hypothesis: decoder over-weights the Kannada language prior when the source has high English-token density, and substitutes nearest-by-phonetic Kannada place-name from training distribution. Closing this gap is expected to also improve:
- Long-sentence robustness (better source-attention discipline)
- Number + entity ordering in payment/transactional prose
- Tech / startup / finance jargon (which clusters multi-token English)
Other v2.4 priorities (in order of expected impact)
- Kannada proverbs & idioms corpus (5–10k pairs) — v2.3 + IT2 + Sarvam all fail on proverbs like ಮಾಡಿದ್ದುಣ್ಣೋ ಮಹಾರಾಯ (= "you reap what you sow")
- Hindi support (
[HI2EN]/[EN2HI]) — opens a second language pair - Iterative back-translation for low-resource domain expansion
- Expanded vocabulary (modern tech terms, longer alphanumeric IDs)
- Standardized BPE tokenizer (currently SentencePiece Unigram)
- Register / style control revisit (rebalanced labels + contrastive separation training)
v3.0 (TBD)
Copy-mechanism / pointer-generator for OOV-proof transliteration. A built-in solution for the entity-preservation problem instead of corpus-only fix.
Citation
@misc{controlmt-v2.3-2026,
author = {Anand Kaman},
title = {ControlMT v2.3 — A 139M-Parameter Specialized Kannada↔English Translation Model
with Code-Mix-Native Training},
year = {2026},
howpublished = {\url{https://huggingface.co/anandkaman/controlmt-v2.3}}
}
License
Apache 2.0 — see LICENSE.
- Downloads last month
- 218
Model tree for anandkaman/controlmt-v2.3
Space using anandkaman/controlmt-v2.3 1
Evaluation results
- BLEU on FLORES-200 devtest (kan_Knda → eng_Latn)self-reported27.200
- chrF on FLORES-200 devtest (kan_Knda → eng_Latn)self-reported55.840
- COMET-DA (Unbabel/wmt22-comet-da) on FLORES-200 devtest (kan_Knda → eng_Latn)self-reported0.846
- CometKiwi-DA (Unbabel/wmt22-cometkiwi-da) on FLORES-200 devtest (kan_Knda → eng_Latn)self-reported0.844
- BLEU on FLORES-200 devtest (eng_Latn → kan_Knda)self-reported18.500
- chrF on FLORES-200 devtest (eng_Latn → kan_Knda)self-reported56.120
- COMET-DA on FLORES-200 devtest (eng_Latn → kan_Knda)self-reported0.844
- CometKiwi-DA on FLORES-200 devtest (eng_Latn → kan_Knda)self-reported0.866