ControlMT v2.3 — Compact Kannada ↔ English Translation (139M)

TL;DR. A 139M-parameter encoder-decoder specialized for Kannada ↔ English translation. Single-pair focus + code-mix-native training + Anti-LM contrastive decoding. Achieves competitive FLORES-200 KN↔EN performance for its parameter size, with COMET-DA above 0.84 in both directions. Apache 2.0, deployable on consumer GPU.

Headline benchmark — FLORES-200 devtest

Metric KN → EN EN → KN
CometKiwi-DA (no ref) 0.8437 0.8663
COMET-DA (with ref) 0.8459 0.8443
BLEU 27.20 18.50
chrF 55.84 56.12

CometKiwi-DA and COMET-DA both clear the 0.82 production floor and the 0.85 aspirational target. BLEU/chrF measured with sacrebleu (default tokenization).

Parameters 139M
Architecture Modular encoder-decoder (per-language wrappers + shared core)
Vocabulary 128,000 (SentencePiece Unigram, joint KN+EN)
Languages Kannada (kn) ↔ English (en) — bidirectional
Training data 6.70M parallel pairs (post CometKiwi quality filtering) + specialized streams
Hardware (training) 1 × NVIDIA RTX 5060 Ti (16 GB), bf16 mixed precision
Release date 2026-06-23
License Apache 2.0
Author Anand Kaman

How this got built — journey + decisions + dead ends

If you're building a similar specialized model, the docs/ folder is a first-person account of how ControlMT went from zero to public release in three months, solo, on one GPU:


Available releases

Repo What you get Best for
anandkaman/controlmt-v2.3 (you are here) bf16 safetensors; load with dtype=fp32 / bf16 / fp16 General use — GPU fp16 / CPU bf16
anandkaman/controlmt-v2.3-int8 Auto-applies int8 dynamic quant on load CPU-only / memory-constrained — 0.28 s/pair, ~140 MB RAM
anandkaman/controlmt-demo (Space) Live web demo (FastAPI + static HTML/CSS/JS) Try in browser, no install
pip install controlmt (SDK) Python wrapper around all of the above One-liner load + auto device/dtype + batched API

Easiest path — the SDK does the right thing automatically:

# CPU-only (smaller install — ~200 MB torch instead of ~2 GB)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install controlmt

# GPU (CUDA) — default; pulls the full ~2 GB CUDA torch wheel
pip install controlmt
from controlmt import ControlMT
model = ControlMT.from_hf()                              # GPU fp16 / CPU bf16 / etc — auto
model = ControlMT.from_hf(quant="int8")                  # CPU int8 dynamic
model = ControlMT.from_hf(device="cpu", dtype="bf16")    # explicit
print(model.translate("ನಾನು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇನೆ."))         # "I speak Kannada."

Why two install paths? pip install controlmt pulls torch>=2.0, which by default fetches the CUDA-enabled wheel (~2 GB). If you don't have a GPU, install the CPU-only torch wheel first (the line with --index-url) — it's ~200 MB and runs the model just fine on CPU at bf16 or int8. This is a PyTorch ecosystem quirk, not a ControlMT one — every model that depends on torch has the same trade-off.

Raw Transformers also works (no SDK needed):

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Main repo — choose dtype at load time
tokenizer = AutoTokenizer.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)

# int8 repo — quantization auto-applied
model_int8 = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3-int8", trust_remote_code=True)

→ Full deployment recipes + verified latency/memory matrix: DEPLOYMENT.md


1. Model Details

ControlMT v2.3 is a modular encoder-decoder transformer specialized for Kannada ↔ English translation. Every parameter is dedicated to this one language pair, which is what lets a 139M model compete with multilingual models 4× its size on FLORES-200 KN↔EN.

Architecture

                ┌── Router (per-row direction token) ──┐
                │                                        │
        ┌───────▼─────────┐                       ┌─────▼───────────┐
        │ KN Lang Encoder │                       │ EN Lang Encoder │
        │ (2 layers)      │                       │ (2 layers)      │
        └───────┬─────────┘                       └─────────────────┘
                │
        ┌───────▼─────────┐
        │ Shared Core Enc │  6 layers, ~19M
        └───────┬─────────┘
                │
        ┌───────▼─────────┐
        │ Shared Core Dec │  6 layers, ~25M
        └───────┬─────────┘
                │
        ┌───────▼─────────┐                       ┌─────────────────┐
        │ KN Lang Decoder │                       │ EN Lang Decoder │
        │ (2 layers)      │                       │ (2 layers)      │
        └─────────────────┘                       └─────────────────┘
                                                  ↓
                                Output projection (tied embeddings, 128K vocab)
Module Parameters
Token embedding (shared, tied with output projection) 65.5M
Per-language encoders (KN + EN, 2 layers each) 12.6M
Shared core (6 enc + 6 dec, d_model=512, d_ff=2048, 8 heads) 44.1M
Per-language decoders (KN + EN, 2 layers each) 16.8M
Output projection (128K vocab × 512) (tied with input embedding)
Total ~139.2M

Why single-pair?

Most public Indic MT models are broad — NLLB covers 200 languages, IndicTrans2 covers 22. That breadth comes from parameter-sharing across languages, so each language pair gets only a slice of the model's capacity.

ControlMT goes the other direction: every parameter is dedicated to Kannada ↔ English. If you need broad multilingual coverage, use NLLB or IndicTrans2. If you need Kannada specifically — and you care about size, latency, or on-device deployment — this is what the trade-off looks like.


2. Intended Use & Out-of-Scope Use

Intended use

  • Production KN↔EN translation for Indian-context content: news, government documents, e-commerce, social media, customer support, conversational interfaces
  • Code-mix-aware translation — handles natural Indian Kannada that embeds English acronyms, brand names, and short loanwords
  • Edge / on-device deployment — at 139M params + int8 quantization, runs on consumer hardware (laptops, mid-tier devices with ≥4 GB RAM)
  • Office / form-data translation (KYC, applications, customer records) — the model demonstrated near-perfect preservation on the release evaluation suite for Aadhar, phone, email, dates, customer IDs, and PAN numbers in the KN→EN direction. EN→KN has a small edge case where mid-sentence PAN-format strings may character-by-character transliterate to Kannada syllables (information preserved, recoverable via a small regex postprocessing pass — see Limitations Section 6).

Out-of-scope use

  • ❌ Not a multilingual translator — only Kannada ↔ English. For other language pairs, see NLLB-200 or IndicTrans2.
  • ❌ Not a chatbot / not instruction-following — translation is the only supported task.
  • ❌ Not a literal-translator for idioms — see Limitations (Section 6).
  • ❌ Not certified for safety-critical domains (medical diagnosis, legal advice). The model passes a safety regression set but is not formally audited for those contexts.
  • ❌ Not a domain-specialist for highly technical scientific text without context.

3. Training Data (summary)

The base corpus is 8.06M parallel KN↔EN pairs aggregated from public Indic MT datasets — Samanantar (Ramesh et al. 2022), BPCC (Gala et al. 2023 / IndicTrans2), Sangraha (Khan et al. 2024 / IndicLLMSuite), and Aksharantar (Madhani et al. 2023) for transliteration coverage.

A multi-stage filtering pipeline (profanity filter, roundtrip audit, CometKiwi quality scoring, misalignment detection) reduces this to 6.64M clean rows in master_v22.jsonl. Bad rows (62,853) are quarantined with _drop_reason audit trail rather than deleted.

Augmenting the main corpus, four small internally-generated streams target specific weaknesses: Pattern A (30K NER-validated proper-noun pairs), Pattern B (8K cm_paired groups for code-mix), F2 (~5K letter-spelled acronyms), and numerical_aug (form-preservation for digits/dates/currency).

Full filtering pipeline, per-stream methodology, training principles, and reproducibility steps: see TRAINING_GUIDE.md.

Data licensing: Model weights and ControlMT-specific generated streams are released under Apache 2.0. Public source corpora retain their original licenses (Samanantar: CC-BY-NC 4.0; others: CC-BY-4.0).

3.4 Training principles

  • Decoder hygiene gate (kn_is_mixed): rows with 3+ consecutive Latin words in KN are excluded from EN→KN target — prevents mixed-code emission
  • CM-Concatenation Level A: paired (kn_pure, kn_mixed) batching for natural code-mix handling
  • EMA (decay=0.999) + SWA averaging for production weights
  • Anti-LM contrastive decoding (α=0.5) at inference — kills repetition + hallucination

4. Evaluation

4.1 Public benchmark sets

Set Pairs License Citation Reference
FLORES-200 devtest 1,012 CC-BY-SA 4.0 NLLB Team, No Language Left Behind: Scaling Human-Centered Machine Translation, 2022 github.com/facebookresearch/flores
IN22-Gen 1,024 CC-BY-4.0 Gala et al., IndicTrans2, TMLR 2023 huggingface.co/datasets/ai4bharat/IN22-Gen
IN22-Conv 1,503 CC-BY-4.0 Gala et al., IndicTrans2, TMLR 2023 huggingface.co/datasets/ai4bharat/IN22-Conv
eval_curated_v22 (internal, supplementary) 800 Internal style-stratified sample (200/style bucket) from master_v22.jsonl Released alongside this model
code_mix_eval (internal, supplementary) 100 Internal code-mix probe set, curated 2026-04 Released alongside this model

4.2 Scoring tools

Tool Use Citation Reference
Unbabel/wmt22-cometkiwi-da Reference-free QE Rei et al., CometKiwi: IST-Unbabel Submission for the WMT22 Quality Estimation Shared Task, WMT 2022 huggingface.co/Unbabel/wmt22-cometkiwi-da
Unbabel/wmt22-comet-da Reference-based QE Rei et al., COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task, WMT 2022 huggingface.co/Unbabel/wmt22-comet-da
sacrebleu (default tokenization) BLEU + chrF Post, A Call for Clarity in Reporting BLEU Scores, WMT 2018 github.com/mjpost/sacrebleu

4.3 Decoding configuration for reported scores

Parameter Value
Beam size 6
Length penalty 1.2
no-repeat n-gram size 3
Anti-LM α 0.5
Max length 256

4.4 Results

FLORES-200 devtest (1,012 pairs)

Metric KN → EN EN → KN
CometKiwi (no ref) 0.8437 0.8663
COMET-DA (with ref) 0.8459 0.8443
BLEU 27.20 18.50
chrF 55.84 56.12

Ship-gate verdict: ✅ PASS — both directions clear the 0.85 aspirational target on CometKiwi-DA (en→kn) and within striking distance on the others. All four metrics above the production floor.

IN22-Conv (1,503 pairs, AI4Bharat conversational benchmark)

Metric KN → EN EN → KN
CometKiwi (no ref) 0.8143 0.8845
COMET-DA (with ref) 0.8232 0.8320
BLEU 21.61 5.47
chrF 46.65 35.30

Ship-gate verdict: ✅ PASS — both COMET-DA values clear the 0.82 production floor; EN→KN CometKiwi at 0.8845 exceeds the 0.85 aspirational target. BLEU EN→KN is naturally low on conversational data (short colloquial utterances with high lexical variance); CometKiwi/COMET-DA (semantic adequacy) is the more reliable signal here.

Aggregate JSON: eval_results/in22_conv.json.

4.5 Comparison context

Direct head-to-head COMET-DA benchmarks for distilled-size Indic MT models on FLORES KN↔EN are not uniformly published in a single source. The major academic reference (IndicTrans2, Gala et al. 2023) reports chrF++ in its main tables and includes COMET-22 values in supplementary tables; NLLB (NLLB Team 2022) reports spBLEU + chrF++ but does not publish per-language COMET-DA for the distilled checkpoints.

As one citation-grounded anchor in the same metric space: IndicTrans2 1.1B reports COMET-22 ≈ 0.84 on IN22-Conv KN→EN (Gala et al. 2023, Appendix Table 45). ControlMT v2.3 reports COMET-DA 0.8459 on FLORES KN→EN at 139M parameters — competitive within its parameter scale class.

For an apples-to-apples comparison on your own infrastructure, the open-source eval pipeline in scripts/eval_release.py can be pointed at any KN↔EN MT model (NLLB / IndicTrans2 / Sarvam-Translate) using the same FLORES devtest pairs and same scoring tooling (CometKiwi-DA, COMET-DA, sacrebleu), giving directly-comparable numbers without trusting any individual paper's reporting.


5. Decoding Configuration (recommended presets)

Default (production)

generate_kwargs = dict(
    num_beams=6,
    length_penalty=1.2,
    no_repeat_ngram_size=3,
    anti_lm_alpha=0.5,
    max_length=256,
)

Fast (~2× throughput, ~0.5 BLEU lower)

generate_kwargs = dict(num_beams=4, anti_lm_alpha=0.0, max_length=256)

Greedy (fastest, ~1.5 BLEU lower than default)

generate_kwargs = dict(num_beams=1, max_length=256)

High-quality (~30% slower, marginal gain)

generate_kwargs = dict(num_beams=8, anti_lm_alpha=0.7, max_length=256)

What is Anti-LM contrastive decoding?

At every decoding step, the model computes two next-token distributions:

  1. Main: p(y_t | source, y_<t)
  2. Anti-LM: p(y_t | NO_source, y_<t) (cross-attention masked out)

Contrastive score: log p_main − α · log p_antilm. Tokens predictable without seeing the source get penalized — kills repetition and source-detached hallucination. α=0 disables; α=0.5 is the production default.


6. Limitations

Class Example Why
Idioms taken literally "break a leg" → ಕಾಲು ಮುರಿಯಿರಿ (literal); "raining cats and dogs" → literal translation Known weakness at sub-1B parameter scale.
Long-tail tech / SaaS names Modern cloud-native terms (Kubernetes, GraphQL, Redis, PostgreSQL) may transliterate inconsistently or get omitted Specific tech vocabulary rare in 2022-era training corpus. Common names (Apple, iPhone, Google) handled well.
Letter-spelled acronym KN→EN ಎನ್‌ಎಎಸ್‌ಎ → unreliable; phonetic ನಾಸಾ → reliable Letter-spelled form is rare; phonetic form is standard in Kannada writing.
Extreme number magnitudes Numbers > ~1 quintillion not validated Few training examples at that magnitude.
Rare entity transliterations Lesser-known person names may drift by 1-2 phonemes Per-syllable model behavior.
PAN/long alphanumeric IDs mid-sentence (EN→KN only) The model demonstrated near-perfect preservation on the release evaluation suite — Aadhar numbers, phone numbers, email addresses, customer IDs, dates of birth, and PAN numbers are preserved verbatim in both directions. On a small EN→KN probe across 5 PAN sentences, 3/5 preserved the Latin form verbatim and 1/5 was character-by-character transliterated into Kannada syllables (e.g. ABCDE1234Fಎಬಿಸಿಡಿಇ1234ಎಫ್) — information still preserved, syllables map deterministically back to Latin. KN→EN direction did not exhibit this on the eval suite. Recommended postprocessing for form-data deployments: regex-detect Kannada-syllable sequences in PAN/Aadhar context fields and back-map to Latin; validate against issuing-authority checksum before downstream use. Mid-sentence PAN is rare in 2022-era training corpus. KN→EN and clear-prefix EN→KN cases preserve Latin verbatim.

Things the model does well

  • ✅ Numbers preserved across multi-number sentences
  • ✅ Dates preserved (including years 2024-2030)
  • ✅ Indian-format numbers (2,50,0002.5 ಲಕ್ಷ ↔ "two and a half lakh")
  • ✅ Kannada numerals ↔ English digits conversion (೨,೫೦,೦೦೦2,50,000)
  • ✅ Currency symbols and units in both directions
  • ✅ Phone numbers, Aadhar numbers, email addresses preserved
  • ✅ Common entity transliteration (Modi, Bengaluru, ISRO, Apple, iPhone, Reuters, etc.)
  • ✅ Long sentences with complex semantics (multi-clause, conditional, scientific)
  • ✅ Negation, tense, aspect handled correctly
  • ✅ Safety regression — no toxic output on provocative inputs (Falklands/Hancock/Peacock set)

Failure-mode honesty

This is a specialized model, not a frontier LLM. For:

  • Idioms → use a 7B+ model or post-edit
  • Modern technical jargon (cloud-native stack names) → either keep source-as-is or use a frontier LLM
  • Multilingual translation → use NLLB-200 or IndicTrans2

7. Ethical Considerations & Bias

Safety filtering applied

  • 40,586 profanity/adult-content rows dropped during corpus filtering
  • Safety regression test set (Falklands/Hancock/Peacock variants) — 100% pass

Known biases (inherent to corpus)

  • Indian-context skew — entities, locations, brand names from Indian public discourse over-represented (this is intentional given the deployment target)
  • 2022-era training data — modern tech terminology (2023-2026) less well-covered
  • News + Wikipedia heavy — colloquial chat patterns under-represented vs daily speech

Source code attribution

This release ships with HF integration code (configuration_controlmt.py, modeling_controlmt.py, tokenization_controlmt.py) plus the native architecture (model.py). All Apache 2.0.


Usage

Quick start — Python + Transformers

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("anandkaman/controlmt-v2.3", trust_remote_code=True)

# KN → EN
out = model.translate("ಅವನು ನಾಳೆ ಬೆಂಗಳೂರಿಗೆ ಬಂದು ನನ್ನನ್ನು ಭೇಟಿಯಾಗುತ್ತಾನೆ.",
                       tokenizer=tokenizer, direction="kn2en")
# "He will come to Bangalore tomorrow and meet me."

# EN → KN
out = model.translate("India is a country in South Asia.",
                       tokenizer=tokenizer, direction="en2kn")
# "ದಕ್ಷಿಣ ಏಷ್ಯಾದ ಒಂದು ದೇಶ ಭಾರತ."

8. Deployment

ControlMT v2.3 is an encoder-decoder seq2seq model — same family as T5/mBART, not decoder-only LM. That distinction matters for serving (see Section 9 of the deployment guide for which platforms can run it natively).

Verified deployment matrix (RTX 5060 Ti box, beam=2, 6 KN↔EN test pairs):

Recipe Latency / pair Memory Notes
CPU int8-dynamic 0.28 s ~140 MB RAM Fastest CPU path, no quality drop
CPU bf16 (recommended) 0.51 s 280 MB RAM One-line dtype=torch.bfloat16
CPU fp32 1.44 s 560 MB RAM Baseline
GPU fp16 (recommended) 0.19 s 404 MB VRAM Volta-and-up
GPU bf16 0.19 s 404 MB VRAM Ampere-and-up
GPU fp32 0.20 s 793 MB VRAM No speed benefit, more memory
HF Space (Docker) 3–15 s shared free-tier Live demo
FastAPI / Docker / Endpts matches device matches device Source under assets/space/

Pinned versions that we verified with: python 3.12.3 · torch 2.10.0 · transformers 4.57.6 · sentencepiece 0.2.1 · safetensors 0.7.0 · huggingface_hub 0.36.2. Minimum supported is torch >= 2.0, transformers >= 4.40.

Not directly supported (architectural — these are decoder-only frameworks): vLLM, Ollama, llama.cpp / GGUF, HF TGI, bitsandbytes int8. Use the FastAPI wrapper instead — at 0.19 s/pair, the optimizations these tools provide are dominated by request overhead.

→ Full recipes, code, and pre-launch checklist in DEPLOYMENT.md. → Reproduce the matrix above: python assets/scripts/verify_deployment.py --device cuda


🎯 Help shape v2.4 — Break-the-Model Challenge

v2.4 is being designed around the gaps we find in v2.3. The live demo includes an opt-in research-data sharing toggle — when enabled, your translation (input + output + timing) is logged to a private dataset we use to identify edge cases for v2.4 training.

Things we particularly want to see fail:

  • Heavily code-mixed phrases (Nange last meeting nalli decision aagilla)
  • Complex numerals (೨,೩೫,೬೭೮, 1,23,45,678, mixed-script percentages)
  • Regional Karnataka dialects (Mangalorean, Dharwad, Kalyana Karnataka)
  • Domain terminology (cricket, finance, government schemes, temple names)
  • Long literary sentences (Bendre, Karanth-era prose)
  • Modern tech / SaaS jargon (already known weak — confirm + extend)

Opt-in is unchecked by default. When you do opt in, inputs are automatically PII-redacted (PAN, Aadhar, phone, email, card numbers) before storage. Full details in PRIVACY.md.


Roadmap

v2.4 — Priorities locked from v2.3 evaluation

#1 — Multi-token code-mix data slice (highest-impact gap from v2.3 evaluation)

A 50k+ corpus slice of Kannada matrix sentence + 2–4 Latin-script English tokens paired with English target preserving every Latin-script token verbatim. This is the largest visible v2.3 weakness, identified during competitor comparison against IndicTrans2 1.1B and Sarvam-Translate (see internal eval_results/competitor_comparison.md):

  • v2.3 handles Kannada + 1 English entity cleanly
  • v2.3 hallucinates entity names at Kannada + 2+ English tokens (e.g. Manyata Tech Park → Girinagar Tech Park when "Software Engineer" is also present in the same sentence)
  • IndicTrans2 1.1B and Sarvam-Translate 4B both handle the 2+ case correctly

Root cause hypothesis: decoder over-weights the Kannada language prior when the source has high English-token density, and substitutes nearest-by-phonetic Kannada place-name from training distribution. Closing this gap is expected to also improve:

  • Long-sentence robustness (better source-attention discipline)
  • Number + entity ordering in payment/transactional prose
  • Tech / startup / finance jargon (which clusters multi-token English)

Other v2.4 priorities (in order of expected impact)

  • Kannada proverbs & idioms corpus (5–10k pairs) — v2.3 + IT2 + Sarvam all fail on proverbs like ಮಾಡಿದ್ದುಣ್ಣೋ ಮಹಾರಾಯ (= "you reap what you sow")
  • Hindi support ([HI2EN] / [EN2HI]) — opens a second language pair
  • Iterative back-translation for low-resource domain expansion
  • Expanded vocabulary (modern tech terms, longer alphanumeric IDs)
  • Standardized BPE tokenizer (currently SentencePiece Unigram)
  • Register / style control revisit (rebalanced labels + contrastive separation training)

v3.0 (TBD)

Copy-mechanism / pointer-generator for OOV-proof transliteration. A built-in solution for the entity-preservation problem instead of corpus-only fix.


Citation

@misc{controlmt-v2.3-2026,
  author = {Anand Kaman},
  title  = {ControlMT v2.3 — A 139M-Parameter Specialized Kannada↔English Translation Model
           with Code-Mix-Native Training},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/anandkaman/controlmt-v2.3}}
}

License

Apache 2.0 — see LICENSE.

Downloads last month
218
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anandkaman/controlmt-v2.3

Finetunes
1 model

Space using anandkaman/controlmt-v2.3 1

Evaluation results

  • BLEU on FLORES-200 devtest (kan_Knda → eng_Latn)
    self-reported
    27.200
  • chrF on FLORES-200 devtest (kan_Knda → eng_Latn)
    self-reported
    55.840
  • COMET-DA (Unbabel/wmt22-comet-da) on FLORES-200 devtest (kan_Knda → eng_Latn)
    self-reported
    0.846
  • CometKiwi-DA (Unbabel/wmt22-cometkiwi-da) on FLORES-200 devtest (kan_Knda → eng_Latn)
    self-reported
    0.844
  • BLEU on FLORES-200 devtest (eng_Latn → kan_Knda)
    self-reported
    18.500
  • chrF on FLORES-200 devtest (eng_Latn → kan_Knda)
    self-reported
    56.120
  • COMET-DA on FLORES-200 devtest (eng_Latn → kan_Knda)
    self-reported
    0.844
  • CometKiwi-DA on FLORES-200 devtest (eng_Latn → kan_Knda)
    self-reported
    0.866