MonSub — Mongol Editor LLM v2

Монгол хэлэнд тусгайлан тохируулсан Whisper ASR output-ийг засах болон ерөнхий Монгол хэлний reasoning LLM. Qwen3.5-4B-Claude-4.6-Opus-Reasoning- Distilled суурь дээр LoRA fine-tune хийсэн v2 хувилбар.

Model Summary

Field	Value
Base model	`Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled`
Architecture	Qwen3.5 causal LM, 4B parameters, bfloat16
Adapter type	LoRA (PEFT), r=64, alpha=128
Adapter size	~340 MB (`adapter_model.safetensors`)
Trainable params	~84M (2.1% of base)
Target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Context length	768 tokens (training); up to 32k (inference)
Language	Mongolian (Cyrillic) + English transliteration
License	Apache 2.0 (inherits from base model terms)

Intended Use

Primary production use cases:

ASR post-editing — fix raw Whisper output: punctuation restoration, capitalization, grammar normalization, brand name correction.
Subtitle cleanup — turn messy chunked transcripts into readable subtitle lines with chain-of-thought explanation of what was fixed.
Short Mongolian summarization — condense articles, interviews, or transcripts into 2-3 sentence summaries.
Mongolian grammar Q&A — explain Mongolian language rules, suffix agreement, conjunction usage.
General Mongolian reasoning — step-by-step analysis of questions with structured chain-of-thought responses (inherited from the Claude- distilled base).

Out-of-scope / not recommended:

Real-time conversational chat (model is optimized for edit tasks)
Long-form creative writing (limited to 512 output tokens typically)
Languages other than Mongolian + English transliterated terms
Factual questions outside the training-time knowledge cutoff (2024)

Training Details

Data composition

V2 extends V1 with augmented Mongolian data targeting known weaknesses:

Bucket	Examples	Purpose
Original news CPT + SFT	95,480	Base Mongolian fluency + punctuation
Brand name corrections	1,038 × 3 weight	"чита 5" → "GTA 5", "аифон" → "iPhone"
Anti-hallucination pairs	282 × 3 weight	Clean text → unchanged output
Comma placement rules	100 × 3 weight	Conjunction commas (бөгөөд, харин, мөн)
Mongolian proper nouns	100 × 3 weight	"улаанбаатар" → "Улаанбаатар"
Knowledge Q&A	26 × 8 weight	History, culture, geography, literature
Summarization	7 × 8 weight	Long text → short summary
Reasoning chains	5 × 8 weight	Multi-perspective analysis
Content rewriting	6 × 8 weight	Formal↔casual, long↔short
Language grammar rules	5 × 8 weight	"гэж" vs "гэдэг", suffix agreement
Total (weighted)	97,673

Hyperparameters

Parameter	Value
Learning rate	5e-5 (v1 used 1e-4)
Scheduler	Cosine with warmup
Warmup ratio	0.03
Batch size (effective)	16 (4 × 4 grad accum)
Epochs	3
Max sequence length	768
Precision	bfloat16
Gradient checkpointing	enabled (use_reentrant=False)
Weight decay	0.01
Max grad norm	1.0
Optimizer	AdamW (torch)
Total steps	18,315

Training run

Hardware: NVIDIA A40 46 GB
Duration: ~9.5 hours (34,160 sec wall clock)
Resume strategy: initialized from V1 adapter, fine-tuned further
Final metrics:
- train_loss: 0.7875 (down from V1's 1.105 — 29% reduction)
- train_samples_per_second: 8.58
- train_steps_per_second: 0.536
- Perplexity: exp(0.7875) ≈ 2.20

Evaluation

Heuristic eval (10 real-world cases)

Test case	V1 quality	V2 quality
Whisper brand-error chunk	3/10	5/10
News segment	9/10	9/10
Interview (commas + caps)	8/10	9/10
Short punctuation restore	10/10	10/10
Long punct + comma	9/10	9/10
Casual YouTube vlog intro	10/10	10/10
Tech review (iPhone)	8/10	8.5/10
Clean text (hallucination test)	5/10 ⚠️	10/10 ✅
Podcast discussion	8/10	8/10
Dates and numbers	9/10	9/10
Average	~79%	~86.5%

Key improvements in V2:

✅ Hallucination guard — the biggest win. On clean text inputs, V1 sometimes invented new facts (e.g. "...hot бөгөөд 1924 онд байгуулагдсан"). V2 correctly outputs "Засах шаардлагагүй" (no change needed).
✅ Brand detection — V2 explicitly identifies "Брэндийн нэр буруу бичигдсэн" in chain-of-thought and produces "IPhone" (was "Аифон" in V1).
✅ Cleaner CoT format — more consistent structure: "Дараах зүйл засах хэрэгтэй → Засварласан хувилбар".
✅ Lower perplexity — 2.18 vs V1's 3.00 (27% more confident).

Residual weaknesses (for v3 roadmap):

Internal commas before "бөгөөд/харин" still inconsistent
"iPhone" vs "IPhone" capitalization
"Чита 5 → GTA 5" is handled by backend post-processing dictionary, not the LoRA itself
Slower at inference (16s/case vs V1's 12s) due to more verbose CoT

Real-world production quality

When combined with the MonSub backend's brand dictionary post-processor (apply_brand_fixes()), end-to-end subtitle quality reaches ~90%+ on typical Mongolian YouTube/podcast content.

Usage

Direct inference (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE = "Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled"
ADAPTER = "Tsedee/mongol-editor-llm-v2"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

def edit(instruction: str, text: str) -> str:
    msgs = [{"role": "user", "content": f"{instruction}\n\n{text}"}]
    prompt = tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

# Example: Whisper ASR fix
result = edit(
    instruction="Дараах ASR-ээс гарсан текстийг засварлаж, зөв subtitle болгоно уу.",
    text="өнөөдөр бид чита 5 тоглоомын шинэ хувилбарын тухай ярилцах болно",
)
print(result)

Example outputs

Input (instruction): Дараах ASR-ээс гарсан текстийг засварлаж, зөв subtitle болгоно уу. Input (text): би 10 жилийн өмнө энэ компанид орж ирсэн тэр үед гуравхан хүн ажилладаг байсан одоо гурван зуу гаруй ажилтантай

V2 output:

Би 10 жилийн өмнө энэ компанид орж ирсэн тэр үед гуравхан хүн
ажилладаг байсан одоо гурван зуу гаруй ажилтантай.

Input (instruction): Дараах асуудлыг шинжилж, дүгнэлт гаргана уу. Input (text): Залуус яагаад гадаад руу явж суралцдаг вэ?

V2 output: (chain-of-thought)

Энэ асуултыг хэд хэдэн өнцгөөс авч үзье:

1. Боловсролын чанар: Олон оронд эрэлттэй байдаг шинжлэх ухаан,
   инженерчлэл, анагаах ухааны чиглэлээр дэлхийн зэрэглэлтэй их
   сургуульд суралцах боломж байдаг.
2. Ажлын зах зээл: Гадаадад сурсан дипломтой хүн Монголын зах зээл дээр
   өндөр үнэтэй мэргэжилтэн болж ажлын байр олоход илүү хялбар байдаг.
3. Хэл сурах: Англи, Япон, Солонгос хэлийг бүрэн эзэмших нь гадаад орон
   хамгийн үр дүнтэй газар.
4. Олон улсын туршлага: Өөр соёл, амьдралын хэв маяг мэдрэх нь хувь
   хүний хөгжилд чухал.

Дүгнэлт: Залуус гадаад руу очих шалтгаан нь олон талын — боловсрол,
карьер, хувь хүний хөгжил гэсэн 3 гол бүрэлдэхүүн хэсгийг агуулж байна.

Limitations

4B parameters is ~100× smaller than Claude Opus / GPT-4. For broad general knowledge, a larger frontier model will outperform MonSub.
Mongolian-specific proper nouns not in training data (e.g. newer celebrities, 2025+ events) may not be recognized.
Brand correction relies on a hybrid: the LoRA detects brand errors, but the backend dictionary (editor_postprocess.py) does the actual replacement. Running the LoRA alone without the post-processor will miss some brand fixes.
Output verbosity — the model likes to emit chain-of-thought even for simple fixes. For minimum-latency subtitle editing, strip the reasoning block before rendering (see example code below).
Not safety-tuned — inherits the base model's alignment (which was distilled from Claude, so relatively aligned, but not explicitly RLHF'd by us).

Stripping chain-of-thought from output

Model output typically follows the pattern:

Энэ өгүүлбэрт дараах зүйлс засах хэрэгтэй:
1. ...
2. ...

Засварласан хувилбар:
<FINAL TEXT HERE>
</think>

<FINAL TEXT DUPLICATE>

To extract only the final corrected text:

def extract_final(raw: str) -> str:
    text = raw
    # Take content after "Засварласан хувилбар:"
    for marker in ("Засварласан хувилбар:", "Засварласан өгүүлбэр:"):
        if marker in text:
            text = text.split(marker, 1)[1]
            break
    # Cut at </think> if present (duplicate after)
    if "</think>" in text:
        text = text.split("</think>", 1)[0]
    return text.strip()

Production site: monsub.vip — full subtitle editor product using this model
Interactive demo: HuggingFace Space
V1 model: Tsedee/mongol-editor-llm-v1 (baseline before augmented data)
Whisper ASR: Tsedee/whisper-large-v3-turbo-mn-2 (the upstream ASR this model corrects)
Subtitle LoRA: Tsedee/monsub-subtitle-v3 (Whisper LoRA adapter used in the same pipeline)

Citation

If you use this model in your research or product, please cite:

@misc{monsub-editor-v2-2026,
  author       = {Tsedee},
  title        = {MonSub Mongol Editor v2: Mongolian Subtitle Correction LLM},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tsedee/mongol-editor-llm-v2}},
}

Framework versions

PEFT 0.18.1
Transformers 5.5.0
PyTorch 2.6.0+cu124
Datasets 4.8.4
Tokenizers 0.22.2

Downloads last month: 18

Model tree for Tsedee/mongol-editor-llm-v2

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled