MonSub — Mongol Editor LLM v2

Монгол хэлэнд тусгайлан тохируулсан Whisper ASR output-ийг засах болон ерөнхий Монгол хэлний reasoning LLM. Qwen3.5-4B-Claude-4.6-Opus-Reasoning- Distilled суурь дээр LoRA fine-tune хийсэн v2 хувилбар.

Model Summary

Field Value
Base model Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Architecture Qwen3.5 causal LM, 4B parameters, bfloat16
Adapter type LoRA (PEFT), r=64, alpha=128
Adapter size ~340 MB (adapter_model.safetensors)
Trainable params ~84M (2.1% of base)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Context length 768 tokens (training); up to 32k (inference)
Language Mongolian (Cyrillic) + English transliteration
License Apache 2.0 (inherits from base model terms)

Intended Use

Primary production use cases:

  1. ASR post-editing — fix raw Whisper output: punctuation restoration, capitalization, grammar normalization, brand name correction.
  2. Subtitle cleanup — turn messy chunked transcripts into readable subtitle lines with chain-of-thought explanation of what was fixed.
  3. Short Mongolian summarization — condense articles, interviews, or transcripts into 2-3 sentence summaries.
  4. Mongolian grammar Q&A — explain Mongolian language rules, suffix agreement, conjunction usage.
  5. General Mongolian reasoning — step-by-step analysis of questions with structured chain-of-thought responses (inherited from the Claude- distilled base).

Out-of-scope / not recommended:

  • Real-time conversational chat (model is optimized for edit tasks)
  • Long-form creative writing (limited to 512 output tokens typically)
  • Languages other than Mongolian + English transliterated terms
  • Factual questions outside the training-time knowledge cutoff (2024)

Training Details

Data composition

V2 extends V1 with augmented Mongolian data targeting known weaknesses:

Bucket Examples Purpose
Original news CPT + SFT 95,480 Base Mongolian fluency + punctuation
Brand name corrections 1,038 × 3 weight "чита 5" → "GTA 5", "аифон" → "iPhone"
Anti-hallucination pairs 282 × 3 weight Clean text → unchanged output
Comma placement rules 100 × 3 weight Conjunction commas (бөгөөд, харин, мөн)
Mongolian proper nouns 100 × 3 weight "улаанбаатар" → "Улаанбаатар"
Knowledge Q&A 26 × 8 weight History, culture, geography, literature
Summarization 7 × 8 weight Long text → short summary
Reasoning chains 5 × 8 weight Multi-perspective analysis
Content rewriting 6 × 8 weight Formal↔casual, long↔short
Language grammar rules 5 × 8 weight "гэж" vs "гэдэг", suffix agreement
Total (weighted) 97,673

Hyperparameters

Parameter Value
Learning rate 5e-5 (v1 used 1e-4)
Scheduler Cosine with warmup
Warmup ratio 0.03
Batch size (effective) 16 (4 × 4 grad accum)
Epochs 3
Max sequence length 768
Precision bfloat16
Gradient checkpointing enabled (use_reentrant=False)
Weight decay 0.01
Max grad norm 1.0
Optimizer AdamW (torch)
Total steps 18,315

Training run

  • Hardware: NVIDIA A40 46 GB
  • Duration: ~9.5 hours (34,160 sec wall clock)
  • Resume strategy: initialized from V1 adapter, fine-tuned further
  • Final metrics:
    • train_loss: 0.7875 (down from V1's 1.105 — 29% reduction)
    • train_samples_per_second: 8.58
    • train_steps_per_second: 0.536
    • Perplexity: exp(0.7875) ≈ 2.20

Evaluation

Heuristic eval (10 real-world cases)

Test case V1 quality V2 quality
Whisper brand-error chunk 3/10 5/10
News segment 9/10 9/10
Interview (commas + caps) 8/10 9/10
Short punctuation restore 10/10 10/10
Long punct + comma 9/10 9/10
Casual YouTube vlog intro 10/10 10/10
Tech review (iPhone) 8/10 8.5/10
Clean text (hallucination test) 5/10 ⚠️ 10/10
Podcast discussion 8/10 8/10
Dates and numbers 9/10 9/10
Average ~79% ~86.5%

Key improvements in V2:

  • Hallucination guard — the biggest win. On clean text inputs, V1 sometimes invented new facts (e.g. "...hot бөгөөд 1924 онд байгуулагдсан"). V2 correctly outputs "Засах шаардлагагүй" (no change needed).
  • Brand detection — V2 explicitly identifies "Брэндийн нэр буруу бичигдсэн" in chain-of-thought and produces "IPhone" (was "Аифон" in V1).
  • Cleaner CoT format — more consistent structure: "Дараах зүйл засах хэрэгтэй → Засварласан хувилбар".
  • Lower perplexity — 2.18 vs V1's 3.00 (27% more confident).

Residual weaknesses (for v3 roadmap):

  • Internal commas before "бөгөөд/харин" still inconsistent
  • "iPhone" vs "IPhone" capitalization
  • "Чита 5 → GTA 5" is handled by backend post-processing dictionary, not the LoRA itself
  • Slower at inference (16s/case vs V1's 12s) due to more verbose CoT

Real-world production quality

When combined with the MonSub backend's brand dictionary post-processor (apply_brand_fixes()), end-to-end subtitle quality reaches ~90%+ on typical Mongolian YouTube/podcast content.

Usage

Direct inference (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE = "Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled"
ADAPTER = "Tsedee/mongol-editor-llm-v2"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

def edit(instruction: str, text: str) -> str:
    msgs = [{"role": "user", "content": f"{instruction}\n\n{text}"}]
    prompt = tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

# Example: Whisper ASR fix
result = edit(
    instruction="Дараах ASR-ээс гарсан текстийг засварлаж, зөв subtitle болгоно уу.",
    text="өнөөдөр бид чита 5 тоглоомын шинэ хувилбарын тухай ярилцах болно",
)
print(result)

Example outputs

Input (instruction): Дараах ASR-ээс гарсан текстийг засварлаж, зөв subtitle болгоно уу. Input (text): би 10 жилийн өмнө энэ компанид орж ирсэн тэр үед гуравхан хүн ажилладаг байсан одоо гурван зуу гаруй ажилтантай

V2 output:

Би 10 жилийн өмнө энэ компанид орж ирсэн тэр үед гуравхан хүн
ажилладаг байсан одоо гурван зуу гаруй ажилтантай.

Input (instruction): Дараах асуудлыг шинжилж, дүгнэлт гаргана уу. Input (text): Залуус яагаад гадаад руу явж суралцдаг вэ?

V2 output: (chain-of-thought)

Энэ асуултыг хэд хэдэн өнцгөөс авч үзье:

1. Боловсролын чанар: Олон оронд эрэлттэй байдаг шинжлэх ухаан,
   инженерчлэл, анагаах ухааны чиглэлээр дэлхийн зэрэглэлтэй их
   сургуульд суралцах боломж байдаг.
2. Ажлын зах зээл: Гадаадад сурсан дипломтой хүн Монголын зах зээл дээр
   өндөр үнэтэй мэргэжилтэн болж ажлын байр олоход илүү хялбар байдаг.
3. Хэл сурах: Англи, Япон, Солонгос хэлийг бүрэн эзэмших нь гадаад орон
   хамгийн үр дүнтэй газар.
4. Олон улсын туршлага: Өөр соёл, амьдралын хэв маяг мэдрэх нь хувь
   хүний хөгжилд чухал.

Дүгнэлт: Залуус гадаад руу очих шалтгаан нь олон талын — боловсрол,
карьер, хувь хүний хөгжил гэсэн 3 гол бүрэлдэхүүн хэсгийг агуулж байна.

Limitations

  • 4B parameters is ~100× smaller than Claude Opus / GPT-4. For broad general knowledge, a larger frontier model will outperform MonSub.
  • Mongolian-specific proper nouns not in training data (e.g. newer celebrities, 2025+ events) may not be recognized.
  • Brand correction relies on a hybrid: the LoRA detects brand errors, but the backend dictionary (editor_postprocess.py) does the actual replacement. Running the LoRA alone without the post-processor will miss some brand fixes.
  • Output verbosity — the model likes to emit chain-of-thought even for simple fixes. For minimum-latency subtitle editing, strip the reasoning block before rendering (see example code below).
  • Not safety-tuned — inherits the base model's alignment (which was distilled from Claude, so relatively aligned, but not explicitly RLHF'd by us).

Stripping chain-of-thought from output

Model output typically follows the pattern:

Энэ өгүүлбэрт дараах зүйлс засах хэрэгтэй:
1. ...
2. ...

Засварласан хувилбар:
<FINAL TEXT HERE>
</think>

<FINAL TEXT DUPLICATE>

To extract only the final corrected text:

def extract_final(raw: str) -> str:
    text = raw
    # Take content after "Засварласан хувилбар:"
    for marker in ("Засварласан хувилбар:", "Засварласан өгүүлбэр:"):
        if marker in text:
            text = text.split(marker, 1)[1]
            break
    # Cut at </think> if present (duplicate after)
    if "</think>" in text:
        text = text.split("</think>", 1)[0]
    return text.strip()

Related

Citation

If you use this model in your research or product, please cite:

@misc{monsub-editor-v2-2026,
  author       = {Tsedee},
  title        = {MonSub Mongol Editor v2: Mongolian Subtitle Correction LLM},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tsedee/mongol-editor-llm-v2}},
}

Framework versions

  • PEFT 0.18.1
  • Transformers 5.5.0
  • PyTorch 2.6.0+cu124
  • Datasets 4.8.4
  • Tokenizers 0.22.2
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tsedee/mongol-editor-llm-v2

Space using Tsedee/mongol-editor-llm-v2 1

Evaluation results