Instructions to use Tsedee/mongol-editor-llm-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Tsedee/mongol-editor-llm-v1 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/workspace/qwen35-4b-claude") model = PeftModel.from_pretrained(base_model, "Tsedee/mongol-editor-llm-v1") - Transformers
How to use Tsedee/mongol-editor-llm-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Tsedee/mongol-editor-llm-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Tsedee/mongol-editor-llm-v1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Tsedee/mongol-editor-llm-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Tsedee/mongol-editor-llm-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tsedee/mongol-editor-llm-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Tsedee/mongol-editor-llm-v1
- SGLang
How to use Tsedee/mongol-editor-llm-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Tsedee/mongol-editor-llm-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tsedee/mongol-editor-llm-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Tsedee/mongol-editor-llm-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Tsedee/mongol-editor-llm-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Tsedee/mongol-editor-llm-v1 with Docker Model Runner:
docker model run hf.co/Tsedee/mongol-editor-llm-v1
| """ | |
| MonSub LLM Editor โ Self-bootstrapping RunPod Serverless Handler | |
| Loads: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled (base) | |
| + Tsedee/mongol-editor-llm-v1 (LoRA adapter) [swap to -v2 after v2 training] | |
| Accepts batches of raw Whisper-style text segments and returns edited | |
| Mongolian subtitle text with post-processing: | |
| - Brand name correction (chitaโGTA, ะฐะธัะพะฝโiPhone, etc.) | |
| - Hallucination guard (rejects outputs that are too different from input) | |
| - Chain-of-thought stripping (keeps only "ะะฐัะฒะฐัะปะฐัะฐะฝ ั ัะฒะธะปะฑะฐั:" content) | |
| - </think> tag cleanup | |
| API: | |
| Input (JSON): | |
| { | |
| "texts": ["text 1", "text 2", ...], # required | |
| "mode": "edit" | "summarize" | "rewrite", # default: "edit" | |
| "instruction": "optional custom prompt", # optional | |
| "skip_post_processing": false # optional | |
| } | |
| Output: | |
| { | |
| "edited": ["edited 1", "edited 2", ...], | |
| "stats": { "count": N, "time_s": T, "tokens_per_s": X }, | |
| "fallback_used": [idx1, idx2, ...] # indices where hallucination guard fired | |
| } | |
| """ | |
| import os, sys, subprocess, time | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # BOOTSTRAP | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| def ensure(pkg_import, pip_name=None): | |
| try: | |
| __import__(pkg_import) | |
| except ImportError: | |
| name = pip_name or pkg_import | |
| print(f"[BOOT] installing {name}...", flush=True) | |
| subprocess.run([sys.executable, "-m", "pip", "install", "--quiet", "--no-cache-dir", name], check=True) | |
| print("[BOOT] LLM editor handler starting...", flush=True) | |
| t0 = time.time() | |
| ensure("runpod") | |
| ensure("transformers", "transformers==5.5.0") | |
| ensure("peft", "peft==0.18.1") | |
| ensure("accelerate", "accelerate>=1.0.0") | |
| ensure("huggingface_hub") | |
| print(f"[BOOT] deps ready in {time.time()-t0:.1f}s", flush=True) | |
| # โโ Module-level: only stdlib + runpod โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| os.environ.setdefault("CUBLAS_WORKSPACE_CONFIG", ":4096:8") | |
| import re | |
| import traceback | |
| import runpod | |
| HF_TOKEN = os.environ.get("HF_TOKEN", "") | |
| BASE_MODEL = os.environ.get("BASE_MODEL", "Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled") | |
| ADAPTER_REPO = os.environ.get("ADAPTER_REPO", "Tsedee/mongol-editor-llm-v1") | |
| MODEL = None | |
| TOKENIZER = None | |
| torch = None # lazy-loaded | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # BRAND CORRECTION DICT โ post-processing safety net | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # Applied AFTER model output to catch brand names the model missed. | |
| # Case-insensitive substring match with word boundaries where possible. | |
| BRAND_FIXES = [ | |
| # (pattern_regex, replacement) | |
| # Games | |
| (r"\bัะธัะฐ\s*5\b", "GTA 5"), | |
| (r"\bะถะธัะฐ\s*5\b", "GTA 5"), | |
| (r"\bะณัะฐ\s*5\b", "GTA 5"), | |
| (r"\bัะธัะฐ\s*6\b", "GTA 6"), | |
| (r"\bะถะธัะฐ\s*6\b", "GTA 6"), | |
| (r"\bะณัะฐ\s*6\b", "GTA 6"), | |
| (r"\bัะธัะฐ\b", "FIFA"), | |
| (r"\bะบะพะป\s*ะพั\s*ะดััะธ\b", "Call of Duty"), | |
| (r"\bะบะฐะปะป\s*ะพั\s*ะดััะธ\b", "Call of Duty"), | |
| (r"\bะผะฐะนะฝะบัะฐัั\b", "Minecraft"), | |
| (r"\bะผะฐะนะฝ\s*ะบัะฐัั\b", "Minecraft"), | |
| (r"\bัะพะฑะปะพะบั\b", "Roblox"), | |
| (r"\bัะพััะฝะฐะนั\b", "Fortnite"), | |
| (r"\bะฒะฐะปัะพัะฐะฝั\b", "Valorant"), | |
| (r"\bะฒะฐะปะพัะฐะฝั\b", "Valorant"), | |
| (r"\bะฑะฐะณััะฐัะธ\b", "Rockstar Games"), | |
| (r"\bะฑะฐะณััะฐั\b", "Rockstar Games"), | |
| (r"\bะฟัะฑะณ\b", "PUBG"), | |
| (r"\bะบั\s*ะณะพ\b", "CS:GO"), | |
| (r"\bะดะพัะฐ\s*2\b", "Dota 2"), | |
| (r"\bัะฑะธัะพัั\b", "Ubisoft"), | |
| (r"\bััะธะผ\b", "Steam"), | |
| # Tech | |
| (r"\bะฐะธัะพะฝ\b", "iPhone"), | |
| (r"\bะฐะนัะพะฝ\b", "iPhone"), | |
| (r"\bะธะฟะฐะด\b", "iPad"), | |
| (r"\bะฐะนะฟะฐะด\b", "iPad"), | |
| (r"\bะผะฐะบะฑาฏาฏะบ\b", "MacBook"), | |
| (r"\bะผะฐะบะฑัะบ\b", "MacBook"), | |
| (r"\bัะนัะฟะพะดั\b", "AirPods"), | |
| (r"\bัะฐะผััะฝะณ\b", "Samsung"), | |
| (r"\bะณัะณะป\b", "Google"), | |
| (r"\bะณาฏาฏะณัะป\b", "Google"), | |
| (r"\bั ัะฐะฒะตะน\b", "Huawei"), | |
| (r"\bัะฐะพะผะธ\b", "Xiaomi"), | |
| (r"\bััะพะผะธ\b", "Xiaomi"), | |
| (r"\bัะตะดะผะธ\b", "Redmi"), | |
| (r"\bัะฟะป\b", "Apple"), | |
| # Apps / Social | |
| (r"\bัััะฑ\b", "YouTube"), | |
| (r"\bััาฏาฏะฑ\b", "YouTube"), | |
| (r"\bัะธะบ\s*ัะพะบ\b", "TikTok"), | |
| (r"\bัะธะบัะพะบ\b", "TikTok"), | |
| (r"\bะธะฝััะฐะณัะฐะผ\b", "Instagram"), | |
| (r"\bััะนัะฑาฏาฏะบ\b", "Facebook"), | |
| (r"\bัะตะนัะฑัะบ\b", "Facebook"), | |
| (r"\bะฒะฐัะฐะฟ\b", "WhatsApp"), | |
| (r"\bะฒะฐััะฐะฟ\b", "WhatsApp"), | |
| (r"\bัะตะปะตะณัะฐะผ\b", "Telegram"), | |
| (r"\bะดะธัะบะพัะด\b", "Discord"), | |
| (r"\bัะฒะธััะตั\b", "Twitter"), | |
| (r"\bัะฟะพัะธัะฐะน\b", "Spotify"), | |
| (r"\bะฝะตััะปะธะบั\b", "Netflix"), | |
| (r"\bัะฑะตั\b", "Uber"), | |
| (r"\bัะฐั\s*ะถะฟั\b", "ChatGPT"), | |
| (r"\bัะฐัะณะฟั\b", "ChatGPT"), | |
| (r"\bะผะธะดะถะพัะฝะธ\b", "Midjourney"), | |
| # Music / celebs | |
| (r"\bะฑัั\b", "BTS"), | |
| (r"\bะฑััั\b", "BTS"), | |
| (r"\bะฑะปัะบะฟะธะฝะบ\b", "BLACKPINK"), | |
| (r"\bะฑะปัะบ\s*ะฟะธะฝะบ\b", "BLACKPINK"), | |
| # Common proper nouns | |
| (r"\bัะปะฐะฐะฝะฑะฐะฐัะฐั\b", "ะฃะปะฐะฐะฝะฑะฐะฐัะฐั"), | |
| (r"\bะผะพะฝะณะพะป\s+ัะปั\b", "ะะพะฝะณะพะป ะฃะปั"), | |
| (r"\bะทะฐัะณะธะนะฝ\s+ะณะฐะทะฐั\b", "ะะฐัะณะธะนะฝ ะณะฐะทะฐั"), | |
| (r"\bัะธั \b", "ะฃะะฅ"), | |
| (r"\bะผัะธั\b", "ะะฃะะก"), | |
| ] | |
| COMPILED_BRAND_FIXES = [(re.compile(pat, re.IGNORECASE), rep) for pat, rep in BRAND_FIXES] | |
| def apply_brand_fixes(text: str) -> str: | |
| """Apply brand name corrections. Case-insensitive substitution.""" | |
| if not text: | |
| return text | |
| for pattern, replacement in COMPILED_BRAND_FIXES: | |
| text = pattern.sub(replacement, text) | |
| return text | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # OUTPUT PARSING & HALLUCINATION GUARD | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| def strip_reasoning(raw_output: str) -> str: | |
| """ | |
| Extract the final edited version from model output. The training format is: | |
| ะญะฝั ำฉะณาฏาฏะปะฑััั ะดะฐัะฐะฐั ะทาฏะนะปั ะทะฐัะฐั ั ัััะณััะน: | |
| 1. ... | |
| 2. ... | |
| ะะฐัะฒะฐัะปะฐัะฐะฝ ั ัะฒะธะปะฑะฐั: | |
| <FINAL TEXT> | |
| </think> | |
| <FINAL TEXT again> | |
| We want just <FINAL TEXT>. Strategy: | |
| 1. Split on "ะะฐัะฒะฐัะปะฐัะฐะฝ ั ัะฒะธะปะฑะฐั:" โ take everything after | |
| 2. Split on "</think>" โ take first half (before tag) | |
| 3. Strip whitespace | |
| 4. If step 1 fails, return input as-is (assume model output was direct) | |
| """ | |
| if not raw_output: | |
| return "" | |
| text = raw_output | |
| # Prefer content after "ะะฐัะฒะฐัะปะฐัะฐะฝ ั ัะฒะธะปะฑะฐั:" | |
| marker = "ะะฐัะฒะฐัะปะฐัะฐะฝ ั ัะฒะธะปะฑะฐั:" | |
| if marker in text: | |
| text = text.split(marker, 1)[1] | |
| else: | |
| # Fallback markers | |
| for alt in ("ะะฐัะฒะฐัะปะฐัะฐะฝ ำฉะณาฏาฏะปะฑัั:", "ะญััะธะนะฝ ั ัะฒะธะปะฑะฐั:", "ะำฉะฒ ั ัะฒะธะปะฑะฐั:"): | |
| if alt in text: | |
| text = text.split(alt, 1)[1] | |
| break | |
| # Cut at </think> โ anything after is a duplicate | |
| if "</think>" in text: | |
| text = text.split("</think>", 1)[0] | |
| if "<think>" in text: | |
| # take content after <think> ... </think> block OR before it | |
| parts = text.split("<think>", 1) | |
| text = parts[0] if parts[0].strip() else parts[1].split("</think>", 1)[-1] | |
| # Sometimes the chain-of-thought bleeds in โ cut at first blank line | |
| # AFTER a colon list ("1. ..." or similar) | |
| lines = [ln.rstrip() for ln in text.strip().split("\n")] | |
| # If first line is a list item, drop lines until we hit blank | |
| cleaned = [] | |
| skip_list = False | |
| for ln in lines: | |
| stripped = ln.strip() | |
| if re.match(r"^\d+\.\s", stripped): | |
| skip_list = True | |
| continue | |
| if skip_list and stripped == "": | |
| skip_list = False | |
| continue | |
| if skip_list: | |
| continue | |
| cleaned.append(ln) | |
| out = "\n".join(cleaned).strip() | |
| return out or text.strip() | |
| def hallucination_guard(original: str, edited: str, max_ratio: float = 1.6) -> tuple[str, bool]: | |
| """ | |
| Guard against hallucination: if the edited text is drastically longer than | |
| the original OR introduces too many new tokens, fall back to the original | |
| (optionally with light cleanup). | |
| Returns (text, fallback_used). | |
| """ | |
| if not edited: | |
| return original, True | |
| orig_len = max(len(original), 1) | |
| edit_len = len(edited) | |
| # Rule 1: too much longer (model invented content) | |
| if edit_len > orig_len * max_ratio and edit_len > orig_len + 40: | |
| return original, True | |
| # Rule 2: too much shorter (model truncated unexpectedly) | |
| if edit_len < orig_len * 0.4 and orig_len > 20: | |
| return original, True | |
| # Rule 3: zero overlap with original words (wrong topic) | |
| orig_words = set(re.findall(r"\w+", original.lower())) | |
| edit_words = set(re.findall(r"\w+", edited.lower())) | |
| if orig_words and len(orig_words & edit_words) / len(orig_words) < 0.3: | |
| return original, True | |
| return edited, False | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # MODEL LOADING (lazy, fork-safe) | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| def load_model(): | |
| global MODEL, TOKENIZER, torch | |
| if MODEL is not None: | |
| return | |
| t = time.time() | |
| print("[LOAD] importing torch...", flush=True) | |
| import torch as _torch | |
| torch = _torch | |
| print("[LOAD] importing transformers + peft...", flush=True) | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| from peft import PeftModel | |
| print(f"[LOAD] CUDA available: {torch.cuda.is_available()}", flush=True) | |
| if torch.cuda.is_available(): | |
| print(f"[LOAD] device: {torch.cuda.get_device_name(0)}", flush=True) | |
| torch.cuda.init() | |
| torch.backends.cuda.matmul.allow_tf32 = True | |
| print(f"[LOAD] tokenizer from {ADAPTER_REPO}...", flush=True) | |
| TOKENIZER = AutoTokenizer.from_pretrained( | |
| ADAPTER_REPO, token=HF_TOKEN, trust_remote_code=True | |
| ) | |
| if TOKENIZER.pad_token is None: | |
| TOKENIZER.pad_token = TOKENIZER.eos_token | |
| print(f"[LOAD] base model {BASE_MODEL}...", flush=True) | |
| base = AutoModelForCausalLM.from_pretrained( | |
| BASE_MODEL, | |
| dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| token=HF_TOKEN, | |
| attn_implementation="eager", | |
| ) | |
| print(f"[LOAD] adapter {ADAPTER_REPO}...", flush=True) | |
| MODEL = PeftModel.from_pretrained(base, ADAPTER_REPO, token=HF_TOKEN) | |
| MODEL.eval() | |
| print(f"[LOAD] ready in {time.time()-t:.1f}s ยท " | |
| f"VRAM {torch.cuda.memory_allocated()/1e9:.2f}GB", flush=True) | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # INFERENCE | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| INSTRUCTIONS = { | |
| "edit": "ะะฐัะฐะฐั ASR-ััั ะณะฐััะฐะฝ ัะตะบััะธะนะณ ะทะฐัะฒะฐัะปะฐะถ, ะทำฉะฒ subtitle ะฑะพะปะณะพะฝะพ ัั.", | |
| "summarize": "ะะฐัะฐะฐั ะฑะธัะปัะณะธะนะฝ ะฐะณััะปะณัะณ ัะพะฒัะธะปะฝะพ ัั.", | |
| "rewrite": "ะะฐัะฐะฐั ำฉะณาฏาฏะปะฑััะธะนะณ ััะฐะฝ ะฑะธัะปัะณััะน ะฑะพะปะณะพะฝ ะทะฐัะฝะฐ ัั.", | |
| } | |
| def generate_one(text: str, instruction: str, max_new_tokens: int = 256) -> str: | |
| """Run the model on a single text with the given instruction.""" | |
| user_msg = f"{instruction}\n\n{text}" | |
| prompt = TOKENIZER.apply_chat_template( | |
| [{"role": "user", "content": user_msg}], | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| inputs = TOKENIZER(prompt, return_tensors="pt", truncation=True, max_length=1024).to(MODEL.device) | |
| with torch.no_grad(): | |
| out_ids = MODEL.generate( | |
| **inputs, | |
| max_new_tokens=max_new_tokens, | |
| do_sample=False, | |
| temperature=1.0, | |
| repetition_penalty=1.05, | |
| pad_token_id=TOKENIZER.pad_token_id, | |
| ) | |
| new_tokens = out_ids[0][inputs["input_ids"].shape[1]:] | |
| raw = TOKENIZER.decode(new_tokens, skip_special_tokens=True).strip() | |
| return raw | |
| def handler(event): | |
| """RunPod serverless entry point.""" | |
| try: | |
| t_total = time.time() | |
| load_model() | |
| inp = event.get("input", {}) or {} | |
| texts = inp.get("texts") | |
| if not texts or not isinstance(texts, list): | |
| return {"error": "Missing 'texts' list in input"} | |
| mode = inp.get("mode", "edit") | |
| custom_instruction = inp.get("instruction") | |
| skip_post = bool(inp.get("skip_post_processing", False)) | |
| max_new_tokens = int(inp.get("max_new_tokens", 256)) | |
| instruction = custom_instruction or INSTRUCTIONS.get(mode, INSTRUCTIONS["edit"]) | |
| edited = [] | |
| fallback_used = [] | |
| total_tokens = 0 | |
| for i, text in enumerate(texts): | |
| if not text or not text.strip(): | |
| edited.append(text) | |
| continue | |
| try: | |
| raw = generate_one(text, instruction, max_new_tokens=max_new_tokens) | |
| parsed = strip_reasoning(raw) | |
| if mode == "edit" and not skip_post: | |
| # Hallucination guard | |
| guarded, is_fallback = hallucination_guard(text, parsed) | |
| # Brand fixes (applied to both fallback and edit) | |
| guarded = apply_brand_fixes(guarded) | |
| if is_fallback: | |
| fallback_used.append(i) | |
| edited.append(guarded) | |
| else: | |
| edited.append(parsed) | |
| total_tokens += len(raw.split()) | |
| except Exception as e: | |
| print(f"[ERR] segment {i}: {e}", flush=True) | |
| traceback.print_exc() | |
| # On any failure, return the original text unchanged | |
| edited.append(text) | |
| fallback_used.append(i) | |
| elapsed = time.time() - t_total | |
| return { | |
| "edited": edited, | |
| "stats": { | |
| "count": len(texts), | |
| "time_s": round(elapsed, 2), | |
| "tokens_per_s": round(total_tokens / elapsed, 1) if elapsed > 0 else 0, | |
| }, | |
| "fallback_used": fallback_used, | |
| "mode": mode, | |
| "model": ADAPTER_REPO, | |
| } | |
| except Exception as e: | |
| traceback.print_exc() | |
| return {"error": str(e)} | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # ENTRY POINT | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| if __name__ == "__main__": | |
| print(f"[BOOT] total bootstrap time: {time.time()-t0:.1f}s", flush=True) | |
| runpod.serverless.start({"handler": handler}) | |