# langid-mmbert-small-8gpu Language identification model based on `jhu-clsp/mmBERT-small` for 14 classes: - `ko`, `no`, `da`, `sv`, `fi`, `nl`, `en`, `fr`, `de`, `es`, `pt`, `it`, `ja` - `UNKNOWN` ## Inference Guide This project supports two inference styles: 1. `pipeline("text-classification")` for quick usage. 2. `AutoTokenizer` + `AutoModelForSequenceClassification` for explicit forward-pass control. Both can use a fast UNKNOWN pre-check: ```python import re URL_PATTERN = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE) def fast_detect_unknown(text: str) -> bool: s = text.strip() if not s: # Empty input is treated as non-language. return True if URL_PATTERN.search(s): # URLs should map to UNKNOWN. return True total = len(s) alpha = sum(ch.isalpha() for ch in s) digits = sum(ch.isdigit() for ch in s) spaces = sum(ch.isspace() for ch in s) symbols = total - alpha - digits - spaces non_space = max(1, total - spaces) # Mostly numeric strings (ids, phone numbers, etc.). if digits / non_space >= 0.8: return True # Symbol-heavy text is usually not valid language content. if symbols / non_space >= 0.45: return True # Very low alphabetic ratio indicates gibberish-like input. if total >= 6 and (alpha / non_space) < 0.2: return True # Long compact mixed tokens often represent hashes/usernames/keys. if " " not in s and total >= 12 and (alpha / non_space) < 0.45 and (digits > 0 or symbols > 0): return True return False ``` ### Option A: Pipeline ```python import torch from transformers import pipeline model_id = "chiennv/langid-mmbert-small" device = 0 if torch.cuda.is_available() else -1 clf = pipeline( "text-classification", model=model_id, tokenizer=model_id, top_k=1, device=device, # GPU id (0,1,...) or -1 for CPU ) text = "Bonjour tout le monde" if fast_detect_unknown(text): print({"label": "UNKNOWN", "score": 1.0}) else: out = clf(text)[0][0] print({"label": out["label"], "score": round(out["score"], 4)}) ``` ### Option B: AutoModelForSequenceClassification Only ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer model_id = "chiennv/langid-mmbert-small" tokenizer = AutoTokenizer.from_pretrained(model_id) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Use FP16 on GPU for faster inference and lower memory. dtype = torch.float16 if device.type == "cuda" else torch.float32 model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=dtype).to(device) model.eval() text = "Bonjour tout le monde" if fast_detect_unknown(text): print({"label": "UNKNOWN", "score": 1.0}) else: inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) inputs = {k: v.to(device) for k, v in inputs.items()} with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=-1).squeeze(0) pred_id = int(torch.argmax(probs).item()) pred_label = model.config.id2label[pred_id] pred_score = float(probs[pred_id].item()) print({"label": pred_label, "score": round(pred_score, 4)}) ``` ### Run local `infer.py` ```bash python infer.py ``` ## GPU Notes - Check CUDA availability: - `python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no-gpu')"` - The AutoModel example above automatically uses GPU + FP16 when CUDA is available.