chiennv
/

langid-mmbert-small

Model card Files Files and versions

langid-mmbert-small / README.md

chiennv's picture

Update README.md

a1461a1 verified 24 days ago

|

history blame contribute delete

3.59 kB

	# langid-mmbert-small-8gpu

	Language identification model based on `jhu-clsp/mmBERT-small` for 14 classes:

	- `ko`, `no`, `da`, `sv`, `fi`, `nl`, `en`, `fr`, `de`, `es`, `pt`, `it`, `ja`
	- `UNKNOWN`

	## Inference Guide

	This project supports two inference styles:

	1. `pipeline("text-classification")` for quick usage.
	2. `AutoTokenizer` + `AutoModelForSequenceClassification` for explicit forward-pass control.

	Both can use a fast UNKNOWN pre-check:

	```python
	import re

	URL_PATTERN = re.compile(r"https?://\S+\|www\.\S+", flags=re.IGNORECASE)

	def fast_detect_unknown(text: str) -> bool:
	s = text.strip()
	if not s:
	# Empty input is treated as non-language.
	return True
	if URL_PATTERN.search(s):
	# URLs should map to UNKNOWN.
	return True

	total = len(s)
	alpha = sum(ch.isalpha() for ch in s)
	digits = sum(ch.isdigit() for ch in s)
	spaces = sum(ch.isspace() for ch in s)
	symbols = total - alpha - digits - spaces
	non_space = max(1, total - spaces)

	# Mostly numeric strings (ids, phone numbers, etc.).
	if digits / non_space >= 0.8:
	return True
	# Symbol-heavy text is usually not valid language content.
	if symbols / non_space >= 0.45:
	return True
	# Very low alphabetic ratio indicates gibberish-like input.
	if total >= 6 and (alpha / non_space) < 0.2:
	return True
	# Long compact mixed tokens often represent hashes/usernames/keys.
	if " " not in s and total >= 12 and (alpha / non_space) < 0.45 and (digits > 0 or symbols > 0):
	return True
	return False
	```

	### Option A: Pipeline

	```python
	import torch
	from transformers import pipeline

	model_id = "chiennv/langid-mmbert-small"
	device = 0 if torch.cuda.is_available() else -1
	clf = pipeline(
	"text-classification",
	model=model_id,
	tokenizer=model_id,
	top_k=1,
	device=device, # GPU id (0,1,...) or -1 for CPU
	)

	text = "Bonjour tout le monde"
	if fast_detect_unknown(text):
	print({"label": "UNKNOWN", "score": 1.0})
	else:
	out = clf(text)[0][0]
	print({"label": out["label"], "score": round(out["score"], 4)})
	```

	### Option B: AutoModelForSequenceClassification Only

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model_id = "chiennv/langid-mmbert-small"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Use FP16 on GPU for faster inference and lower memory.
	dtype = torch.float16 if device.type == "cuda" else torch.float32
	model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=dtype).to(device)
	model.eval()

	text = "Bonjour tout le monde"
	if fast_detect_unknown(text):
	print({"label": "UNKNOWN", "score": 1.0})
	else:
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	inputs = {k: v.to(device) for k, v in inputs.items()}
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1).squeeze(0)
	pred_id = int(torch.argmax(probs).item())
	pred_label = model.config.id2label[pred_id]
	pred_score = float(probs[pred_id].item())
	print({"label": pred_label, "score": round(pred_score, 4)})
	```

	### Run local `infer.py`

	```bash
	python infer.py
	```

	## GPU Notes

	- Check CUDA availability:
	- `python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no-gpu')"`
	- The AutoModel example above automatically uses GPU + FP16 when CUDA is available.