File size: 3,592 Bytes

# langid-mmbert-small-8gpu

Language identification model based on `jhu-clsp/mmBERT-small` for 14 classes:

- `ko`, `no`, `da`, `sv`, `fi`, `nl`, `en`, `fr`, `de`, `es`, `pt`, `it`, `ja`
- `UNKNOWN`

## Inference Guide

This project supports two inference styles:

1. `pipeline("text-classification")` for quick usage.
2. `AutoTokenizer` + `AutoModelForSequenceClassification` for explicit forward-pass control.

Both can use a fast UNKNOWN pre-check:

```python
import re

URL_PATTERN = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)

def fast_detect_unknown(text: str) -> bool:
    s = text.strip()
    if not s:
        # Empty input is treated as non-language.
        return True
    if URL_PATTERN.search(s):
        # URLs should map to UNKNOWN.
        return True

    total = len(s)
    alpha = sum(ch.isalpha() for ch in s)
    digits = sum(ch.isdigit() for ch in s)
    spaces = sum(ch.isspace() for ch in s)
    symbols = total - alpha - digits - spaces
    non_space = max(1, total - spaces)

    # Mostly numeric strings (ids, phone numbers, etc.).
    if digits / non_space >= 0.8:
        return True
    # Symbol-heavy text is usually not valid language content.
    if symbols / non_space >= 0.45:
        return True
    # Very low alphabetic ratio indicates gibberish-like input.
    if total >= 6 and (alpha / non_space) < 0.2:
        return True
    # Long compact mixed tokens often represent hashes/usernames/keys.
    if " " not in s and total >= 12 and (alpha / non_space) < 0.45 and (digits > 0 or symbols > 0):
        return True
    return False
```

### Option A: Pipeline

```python
import torch
from transformers import pipeline

model_id = "chiennv/langid-mmbert-small"
device = 0 if torch.cuda.is_available() else -1
clf = pipeline(
    "text-classification",
    model=model_id,
    tokenizer=model_id,
    top_k=1,
    device=device,  # GPU id (0,1,...) or -1 for CPU
)

text = "Bonjour tout le monde"
if fast_detect_unknown(text):
    print({"label": "UNKNOWN", "score": 1.0})
else:
    out = clf(text)[0][0]
    print({"label": out["label"], "score": round(out["score"], 4)})
```

### Option B: AutoModelForSequenceClassification Only

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "chiennv/langid-mmbert-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Use FP16 on GPU for faster inference and lower memory.
dtype = torch.float16 if device.type == "cuda" else torch.float32
model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=dtype).to(device)
model.eval()

text = "Bonjour tout le monde"
if fast_detect_unknown(text):
    print({"label": "UNKNOWN", "score": 1.0})
else:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=-1).squeeze(0)
    pred_id = int(torch.argmax(probs).item())
    pred_label = model.config.id2label[pred_id]
    pred_score = float(probs[pred_id].item())
    print({"label": pred_label, "score": round(pred_score, 4)})
```

### Run local `infer.py`

```bash
python infer.py
```

## GPU Notes

- Check CUDA availability:
  - `python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no-gpu')"`
- The AutoModel example above automatically uses GPU + FP16 when CUDA is available.