| --- |
| license: apache-2.0 |
| language: |
| - multilingual |
| tags: |
| - programming-language-identification |
| - code |
| - byte-level |
| - lite |
| pipeline_tag: text-classification |
| metrics: |
| - f1 |
| - accuracy |
| --- |
| |
| # programming-language-identification-100plus-lite |
|
|
| Byte-level programming-language identification across **107 languages**. |
| **2.35M parameters**, no |
| tokenizer, ships at **~9 MB fp32 / ~4.5 MB bf16**. |
|
|
| **[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb)** · **[Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb)** — Download and run in Colab or Jupyter. |
|
|
| The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with |
| RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash |
| embedding), vendored from |
| [PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0) |
| and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical |
| programming languages. |
|
|
| ## Comparison with `philomath-1209/programming-language-identification` |
|
|
| 3,057 test rows over the **26 labels** philomath supports. ONNX, |
| `CPUExecutionProvider`, batch 64. |
|
|
| | model | params | accuracy | macro F1 | weighted F1 | speed | |
| |---|---:|---:|---:|---:|---:| |
| | **programming-language-identification-100plus-lite** (ONNX) | 2.35 M | 0.9094 | **0.9410** | **0.9361** | **2.37×** | |
| | philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00× | |
|
|
|
|
| ## Files |
|
|
| ``` |
| model.pt fp32 PyTorch checkpoint (CommonLingua format) |
| model.bf16.pt bf16 sidecar checkpoint (smaller, same accuracy in eval) |
| lang2idx.json 107-label index |
| training_metadata.json hyperparameters and dataset stats |
| training_history.json per-epoch loss / val_acc / val_macro_f1 |
| onnx/ |
| model.onnx ONNX export (opset 20, dynamic batch) |
| model.onnx.data external weights blob |
| lang2idx.json (mirror) |
| onnx_metadata.json parity report vs PyTorch |
| ``` |
|
|
| ## Quick start — PyTorch |
|
|
| ```python |
| import torch, numpy as np, sys |
| sys.path.append("path/to/code-language-id/src") |
| from code_language_id.byte_hybrid import ByteHybrid, CONFIGS |
| |
| ckpt = torch.load("model.pt", map_location="cpu", weights_only=False) |
| model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"], |
| **CONFIGS[ckpt["config"]]).eval() |
| model.load_state_dict(ckpt["model_state_dict"]) |
| idx2lang = {v: k for k, v in ckpt["lang2idx"].items()} |
| |
| def encode(texts, max_len=ckpt["max_len"]): |
| out = np.full((len(texts), max_len), 256, dtype=np.int64) |
| for i, t in enumerate(texts): |
| b = t.encode("utf-8", errors="replace")[:max_len] |
| out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8) |
| return torch.from_numpy(out) |
| |
| with torch.no_grad(): |
| logits = model(encode(["def hello():\n print('hi')"])) |
| print(idx2lang[int(logits.argmax(-1))]) # -> Python |
| ``` |
|
|
| ## Quick start — ONNX Runtime |
|
|
| ```python |
| import onnxruntime as ort, numpy as np, json |
| |
| sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"]) |
| lang2idx = json.load(open("onnx/lang2idx.json")) |
| idx2lang = {v: k for k, v in lang2idx.items()} |
| MAX_LEN = 1023 |
| |
| def encode(texts, max_len=MAX_LEN): |
| out = np.full((len(texts), max_len), 256, dtype=np.int64) |
| for i, t in enumerate(texts): |
| b = t.encode("utf-8", errors="replace")[:max_len] |
| out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8) |
| return out |
| |
| logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0] |
| print(idx2lang[int(logits.argmax(-1))]) # -> Rust |
| ``` |
|
|
| ## Training summary |
|
|
| - **Data**: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1 |
| (`bigcode/the-stack`), task-split to prevent leakage. |
| 72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels. |
| - **Snippets**: variable-window (64–1023 bytes) UTF-8. |
| - **Optimizer**: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup, |
| peak LR 3e-3, 5 % warmup, gradient clipping 1.0. |
| - **Schedule**: 30 epochs, bf16 autocast, batch 128 (effective 128 with |
| gradient clipping; SDPA fused attention). |
| - **Best val macro F1**: 0.9085 @ epoch 26 (early stopped). |
|
|
| See `training_metadata.json` for the full hyperparameter dump. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{mariappan2026codelangidlite, |
| author = {Mariappan, Vijayachandran}, |
| title = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite} |
| } |
| ``` |
|
|
| Upstream architecture: |
|
|
| ```bibtex |
| @misc{commonlingua, |
| author = {{PleIAs}}, |
| title = {CommonLingua: Byte-level Language Identification for 334 Languages}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/PleIAs/CommonLingua} |
| } |
| ``` |
|
|
| ## License & attribution |
|
|
| Apache-2.0. Architecture and reference inference code derive from |
| **PleIAs/CommonLingua** (Apache-2.0). Trained weights and dataset curation are |
| original to this repository. |
|
|