File size: 5,254 Bytes
58b8e27 e9e3e63 58b8e27 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
license: apache-2.0
language:
- multilingual
tags:
- programming-language-identification
- code
- byte-level
- lite
pipeline_tag: text-classification
metrics:
- f1
- accuracy
---
# programming-language-identification-100plus-lite
Byte-level programming-language identification across **107 languages**.
**2.35M parameters**, no
tokenizer, ships at **~9 MB fp32 / ~4.5 MB bf16**.
**[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb)** · **[Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb)** — Download and run in Colab or Jupyter.
The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with
RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash
embedding), vendored from
[PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0)
and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
programming languages.
## Comparison with `philomath-1209/programming-language-identification`
3,057 test rows over the **26 labels** philomath supports. ONNX,
`CPUExecutionProvider`, batch 64.
| model | params | accuracy | macro F1 | weighted F1 | speed |
|---|---:|---:|---:|---:|---:|
| **programming-language-identification-100plus-lite** (ONNX) | 2.35 M | 0.9094 | **0.9410** | **0.9361** | **2.37×** |
| philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00× |
## Files
```
model.pt fp32 PyTorch checkpoint (CommonLingua format)
model.bf16.pt bf16 sidecar checkpoint (smaller, same accuracy in eval)
lang2idx.json 107-label index
training_metadata.json hyperparameters and dataset stats
training_history.json per-epoch loss / val_acc / val_macro_f1
onnx/
model.onnx ONNX export (opset 20, dynamic batch)
model.onnx.data external weights blob
lang2idx.json (mirror)
onnx_metadata.json parity report vs PyTorch
```
## Quick start — PyTorch
```python
import torch, numpy as np, sys
sys.path.append("path/to/code-language-id/src")
from code_language_id.byte_hybrid import ByteHybrid, CONFIGS
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
**CONFIGS[ckpt["config"]]).eval()
model.load_state_dict(ckpt["model_state_dict"])
idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}
def encode(texts, max_len=ckpt["max_len"]):
out = np.full((len(texts), max_len), 256, dtype=np.int64)
for i, t in enumerate(texts):
b = t.encode("utf-8", errors="replace")[:max_len]
out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
return torch.from_numpy(out)
with torch.no_grad():
logits = model(encode(["def hello():\n print('hi')"]))
print(idx2lang[int(logits.argmax(-1))]) # -> Python
```
## Quick start — ONNX Runtime
```python
import onnxruntime as ort, numpy as np, json
sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
lang2idx = json.load(open("onnx/lang2idx.json"))
idx2lang = {v: k for k, v in lang2idx.items()}
MAX_LEN = 1023
def encode(texts, max_len=MAX_LEN):
out = np.full((len(texts), max_len), 256, dtype=np.int64)
for i, t in enumerate(texts):
b = t.encode("utf-8", errors="replace")[:max_len]
out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
return out
logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
print(idx2lang[int(logits.argmax(-1))]) # -> Rust
```
## Training summary
- **Data**: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1
(`bigcode/the-stack`), task-split to prevent leakage.
72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels.
- **Snippets**: variable-window (64–1023 bytes) UTF-8.
- **Optimizer**: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup,
peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
- **Schedule**: 30 epochs, bf16 autocast, batch 128 (effective 128 with
gradient clipping; SDPA fused attention).
- **Best val macro F1**: 0.9085 @ epoch 26 (early stopped).
See `training_metadata.json` for the full hyperparameter dump.
## Citation
If you use this model, please cite:
```bibtex
@misc{mariappan2026codelangidlite,
author = {Mariappan, Vijayachandran},
title = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
}
```
Upstream architecture:
```bibtex
@misc{commonlingua,
author = {{PleIAs}},
title = {CommonLingua: Byte-level Language Identification for 334 Languages},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/PleIAs/CommonLingua}
}
```
## License & attribution
Apache-2.0. Architecture and reference inference code derive from
**PleIAs/CommonLingua** (Apache-2.0). Trained weights and dataset curation are
original to this repository.
|