File size: 5,254 Bytes

---
license: apache-2.0
language:
- multilingual
tags:
- programming-language-identification
- code
- byte-level
- lite
pipeline_tag: text-classification
metrics:
- f1
- accuracy
---

# programming-language-identification-100plus-lite

Byte-level programming-language identification across **107 languages**.
**2.35M parameters**, no
tokenizer, ships at **~9 MB fp32 / ~4.5 MB bf16**.

**[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb)** · **[Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb)** — Download and run in Colab or Jupyter.

The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with
RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash
embedding), vendored from
[PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0)
and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
programming languages.

## Comparison with `philomath-1209/programming-language-identification`

3,057 test rows over the **26 labels** philomath supports. ONNX,
`CPUExecutionProvider`, batch 64.

| model | params | accuracy | macro F1 | weighted F1 | speed |
|---|---:|---:|---:|---:|---:|
| **programming-language-identification-100plus-lite** (ONNX) | 2.35 M | 0.9094 | **0.9410** | **0.9361** | **2.37×** |
| philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00× |


## Files

```
model.pt              fp32 PyTorch checkpoint (CommonLingua format)
model.bf16.pt         bf16 sidecar checkpoint (smaller, same accuracy in eval)
lang2idx.json         107-label index
training_metadata.json  hyperparameters and dataset stats
training_history.json   per-epoch loss / val_acc / val_macro_f1
onnx/
  model.onnx          ONNX export (opset 20, dynamic batch)
  model.onnx.data     external weights blob
  lang2idx.json       (mirror)
  onnx_metadata.json  parity report vs PyTorch
```

## Quick start — PyTorch

```python
import torch, numpy as np, sys
sys.path.append("path/to/code-language-id/src")
from code_language_id.byte_hybrid import ByteHybrid, CONFIGS

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
                   **CONFIGS[ckpt["config"]]).eval()
model.load_state_dict(ckpt["model_state_dict"])
idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}

def encode(texts, max_len=ckpt["max_len"]):
    out = np.full((len(texts), max_len), 256, dtype=np.int64)
    for i, t in enumerate(texts):
        b = t.encode("utf-8", errors="replace")[:max_len]
        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
    return torch.from_numpy(out)

with torch.no_grad():
    logits = model(encode(["def hello():\n    print('hi')"]))
print(idx2lang[int(logits.argmax(-1))])   # -> Python
```

## Quick start — ONNX Runtime

```python
import onnxruntime as ort, numpy as np, json

sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
lang2idx = json.load(open("onnx/lang2idx.json"))
idx2lang = {v: k for k, v in lang2idx.items()}
MAX_LEN = 1023

def encode(texts, max_len=MAX_LEN):
    out = np.full((len(texts), max_len), 256, dtype=np.int64)
    for i, t in enumerate(texts):
        b = t.encode("utf-8", errors="replace")[:max_len]
        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
    return out

logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
print(idx2lang[int(logits.argmax(-1))])   # -> Rust
```

## Training summary

- **Data**: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1
  (`bigcode/the-stack`), task-split to prevent leakage.
  72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels.
- **Snippets**: variable-window (64–1023 bytes) UTF-8.
- **Optimizer**: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup,
  peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
- **Schedule**: 30 epochs, bf16 autocast, batch 128 (effective 128 with
  gradient clipping; SDPA fused attention).
- **Best val macro F1**: 0.9085 @ epoch 26 (early stopped).

See `training_metadata.json` for the full hyperparameter dump.

## Citation

If you use this model, please cite:

```bibtex
@misc{mariappan2026codelangidlite,
  author    = {Mariappan, Vijayachandran},
  title     = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
}
```

Upstream architecture:

```bibtex
@misc{commonlingua,
  author    = {{PleIAs}},
  title     = {CommonLingua: Byte-level Language Identification for 334 Languages},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/PleIAs/CommonLingua}
}
```

## License & attribution

Apache-2.0. Architecture and reference inference code derive from
**PleIAs/CommonLingua** (Apache-2.0). Trained weights and dataset curation are
original to this repository.