File size: 5,254 Bytes
58b8e27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9e3e63
58b8e27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: apache-2.0
language:
- multilingual
tags:
- programming-language-identification
- code
- byte-level
- lite
pipeline_tag: text-classification
metrics:
- f1
- accuracy
---

# programming-language-identification-100plus-lite

Byte-level programming-language identification across **107 languages**.
**2.35M parameters**, no
tokenizer, ships at **~9 MB fp32 / ~4.5 MB bf16**.

**[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb)** · **[Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb)** — Download and run in Colab or Jupyter.

The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with
RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash
embedding), vendored from
[PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0)
and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
programming languages.

## Comparison with `philomath-1209/programming-language-identification`

3,057 test rows over the **26 labels** philomath supports. ONNX,
`CPUExecutionProvider`, batch 64.

| model | params | accuracy | macro F1 | weighted F1 | speed |
|---|---:|---:|---:|---:|---:|
| **programming-language-identification-100plus-lite** (ONNX) | 2.35 M | 0.9094 | **0.9410** | **0.9361** | **2.37×** |
| philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00× |


## Files

```
model.pt              fp32 PyTorch checkpoint (CommonLingua format)
model.bf16.pt         bf16 sidecar checkpoint (smaller, same accuracy in eval)
lang2idx.json         107-label index
training_metadata.json  hyperparameters and dataset stats
training_history.json   per-epoch loss / val_acc / val_macro_f1
onnx/
  model.onnx          ONNX export (opset 20, dynamic batch)
  model.onnx.data     external weights blob
  lang2idx.json       (mirror)
  onnx_metadata.json  parity report vs PyTorch
```

## Quick start — PyTorch

```python
import torch, numpy as np, sys
sys.path.append("path/to/code-language-id/src")
from code_language_id.byte_hybrid import ByteHybrid, CONFIGS

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
                   **CONFIGS[ckpt["config"]]).eval()
model.load_state_dict(ckpt["model_state_dict"])
idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}

def encode(texts, max_len=ckpt["max_len"]):
    out = np.full((len(texts), max_len), 256, dtype=np.int64)
    for i, t in enumerate(texts):
        b = t.encode("utf-8", errors="replace")[:max_len]
        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
    return torch.from_numpy(out)

with torch.no_grad():
    logits = model(encode(["def hello():\n    print('hi')"]))
print(idx2lang[int(logits.argmax(-1))])   # -> Python
```

## Quick start — ONNX Runtime

```python
import onnxruntime as ort, numpy as np, json

sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
lang2idx = json.load(open("onnx/lang2idx.json"))
idx2lang = {v: k for k, v in lang2idx.items()}
MAX_LEN = 1023

def encode(texts, max_len=MAX_LEN):
    out = np.full((len(texts), max_len), 256, dtype=np.int64)
    for i, t in enumerate(texts):
        b = t.encode("utf-8", errors="replace")[:max_len]
        out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
    return out

logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
print(idx2lang[int(logits.argmax(-1))])   # -> Rust
```

## Training summary

- **Data**: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1
  (`bigcode/the-stack`), task-split to prevent leakage.
  72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels.
- **Snippets**: variable-window (64–1023 bytes) UTF-8.
- **Optimizer**: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup,
  peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
- **Schedule**: 30 epochs, bf16 autocast, batch 128 (effective 128 with
  gradient clipping; SDPA fused attention).
- **Best val macro F1**: 0.9085 @ epoch 26 (early stopped).

See `training_metadata.json` for the full hyperparameter dump.

## Citation

If you use this model, please cite:

```bibtex
@misc{mariappan2026codelangidlite,
  author    = {Mariappan, Vijayachandran},
  title     = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
}
```

Upstream architecture:

```bibtex
@misc{commonlingua,
  author    = {{PleIAs}},
  title     = {CommonLingua: Byte-level Language Identification for 334 Languages},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/PleIAs/CommonLingua}
}
```

## License & attribution

Apache-2.0. Architecture and reference inference code derive from
**PleIAs/CommonLingua** (Apache-2.0). Trained weights and dataset curation are
original to this repository.