Add config.json and drop unregistered library_name

78e474b 9 days ago

5.25 kB

	---
	license: apache-2.0
	language:
	- multilingual
	tags:
	- programming-language-identification
	- code
	- byte-level
	- lite
	pipeline_tag: text-classification
	metrics:
	- f1
	- accuracy
	---

	# programming-language-identification-100plus-lite

	Byte-level programming-language identification across 107 languages.
	2.35M parameters, no
	tokenizer, ships at ~9 MB fp32 / ~4.5 MB bf16.

	[Open PyTorch Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_pytorch_demo.ipynb) · [Open ONNX Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite/blob/main/lite_onnx_demo.ipynb) — Download and run in Colab or Jupyter.

	The architecture is `ByteHybrid` (3 × Conv1D → 1 × bidirectional attention with
	RoPE → masked mean-pool → classifier head, with a 4096-bucket trigram-hash
	embedding), vendored from
	[PleIAs/CommonLingua](https://huggingface.co/PleIAs/CommonLingua) (Apache-2.0)
	and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
	programming languages.

	## Comparison with `philomath-1209/programming-language-identification`

	3,057 test rows over the 26 labels philomath supports. ONNX,
	`CPUExecutionProvider`, batch 64.

	\| model \| params \| accuracy \| macro F1 \| weighted F1 \| speed \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| programming-language-identification-100plus-lite (ONNX) \| 2.35 M \| 0.9094 \| 0.9410 \| 0.9361 \| 2.37× \|
	\| philomath-1209/programming-language-identification (ONNX) \| 84 M \| 0.8449 \| 0.8445 \| 0.8467 \| 1.00× \|


	## Files

	```
	model.pt fp32 PyTorch checkpoint (CommonLingua format)
	model.bf16.pt bf16 sidecar checkpoint (smaller, same accuracy in eval)
	lang2idx.json 107-label index
	training_metadata.json hyperparameters and dataset stats
	training_history.json per-epoch loss / val_acc / val_macro_f1
	onnx/
	model.onnx ONNX export (opset 20, dynamic batch)
	model.onnx.data external weights blob
	lang2idx.json (mirror)
	onnx_metadata.json parity report vs PyTorch
	```

	## Quick start — PyTorch

	```python
	import torch, numpy as np, sys
	sys.path.append("path/to/code-language-id/src")
	from code_language_id.byte_hybrid import ByteHybrid, CONFIGS

	ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
	model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
	**CONFIGS[ckpt["config"]]).eval()
	model.load_state_dict(ckpt["model_state_dict"])
	idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}

	def encode(texts, max_len=ckpt["max_len"]):
	out = np.full((len(texts), max_len), 256, dtype=np.int64)
	for i, t in enumerate(texts):
	b = t.encode("utf-8", errors="replace")[:max_len]
	out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
	return torch.from_numpy(out)

	with torch.no_grad():
	logits = model(encode(["def hello():\n print('hi')"]))
	print(idx2lang[int(logits.argmax(-1))]) # -> Python
	```

	## Quick start — ONNX Runtime

	```python
	import onnxruntime as ort, numpy as np, json

	sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
	lang2idx = json.load(open("onnx/lang2idx.json"))
	idx2lang = {v: k for k, v in lang2idx.items()}
	MAX_LEN = 1023

	def encode(texts, max_len=MAX_LEN):
	out = np.full((len(texts), max_len), 256, dtype=np.int64)
	for i, t in enumerate(texts):
	b = t.encode("utf-8", errors="replace")[:max_len]
	out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
	return out

	logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
	print(idx2lang[int(logits.argmax(-1))]) # -> Rust
	```

	## Training summary

	- Data: Rosetta Code (`cakiki/rosetta-code`) + The Stack v1
	(`bigcode/the-stack`), task-split to prevent leakage.
	72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels.
	- Snippets: variable-window (64–1023 bytes) UTF-8.
	- Optimizer: AdamW (β=0.9, 0.95, weight decay 0.01) + cosine-with-warmup,
	peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
	- Schedule: 30 epochs, bf16 autocast, batch 128 (effective 128 with
	gradient clipping; SDPA fused attention).
	- Best val macro F1: 0.9085 @ epoch 26 (early stopped).

	See `training_metadata.json` for the full hyperparameter dump.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{mariappan2026codelangidlite,
	author = {Mariappan, Vijayachandran},
	title = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
	}
	```

	Upstream architecture:

	```bibtex
	@misc{commonlingua,
	author = {{PleIAs}},
	title = {CommonLingua: Byte-level Language Identification for 334 Languages},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/PleIAs/CommonLingua}
	}
	```

	## License & attribution

	Apache-2.0. Architecture and reference inference code derive from
	PleIAs/CommonLingua (Apache-2.0). Trained weights and dataset curation are
	original to this repository.