File size: 4,237 Bytes
c96794f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd4b22
c96794f
 
 
 
 
23bcac5
c96794f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd4b22
c96794f
bcd4b22
 
c96794f
 
 
 
 
23bcac5
c96794f
 
 
 
 
 
 
 
 
 
bcd4b22
c96794f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd4b22
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- code
- programming-language-identification
- language-detection
- modernbert
base_model: answerdotai/ModernBERT-base
datasets:
- cakiki/rosetta-code
- bigcode/the-stack
metrics:
- accuracy
- f1
---

# Programming Language Identification (100+ languages)

A ModernBERT classifier that identifies the programming language of a code
snippet across **107 languages**.

## Inference

### PyTorch

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    attn_implementation="eager",
    torch_dtype=torch.bfloat16,
).eval()

code = "def greet(name: str) -> None:\n    print(f'hello, {name}')"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
print(model.config.id2label[int(logits.argmax(-1))])  # -> "Python"
```

### Batch

```python
snippets = [py_code, rust_code, go_code]  # list of strings
inputs = tokenizer(
    snippets, return_tensors="pt", padding=True, truncation=True, max_length=512
)
with torch.no_grad():
    logits = model(**inputs).logits
for i, pred in enumerate(logits.argmax(-1).tolist()):
    print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
```

### ONNX Runtime

An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
pulling PyTorch — handy for non-Python consumers and edge deployments.

```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_id, subfolder="onnx"
)

inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
logits = ort_model(**inputs).logits
print(ort_model.config.id2label[int(logits.argmax(-1))])
```

**[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** — download and run in Colab or Jupyter.

## Evaluation

Held-out validation split (9,495 rows, 107 labels):

| metric | value |
|---|---|
| macro F1 | **0.9206** |
| accuracy | 0.9306 |


Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270,
COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173.

## Supported languages (107)

ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey,
AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon,
Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D,
Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom,
Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java,
JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB,
MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle,
NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp,
PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku,
Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk,
Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET,
Wren, Zig, jq

## Training data

91,209 code samples across 107 languages, drawn from Rosetta Code
(`cakiki/rosetta-code`) and The Stack v1 (`bigcode/the-stack`). Labels were
independently verified by an LLM judge, and a small set of high-confidence
mislabels between mainstream languages was removed.

Splits are grouped by task to prevent task-level leakage:
72,549 / 9,495 / 8,880 rows (train / val / test).

## Limitations

- Only the first **512 characters** of each input are used — longer files are
  truncated before classification.
- The classifier is purely content-based. If you have file extensions, treat
  them as a strong prior in a production pipeline.