license: bsd-3-clause
pipeline_tag: text-classification
tags:
- text-classification
- code-classification
- ngram
- browser
- typescript
model-index:
- name: SNIP
results:
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP validation split
type: internal
metrics:
- type: accuracy
value: 1
name: Accuracy
- type: macro_f1
value: 1
name: Macro F1
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP test split
type: internal
metrics:
- type: accuracy
value: 0.9962
name: Accuracy
- type: macro_f1
value: 0.9926
name: Macro F1
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP held-out evaluation suites
type: internal
metrics:
- type: accuracy
value: 0.9816
name: Accuracy
- type: macro_f1
value: 0.9819
name: Macro F1
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP hard-neighbor cases
type: internal
metrics:
- type: accuracy
value: 0.9932
name: Accuracy
- type: macro_f1
value: 0.6905
name: Macro F1
SNIP Model Card
Test the model at snip.wesring.com. Read the full report on hugging face or in the github repo
Model
- Name: Small N-gram Identifier for Pastes (SNIP)
- Package version:
1.0.0 - Model version:
snip-109 - Model file:
model/snip_model.json - Runtime source:
src/snip.ts - Published runtime:
dist/snip.js
Intended Use
SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate.
Labels
bash, c, cpp, csharp, css, csv, diff, dockerfile, go, html, ini, java, javascript, json, log, lua, markdown, php, plain_text, powershell, python, ruby, rust, sql, toml, typescript, xml, yaml.
Architecture
- Multiclass linear classifier
- Hashed character n-grams, length 1-5
- 32,768 hash buckets
- L2-normalized
log1p(count)features - 1,500 retained weights per label
- 4-decimal serialized weights
- TypeScript source with no runtime dependencies
Size
- Raw model JSON: 626,596 bytes
- Gzip model JSON: 203,820 bytes
Performance
Measured in Google Chrome 149.0.7827.116 on macOS:
| Input size | Sampled chars | P50 ms | P95 ms |
|---|---|---|---|
| 1 KB | 1,024 | 1.490 | 1.548 |
| 16 KB | 16,384 | 6.580 | 6.730 |
| 100 KB | 12,292 | 5.180 | 5.310 |
| 1 MB | 12,292 | 5.170 | 5.310 |
| 5 MB | 12,292 | 5.210 | 5.380 |
Metrics
| Evaluation Set | Examples | Accuracy | Macro F1 |
|---|---|---|---|
| Validation | 487 | 1.0000 | 1.0000 |
| Test | 532 | 0.9962 | 0.9926 |
| Hard cases | 148 | 0.9932 | 0.6905 |
| Development holdouts | 207 | 0.9903 | - |
| Regression holdouts | 196 | 0.9796 | - |
| Final mixed holdout | 56 | 0.9821 | 0.9810 |
Limitations
- Optimized for text, not binary file identification.
- Very short snippets may lack enough evidence for a specific syntax label.
- TypeScript and JavaScript can be close when a snippet has no type syntax.
- JSONL application logs can be close to JSON.
- Markdown/plain-text separation can be weak on very short prose-like Markdown.
Training Notes
The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.