snip / README.md
wesringml's picture
Update README.md
48b4098 verified
|
Raw
History Blame Contribute Delete
3.88 kB
---
license: bsd-3-clause
pipeline_tag: text-classification
tags:
- text-classification
- code-classification
- ngram
- browser
- typescript
model-index:
- name: SNIP
results:
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP validation split
type: internal
metrics:
- type: accuracy
value: 1.0
name: Accuracy
- type: macro_f1
value: 1.0
name: Macro F1
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP test split
type: internal
metrics:
- type: accuracy
value: 0.9962
name: Accuracy
- type: macro_f1
value: 0.9926
name: Macro F1
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP held-out evaluation suites
type: internal
metrics:
- type: accuracy
value: 0.9816
name: Accuracy
- type: macro_f1
value: 0.9819
name: Macro F1
- task:
type: text-classification
name: Text and code snippet classification
dataset:
name: SNIP hard-neighbor cases
type: internal
metrics:
- type: accuracy
value: 0.9932
name: Accuracy
- type: macro_f1
value: 0.6905
name: Macro F1
---
# SNIP Model Card
Test the model at [snip.wesring.com](https://snip.wesring.com). Read the full report on hugging face or in the [github repo](https://github.com/wesr/snip/blob/main/REPORT.md)
## Model
- Name: Small N-gram Identifier for Pastes (SNIP)
- Package version: `1.0.0`
- Model version: `snip-109`
- Model file: `model/snip_model.json`
- Runtime source: `src/snip.ts`
- Published runtime: `dist/snip.js`
## Intended Use
SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate.
## Labels
`bash`, `c`, `cpp`, `csharp`, `css`, `csv`, `diff`, `dockerfile`, `go`, `html`, `ini`, `java`, `javascript`, `json`, `log`, `lua`, `markdown`, `php`, `plain_text`, `powershell`, `python`, `ruby`, `rust`, `sql`, `toml`, `typescript`, `xml`, `yaml`.
## Architecture
- Multiclass linear classifier
- Hashed character n-grams, length 1-5
- 32,768 hash buckets
- L2-normalized `log1p(count)` features
- 1,500 retained weights per label
- 4-decimal serialized weights
- TypeScript source with no runtime dependencies
## Size
- Raw model JSON: 626,596 bytes
- Gzip model JSON: 203,820 bytes
## Performance
Measured in Google Chrome 149.0.7827.116 on macOS:
| Input size | Sampled chars | P50 ms | P95 ms |
| --- | ---: | ---: | ---: |
| 1 KB | 1,024 | 1.490 | 1.548 |
| 16 KB | 16,384 | 6.580 | 6.730 |
| 100 KB | 12,292 | 5.180 | 5.310 |
| 1 MB | 12,292 | 5.170 | 5.310 |
| 5 MB | 12,292 | 5.210 | 5.380 |
## Metrics
| Evaluation Set | Examples | Accuracy | Macro F1 |
| --- | ---: | ---: | ---: |
| Validation | 487 | 1.0000 | 1.0000 |
| Test | 532 | 0.9962 | 0.9926 |
| Hard cases | 148 | 0.9932 | 0.6905 |
| Development holdouts | 207 | 0.9903 | - |
| Regression holdouts | 196 | 0.9796 | - |
| Final mixed holdout | 56 | 0.9821 | 0.9810 |
## Limitations
- Optimized for text, not binary file identification.
- Very short snippets may lack enough evidence for a specific syntax label.
- TypeScript and JavaScript can be close when a snippet has no type syntax.
- JSONL application logs can be close to JSON.
- Markdown/plain-text separation can be weak on very short prose-like Markdown.
## Training Notes
The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.