--- license: bsd-3-clause pipeline_tag: text-classification tags: - text-classification - code-classification - ngram - browser - typescript model-index: - name: SNIP results: - task: type: text-classification name: Text and code snippet classification dataset: name: SNIP validation split type: internal metrics: - type: accuracy value: 1.0 name: Accuracy - type: macro_f1 value: 1.0 name: Macro F1 - task: type: text-classification name: Text and code snippet classification dataset: name: SNIP test split type: internal metrics: - type: accuracy value: 0.9962 name: Accuracy - type: macro_f1 value: 0.9926 name: Macro F1 - task: type: text-classification name: Text and code snippet classification dataset: name: SNIP held-out evaluation suites type: internal metrics: - type: accuracy value: 0.9816 name: Accuracy - type: macro_f1 value: 0.9819 name: Macro F1 - task: type: text-classification name: Text and code snippet classification dataset: name: SNIP hard-neighbor cases type: internal metrics: - type: accuracy value: 0.9932 name: Accuracy - type: macro_f1 value: 0.6905 name: Macro F1 --- # SNIP Model Card Test the model at [snip.wesring.com](https://snip.wesring.com). Read the full report on hugging face or in the [github repo](https://github.com/wesr/snip/blob/main/REPORT.md) ## Model - Name: Small N-gram Identifier for Pastes (SNIP) - Package version: `1.0.0` - Model version: `snip-109` - Model file: `model/snip_model.json` - Runtime source: `src/snip.ts` - Published runtime: `dist/snip.js` ## Intended Use SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate. ## Labels `bash`, `c`, `cpp`, `csharp`, `css`, `csv`, `diff`, `dockerfile`, `go`, `html`, `ini`, `java`, `javascript`, `json`, `log`, `lua`, `markdown`, `php`, `plain_text`, `powershell`, `python`, `ruby`, `rust`, `sql`, `toml`, `typescript`, `xml`, `yaml`. ## Architecture - Multiclass linear classifier - Hashed character n-grams, length 1-5 - 32,768 hash buckets - L2-normalized `log1p(count)` features - 1,500 retained weights per label - 4-decimal serialized weights - TypeScript source with no runtime dependencies ## Size - Raw model JSON: 626,596 bytes - Gzip model JSON: 203,820 bytes ## Performance Measured in Google Chrome 149.0.7827.116 on macOS: | Input size | Sampled chars | P50 ms | P95 ms | | --- | ---: | ---: | ---: | | 1 KB | 1,024 | 1.490 | 1.548 | | 16 KB | 16,384 | 6.580 | 6.730 | | 100 KB | 12,292 | 5.180 | 5.310 | | 1 MB | 12,292 | 5.170 | 5.310 | | 5 MB | 12,292 | 5.210 | 5.380 | ## Metrics | Evaluation Set | Examples | Accuracy | Macro F1 | | --- | ---: | ---: | ---: | | Validation | 487 | 1.0000 | 1.0000 | | Test | 532 | 0.9962 | 0.9926 | | Hard cases | 148 | 0.9932 | 0.6905 | | Development holdouts | 207 | 0.9903 | - | | Regression holdouts | 196 | 0.9796 | - | | Final mixed holdout | 56 | 0.9821 | 0.9810 | ## Limitations - Optimized for text, not binary file identification. - Very short snippets may lack enough evidence for a specific syntax label. - TypeScript and JavaScript can be close when a snippet has no type syntax. - JSONL application logs can be close to JSON. - Markdown/plain-text separation can be weak on very short prose-like Markdown. ## Training Notes The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.