---
license: bsd-3-clause
pipeline_tag: text-classification
tags:
- text-classification
- code-classification
- ngram
- browser
- typescript
model-index:
- name: SNIP
  results:
  - task:
      type: text-classification
      name: Text and code snippet classification
    dataset:
      name: SNIP validation split
      type: internal
    metrics:
    - type: accuracy
      value: 1.0
      name: Accuracy
    - type: macro_f1
      value: 1.0
      name: Macro F1
  - task:
      type: text-classification
      name: Text and code snippet classification
    dataset:
      name: SNIP test split
      type: internal
    metrics:
    - type: accuracy
      value: 0.9962
      name: Accuracy
    - type: macro_f1
      value: 0.9926
      name: Macro F1
  - task:
      type: text-classification
      name: Text and code snippet classification
    dataset:
      name: SNIP held-out evaluation suites
      type: internal
    metrics:
    - type: accuracy
      value: 0.9816
      name: Accuracy
    - type: macro_f1
      value: 0.9819
      name: Macro F1
  - task:
      type: text-classification
      name: Text and code snippet classification
    dataset:
      name: SNIP hard-neighbor cases
      type: internal
    metrics:
    - type: accuracy
      value: 0.9932
      name: Accuracy
    - type: macro_f1
      value: 0.6905
      name: Macro F1
---

# SNIP Model Card

Test the model at [snip.wesring.com](https://snip.wesring.com). Read the full report on hugging face or in the [github repo](https://github.com/wesr/snip/blob/main/REPORT.md)

## Model

- Name: Small N-gram Identifier for Pastes (SNIP)
- Package version: `1.0.0`
- Model version: `snip-109`
- Model file: `model/snip_model.json`
- Runtime source: `src/snip.ts`
- Published runtime: `dist/snip.js`

## Intended Use

SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate.

## Labels

`bash`, `c`, `cpp`, `csharp`, `css`, `csv`, `diff`, `dockerfile`, `go`, `html`, `ini`, `java`, `javascript`, `json`, `log`, `lua`, `markdown`, `php`, `plain_text`, `powershell`, `python`, `ruby`, `rust`, `sql`, `toml`, `typescript`, `xml`, `yaml`.

## Architecture

- Multiclass linear classifier
- Hashed character n-grams, length 1-5
- 32,768 hash buckets
- L2-normalized `log1p(count)` features
- 1,500 retained weights per label
- 4-decimal serialized weights
- TypeScript source with no runtime dependencies

## Size

- Raw model JSON: 626,596 bytes
- Gzip model JSON: 203,820 bytes

## Performance

Measured in Google Chrome 149.0.7827.116 on macOS:

| Input size | Sampled chars | P50 ms | P95 ms |
| --- | ---: | ---: | ---: |
| 1 KB | 1,024 | 1.490 | 1.548 |
| 16 KB | 16,384 | 6.580 | 6.730 |
| 100 KB | 12,292 | 5.180 | 5.310 |
| 1 MB | 12,292 | 5.170 | 5.310 |
| 5 MB | 12,292 | 5.210 | 5.380 |

## Metrics

| Evaluation Set | Examples | Accuracy | Macro F1 |
| --- | ---: | ---: | ---: |
| Validation | 487 | 1.0000 | 1.0000 |
| Test | 532 | 0.9962 | 0.9926 |
| Hard cases | 148 | 0.9932 | 0.6905 |
| Development holdouts | 207 | 0.9903 | - |
| Regression holdouts | 196 | 0.9796 | - |
| Final mixed holdout | 56 | 0.9821 | 0.9810 |

## Limitations

- Optimized for text, not binary file identification.
- Very short snippets may lack enough evidence for a specific syntax label.
- TypeScript and JavaScript can be close when a snippet has no type syntax.
- JSONL application logs can be close to JSON.
- Markdown/plain-text separation can be weak on very short prose-like Markdown.

## Training Notes

The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.