snip / README.md

Update README.md

48b4098 verified 7 days ago

3.88 kB

license: bsd-3-clause
pipeline_tag: text-classification
tags:
  - text-classification
  - code-classification
  - ngram
  - browser
  - typescript
model-index:
  - name: SNIP
    results:
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP validation split
          type: internal
        metrics:
          - type: accuracy
            value: 1
            name: Accuracy
          - type: macro_f1
            value: 1
            name: Macro F1
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP test split
          type: internal
        metrics:
          - type: accuracy
            value: 0.9962
            name: Accuracy
          - type: macro_f1
            value: 0.9926
            name: Macro F1
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP held-out evaluation suites
          type: internal
        metrics:
          - type: accuracy
            value: 0.9816
            name: Accuracy
          - type: macro_f1
            value: 0.9819
            name: Macro F1
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP hard-neighbor cases
          type: internal
        metrics:
          - type: accuracy
            value: 0.9932
            name: Accuracy
          - type: macro_f1
            value: 0.6905
            name: Macro F1

SNIP Model Card

Test the model at snip.wesring.com. Read the full report on hugging face or in the github repo

Model

Name: Small N-gram Identifier for Pastes (SNIP)
Package version: 1.0.0
Model version: snip-109
Model file: model/snip_model.json
Runtime source: src/snip.ts
Published runtime: dist/snip.js

Intended Use

SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate.

Labels

bash, c, cpp, csharp, css, csv, diff, dockerfile, go, html, ini, java, javascript, json, log, lua, markdown, php, plain_text, powershell, python, ruby, rust, sql, toml, typescript, xml, yaml.

Architecture

Multiclass linear classifier
Hashed character n-grams, length 1-5
32,768 hash buckets
L2-normalized log1p(count) features
1,500 retained weights per label
4-decimal serialized weights
TypeScript source with no runtime dependencies

Size

Raw model JSON: 626,596 bytes
Gzip model JSON: 203,820 bytes

Performance

Measured in Google Chrome 149.0.7827.116 on macOS:

Input size	Sampled chars	P50 ms	P95 ms
1 KB	1,024	1.490	1.548
16 KB	16,384	6.580	6.730
100 KB	12,292	5.180	5.310
1 MB	12,292	5.170	5.310
5 MB	12,292	5.210	5.380

Metrics

Evaluation Set	Examples	Accuracy	Macro F1
Validation	487	1.0000	1.0000
Test	532	0.9962	0.9926
Hard cases	148	0.9932	0.6905
Development holdouts	207	0.9903	-
Regression holdouts	196	0.9796	-
Final mixed holdout	56	0.9821	0.9810

Limitations

Optimized for text, not binary file identification.
Very short snippets may lack enough evidence for a specific syntax label.
TypeScript and JavaScript can be close when a snippet has no type syntax.
JSONL application logs can be close to JSON.
Markdown/plain-text separation can be weak on very short prose-like Markdown.

Training Notes

The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.