snip / README.md
wesringml's picture
Update README.md
48b4098 verified
|
Raw
History Blame Contribute Delete
3.88 kB
metadata
license: bsd-3-clause
pipeline_tag: text-classification
tags:
  - text-classification
  - code-classification
  - ngram
  - browser
  - typescript
model-index:
  - name: SNIP
    results:
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP validation split
          type: internal
        metrics:
          - type: accuracy
            value: 1
            name: Accuracy
          - type: macro_f1
            value: 1
            name: Macro F1
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP test split
          type: internal
        metrics:
          - type: accuracy
            value: 0.9962
            name: Accuracy
          - type: macro_f1
            value: 0.9926
            name: Macro F1
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP held-out evaluation suites
          type: internal
        metrics:
          - type: accuracy
            value: 0.9816
            name: Accuracy
          - type: macro_f1
            value: 0.9819
            name: Macro F1
      - task:
          type: text-classification
          name: Text and code snippet classification
        dataset:
          name: SNIP hard-neighbor cases
          type: internal
        metrics:
          - type: accuracy
            value: 0.9932
            name: Accuracy
          - type: macro_f1
            value: 0.6905
            name: Macro F1

SNIP Model Card

Test the model at snip.wesring.com. Read the full report on hugging face or in the github repo

Model

  • Name: Small N-gram Identifier for Pastes (SNIP)
  • Package version: 1.0.0
  • Model version: snip-109
  • Model file: model/snip_model.json
  • Runtime source: src/snip.ts
  • Published runtime: dist/snip.js

Intended Use

SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate.

Labels

bash, c, cpp, csharp, css, csv, diff, dockerfile, go, html, ini, java, javascript, json, log, lua, markdown, php, plain_text, powershell, python, ruby, rust, sql, toml, typescript, xml, yaml.

Architecture

  • Multiclass linear classifier
  • Hashed character n-grams, length 1-5
  • 32,768 hash buckets
  • L2-normalized log1p(count) features
  • 1,500 retained weights per label
  • 4-decimal serialized weights
  • TypeScript source with no runtime dependencies

Size

  • Raw model JSON: 626,596 bytes
  • Gzip model JSON: 203,820 bytes

Performance

Measured in Google Chrome 149.0.7827.116 on macOS:

Input size Sampled chars P50 ms P95 ms
1 KB 1,024 1.490 1.548
16 KB 16,384 6.580 6.730
100 KB 12,292 5.180 5.310
1 MB 12,292 5.170 5.310
5 MB 12,292 5.210 5.380

Metrics

Evaluation Set Examples Accuracy Macro F1
Validation 487 1.0000 1.0000
Test 532 0.9962 0.9926
Hard cases 148 0.9932 0.6905
Development holdouts 207 0.9903 -
Regression holdouts 196 0.9796 -
Final mixed holdout 56 0.9821 0.9810

Limitations

  • Optimized for text, not binary file identification.
  • Very short snippets may lack enough evidence for a specific syntax label.
  • TypeScript and JavaScript can be close when a snippet has no type syntax.
  • JSONL application logs can be close to JSON.
  • Markdown/plain-text separation can be weak on very short prose-like Markdown.

Training Notes

The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.