snip / README.md

Update README.md

48b4098 verified 8 days ago

3.88 kB

	---
	license: bsd-3-clause
	pipeline_tag: text-classification
	tags:
	- text-classification
	- code-classification
	- ngram
	- browser
	- typescript
	model-index:
	- name: SNIP
	results:
	- task:
	type: text-classification
	name: Text and code snippet classification
	dataset:
	name: SNIP validation split
	type: internal
	metrics:
	- type: accuracy
	value: 1.0
	name: Accuracy
	- type: macro_f1
	value: 1.0
	name: Macro F1
	- task:
	type: text-classification
	name: Text and code snippet classification
	dataset:
	name: SNIP test split
	type: internal
	metrics:
	- type: accuracy
	value: 0.9962
	name: Accuracy
	- type: macro_f1
	value: 0.9926
	name: Macro F1
	- task:
	type: text-classification
	name: Text and code snippet classification
	dataset:
	name: SNIP held-out evaluation suites
	type: internal
	metrics:
	- type: accuracy
	value: 0.9816
	name: Accuracy
	- type: macro_f1
	value: 0.9819
	name: Macro F1
	- task:
	type: text-classification
	name: Text and code snippet classification
	dataset:
	name: SNIP hard-neighbor cases
	type: internal
	metrics:
	- type: accuracy
	value: 0.9932
	name: Accuracy
	- type: macro_f1
	value: 0.6905
	name: Macro F1
	---

	# SNIP Model Card

	Test the model at [snip.wesring.com](https://snip.wesring.com). Read the full report on hugging face or in the [github repo](https://github.com/wesr/snip/blob/main/REPORT.md)

	## Model

	- Name: Small N-gram Identifier for Pastes (SNIP)
	- Package version: `1.0.0`
	- Model version: `snip-109`
	- Model file: `model/snip_model.json`
	- Runtime source: `src/snip.ts`
	- Published runtime: `dist/snip.js`

	## Intended Use

	SNIP predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. It is designed for browser-local inference in applications where sending pasted content to a server is undesirable. It aims to be quick, small in size, and fairly accurate.

	## Labels

	`bash`, `c`, `cpp`, `csharp`, `css`, `csv`, `diff`, `dockerfile`, `go`, `html`, `ini`, `java`, `javascript`, `json`, `log`, `lua`, `markdown`, `php`, `plain_text`, `powershell`, `python`, `ruby`, `rust`, `sql`, `toml`, `typescript`, `xml`, `yaml`.

	## Architecture

	- Multiclass linear classifier
	- Hashed character n-grams, length 1-5
	- 32,768 hash buckets
	- L2-normalized `log1p(count)` features
	- 1,500 retained weights per label
	- 4-decimal serialized weights
	- TypeScript source with no runtime dependencies

	## Size

	- Raw model JSON: 626,596 bytes
	- Gzip model JSON: 203,820 bytes

	## Performance

	Measured in Google Chrome 149.0.7827.116 on macOS:

	\| Input size \| Sampled chars \| P50 ms \| P95 ms \|
	\| --- \| ---: \| ---: \| ---: \|
	\| 1 KB \| 1,024 \| 1.490 \| 1.548 \|
	\| 16 KB \| 16,384 \| 6.580 \| 6.730 \|
	\| 100 KB \| 12,292 \| 5.180 \| 5.310 \|
	\| 1 MB \| 12,292 \| 5.170 \| 5.310 \|
	\| 5 MB \| 12,292 \| 5.210 \| 5.380 \|

	## Metrics

	\| Evaluation Set \| Examples \| Accuracy \| Macro F1 \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Validation \| 487 \| 1.0000 \| 1.0000 \|
	\| Test \| 532 \| 0.9962 \| 0.9926 \|
	\| Hard cases \| 148 \| 0.9932 \| 0.6905 \|
	\| Development holdouts \| 207 \| 0.9903 \| - \|
	\| Regression holdouts \| 196 \| 0.9796 \| - \|
	\| Final mixed holdout \| 56 \| 0.9821 \| 0.9810 \|

	## Limitations

	- Optimized for text, not binary file identification.
	- Very short snippets may lack enough evidence for a specific syntax label.
	- TypeScript and JavaScript can be close when a snippet has no type syntax.
	- JSONL application logs can be close to JSON.
	- Markdown/plain-text separation can be weak on very short prose-like Markdown.

	## Training Notes

	The training corpus combines generated structured text, curated programming examples, realistic local files, and targeted hard-neighbor examples.