Final Plan: pico-type β A 3.5M Multi-Head Byte Classifier for Universal Content Type Detection
1. Executive Summary
pico-type is a tiny, fully open-source, byte-level multi-head classifier that takes any blob (clipboard text, file bytes, image header, etc.) and emits a structured label set in one forward pass. It ships as:
- A HuggingFace model (Apache-2.0) with multi-tier Matryoshka exports
- A Rust + Python CLI (
picotype) - A Gradio Space that doubles as an MCP server (callable from Claude, Cursor, VSCode)
- A browser extension (compete with ClipGate)
- A Raycast/Alfred extension
- A companion arXiv paper
Why now / the gap: Existing clipboard tools are either regex-only (ClipGate, 13 types) or LLM-powered (Clipboard_ai, needs Ollama, GB-scale). Existing tiny classifiers do one job (code-language, MIME, intent, PII). No model does all of them in one sub-5MB forward pass with a multi-head output.
2. Architecture
Inputs (β€1024 UTF-8 bytes, masked/padded)
β
ββ ByteEmbed (256 β 96d, learned; optional FFT-Rotate variant from Kathleen)
β
ββ 3Γ Conv1D block (kernel 3,5,7) + GELU + residual
β
ββ 2Γ Bidirectional attention block (d=192, 4 heads, RoPE ΞΈ=500k)
β [ablation: also test a Kathleen-style O(L) oscillator trunk
β for the sub-1MB mobile tier]
β
ββ Pool = [mean β max β std] β 576d shared trunk
β
ββ Matryoshka heads (independent Linear at 16/64/192/576 dims)
ββ h_coarse (12-way softmax) β primary type
ββ h_modality (8-way softmax) β textual / binary-image / binary-archive β¦
ββ h_subtype (24-way softmax) β JSON, YAML, CSV, HTML, β¦ (only if coarseβ{config,markup,data})
ββ h_code_lang (62-way softmax + "undetected") β if coarse=code
ββ h_text_lang (30-way softmax + "undetected") β if coarse=text
ββ h_file_mime (90-way softmax + "undetected") β if modalityβ{binary-*}
ββ h_risk (6-way sigmoid multi-label) β api_key, jwt, ssh_key, password, email, phone
Output schema (always returned, with confidence):
{
"coarse": "code",
"modality":"textual",
"subtype": null,
"code_language": "python", // or "undetected"
"text_language": null,
"file_mime": null,
"risk_flags": [],
"confidence": 0.94,
"model_tier": "standard" // tiny | small | standard | full
}
Size targets (Matryoshka tiering):
| Tier | Dim | Params | FP32 | INT8 | Target device |
|---|---|---|---|---|---|
pico-type-tiny |
16 | ~0.5M | 2 MB | 0.5 MB | MCU / IoT |
pico-type-small |
64 | ~1.5M | 6 MB | 1.5 MB | Browser (WASM), mobile |
pico-type-base |
192 | ~3.5M | 14 MB | 3.5 MB | Desktop CLI, browser ext |
pico-type-pro |
576 | ~8M | 32 MB | 8 MB | Server / accurate use |
3. Training
Framework: Pure PyTorch (~600 LoC train script), no transformers dependency. Permissive.
Distillation teachers (Polymorph-style per-head):
microsoft/deberta-v3-small(140M) β coarse + modalityhuggingface/CodeBERTa-language-id(84M) β code_langpapluca/xlm-roberta-base-language-detection(270M) β text_langmjbommar/magic-bert-50m-roformer-classification(42M) β file_mime- T=2.0, Ξ±=0.7 KD loss, plus hard-label CE on 30% of batch
Data sources (all open / free):
- Code:
bigcode/the-stack-v2(multilingual),cakiki/rosetta-code - Natural text: Wikipedia dumps, OSCAR, mC4 (subset)
- Configs/markup: GitHub
*.{json,yaml,toml,ini,xml,html,md,tex}via BigQuery GH archive - URLs: CommonCrawl WAT
- Errors: StackOverflow / GH issue bodies (regex-filtered for traceback patterns)
- File bytes:
mjbommar/binary-tokenizer-001-64kcorpora, plus synthetic magic-byte headers - Images: 1 KB header slices from OpenImages V7
- Secrets: synthetic generation (regex + entropy filter),
gptmail-secret-detectiondatasets
Synthetic data generator: 1 Python script producing balanced mixtures of: prose paragraphs, code snippets (multi-lang, snippets of 64β1023 bytes), config files, error traces, etc. Critical for head coverage.
Training recipe:
- Sequence length 1024 bytes
- AdamW, peak lr 3e-3, cosine, 5% warmup, 30 epochs
- bf16, batch 128, grad clip 1.0
- Task-balanced sampling: each head gets β₯1 sample per batch
- Loss = Ξ£_head w_h Β· L_h with w tuned on val
Hardware: Single A100 80GB, ~36 h total training (3 seeds).
4. Data β Label Pipeline
| Bucket | Source | Auto-labels |
|---|---|---|
| Code | The Stack v2 | language field, repo paths |
| Prose | Wikipedia, mC4 | lang metadata |
| Config/markup | GH files | file ext + content sniff |
| URL | CommonCrawl WAT | regex for http(s):// |
| Error | SO, GH issues | regex for Traceback, Error:, at line |
| File bytes | binary-tokenizer corpora | magic-byte first-32B lookup |
| Image header | OpenImages | first 32B β format (PNG/JPEG/WebP/β¦) |
| Secret | synthetic | regex + entropy β₯ 4.5 |
5. Distribution & Popularity
5.1 HuggingFace deliverables
pico-type/pico-type-baseβ main model card with eval harness, ONNX int8/fp16, tract-onnx, GGUFpico-type/pico-type-tinyβ¦-proβ Matryoshka tierspico-type/picotype-spaceβ Gradio Space (also a registered MCP server on HF Hub β free distribution viahuggingface.co/mcpproxy)pico-type/picotype-evalβ public eval suite (reproducible JSON benchmarks)- Datasets:
pico-type/synth-clipboard-v1,pico-type/eval-suite-v1
5.2 Software deliverables
- CLI (
picotype): Rust binary, ONNX runtime, sub-5ms inference. Reads stdin/file/clipboard, prints JSON. - MCP server (
picotype-mcp): stdio + Streamable HTTP, exposes tools:classify,classify_batch,watch_clipboard(macOS/Windows),classify_history - Browser extension (Chrome/Firefox): replaces ClipGate's regex with pico-type; per-type icons; on-device, 100% local
- Raycast extension + Alfred workflow
- VSCode extension: paste-with-type, status bar shows type of selection
- iOS Shortcut / Android Tasker plug-in
5.3 Launch timeline (4β6 weeks)
| Week | Milestone |
|---|---|
| 1 | Repo scaffold, byte-level pipeline working on synthetic data, all heads wired |
| 2 | Real data ingest, distillation teachers downloaded, base tier trained, eval harness |
| 3 | All four Matryoshka tiers trained, ONNX/tract/GGUF exported, ablations complete |
| 4 | CLI (Rust), MCP server (Python+Gradio), HF Space + MCP registration, model card |
| 5 | Browser extension (MV3), Raycast/VSCode extensions, docs site, demo video |
| 6 | arXiv preprint, Show HN, r/MachineLearning, r/LocalLLaMA, r/programming launches |
5.4 Growth hooks
- First MCP-native content classifier β every Claude/Cursor/VSCode user is a potential user
- Free-tier on HF = automatic inference endpoint exposure
- Trending strategy: launch on a Tuesday, cross-post 4-5 channels same day, follow-up "vs ClipGate benchmarks" post
- Maintain a "What's new in pico-type" weekly digest β builds audience
- Open evaluation suite β community contributes new types (e.g.,
meme_url,arxiv_id,semantic_version) via a small "head-add" fine-tuning recipe - Roadmap V2: adding a user-contributed
custom_typeshead (LoRA per type) Γ la Polymorph
6. Repo structure (proposed)
pico-type/
βββ README.md
βββ LICENSE # Apache-2.0
βββ model/
β βββ pico_type/
β β βββ arch.py # ByteHybrid trunk + heads
β β βββ train.py # multi-task trainer
β β βββ distill.py # KD from per-head teachers
β β βββ data.py # synthetic generator + dataset
β β βββ export.py # ONNX, int8, tract, gguf
β β βββ eval.py # public eval harness
β βββ configs/ # tier configs
βββ crates/picotype/ # Rust CLI
βββ crates/picotype-mcp/ # Rust MCP server (or Python)
βββ extensions/
β βββ chrome/ # MV3
β βββ raycast/
β βββ alfred/
β βββ vscode/
βββ paper/ # arXiv LaTeX
βββ spaces/picotype/ # Gradio + MCP
βββ docs/ # mintlify or mkdocs
7. Key risks & mitigations
| Risk | Mitigation |
|---|---|
| Multi-head gradients fight (task interference) | Gradient norm clipping per-head, gating + scheduled dropout |
| 30 langs underperform on low-resource | Reuse PleIAs/CommonLingua approach, weight by data |
| Byte-level convs slow on long inputs | Cap at 1024 bytes, document limit, Matryoshka tiers |
| MCP ecosystem churns | Support stdio + Streamable HTTP (latest spec), drop SSE |
| Competition clones us | First-mover + multi-channel launch + arXiv paper + community eval suite |
8. Open decisions before I start coding
- Name lock:
pico-typeconfirmed. Handle:pico-typeon HF? Orpicotype(no dash)? I suggestpico-typefor HF andpicotypefor the CLI to match existing naming patterns. - Tier naming: I propose
tiny/small/base/promatching Sentence-Transformers conventions. - OSI license: Apache-2.0 (matches CommonLingua base, patent grant). Confirm.
- arXiv target:
cs.CL(primary) +cs.LG. Co-authors: open question. - Tagline (for the model card): "One tiny model, one forward pass, every clipboard."
Ready to execute when you give the green light. I'll start by scaffolding the repo (model/pico_type/arch.py + data.py + minimal train.py) and getting a single-head baseline running on synthetic data β proving the byte-level pipeline + multi-head design before we invest in full data ingest. Estimated time-to-first-baseline: ~2 hours.