pico-type / docs /PLAN.md
GautamKishore's picture
Upload folder using huggingface_hub
3219e0e verified
|
Raw
History Blame Contribute Delete
10 kB
# Final Plan: **pico-type** β€” A 3.5M Multi-Head Byte Classifier for Universal Content Type Detection
## 1. Executive Summary
**pico-type** is a tiny, fully open-source, byte-level multi-head classifier that takes any blob (clipboard text, file bytes, image header, etc.) and emits a structured label set in one forward pass. It ships as:
- A HuggingFace model (Apache-2.0) with multi-tier Matryoshka exports
- A Rust + Python CLI (`picotype`)
- A Gradio Space that doubles as an **MCP server** (callable from Claude, Cursor, VSCode)
- A browser extension (compete with ClipGate)
- A Raycast/Alfred extension
- A companion arXiv paper
**Why now / the gap:** Existing clipboard tools are either regex-only (ClipGate, 13 types) or LLM-powered (Clipboard_ai, needs Ollama, GB-scale). Existing tiny classifiers do one job (code-language, MIME, intent, PII). No model does **all of them in one sub-5MB forward pass with a multi-head output**.
## 2. Architecture
```
Inputs (≀1024 UTF-8 bytes, masked/padded)
β”‚
β”œβ”€ ByteEmbed (256 β†’ 96d, learned; optional FFT-Rotate variant from Kathleen)
β”‚
β”œβ”€ 3Γ— Conv1D block (kernel 3,5,7) + GELU + residual
β”‚
β”œβ”€ 2Γ— Bidirectional attention block (d=192, 4 heads, RoPE ΞΈ=500k)
β”‚ [ablation: also test a Kathleen-style O(L) oscillator trunk
β”‚ for the sub-1MB mobile tier]
β”‚
β”œβ”€ Pool = [mean β€– max β€– std] β†’ 576d shared trunk
β”‚
└─ Matryoshka heads (independent Linear at 16/64/192/576 dims)
β”œβ”€ h_coarse (12-way softmax) β€” primary type
β”œβ”€ h_modality (8-way softmax) β€” textual / binary-image / binary-archive …
β”œβ”€ h_subtype (24-way softmax) β€” JSON, YAML, CSV, HTML, … (only if coarse∈{config,markup,data})
β”œβ”€ h_code_lang (62-way softmax + "undetected") β€” if coarse=code
β”œβ”€ h_text_lang (30-way softmax + "undetected") β€” if coarse=text
β”œβ”€ h_file_mime (90-way softmax + "undetected") β€” if modality∈{binary-*}
└─ h_risk (6-way sigmoid multi-label) β€” api_key, jwt, ssh_key, password, email, phone
```
**Output schema (always returned, with confidence):**
```json
{
"coarse": "code",
"modality":"textual",
"subtype": null,
"code_language": "python", // or "undetected"
"text_language": null,
"file_mime": null,
"risk_flags": [],
"confidence": 0.94,
"model_tier": "standard" // tiny | small | standard | full
}
```
**Size targets (Matryoshka tiering):**
| Tier | Dim | Params | FP32 | INT8 | Target device |
|---|---|---|---|---|---|
| `pico-type-tiny` | 16 | ~0.5M | 2 MB | 0.5 MB | MCU / IoT |
| `pico-type-small` | 64 | ~1.5M | 6 MB | 1.5 MB | Browser (WASM), mobile |
| `pico-type-base` | 192 | ~3.5M | 14 MB | 3.5 MB | Desktop CLI, browser ext |
| `pico-type-pro` | 576 | ~8M | 32 MB | 8 MB | Server / accurate use |
## 3. Training
**Framework:** Pure PyTorch (~600 LoC train script), no transformers dependency. Permissive.
**Distillation teachers (Polymorph-style per-head):**
- `microsoft/deberta-v3-small` (140M) β†’ coarse + modality
- `huggingface/CodeBERTa-language-id` (84M) β†’ code_lang
- `papluca/xlm-roberta-base-language-detection` (270M) β†’ text_lang
- `mjbommar/magic-bert-50m-roformer-classification` (42M) β†’ file_mime
- T=2.0, Ξ±=0.7 KD loss, plus hard-label CE on 30% of batch
**Data sources (all open / free):**
- Code: `bigcode/the-stack-v2` (multilingual), `cakiki/rosetta-code`
- Natural text: Wikipedia dumps, OSCAR, mC4 (subset)
- Configs/markup: GitHub `*.{json,yaml,toml,ini,xml,html,md,tex}` via BigQuery GH archive
- URLs: CommonCrawl WAT
- Errors: StackOverflow / GH issue bodies (regex-filtered for traceback patterns)
- File bytes: `mjbommar/binary-tokenizer-001-64k` corpora, plus synthetic magic-byte headers
- Images: 1 KB header slices from OpenImages V7
- Secrets: synthetic generation (regex + entropy filter), `gptmail-secret-detection` datasets
**Synthetic data generator:** 1 Python script producing balanced mixtures of: prose paragraphs, code snippets (multi-lang, snippets of 64–1023 bytes), config files, error traces, etc. Critical for head coverage.
**Training recipe:**
- Sequence length 1024 bytes
- AdamW, peak lr 3e-3, cosine, 5% warmup, 30 epochs
- bf16, batch 128, grad clip 1.0
- Task-balanced sampling: each head gets β‰₯1 sample per batch
- Loss = Ξ£_head w_h Β· L_h with w tuned on val
**Hardware:** Single A100 80GB, ~36 h total training (3 seeds).
## 4. Data β†’ Label Pipeline
| Bucket | Source | Auto-labels |
|---|---|---|
| Code | The Stack v2 | `language` field, repo paths |
| Prose | Wikipedia, mC4 | `lang` metadata |
| Config/markup | GH files | file ext + content sniff |
| URL | CommonCrawl WAT | regex for `http(s)://` |
| Error | SO, GH issues | regex for `Traceback`, `Error:`, `at line` |
| File bytes | binary-tokenizer corpora | magic-byte first-32B lookup |
| Image header | OpenImages | first 32B β†’ format (PNG/JPEG/WebP/…) |
| Secret | synthetic | regex + entropy β‰₯ 4.5 |
## 5. Distribution & Popularity
### 5.1 HuggingFace deliverables
- `pico-type/pico-type-base` β€” main model card with eval harness, ONNX int8/fp16, tract-onnx, GGUF
- `pico-type/pico-type-tiny` … `-pro` β€” Matryoshka tiers
- `pico-type/picotype-space` β€” Gradio Space (also a **registered MCP server** on HF Hub β€” free distribution via `huggingface.co/mcp` proxy)
- `pico-type/picotype-eval` β€” public eval suite (reproducible JSON benchmarks)
- Datasets: `pico-type/synth-clipboard-v1`, `pico-type/eval-suite-v1`
### 5.2 Software deliverables
- **CLI** (`picotype`): Rust binary, ONNX runtime, sub-5ms inference. Reads stdin/file/clipboard, prints JSON.
- **MCP server** (`picotype-mcp`): stdio + Streamable HTTP, exposes tools: `classify`, `classify_batch`, `watch_clipboard` (macOS/Windows), `classify_history`
- **Browser extension** (Chrome/Firefox): replaces ClipGate's regex with pico-type; per-type icons; on-device, 100% local
- **Raycast extension** + Alfred workflow
- **VSCode extension**: paste-with-type, status bar shows type of selection
- **iOS Shortcut / Android Tasker** plug-in
### 5.3 Launch timeline (4–6 weeks)
| Week | Milestone |
|---|---|
| 1 | Repo scaffold, byte-level pipeline working on synthetic data, all heads wired |
| 2 | Real data ingest, distillation teachers downloaded, base tier trained, eval harness |
| 3 | All four Matryoshka tiers trained, ONNX/tract/GGUF exported, ablations complete |
| 4 | CLI (Rust), MCP server (Python+Gradio), HF Space + MCP registration, model card |
| 5 | Browser extension (MV3), Raycast/VSCode extensions, docs site, demo video |
| 6 | arXiv preprint, Show HN, r/MachineLearning, r/LocalLLaMA, r/programming launches |
### 5.4 Growth hooks
- First **MCP-native content classifier** β€” every Claude/Cursor/VSCode user is a potential user
- Free-tier on HF = automatic inference endpoint exposure
- Trending strategy: launch on a Tuesday, cross-post 4-5 channels same day, follow-up "vs ClipGate benchmarks" post
- Maintain a "What's new in pico-type" weekly digest β†’ builds audience
- Open evaluation suite β†’ community contributes new types (e.g., `meme_url`, `arxiv_id`, `semantic_version`) via a small "head-add" fine-tuning recipe
- Roadmap V2: adding a user-contributed `custom_types` head (LoRA per type) Γ  la Polymorph
## 6. Repo structure (proposed)
```
pico-type/
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE # Apache-2.0
β”œβ”€β”€ model/
β”‚ β”œβ”€β”€ pico_type/
β”‚ β”‚ β”œβ”€β”€ arch.py # ByteHybrid trunk + heads
β”‚ β”‚ β”œβ”€β”€ train.py # multi-task trainer
β”‚ β”‚ β”œβ”€β”€ distill.py # KD from per-head teachers
β”‚ β”‚ β”œβ”€β”€ data.py # synthetic generator + dataset
β”‚ β”‚ β”œβ”€β”€ export.py # ONNX, int8, tract, gguf
β”‚ β”‚ └── eval.py # public eval harness
β”‚ └── configs/ # tier configs
β”œβ”€β”€ crates/picotype/ # Rust CLI
β”œβ”€β”€ crates/picotype-mcp/ # Rust MCP server (or Python)
β”œβ”€β”€ extensions/
β”‚ β”œβ”€β”€ chrome/ # MV3
β”‚ β”œβ”€β”€ raycast/
β”‚ β”œβ”€β”€ alfred/
β”‚ └── vscode/
β”œβ”€β”€ paper/ # arXiv LaTeX
β”œβ”€β”€ spaces/picotype/ # Gradio + MCP
└── docs/ # mintlify or mkdocs
```
## 7. Key risks & mitigations
| Risk | Mitigation |
|---|---|
| Multi-head gradients fight (task interference) | Gradient norm clipping per-head, gating + scheduled dropout |
| 30 langs underperform on low-resource | Reuse PleIAs/CommonLingua approach, weight by data |
| Byte-level convs slow on long inputs | Cap at 1024 bytes, document limit, Matryoshka tiers |
| MCP ecosystem churns | Support stdio + Streamable HTTP (latest spec), drop SSE |
| Competition clones us | First-mover + multi-channel launch + arXiv paper + community eval suite |
## 8. Open decisions before I start coding
1. **Name lock**: `pico-type` confirmed. Handle: `pico-type` on HF? Or `picotype` (no dash)? I suggest `pico-type` for HF and `picotype` for the CLI to match existing naming patterns.
2. **Tier naming**: I propose `tiny`/`small`/`base`/`pro` matching Sentence-Transformers conventions.
3. **OSI license**: Apache-2.0 (matches CommonLingua base, patent grant). Confirm.
4. **arXiv target**: `cs.CL` (primary) + `cs.LG`. Co-authors: open question.
5. **Tagline** (for the model card): "One tiny model, one forward pass, every clipboard."
---
**Ready to execute when you give the green light.** I'll start by scaffolding the repo (`model/pico_type/arch.py` + `data.py` + minimal `train.py`) and getting a single-head baseline running on synthetic data β€” proving the byte-level pipeline + multi-head design before we invest in full data ingest. Estimated time-to-first-baseline: ~2 hours.