pico-type / docs /PLAN.md
GautamKishore's picture
Upload folder using huggingface_hub
3219e0e verified
|
Raw
History Blame Contribute Delete
10 kB

Final Plan: pico-type β€” A 3.5M Multi-Head Byte Classifier for Universal Content Type Detection

1. Executive Summary

pico-type is a tiny, fully open-source, byte-level multi-head classifier that takes any blob (clipboard text, file bytes, image header, etc.) and emits a structured label set in one forward pass. It ships as:

  • A HuggingFace model (Apache-2.0) with multi-tier Matryoshka exports
  • A Rust + Python CLI (picotype)
  • A Gradio Space that doubles as an MCP server (callable from Claude, Cursor, VSCode)
  • A browser extension (compete with ClipGate)
  • A Raycast/Alfred extension
  • A companion arXiv paper

Why now / the gap: Existing clipboard tools are either regex-only (ClipGate, 13 types) or LLM-powered (Clipboard_ai, needs Ollama, GB-scale). Existing tiny classifiers do one job (code-language, MIME, intent, PII). No model does all of them in one sub-5MB forward pass with a multi-head output.

2. Architecture

Inputs (≀1024 UTF-8 bytes, masked/padded)
  β”‚
  β”œβ”€ ByteEmbed (256 β†’ 96d, learned; optional FFT-Rotate variant from Kathleen)
  β”‚
  β”œβ”€ 3Γ— Conv1D block (kernel 3,5,7)  + GELU + residual
  β”‚
  β”œβ”€ 2Γ— Bidirectional attention block  (d=192, 4 heads, RoPE ΞΈ=500k)
  β”‚     [ablation: also test a Kathleen-style O(L) oscillator trunk
  β”‚      for the sub-1MB mobile tier]
  β”‚
  β”œβ”€ Pool = [mean β€– max β€– std]  β†’ 576d shared trunk
  β”‚
  └─ Matryoshka heads (independent Linear at 16/64/192/576 dims)
       β”œβ”€ h_coarse        (12-way softmax)   β€” primary type
       β”œβ”€ h_modality      (8-way softmax)    β€” textual / binary-image / binary-archive …
       β”œβ”€ h_subtype       (24-way softmax)   β€” JSON, YAML, CSV, HTML, … (only if coarse∈{config,markup,data})
       β”œβ”€ h_code_lang     (62-way softmax + "undetected")  β€” if coarse=code
       β”œβ”€ h_text_lang     (30-way softmax + "undetected")  β€” if coarse=text
       β”œβ”€ h_file_mime     (90-way softmax + "undetected")  β€” if modality∈{binary-*}
       └─ h_risk          (6-way sigmoid multi-label)      β€” api_key, jwt, ssh_key, password, email, phone

Output schema (always returned, with confidence):

{
  "coarse":  "code",
  "modality":"textual",
  "subtype": null,
  "code_language":    "python",        // or "undetected"
  "text_language":    null,
  "file_mime":        null,
  "risk_flags":       [],
  "confidence":       0.94,
  "model_tier":       "standard"       // tiny | small | standard | full
}

Size targets (Matryoshka tiering):

Tier Dim Params FP32 INT8 Target device
pico-type-tiny 16 ~0.5M 2 MB 0.5 MB MCU / IoT
pico-type-small 64 ~1.5M 6 MB 1.5 MB Browser (WASM), mobile
pico-type-base 192 ~3.5M 14 MB 3.5 MB Desktop CLI, browser ext
pico-type-pro 576 ~8M 32 MB 8 MB Server / accurate use

3. Training

Framework: Pure PyTorch (~600 LoC train script), no transformers dependency. Permissive.

Distillation teachers (Polymorph-style per-head):

  • microsoft/deberta-v3-small (140M) β†’ coarse + modality
  • huggingface/CodeBERTa-language-id (84M) β†’ code_lang
  • papluca/xlm-roberta-base-language-detection (270M) β†’ text_lang
  • mjbommar/magic-bert-50m-roformer-classification (42M) β†’ file_mime
  • T=2.0, Ξ±=0.7 KD loss, plus hard-label CE on 30% of batch

Data sources (all open / free):

  • Code: bigcode/the-stack-v2 (multilingual), cakiki/rosetta-code
  • Natural text: Wikipedia dumps, OSCAR, mC4 (subset)
  • Configs/markup: GitHub *.{json,yaml,toml,ini,xml,html,md,tex} via BigQuery GH archive
  • URLs: CommonCrawl WAT
  • Errors: StackOverflow / GH issue bodies (regex-filtered for traceback patterns)
  • File bytes: mjbommar/binary-tokenizer-001-64k corpora, plus synthetic magic-byte headers
  • Images: 1 KB header slices from OpenImages V7
  • Secrets: synthetic generation (regex + entropy filter), gptmail-secret-detection datasets

Synthetic data generator: 1 Python script producing balanced mixtures of: prose paragraphs, code snippets (multi-lang, snippets of 64–1023 bytes), config files, error traces, etc. Critical for head coverage.

Training recipe:

  • Sequence length 1024 bytes
  • AdamW, peak lr 3e-3, cosine, 5% warmup, 30 epochs
  • bf16, batch 128, grad clip 1.0
  • Task-balanced sampling: each head gets β‰₯1 sample per batch
  • Loss = Ξ£_head w_h Β· L_h with w tuned on val

Hardware: Single A100 80GB, ~36 h total training (3 seeds).

4. Data β†’ Label Pipeline

Bucket Source Auto-labels
Code The Stack v2 language field, repo paths
Prose Wikipedia, mC4 lang metadata
Config/markup GH files file ext + content sniff
URL CommonCrawl WAT regex for http(s)://
Error SO, GH issues regex for Traceback, Error:, at line
File bytes binary-tokenizer corpora magic-byte first-32B lookup
Image header OpenImages first 32B β†’ format (PNG/JPEG/WebP/…)
Secret synthetic regex + entropy β‰₯ 4.5

5. Distribution & Popularity

5.1 HuggingFace deliverables

  • pico-type/pico-type-base β€” main model card with eval harness, ONNX int8/fp16, tract-onnx, GGUF
  • pico-type/pico-type-tiny … -pro β€” Matryoshka tiers
  • pico-type/picotype-space β€” Gradio Space (also a registered MCP server on HF Hub β€” free distribution via huggingface.co/mcp proxy)
  • pico-type/picotype-eval β€” public eval suite (reproducible JSON benchmarks)
  • Datasets: pico-type/synth-clipboard-v1, pico-type/eval-suite-v1

5.2 Software deliverables

  • CLI (picotype): Rust binary, ONNX runtime, sub-5ms inference. Reads stdin/file/clipboard, prints JSON.
  • MCP server (picotype-mcp): stdio + Streamable HTTP, exposes tools: classify, classify_batch, watch_clipboard (macOS/Windows), classify_history
  • Browser extension (Chrome/Firefox): replaces ClipGate's regex with pico-type; per-type icons; on-device, 100% local
  • Raycast extension + Alfred workflow
  • VSCode extension: paste-with-type, status bar shows type of selection
  • iOS Shortcut / Android Tasker plug-in

5.3 Launch timeline (4–6 weeks)

Week Milestone
1 Repo scaffold, byte-level pipeline working on synthetic data, all heads wired
2 Real data ingest, distillation teachers downloaded, base tier trained, eval harness
3 All four Matryoshka tiers trained, ONNX/tract/GGUF exported, ablations complete
4 CLI (Rust), MCP server (Python+Gradio), HF Space + MCP registration, model card
5 Browser extension (MV3), Raycast/VSCode extensions, docs site, demo video
6 arXiv preprint, Show HN, r/MachineLearning, r/LocalLLaMA, r/programming launches

5.4 Growth hooks

  • First MCP-native content classifier β€” every Claude/Cursor/VSCode user is a potential user
  • Free-tier on HF = automatic inference endpoint exposure
  • Trending strategy: launch on a Tuesday, cross-post 4-5 channels same day, follow-up "vs ClipGate benchmarks" post
  • Maintain a "What's new in pico-type" weekly digest β†’ builds audience
  • Open evaluation suite β†’ community contributes new types (e.g., meme_url, arxiv_id, semantic_version) via a small "head-add" fine-tuning recipe
  • Roadmap V2: adding a user-contributed custom_types head (LoRA per type) Γ  la Polymorph

6. Repo structure (proposed)

pico-type/
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE                          # Apache-2.0
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ pico_type/
β”‚   β”‚   β”œβ”€β”€ arch.py                  # ByteHybrid trunk + heads
β”‚   β”‚   β”œβ”€β”€ train.py                 # multi-task trainer
β”‚   β”‚   β”œβ”€β”€ distill.py               # KD from per-head teachers
β”‚   β”‚   β”œβ”€β”€ data.py                  # synthetic generator + dataset
β”‚   β”‚   β”œβ”€β”€ export.py                # ONNX, int8, tract, gguf
β”‚   β”‚   └── eval.py                  # public eval harness
β”‚   └── configs/                     # tier configs
β”œβ”€β”€ crates/picotype/                 # Rust CLI
β”œβ”€β”€ crates/picotype-mcp/             # Rust MCP server (or Python)
β”œβ”€β”€ extensions/
β”‚   β”œβ”€β”€ chrome/                      # MV3
β”‚   β”œβ”€β”€ raycast/
β”‚   β”œβ”€β”€ alfred/
β”‚   └── vscode/
β”œβ”€β”€ paper/                           # arXiv LaTeX
β”œβ”€β”€ spaces/picotype/                 # Gradio + MCP
└── docs/                            # mintlify or mkdocs

7. Key risks & mitigations

Risk Mitigation
Multi-head gradients fight (task interference) Gradient norm clipping per-head, gating + scheduled dropout
30 langs underperform on low-resource Reuse PleIAs/CommonLingua approach, weight by data
Byte-level convs slow on long inputs Cap at 1024 bytes, document limit, Matryoshka tiers
MCP ecosystem churns Support stdio + Streamable HTTP (latest spec), drop SSE
Competition clones us First-mover + multi-channel launch + arXiv paper + community eval suite

8. Open decisions before I start coding

  1. Name lock: pico-type confirmed. Handle: pico-type on HF? Or picotype (no dash)? I suggest pico-type for HF and picotype for the CLI to match existing naming patterns.
  2. Tier naming: I propose tiny/small/base/pro matching Sentence-Transformers conventions.
  3. OSI license: Apache-2.0 (matches CommonLingua base, patent grant). Confirm.
  4. arXiv target: cs.CL (primary) + cs.LG. Co-authors: open question.
  5. Tagline (for the model card): "One tiny model, one forward pass, every clipboard."

Ready to execute when you give the green light. I'll start by scaffolding the repo (model/pico_type/arch.py + data.py + minimal train.py) and getting a single-head baseline running on synthetic data β€” proving the byte-level pipeline + multi-head design before we invest in full data ingest. Estimated time-to-first-baseline: ~2 hours.