GautamKishore commited on
Commit
df6bb25
Β·
verified Β·
1 Parent(s): 8ebf8ff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +71 -49
README.md CHANGED
@@ -9,8 +9,11 @@ tags:
9
  - onnx
10
  - edge-ai
11
  - matryoshka
 
 
 
12
  pipeline_tag: text-classification
13
- library_name: generic
14
  inference:
15
  parameters:
16
  provider: CPUExecutionProvider
@@ -18,17 +21,25 @@ inference:
18
 
19
  <div align="center">
20
 
 
 
21
  # pico-type πŸ”
22
 
23
  **A tiny byte-level multi-head content classifier** β€” ~1.5M params, ~209KB ONNX, <6ms inference.
24
 
25
- Classifies any content into **7 categories** from raw bytes: coarse type, modality, subtype, code language, text language, file MIME, and risk flags.
26
 
27
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
28
  [![Python](https://img.shields.io/badge/python-3.11%2B-blue)]()
29
- [![ONNX](https://img.shields.io/badge/ONNX-exported-success)](checkpoints)
 
30
  [![HuggingFace Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/eulogik/pico-type)
 
31
  [![GitHub](https://img.shields.io/badge/GitHub-eulogik/pico--type-181717?logo=github)](https://github.com/eulogik/pico-type)
 
 
 
 
32
 
33
  </div>
34
 
@@ -36,26 +47,28 @@ Classifies any content into **7 categories** from raw bytes: coarse type, modali
36
 
37
  ## ✨ Features
38
 
39
- - **No tokenizer** β€” operates directly on raw UTF-8 bytes (supports all languages)
40
- - **7 heads, one forward pass** β€” coarse type, modality, subtype, code lang, text lang, file MIME, risk
41
  - **4 Matryoshka tiers** β€” tiny (16d) β†’ small (64d) β†’ base (192d) β†’ pro (576d)
42
- - **~200KB ONNX** β€” deploy on edge devices, serverless, browser (WebAssembly)
43
- - **<12ms inference** on CPU via ONNX Runtime
44
- - **CLI, Gradio Space, MCP server** β€” ready to use
45
 
46
  ## πŸ“Š Performance
47
 
48
- | Head | Classes | Accuracy |
49
- |------|---------|----------|
50
- | coarse | 12 | **100%** |
51
- | modality | 8 | **100%** |
52
- | subtype | 24 | **98.4%** |
53
- | code_lang | 62 | **53.9%** |
54
- | text_lang | 30 | **100%** |
55
- | file_mime | 90 | **100%** |
56
- | risk (mAP) | 6 | **100%** |
57
 
58
- _1000 evaluation samples, 9000 training steps (5000 synthetic + 4000 real-code fine-tune), base tier, ~5.6ms inference._
 
 
59
 
60
  ## πŸš€ Quick Start
61
 
@@ -70,11 +83,11 @@ picotype --clip
70
 
71
  ### Python
72
  ```python
73
- from model.pico_type.cli import load_onnx_model, run_onnx
74
 
75
- session = load_onnx_model("base", "checkpoints")
76
- result = run_onnx(session, "def hello(): pass")
77
- print(result)
78
  ```
79
 
80
  ### MCP Server (Claude/Cursor)
@@ -88,52 +101,61 @@ PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server
88
  Bytes β†’ ByteEmbed(256β†’96d) β†’ 3Γ—Conv1D(k=3,5,7) β†’ 2Γ—BiAttention(RoPE) β†’ Pool(meanβ€–maxβ€–std) β†’ 7Γ—Matryoshka Heads
89
  ```
90
 
91
- - **ByteEmbed**: lookup-free byte embedding (256 vocab, 96 dim)
92
- - **Conv1D**: 3 parallel kernels (width 3, 5, 7) with residual + layer norm
93
- - **BiAttention**: bidirectional self-attention with RoPE, 4 heads, 96 dim
94
- - **Pool**: mean + max + std concatenation
95
- - **Matryoshka Heads**: 4 slices of the pooled vector (16/64/192/576 dim) β†’ 7 linear classifiers
 
 
96
 
97
- Total parameters: **1.43M** (tiny) / **1.45M** (small) / **1.48M** (base) / **1.56M** (pro)
98
 
99
  ## πŸ”§ Model Tiers
100
 
101
- | Tier | Dim | Params | ONNX Size |
102
- |------|-----|--------|-----------|
103
- | tiny | 16 | 1.43M | 207 KB |
104
- | small | 64 | 1.45M | 207 KB |
105
- | base | 192 | 1.48M | 209 KB |
106
- | pro | 576 | 1.56M | 206 KB |
107
 
108
- All tiers share the same trunk; only the final linear layers differ.
109
 
110
  ## πŸ§ͺ Classification Heads
111
 
112
- | Head | Classes | Examples |
113
- |------|---------|----------|
114
- | **coarse** | 12 | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
115
- | **modality** | 8 | textual, binary_image, binary_archive, binary_executable, etc. |
116
- | **subtype** | 24 | json, yaml, toml, csv, html, markdown, sql, log, dockerfile, etc. |
117
- | **code_lang** | 62 | python, javascript, typescript, java, c, cpp, go, rust, etc. |
118
- | **text_lang** | 30 | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi, etc. |
119
- | **file_mime** | 90 | text/html, application/json, application/pdf, image/png, video/mp4, etc. |
120
- | **risk** | 6 | api_key, jwt, password, email, phone, ssh_key |
121
 
122
  ## 🌐 Deployment
123
 
124
- | Platform | Location |
125
- |----------|----------|
126
  | HuggingFace Space | [eulogik/pico-type](https://huggingface.co/spaces/eulogik/pico-type) |
127
  | HuggingFace Model | [eulogik/pico-type](https://huggingface.co/eulogik/pico-type) |
128
  | GitHub | [eulogik/pico-type](https://github.com/eulogik/pico-type) |
129
  | PyPI | `pip install picotype` |
 
130
 
131
  ## πŸ“š Documentation
132
 
133
- - [Model Card](MODEL_CARD.md) β€” detailed architecture, training, and evaluation
134
  - [Architecture Plan](docs/PLAN.md) β€” full design document
135
- - [Walkthrough](walkthrough.md) β€” development log
136
 
137
  ## πŸ“„ License
138
 
139
- Apache 2.0
 
 
 
 
 
 
 
9
  - onnx
10
  - edge-ai
11
  - matryoshka
12
+ - multi-head
13
+ - classifier
14
+ - clipboard
15
  pipeline_tag: text-classification
16
+ library_name: pico-type
17
  inference:
18
  parameters:
19
  provider: CPUExecutionProvider
 
21
 
22
  <div align="center">
23
 
24
+ <img src="https://raw.githubusercontent.com/eulogik/pico-type/main/docs/logo.png" alt="pico-type" height="64" />
25
+
26
  # pico-type πŸ”
27
 
28
  **A tiny byte-level multi-head content classifier** β€” ~1.5M params, ~209KB ONNX, <6ms inference.
29
 
30
+ _Classifies any content into **7 categories** from raw bytes in a single forward pass._
31
 
32
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
33
  [![Python](https://img.shields.io/badge/python-3.11%2B-blue)]()
34
+ [![ONNX](https://img.shields.io/badge/ONNX-exported-success)](https://huggingface.co/eulogik/pico-type)
35
+ [![PyPI](https://img.shields.io/pypi/v/pico-type?color=blue)](https://pypi.org/project/pico-type/)
36
  [![HuggingFace Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/eulogik/pico-type)
37
+ [![HuggingFace Model](https://img.shields.io/badge/HuggingFace-Model-orange)](https://huggingface.co/eulogik/pico-type)
38
  [![GitHub](https://img.shields.io/badge/GitHub-eulogik/pico--type-181717?logo=github)](https://github.com/eulogik/pico-type)
39
+ [![CI](https://img.shields.io/github/actions/workflow/status/eulogik/pico-type/ci.yml?logo=githubactions&label=CI)](https://github.com/eulogik/pico-type/actions)
40
+ [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20758542.svg)](https://doi.org/10.5281/zenodo.20758542)
41
+
42
+ _Built by [**eulogik**](https://eulogik.com) β€” AI infrastructure for developers._
43
 
44
  </div>
45
 
 
47
 
48
  ## ✨ Features
49
 
50
+ - **No tokenizer** β€” operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
51
+ - **7 heads, one forward pass** β€” coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
52
  - **4 Matryoshka tiers** β€” tiny (16d) β†’ small (64d) β†’ base (192d) β†’ pro (576d)
53
+ - **~200KB ONNX** β€” deploy on edge devices, serverless functions, browser (WebAssembly)
54
+ - **<6ms inference** on CPU via ONNX Runtime (base tier, 1024 bytes)
55
+ - **CLI, Gradio Space, MCP server** β€” ready for any integration
56
 
57
  ## πŸ“Š Performance
58
 
59
+ | Head | Classes | Synthetic Accuracy | Real-World Accuracy |
60
+ |------|---------|-------------------|---------------------|
61
+ | coarse | 12 | **100%** | **86%** |
62
+ | modality | 8 | **100%** | **100%** |
63
+ | subtype | 24 | **95%** | **β€”** |
64
+ | code_lang | 62 | **39%** | **β€”** |
65
+ | text_lang | 30 | **99%** | **100%** |
66
+ | file_mime | 90 | **100%** | **β€”** |
67
+ | risk (mAP) | 6 | **100%** | **β€”** |
68
 
69
+ _Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference._
70
+
71
+ > Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is **52%** (11/21 correct), up from **23%** baseline before diverse training.
72
 
73
  ## πŸš€ Quick Start
74
 
 
83
 
84
  ### Python
85
  ```python
86
+ from picotype import PicoType, PicoTypeConfig, decode_output
87
 
88
+ model = PicoType(PicoTypeConfig()).eval()
89
+ # ... load checkpoint ...
90
+ result = decode_output(model(b"input bytes"), tier="base")
91
  ```
92
 
93
  ### MCP Server (Claude/Cursor)
 
101
  Bytes β†’ ByteEmbed(256β†’96d) β†’ 3Γ—Conv1D(k=3,5,7) β†’ 2Γ—BiAttention(RoPE) β†’ Pool(meanβ€–maxβ€–std) β†’ 7Γ—Matryoshka Heads
102
  ```
103
 
104
+ | Component | Description |
105
+ |-----------|-------------|
106
+ | **ByteEmbed** | `nn.Embedding(256, 96)` β€” lookup-free byte embedding |
107
+ | **Conv1D** | 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU |
108
+ | **BiAttention** | Bidirectional self-attention with Rotary Position Embeddings, 4 heads |
109
+ | **Pool** | Mean + Max + Std concatenation over masked positions |
110
+ | **Matryoshka Heads** | 4 tier slices of the pooled vector β†’ 7 linear classifiers |
111
 
112
+ **Total parameters**: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)
113
 
114
  ## πŸ”§ Model Tiers
115
 
116
+ | Tier | Dim | Params | ONNX Size | Speed |
117
+ |------|-----|--------|-----------|-------|
118
+ | tiny | 16 | 1.43M | 207 KB | ~3ms |
119
+ | small | 64 | 1.45M | 207 KB | ~4ms |
120
+ | base | 192 | 1.48M | 209 KB | ~5ms |
121
+ | pro | 576 | 1.56M | 206 KB | ~12ms |
122
 
123
+ All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.
124
 
125
  ## πŸ§ͺ Classification Heads
126
 
127
+ | Head | Classes | Gated By | Examples |
128
+ |------|---------|----------|----------|
129
+ | **coarse** | 12 | β€” | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
130
+ | **modality** | 8 | β€” | textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other |
131
+ | **subtype** | 24 | config, markup, data | json, yaml, toml, csv, html, markdown, sql, log, dockerfile |
132
+ | **code_lang** | 62 | code | python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql |
133
+ | **text_lang** | 30 | text | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi |
134
+ | **file_mime** | 90 | image, file | text/html, application/json, application/pdf, image/png, video/mp4 |
135
+ | **risk** | 6 | β€” | api_key, jwt, password, email, phone, ssh_key (probabilities) |
136
 
137
  ## 🌐 Deployment
138
 
139
+ | Platform | URL |
140
+ |----------|-----|
141
  | HuggingFace Space | [eulogik/pico-type](https://huggingface.co/spaces/eulogik/pico-type) |
142
  | HuggingFace Model | [eulogik/pico-type](https://huggingface.co/eulogik/pico-type) |
143
  | GitHub | [eulogik/pico-type](https://github.com/eulogik/pico-type) |
144
  | PyPI | `pip install picotype` |
145
+ | Zenodo | [10.5281/zenodo.20758542](https://doi.org/10.5281/zenodo.20758542) |
146
 
147
  ## πŸ“š Documentation
148
 
149
+ - [Model Card](MODEL_CARD.md) β€” detailed architecture, training, evaluation
150
  - [Architecture Plan](docs/PLAN.md) β€” full design document
151
+ - [Walkthrough](walkthrough.md) β€” development log with all decisions
152
 
153
  ## πŸ“„ License
154
 
155
+ Apache 2.0 β€” free for commercial and personal use.
156
+
157
+ ---
158
+
159
+ <div align="center">
160
+ <sub>Built with ❀️ by <a href="https://eulogik.com">eulogik</a></sub>
161
+ </div>