Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -9,8 +9,11 @@ tags:
|
|
| 9 |
- onnx
|
| 10 |
- edge-ai
|
| 11 |
- matryoshka
|
|
|
|
|
|
|
|
|
|
| 12 |
pipeline_tag: text-classification
|
| 13 |
-
library_name:
|
| 14 |
inference:
|
| 15 |
parameters:
|
| 16 |
provider: CPUExecutionProvider
|
|
@@ -18,17 +21,25 @@ inference:
|
|
| 18 |
|
| 19 |
<div align="center">
|
| 20 |
|
|
|
|
|
|
|
| 21 |
# pico-type π
|
| 22 |
|
| 23 |
**A tiny byte-level multi-head content classifier** β ~1.5M params, ~209KB ONNX, <6ms inference.
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
[](LICENSE)
|
| 28 |
[]()
|
| 29 |
-
[](
|
|
|
|
| 30 |
[](https://huggingface.co/spaces/eulogik/pico-type)
|
|
|
|
| 31 |
[](https://github.com/eulogik/pico-type)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
</div>
|
| 34 |
|
|
@@ -36,26 +47,28 @@ Classifies any content into **7 categories** from raw bytes: coarse type, modali
|
|
| 36 |
|
| 37 |
## β¨ Features
|
| 38 |
|
| 39 |
-
- **No tokenizer** β operates directly on raw UTF-8 bytes (supports all languages)
|
| 40 |
-
- **7 heads, one forward pass** β coarse type, modality, subtype, code lang, text lang, file MIME, risk
|
| 41 |
- **4 Matryoshka tiers** β tiny (16d) β small (64d) β base (192d) β pro (576d)
|
| 42 |
-
- **~200KB ONNX** β deploy on edge devices, serverless, browser (WebAssembly)
|
| 43 |
-
- **<
|
| 44 |
-
- **CLI, Gradio Space, MCP server** β ready
|
| 45 |
|
| 46 |
## π Performance
|
| 47 |
|
| 48 |
-
| Head | Classes | Accuracy |
|
| 49 |
-
|------|---------|----------|
|
| 50 |
-
| coarse | 12 | **100%** |
|
| 51 |
-
| modality | 8 | **100%** |
|
| 52 |
-
| subtype | 24 | **
|
| 53 |
-
| code_lang | 62 | **
|
| 54 |
-
| text_lang | 30 | **100%** |
|
| 55 |
-
| file_mime | 90 | **100%** |
|
| 56 |
-
| risk (mAP) | 6 | **100%** |
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
| 59 |
|
| 60 |
## π Quick Start
|
| 61 |
|
|
@@ -70,11 +83,11 @@ picotype --clip
|
|
| 70 |
|
| 71 |
### Python
|
| 72 |
```python
|
| 73 |
-
from
|
| 74 |
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
```
|
| 79 |
|
| 80 |
### MCP Server (Claude/Cursor)
|
|
@@ -88,52 +101,61 @@ PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server
|
|
| 88 |
Bytes β ByteEmbed(256β96d) β 3ΓConv1D(k=3,5,7) β 2ΓBiAttention(RoPE) β Pool(meanβmaxβstd) β 7ΓMatryoshka Heads
|
| 89 |
```
|
| 90 |
|
| 91 |
-
|
| 92 |
-
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
|
|
|
|
|
|
| 96 |
|
| 97 |
-
Total parameters
|
| 98 |
|
| 99 |
## π§ Model Tiers
|
| 100 |
|
| 101 |
-
| Tier | Dim | Params | ONNX Size |
|
| 102 |
-
|------|-----|--------|-----------|
|
| 103 |
-
| tiny | 16 | 1.43M | 207 KB |
|
| 104 |
-
| small | 64 | 1.45M | 207 KB |
|
| 105 |
-
| base | 192 | 1.48M | 209 KB |
|
| 106 |
-
| pro | 576 | 1.56M | 206 KB |
|
| 107 |
|
| 108 |
-
All tiers share the same trunk; only the final linear layers differ.
|
| 109 |
|
| 110 |
## π§ͺ Classification Heads
|
| 111 |
|
| 112 |
-
| Head | Classes | Examples |
|
| 113 |
-
|------|---------|----------|
|
| 114 |
-
| **coarse** | 12 | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
|
| 115 |
-
| **modality** | 8 | textual, binary_image, binary_archive, binary_executable,
|
| 116 |
-
| **subtype** | 24 | json, yaml, toml, csv, html, markdown, sql, log, dockerfile
|
| 117 |
-
| **code_lang** | 62 | python, javascript, typescript, java, c, cpp, go, rust,
|
| 118 |
-
| **text_lang** | 30 | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
|
| 119 |
-
| **file_mime** | 90 | text/html, application/json, application/pdf, image/png, video/mp4
|
| 120 |
-
| **risk** | 6 | api_key, jwt, password, email, phone, ssh_key |
|
| 121 |
|
| 122 |
## π Deployment
|
| 123 |
|
| 124 |
-
| Platform |
|
| 125 |
-
|----------|-----
|
| 126 |
| HuggingFace Space | [eulogik/pico-type](https://huggingface.co/spaces/eulogik/pico-type) |
|
| 127 |
| HuggingFace Model | [eulogik/pico-type](https://huggingface.co/eulogik/pico-type) |
|
| 128 |
| GitHub | [eulogik/pico-type](https://github.com/eulogik/pico-type) |
|
| 129 |
| PyPI | `pip install picotype` |
|
|
|
|
| 130 |
|
| 131 |
## π Documentation
|
| 132 |
|
| 133 |
-
- [Model Card](MODEL_CARD.md) β detailed architecture, training,
|
| 134 |
- [Architecture Plan](docs/PLAN.md) β full design document
|
| 135 |
-
- [Walkthrough](walkthrough.md) β development log
|
| 136 |
|
| 137 |
## π License
|
| 138 |
|
| 139 |
-
Apache 2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- onnx
|
| 10 |
- edge-ai
|
| 11 |
- matryoshka
|
| 12 |
+
- multi-head
|
| 13 |
+
- classifier
|
| 14 |
+
- clipboard
|
| 15 |
pipeline_tag: text-classification
|
| 16 |
+
library_name: pico-type
|
| 17 |
inference:
|
| 18 |
parameters:
|
| 19 |
provider: CPUExecutionProvider
|
|
|
|
| 21 |
|
| 22 |
<div align="center">
|
| 23 |
|
| 24 |
+
<img src="https://raw.githubusercontent.com/eulogik/pico-type/main/docs/logo.png" alt="pico-type" height="64" />
|
| 25 |
+
|
| 26 |
# pico-type π
|
| 27 |
|
| 28 |
**A tiny byte-level multi-head content classifier** β ~1.5M params, ~209KB ONNX, <6ms inference.
|
| 29 |
|
| 30 |
+
_Classifies any content into **7 categories** from raw bytes in a single forward pass._
|
| 31 |
|
| 32 |
[](LICENSE)
|
| 33 |
[]()
|
| 34 |
+
[](https://huggingface.co/eulogik/pico-type)
|
| 35 |
+
[](https://pypi.org/project/pico-type/)
|
| 36 |
[](https://huggingface.co/spaces/eulogik/pico-type)
|
| 37 |
+
[](https://huggingface.co/eulogik/pico-type)
|
| 38 |
[](https://github.com/eulogik/pico-type)
|
| 39 |
+
[](https://github.com/eulogik/pico-type/actions)
|
| 40 |
+
[](https://doi.org/10.5281/zenodo.20758542)
|
| 41 |
+
|
| 42 |
+
_Built by [**eulogik**](https://eulogik.com) β AI infrastructure for developers._
|
| 43 |
|
| 44 |
</div>
|
| 45 |
|
|
|
|
| 47 |
|
| 48 |
## β¨ Features
|
| 49 |
|
| 50 |
+
- **No tokenizer** β operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
|
| 51 |
+
- **7 heads, one forward pass** β coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
|
| 52 |
- **4 Matryoshka tiers** β tiny (16d) β small (64d) β base (192d) β pro (576d)
|
| 53 |
+
- **~200KB ONNX** β deploy on edge devices, serverless functions, browser (WebAssembly)
|
| 54 |
+
- **<6ms inference** on CPU via ONNX Runtime (base tier, 1024 bytes)
|
| 55 |
+
- **CLI, Gradio Space, MCP server** β ready for any integration
|
| 56 |
|
| 57 |
## π Performance
|
| 58 |
|
| 59 |
+
| Head | Classes | Synthetic Accuracy | Real-World Accuracy |
|
| 60 |
+
|------|---------|-------------------|---------------------|
|
| 61 |
+
| coarse | 12 | **100%** | **86%** |
|
| 62 |
+
| modality | 8 | **100%** | **100%** |
|
| 63 |
+
| subtype | 24 | **95%** | **β** |
|
| 64 |
+
| code_lang | 62 | **39%** | **β** |
|
| 65 |
+
| text_lang | 30 | **99%** | **100%** |
|
| 66 |
+
| file_mime | 90 | **100%** | **β** |
|
| 67 |
+
| risk (mAP) | 6 | **100%** | **β** |
|
| 68 |
|
| 69 |
+
_Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference._
|
| 70 |
+
|
| 71 |
+
> Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is **52%** (11/21 correct), up from **23%** baseline before diverse training.
|
| 72 |
|
| 73 |
## π Quick Start
|
| 74 |
|
|
|
|
| 83 |
|
| 84 |
### Python
|
| 85 |
```python
|
| 86 |
+
from picotype import PicoType, PicoTypeConfig, decode_output
|
| 87 |
|
| 88 |
+
model = PicoType(PicoTypeConfig()).eval()
|
| 89 |
+
# ... load checkpoint ...
|
| 90 |
+
result = decode_output(model(b"input bytes"), tier="base")
|
| 91 |
```
|
| 92 |
|
| 93 |
### MCP Server (Claude/Cursor)
|
|
|
|
| 101 |
Bytes β ByteEmbed(256β96d) β 3ΓConv1D(k=3,5,7) β 2ΓBiAttention(RoPE) β Pool(meanβmaxβstd) β 7ΓMatryoshka Heads
|
| 102 |
```
|
| 103 |
|
| 104 |
+
| Component | Description |
|
| 105 |
+
|-----------|-------------|
|
| 106 |
+
| **ByteEmbed** | `nn.Embedding(256, 96)` β lookup-free byte embedding |
|
| 107 |
+
| **Conv1D** | 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU |
|
| 108 |
+
| **BiAttention** | Bidirectional self-attention with Rotary Position Embeddings, 4 heads |
|
| 109 |
+
| **Pool** | Mean + Max + Std concatenation over masked positions |
|
| 110 |
+
| **Matryoshka Heads** | 4 tier slices of the pooled vector β 7 linear classifiers |
|
| 111 |
|
| 112 |
+
**Total parameters**: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)
|
| 113 |
|
| 114 |
## π§ Model Tiers
|
| 115 |
|
| 116 |
+
| Tier | Dim | Params | ONNX Size | Speed |
|
| 117 |
+
|------|-----|--------|-----------|-------|
|
| 118 |
+
| tiny | 16 | 1.43M | 207 KB | ~3ms |
|
| 119 |
+
| small | 64 | 1.45M | 207 KB | ~4ms |
|
| 120 |
+
| base | 192 | 1.48M | 209 KB | ~5ms |
|
| 121 |
+
| pro | 576 | 1.56M | 206 KB | ~12ms |
|
| 122 |
|
| 123 |
+
All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.
|
| 124 |
|
| 125 |
## π§ͺ Classification Heads
|
| 126 |
|
| 127 |
+
| Head | Classes | Gated By | Examples |
|
| 128 |
+
|------|---------|----------|----------|
|
| 129 |
+
| **coarse** | 12 | β | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
|
| 130 |
+
| **modality** | 8 | β | textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other |
|
| 131 |
+
| **subtype** | 24 | config, markup, data | json, yaml, toml, csv, html, markdown, sql, log, dockerfile |
|
| 132 |
+
| **code_lang** | 62 | code | python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql |
|
| 133 |
+
| **text_lang** | 30 | text | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi |
|
| 134 |
+
| **file_mime** | 90 | image, file | text/html, application/json, application/pdf, image/png, video/mp4 |
|
| 135 |
+
| **risk** | 6 | β | api_key, jwt, password, email, phone, ssh_key (probabilities) |
|
| 136 |
|
| 137 |
## π Deployment
|
| 138 |
|
| 139 |
+
| Platform | URL |
|
| 140 |
+
|----------|-----|
|
| 141 |
| HuggingFace Space | [eulogik/pico-type](https://huggingface.co/spaces/eulogik/pico-type) |
|
| 142 |
| HuggingFace Model | [eulogik/pico-type](https://huggingface.co/eulogik/pico-type) |
|
| 143 |
| GitHub | [eulogik/pico-type](https://github.com/eulogik/pico-type) |
|
| 144 |
| PyPI | `pip install picotype` |
|
| 145 |
+
| Zenodo | [10.5281/zenodo.20758542](https://doi.org/10.5281/zenodo.20758542) |
|
| 146 |
|
| 147 |
## π Documentation
|
| 148 |
|
| 149 |
+
- [Model Card](MODEL_CARD.md) β detailed architecture, training, evaluation
|
| 150 |
- [Architecture Plan](docs/PLAN.md) β full design document
|
| 151 |
+
- [Walkthrough](walkthrough.md) β development log with all decisions
|
| 152 |
|
| 153 |
## π License
|
| 154 |
|
| 155 |
+
Apache 2.0 β free for commercial and personal use.
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
<div align="center">
|
| 160 |
+
<sub>Built with β€οΈ by <a href="https://eulogik.com">eulogik</a></sub>
|
| 161 |
+
</div>
|