lelloman's picture
Upload README.md with huggingface_hub
4f4cd5d verified
---
license: mit
language:
- en
tags:
- text-classification
- onnx
- bert
- torrent
- content-classification
base_model: prajjwal1/bert-tiny
pipeline_tag: text-classification
---
# BERT Torrent Classifier
A fine-tuned BERT-tiny model for classifying torrent content into media types.
## Model Details
- **Base model:** [prajjwal1/bert-tiny](https://huggingface.co/prajjwal1/bert-tiny)
- **Task:** Multi-class text classification
- **Labels:** audio, video, software, book, other
- **Format:** ONNX (with embedded weights)
- **Size:** ~17MB
## Training
- **Training data:** ~10k torrent names with 4-LLM consensus voting
- **LLM ensemble:** qwen2.5:3b, gemma3:4b, mistral:7b, qwen3-coder:30b
- **Consensus rules:** 4-agree = high confidence, 3v1 = majority vote, 2v2 = discarded
- **Accuracy:** ~92% on held-out test set
## Usage
This model is designed for use with [mimmo](https://github.com/lelloman/mimmo), a Rust library for torrent content classification. The ONNX model is embedded directly in the binary at compile time.
```rust
// Model is automatically downloaded during build
const MODEL_BYTES: &[u8] = include_bytes!("../models/bert/model_embedded.onnx");
const TOKENIZER_JSON: &str = include_str!("../models/bert/tokenizer.json");
```
## Performance
- Inference: <10ms per sample (CPU)
- Used as ML fallback when pattern matching is inconclusive
## Files
- `model_embedded.onnx` - ONNX model with embedded weights
- `tokenizer.json` - HuggingFace tokenizer
- `vocab.txt` - Vocabulary file
- `config.json` - Model configuration
## License
MIT