|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-classification |
|
|
- onnx |
|
|
- bert |
|
|
- torrent |
|
|
- content-classification |
|
|
base_model: prajjwal1/bert-tiny |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# BERT Torrent Classifier |
|
|
|
|
|
A fine-tuned BERT-tiny model for classifying torrent content into media types. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base model:** [prajjwal1/bert-tiny](https://huggingface.co/prajjwal1/bert-tiny) |
|
|
- **Task:** Multi-class text classification |
|
|
- **Labels:** audio, video, software, book, other |
|
|
- **Format:** ONNX (with embedded weights) |
|
|
- **Size:** ~17MB |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Training data:** ~10k torrent names with 4-LLM consensus voting |
|
|
- **LLM ensemble:** qwen2.5:3b, gemma3:4b, mistral:7b, qwen3-coder:30b |
|
|
- **Consensus rules:** 4-agree = high confidence, 3v1 = majority vote, 2v2 = discarded |
|
|
- **Accuracy:** ~92% on held-out test set |
|
|
|
|
|
## Usage |
|
|
|
|
|
This model is designed for use with [mimmo](https://github.com/lelloman/mimmo), a Rust library for torrent content classification. The ONNX model is embedded directly in the binary at compile time. |
|
|
|
|
|
```rust |
|
|
// Model is automatically downloaded during build |
|
|
const MODEL_BYTES: &[u8] = include_bytes!("../models/bert/model_embedded.onnx"); |
|
|
const TOKENIZER_JSON: &str = include_str!("../models/bert/tokenizer.json"); |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
- Inference: <10ms per sample (CPU) |
|
|
- Used as ML fallback when pattern matching is inconclusive |
|
|
|
|
|
## Files |
|
|
|
|
|
- `model_embedded.onnx` - ONNX model with embedded weights |
|
|
- `tokenizer.json` - HuggingFace tokenizer |
|
|
- `vocab.txt` - Vocabulary file |
|
|
- `config.json` - Model configuration |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|