--- license: mit language: - en tags: - text-classification - onnx - bert - torrent - content-classification base_model: prajjwal1/bert-tiny pipeline_tag: text-classification --- # BERT Torrent Classifier A fine-tuned BERT-tiny model for classifying torrent content into media types. ## Model Details - **Base model:** [prajjwal1/bert-tiny](https://huggingface.co/prajjwal1/bert-tiny) - **Task:** Multi-class text classification - **Labels:** audio, video, software, book, other - **Format:** ONNX (with embedded weights) - **Size:** ~17MB ## Training - **Training data:** ~10k torrent names with 4-LLM consensus voting - **LLM ensemble:** qwen2.5:3b, gemma3:4b, mistral:7b, qwen3-coder:30b - **Consensus rules:** 4-agree = high confidence, 3v1 = majority vote, 2v2 = discarded - **Accuracy:** ~92% on held-out test set ## Usage This model is designed for use with [mimmo](https://github.com/lelloman/mimmo), a Rust library for torrent content classification. The ONNX model is embedded directly in the binary at compile time. ```rust // Model is automatically downloaded during build const MODEL_BYTES: &[u8] = include_bytes!("../models/bert/model_embedded.onnx"); const TOKENIZER_JSON: &str = include_str!("../models/bert/tokenizer.json"); ``` ## Performance - Inference: <10ms per sample (CPU) - Used as ML fallback when pattern matching is inconclusive ## Files - `model_embedded.onnx` - ONNX model with embedded weights - `tokenizer.json` - HuggingFace tokenizer - `vocab.txt` - Vocabulary file - `config.json` - Model configuration ## License MIT