lelloman
/

bert-torrent-classifier

Text Classification

content-classification

Model card Files Files and versions

bert-torrent-classifier / README.md

lelloman's picture

Upload README.md with huggingface_hub

4f4cd5d verified 28 days ago

|

history blame contribute delete

1.57 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-classification
	- onnx
	- bert
	- torrent
	- content-classification
	base_model: prajjwal1/bert-tiny
	pipeline_tag: text-classification
	---

	# BERT Torrent Classifier

	A fine-tuned BERT-tiny model for classifying torrent content into media types.

	## Model Details

	- Base model: [prajjwal1/bert-tiny](https://huggingface.co/prajjwal1/bert-tiny)
	- Task: Multi-class text classification
	- Labels: audio, video, software, book, other
	- Format: ONNX (with embedded weights)
	- Size: ~17MB

	## Training

	- Training data: ~10k torrent names with 4-LLM consensus voting
	- LLM ensemble: qwen2.5:3b, gemma3:4b, mistral:7b, qwen3-coder:30b
	- Consensus rules: 4-agree = high confidence, 3v1 = majority vote, 2v2 = discarded
	- Accuracy: ~92% on held-out test set

	## Usage

	This model is designed for use with [mimmo](https://github.com/lelloman/mimmo), a Rust library for torrent content classification. The ONNX model is embedded directly in the binary at compile time.

	```rust
	// Model is automatically downloaded during build
	const MODEL_BYTES: &[u8] = include_bytes!("../models/bert/model_embedded.onnx");
	const TOKENIZER_JSON: &str = include_str!("../models/bert/tokenizer.json");
	```

	## Performance

	- Inference: <10ms per sample (CPU)
	- Used as ML fallback when pattern matching is inconclusive

	## Files

	- `model_embedded.onnx` - ONNX model with embedded weights
	- `tokenizer.json` - HuggingFace tokenizer
	- `vocab.txt` - Vocabulary file
	- `config.json` - Model configuration

	## License

	MIT