Quality-Classifier / README.md
atharv-savarkar's picture
Upload folder using huggingface_hub
1ee66bc verified
# πŸ“Š Multilingual Quality Classifier
**A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.**
This repository provides:
- βœ… Language-specific quality classifiers (one model per language)
- βœ… Streaming JSONL inference (handles multi-TB corpora without RAM blowup)
- βœ… Single-GPU and Multi-GPU (DDP) support
- βœ… Corruption-safe pipeline (skips bad JSON, logs errors, never crashes)
- βœ… Per-class sharded outputs
- βœ… Automatic logging of progress and failures
## 🧠 What is this?
This system classifies text into 5 quality buckets 0, 1, 2, 3, 4
Each language has its own model, and a shared label mapping.
## πŸ“ Repository Structure
```json
Quality-Classifier/
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ en/
β”‚ β”œβ”€β”€ bn/
β”‚ β”œβ”€β”€ hi/
β”‚ β”œβ”€β”€ ...
β”‚ └── label_to_id.json
└── qc_infer.py
```
```models/<language>/``` β†’ HuggingFace-style model directory
```models/label_to_id.json``` β†’ Shared class mapping
```qc_infer.py``` β†’ Production inference script
## πŸ“‚ Input Format
Input can be:
βœ… A single .jsonl file, or
βœ… A directory containing many .jsonl files (recursively)
Each line must be a JSON object containing at least one of the text keys:
```json
{"text": "This is a sample sentence"}
{"content": "Another example"}
{"body": "Yet another example"}
```
You can control which keys are checked using:
```css
--text_key text content body
```
## How to Run
For Multi-GPU DDP training
```python
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \
--input_path /data/jsonl_inputs \
--output_path /data/qc_outputs \
--language en \
--text_key text content generated_text
```
For Single-GPU DDP training
```python
CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \
--input_path /data/jsonl_inputs \
--output_path /data/qc_outputs \
--language en \
--text_key text content generated_text
```
## πŸ“€ Output Format
The script writes class-wise sharded JSONL files:
```lua
output_dir/
β”œβ”€β”€ 0.rank*.jsonl
β”œβ”€β”€ 1.rank*.jsonl
β”œβ”€β”€ 2.rank*.jsonl
β”œβ”€β”€ 3.rank*.jsonl
β”œβ”€β”€ 4.rank*.jsonl
└── qc_infer.log
```
## 🧾 Logging
A full log is written to:
```bash
output_dir/qc_infer.log
```
It contains:
- Number of files discovered
- Number of lines indexed
- Corrupted JSON lines
- Missing text key errors
- Batch failures (with stack traces)
- Progress info
The script:
- βœ… Skips corrupted JSON
- βœ… Skips invalid samples
- βœ… Never crashes due to bad data