📊 Multilingual Quality Classifier

A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.

This repository provides:

✅ Language-specific quality classifiers (one model per language)
✅ Streaming JSONL inference (handles multi-TB corpora without RAM blowup)
✅ Single-GPU and Multi-GPU (DDP) support
✅ Corruption-safe pipeline (skips bad JSON, logs errors, never crashes)
✅ Per-class sharded outputs
✅ Automatic logging of progress and failures

🧠 What is this?

This system classifies text into 5 quality buckets 0, 1, 2, 3, 4

Each language has its own model, and a shared label mapping.

📁 Repository Structure

Quality-Classifier/
├── models/
│   ├── en/
│   ├── bn/
│   ├── hi/
│   ├── ...
│   └── label_to_id.json
└── qc_infer.py

models/<language>/ → HuggingFace-style model directory

models/label_to_id.json → Shared class mapping

qc_infer.py → Production inference script

📂 Input Format

Input can be: ✅ A single .jsonl file, or ✅ A directory containing many .jsonl files (recursively)

Each line must be a JSON object containing at least one of the text keys:

{"text": "This is a sample sentence"}
{"content": "Another example"}
{"body": "Yet another example"}

You can control which keys are checked using:

--text_key text content body

How to Run

For Multi-GPU DDP training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \
  --input_path /data/jsonl_inputs \
  --output_path /data/qc_outputs \
  --language en \
  --text_key text content generated_text

For Single-GPU DDP training

CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \
  --input_path /data/jsonl_inputs \
  --output_path /data/qc_outputs \
  --language en \
  --text_key text content generated_text

📤 Output Format

The script writes class-wise sharded JSONL files:

output_dir/
 ├── 0.rank*.jsonl
 ├── 1.rank*.jsonl
 ├── 2.rank*.jsonl
 ├── 3.rank*.jsonl
 ├── 4.rank*.jsonl
 └── qc_infer.log

🧾 Logging

A full log is written to:

output_dir/qc_infer.log

It contains:

Number of files discovered
Number of lines indexed
Corrupted JSON lines
Missing text key errors
Batch failures (with stack traces)
Progress info

The script:

✅ Skips corrupted JSON
✅ Skips invalid samples
✅ Never crashes due to bad data