| # π Multilingual Quality Classifier |
|
|
| **A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.** |
|
|
| This repository provides: |
|
|
| - β
Language-specific quality classifiers (one model per language) |
| - β
Streaming JSONL inference (handles multi-TB corpora without RAM blowup) |
| - β
Single-GPU and Multi-GPU (DDP) support |
| - β
Corruption-safe pipeline (skips bad JSON, logs errors, never crashes) |
| - β
Per-class sharded outputs |
| - β
Automatic logging of progress and failures |
|
|
| ## π§ What is this? |
|
|
| This system classifies text into 5 quality buckets 0, 1, 2, 3, 4 |
|
|
| Each language has its own model, and a shared label mapping. |
|
|
| ## π Repository Structure |
|
|
| ```json |
| Quality-Classifier/ |
| βββ models/ |
| β βββ en/ |
| β βββ bn/ |
| β βββ hi/ |
| β βββ ... |
| β βββ label_to_id.json |
| βββ qc_infer.py |
| ``` |
|
|
| ```models/<language>/``` β HuggingFace-style model directory |
|
|
| ```models/label_to_id.json``` β Shared class mapping |
|
|
| ```qc_infer.py``` β Production inference script |
|
|
| ## π Input Format |
|
|
| Input can be: |
| β
A single .jsonl file, or |
| β
A directory containing many .jsonl files (recursively) |
|
|
| Each line must be a JSON object containing at least one of the text keys: |
|
|
| ```json |
| {"text": "This is a sample sentence"} |
| {"content": "Another example"} |
| {"body": "Yet another example"} |
| ``` |
|
|
| You can control which keys are checked using: |
| ```css |
| --text_key text content body |
| ``` |
|
|
| ## How to Run |
|
|
| For Multi-GPU DDP training |
|
|
| ```python |
| CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \ |
| --input_path /data/jsonl_inputs \ |
| --output_path /data/qc_outputs \ |
| --language en \ |
| --text_key text content generated_text |
| ``` |
|
|
| For Single-GPU DDP training |
|
|
| ```python |
| CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \ |
| --input_path /data/jsonl_inputs \ |
| --output_path /data/qc_outputs \ |
| --language en \ |
| --text_key text content generated_text |
| ``` |
|
|
| ## π€ Output Format |
|
|
| The script writes class-wise sharded JSONL files: |
| ```lua |
| output_dir/ |
| βββ 0.rank*.jsonl |
| βββ 1.rank*.jsonl |
| βββ 2.rank*.jsonl |
| βββ 3.rank*.jsonl |
| βββ 4.rank*.jsonl |
| βββ qc_infer.log |
| ``` |
|
|
| ## π§Ύ Logging |
|
|
| A full log is written to: |
|
|
| ```bash |
| output_dir/qc_infer.log |
| ``` |
|
|
| It contains: |
| - Number of files discovered |
| - Number of lines indexed |
| - Corrupted JSON lines |
| - Missing text key errors |
| - Batch failures (with stack traces) |
| - Progress info |
|
|
| The script: |
|
|
| - β
Skips corrupted JSON |
| - β
Skips invalid samples |
| - β
Never crashes due to bad data |
|
|