Quality-Classifier / README.md
atharv-savarkar's picture
Upload folder using huggingface_hub
1ee66bc verified

πŸ“Š Multilingual Quality Classifier

A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.

This repository provides:

  • βœ… Language-specific quality classifiers (one model per language)
  • βœ… Streaming JSONL inference (handles multi-TB corpora without RAM blowup)
  • βœ… Single-GPU and Multi-GPU (DDP) support
  • βœ… Corruption-safe pipeline (skips bad JSON, logs errors, never crashes)
  • βœ… Per-class sharded outputs
  • βœ… Automatic logging of progress and failures

🧠 What is this?

This system classifies text into 5 quality buckets 0, 1, 2, 3, 4

Each language has its own model, and a shared label mapping.

πŸ“ Repository Structure

Quality-Classifier/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ en/
β”‚   β”œβ”€β”€ bn/
β”‚   β”œβ”€β”€ hi/
β”‚   β”œβ”€β”€ ...
β”‚   └── label_to_id.json
└── qc_infer.py

models/<language>/ β†’ HuggingFace-style model directory

models/label_to_id.json β†’ Shared class mapping

qc_infer.py β†’ Production inference script

πŸ“‚ Input Format

Input can be: βœ… A single .jsonl file, or βœ… A directory containing many .jsonl files (recursively)

Each line must be a JSON object containing at least one of the text keys:

{"text": "This is a sample sentence"}
{"content": "Another example"}
{"body": "Yet another example"}

You can control which keys are checked using:

--text_key text content body

How to Run

For Multi-GPU DDP training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \
  --input_path /data/jsonl_inputs \
  --output_path /data/qc_outputs \
  --language en \
  --text_key text content generated_text

For Single-GPU DDP training

CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \
  --input_path /data/jsonl_inputs \
  --output_path /data/qc_outputs \
  --language en \
  --text_key text content generated_text

πŸ“€ Output Format

The script writes class-wise sharded JSONL files:

output_dir/
 β”œβ”€β”€ 0.rank*.jsonl
 β”œβ”€β”€ 1.rank*.jsonl
 β”œβ”€β”€ 2.rank*.jsonl
 β”œβ”€β”€ 3.rank*.jsonl
 β”œβ”€β”€ 4.rank*.jsonl
 └── qc_infer.log

🧾 Logging

A full log is written to:

output_dir/qc_infer.log

It contains:

  • Number of files discovered
  • Number of lines indexed
  • Corrupted JSON lines
  • Missing text key errors
  • Batch failures (with stack traces)
  • Progress info

The script:

  • βœ… Skips corrupted JSON
  • βœ… Skips invalid samples
  • βœ… Never crashes due to bad data