atharv-savarkar
/

Quality-Classifier

Model card Files Files and versions

Quality-Classifier / README.md

atharv-savarkar's picture

atharv-savarkar

Upload folder using huggingface_hub

1ee66bc verified 2 months ago

|

history blame contribute delete

2.59 kB

	# 📊 Multilingual Quality Classifier

	A high-throughput, production-grade text quality classification system for large-scale corpora across multiple languages.

	This repository provides:

	- ✅ Language-specific quality classifiers (one model per language)
	- ✅ Streaming JSONL inference (handles multi-TB corpora without RAM blowup)
	- ✅ Single-GPU and Multi-GPU (DDP) support
	- ✅ Corruption-safe pipeline (skips bad JSON, logs errors, never crashes)
	- ✅ Per-class sharded outputs
	- ✅ Automatic logging of progress and failures

	## 🧠 What is this?

	This system classifies text into 5 quality buckets 0, 1, 2, 3, 4

	Each language has its own model, and a shared label mapping.

	## 📁 Repository Structure

	```json
	Quality-Classifier/
	├── models/
	│ ├── en/
	│ ├── bn/
	│ ├── hi/
	│ ├── ...
	│ └── label_to_id.json
	└── qc_infer.py
	```

	```models/<language>/``` → HuggingFace-style model directory

	```models/label_to_id.json``` → Shared class mapping

	```qc_infer.py``` → Production inference script

	## 📂 Input Format

	Input can be:
	✅ A single .jsonl file, or
	✅ A directory containing many .jsonl files (recursively)

	Each line must be a JSON object containing at least one of the text keys:

	```json
	{"text": "This is a sample sentence"}
	{"content": "Another example"}
	{"body": "Yet another example"}
	```

	You can control which keys are checked using:
	```css
	--text_key text content body
	```

	## How to Run

	For Multi-GPU DDP training

	```python
	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qc_infer.py \
	--input_path /data/jsonl_inputs \
	--output_path /data/qc_outputs \
	--language en \
	--text_key text content generated_text
	```

	For Single-GPU DDP training

	```python
	CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 qc_infer.py \
	--input_path /data/jsonl_inputs \
	--output_path /data/qc_outputs \
	--language en \
	--text_key text content generated_text
	```

	## 📤 Output Format

	The script writes class-wise sharded JSONL files:
	```lua
	output_dir/
	├── 0.rank*.jsonl
	├── 1.rank*.jsonl
	├── 2.rank*.jsonl
	├── 3.rank*.jsonl
	├── 4.rank*.jsonl
	└── qc_infer.log
	```

	## 🧾 Logging

	A full log is written to:

	```bash
	output_dir/qc_infer.log
	```

	It contains:
	- Number of files discovered
	- Number of lines indexed
	- Corrupted JSON lines
	- Missing text key errors
	- Batch failures (with stack traces)
	- Progress info

	The script:

	- ✅ Skips corrupted JSON
	- ✅ Skips invalid samples
	- ✅ Never crashes due to bad data