LocalDoc
/

azerbaijani-text-quality-classifier

Text Classification

Model card Files Files and versions

azerbaijani-text-quality-classifier / README.md

vrashad's picture

Update README.md

d29532d verified 2 days ago

|

history blame contribute delete

1.68 kB

	---
	license: apache-2.0
	language:
	- az
	base_model: jhu-clsp/mmBERT-base
	pipeline_tag: text-classification
	tags:
	- azerbaijani
	- text-quality
	- data-filtering
	datasets:
	- LocalDoc/azerbaijani-text-quality-labeled
	---

	# Azerbaijani Text Quality Classifier

	Regression model that scores the quality of Azerbaijani web text on a
	continuous 0-3 scale. Built to filter a raw web corpus (OSCAR-derived)
	before language-model pretraining.

	- Base model: jhu-clsp/mmBERT-base
	- Task: regression, single output (~0..3). Higher = cleaner text.
	- Max length: 4096 tokens

	## Score scale

	- 3 — clean, coherent Azerbaijani prose
	- 2 — substantial good prose mixed with junk (menus, footers, ads)
	- 1 — mostly junk, little recoverable prose
	- 0 — pure junk: navigation pages, spam, machine translation, non-Azerbaijani text

	## Usage

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	tok = AutoTokenizer.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
	model.eval()

	text = "..."
	enc = tok(text, truncation=True, max_length=4096, return_tensors="pt")
	with torch.no_grad():
	score = model(**enc).logits.squeeze().item()
	print(score)
	```

	## Limitations

	Training labels were generated by an LLM (Mistral-Small-24B), not by humans.
	Reported validation metrics (val-MSE ~0.14, rounded accuracy ~0.83) measure
	agreement with the LLM labels, not agreement with human judgement —
	the latter has not yet been measured against a human-annotated test set.
	Use with this caveat in mind.