--- license: apache-2.0 language: - az base_model: jhu-clsp/mmBERT-base pipeline_tag: text-classification tags: - azerbaijani - text-quality - data-filtering datasets: - LocalDoc/azerbaijani-text-quality-labeled --- # Azerbaijani Text Quality Classifier Regression model that scores the quality of Azerbaijani web text on a continuous 0-3 scale. Built to filter a raw web corpus (OSCAR-derived) before language-model pretraining. - **Base model:** jhu-clsp/mmBERT-base - **Task:** regression, single output (~0..3). Higher = cleaner text. - **Max length:** 4096 tokens ## Score scale - **3** — clean, coherent Azerbaijani prose - **2** — substantial good prose mixed with junk (menus, footers, ads) - **1** — mostly junk, little recoverable prose - **0** — pure junk: navigation pages, spam, machine translation, non-Azerbaijani text ## Usage ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tok = AutoTokenizer.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier") model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier") model.eval() text = "..." enc = tok(text, truncation=True, max_length=4096, return_tensors="pt") with torch.no_grad(): score = model(**enc).logits.squeeze().item() print(score) ``` ## Limitations Training labels were generated by an LLM (Mistral-Small-24B), not by humans. Reported validation metrics (val-MSE ~0.14, rounded accuracy ~0.83) measure **agreement with the LLM labels**, not agreement with human judgement — the latter has not yet been measured against a human-annotated test set. Use with this caveat in mind.