| --- |
| license: apache-2.0 |
| language: |
| - az |
| base_model: jhu-clsp/mmBERT-base |
| pipeline_tag: text-classification |
| tags: |
| - azerbaijani |
| - text-quality |
| - data-filtering |
| datasets: |
| - LocalDoc/azerbaijani-text-quality-labeled |
| --- |
| |
| # Azerbaijani Text Quality Classifier |
|
|
| Regression model that scores the quality of Azerbaijani web text on a |
| continuous 0-3 scale. Built to filter a raw web corpus (OSCAR-derived) |
| before language-model pretraining. |
|
|
| - **Base model:** jhu-clsp/mmBERT-base |
| - **Task:** regression, single output (~0..3). Higher = cleaner text. |
| - **Max length:** 4096 tokens |
|
|
| ## Score scale |
|
|
| - **3** β clean, coherent Azerbaijani prose |
| - **2** β substantial good prose mixed with junk (menus, footers, ads) |
| - **1** β mostly junk, little recoverable prose |
| - **0** β pure junk: navigation pages, spam, machine translation, non-Azerbaijani text |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier") |
| model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier") |
| model.eval() |
| |
| text = "..." |
| enc = tok(text, truncation=True, max_length=4096, return_tensors="pt") |
| with torch.no_grad(): |
| score = model(**enc).logits.squeeze().item() |
| print(score) |
| ``` |
|
|
| ## Limitations |
|
|
| Training labels were generated by an LLM (Mistral-Small-24B), not by humans. |
| Reported validation metrics (val-MSE ~0.14, rounded accuracy ~0.83) measure |
| **agreement with the LLM labels**, not agreement with human judgement β |
| the latter has not yet been measured against a human-annotated test set. |
| Use with this caveat in mind. |