File size: 1,680 Bytes
f451933
 
 
 
 
 
 
 
 
 
d29532d
 
f451933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d29532d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
license: apache-2.0
language:
- az
base_model: jhu-clsp/mmBERT-base
pipeline_tag: text-classification
tags:
- azerbaijani
- text-quality
- data-filtering
datasets:
- LocalDoc/azerbaijani-text-quality-labeled
---

# Azerbaijani Text Quality Classifier

Regression model that scores the quality of Azerbaijani web text on a
continuous 0-3 scale. Built to filter a raw web corpus (OSCAR-derived)
before language-model pretraining.

- **Base model:** jhu-clsp/mmBERT-base
- **Task:** regression, single output (~0..3). Higher = cleaner text.
- **Max length:** 4096 tokens

## Score scale

- **3** — clean, coherent Azerbaijani prose
- **2** — substantial good prose mixed with junk (menus, footers, ads)
- **1** — mostly junk, little recoverable prose
- **0** — pure junk: navigation pages, spam, machine translation, non-Azerbaijani text

## Usage

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tok = AutoTokenizer.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/azerbaijani-text-quality-classifier")
model.eval()

text = "..."
enc = tok(text, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
    score = model(**enc).logits.squeeze().item()
print(score)
```

## Limitations

Training labels were generated by an LLM (Mistral-Small-24B), not by humans.
Reported validation metrics (val-MSE ~0.14, rounded accuracy ~0.83) measure
**agreement with the LLM labels**, not agreement with human judgement —
the latter has not yet been measured against a human-annotated test set.
Use with this caveat in mind.