Galahad-Quality-Classifier

Model Type: Sequence Classification Model
Base Model: Qwen/Qwen3-Embedding-0.6B

This is a text classification model designed to enable qualitative data annotation, facilitate the creation of quality-specific data blends, and allow for the addition of metadata tags based on document quality. (Unlike nvidia's quality-classifier, this is a Domain-specific classifier)

The model classifies documents (e.g., text samples, comments, reviews, or web pages) into one of three distinct quality classes:

Class ID	Class Name	Score Mapping	Quality Description
2	High Quality	Score > 4	Highly informative, professional, and well-structured content.
1	Medium Quality	3 ≤ Score ≤ 4	Acceptable, non-offensive, and relevant content with moderate structure.
0	Low Quality	Score < 3	Poorly structured, potentially spammy, low-information, or inappropriate content.

The score is labeled by prompting the Deepseek-V3.2(Chat) via official api.

Results

For 300M version:

Label	Precision	Recall	F1-Score	Support
0	0.81	0.66	0.73	119
1	0.78	0.84	0.81	354
2	0.42	0.39	0.41	69
Accuracy	—	—	0.75	542
Macro Avg	0.67	0.63	0.65	542
Weighted Avg	0.74	0.75	0.74	542

For 0.6B model:

Label	Precision	Recall	F1-Score	Support
0	0.75	0.72	0.74	119
1	0.81	0.81	0.81	354
2	0.47	0.52	0.49	69
Accuracy	—	—	0.75	542
Macro Avg	0.68	0.68	0.68	542
Weighted Avg	0.76	0.75	0.75	542

Limitations: The quality assessment is inherently subjective as the data is labeled by a single model and may vary among different models. The dataset is relatively small with the size is only 30k.