Galahad-Quality-Classifier

Galahad

This is a text classification model designed to enable qualitative data annotation, facilitate the creation of quality-specific data blends, and allow for the addition of metadata tags based on document quality. (Unlike nvidia's quality-classifier, this is a Domain-specific classifier)

The model classifies documents (e.g., text samples, comments, reviews, or web pages) into one of three distinct quality classes:

Class ID Class Name Score Mapping Quality Description
2 High Quality Score > 4 Highly informative, professional, and well-structured content.
1 Medium Quality 3 ≤ Score ≤ 4 Acceptable, non-offensive, and relevant content with moderate structure.
0 Low Quality Score < 3 Poorly structured, potentially spammy, low-information, or inappropriate content.

The score is labeled by prompting the Deepseek-V3.2(Chat) via official api.

Results

For 300M version:

Label Precision Recall F1-Score Support
0 0.81 0.66 0.73 119
1 0.78 0.84 0.81 354
2 0.42 0.39 0.41 69
Accuracy — — 0.75 542
Macro Avg 0.67 0.63 0.65 542
Weighted Avg 0.74 0.75 0.74 542

For 0.6B model:

Label Precision Recall F1-Score Support
0 0.75 0.72 0.74 119
1 0.81 0.81 0.81 354
2 0.47 0.52 0.49 69
Accuracy — — 0.75 542
Macro Avg 0.68 0.68 0.68 542
Weighted Avg 0.76 0.75 0.75 542

Limitations: The quality assessment is inherently subjective as the data is labeled by a single model and may vary among different models. The dataset is relatively small with the size is only 30k.

Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TerenceLau/galahad-classifier-0.6B

Finetuned
(121)
this model

Collection including TerenceLau/galahad-classifier-0.6B