Galahad
Collection
Galahad
•
3 items
•
Updated
Qwen/Qwen3-Embedding-0.6BThis is a text classification model designed to enable qualitative data annotation, facilitate the creation of quality-specific data blends, and allow for the addition of metadata tags based on document quality. (Unlike nvidia's quality-classifier, this is a Domain-specific classifier)
The model classifies documents (e.g., text samples, comments, reviews, or web pages) into one of three distinct quality classes:
| Class ID | Class Name | Score Mapping | Quality Description |
|---|---|---|---|
| 2 | High Quality | Score > 4 | Highly informative, professional, and well-structured content. |
| 1 | Medium Quality | 3 ≤ Score ≤ 4 | Acceptable, non-offensive, and relevant content with moderate structure. |
| 0 | Low Quality | Score < 3 | Poorly structured, potentially spammy, low-information, or inappropriate content. |
The score is labeled by prompting the Deepseek-V3.2(Chat) via official api.
For 300M version:
| Label | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.81 | 0.66 | 0.73 | 119 |
| 1 | 0.78 | 0.84 | 0.81 | 354 |
| 2 | 0.42 | 0.39 | 0.41 | 69 |
| Accuracy | — | — | 0.75 | 542 |
| Macro Avg | 0.67 | 0.63 | 0.65 | 542 |
| Weighted Avg | 0.74 | 0.75 | 0.74 | 542 |
For 0.6B model:
| Label | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.75 | 0.72 | 0.74 | 119 |
| 1 | 0.81 | 0.81 | 0.81 | 354 |
| 2 | 0.47 | 0.52 | 0.49 | 69 |
| Accuracy | — | — | 0.75 | 542 |
| Macro Avg | 0.68 | 0.68 | 0.68 | 542 |
| Weighted Avg | 0.76 | 0.75 | 0.75 | 542 |
Limitations: The quality assessment is inherently subjective as the data is labeled by a single model and may vary among different models. The dataset is relatively small with the size is only 30k.