|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
base_model: agentlans/snowflake-arctic-embed-xs-zyda-2 |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- text-classification |
|
|
- grammar-classification |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: agentlans/snowflake-arctic-xs-grammar-classifier |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Grammar Classification |
|
|
dataset: |
|
|
name: agentlans/grammar-classification |
|
|
type: agentlans/grammar-classification |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.8724 |
|
|
name: Accuracy |
|
|
datasets: |
|
|
- agentlans/grammar-classification |
|
|
- liweili/c4_200m |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# snowflake-arctic-xs-grammar-classifier |
|
|
|
|
|
This model is a fine-tuned version of [agentlans/snowflake-arctic-embed-xs-zyda-2](https://huggingface.co/agentlans/snowflake-arctic-embed-xs-zyda-2) for grammar classification. It achieves an accuracy of 0.8724 on the evaluation set. |
|
|
|
|
|
## Model description |
|
|
|
|
|
The snowflake-arctic-xs-grammar-classifier is designed to classify the grammatical correctness of English sentences. |
|
|
It is based on the snowflake-arctic-embed-xs-zyda-2 model and has been fine-tuned on a grammar classification dataset derived from the C4 (Colossal Clean Crawled Corpus). |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
This model is intended for classifying the grammatical correctness of English sentences. It can be used in various applications such as writing assistance tools, educational software, or content moderation systems. |
|
|
|
|
|
### Usage example |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
import torch |
|
|
|
|
|
device = 0 if torch.cuda.is_available() else -1 |
|
|
classifier = pipeline( |
|
|
"text-classification", |
|
|
model="agentlans/snowflake-arctic-xs-grammar-classifier", |
|
|
device=device, |
|
|
) |
|
|
|
|
|
text = "I absolutely loved this movie!" |
|
|
result = classifier(text) |
|
|
print(result) # [{'label': 'grammatical', 'score': 0.8963921666145325}] |
|
|
``` |
|
|
|
|
|
### Example Classifications |
|
|
|
|
|
| Status | Text | Explanation | |
|
|
|:--------:|------|-------------| |
|
|
| βοΈ | I absolutely loved this movie! | Grammatically correct, clear sentence structure | |
|
|
| β | How do I shot web? | Grammatically incorrect, improper verb usage | |
|
|
| βοΈ | Beware the Jabberwock, my son! | Poetic language, grammatically sound | |
|
|
| βοΈ | Colourless green ideas sleep furiously. | Grammatically correct, though semantically nonsensical | |
|
|
| β | Has anyone really been far even as decided to use even go want to do look more like? | Completely incoherent and grammatically incorrect | |
|
|
|
|
|
### Limitations |
|
|
|
|
|
The model's performance is limited by the quality and diversity of its training data. It may not perform well on specialized or domain-specific text, or on languages other than English. Additionally, it may struggle with complex grammatical structures or nuanced language use. |
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
The model was trained on the [agentlans/grammar-classification](https://huggingface.co/datasets/agentlans/grammar-classification) dataset, which contains 600 000 examples for binary classification of grammatical correctness in English. This dataset is derived from a subset of the C4_200M Synthetic Dataset for Grammatical Error Correction. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
- Learning rate: 5e-05 |
|
|
- Batch size: 128 |
|
|
- Number of epochs: 10 |
|
|
- Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08 |
|
|
- Learning rate scheduler: Linear |
|
|
|
|
|
<details> |
|
|
<summary>π Detailed Training Results</summary> |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Input Tokens Seen | |
|
|
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:-----------------:| |
|
|
| 0.5192 | 1.0 | 3750 | 0.4722 | 0.7738 | 61 440 000 | |
|
|
| 0.4875 | 2.0 | 7500 | 0.4521 | 0.7881 | 122 880 000 | |
|
|
| 0.4590 | 3.0 | 11250 | 0.3895 | 0.8227 | 184 320 000 | |
|
|
| 0.4351 | 4.0 | 15000 | 0.3981 | 0.8197 | 245 760 000 | |
|
|
| 0.4157 | 5.0 | 18750 | 0.3690 | 0.8337 | 307 200 000 | |
|
|
| 0.3955 | 6.0 | 22500 | 0.3260 | 0.8585 | 368 640 000 | |
|
|
| 0.3788 | 7.0 | 26250 | 0.3267 | 0.8566 | 430 080 000 | |
|
|
| 0.3616 | 8.0 | 30000 | 0.3192 | 0.8621 | 491 520 000 | |
|
|
| 0.3459 | 9.0 | 33750 | 0.3017 | 0.8707 | 552 960 000 | |
|
|
| 0.3382 | 10.0 | 37500 | 0.2971 | 0.8724 | 614 400 000 | |
|
|
|
|
|
</details> |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers: 4.46.3 |
|
|
- PyTorch: 2.5.1+cu124 |
|
|
- Datasets: 3.2.0 |
|
|
- Tokenizers: 20.3 |