Create README.md

eed2815 verified 12 months ago

4.7 kB

	---
	library_name: transformers
	license: mit
	base_model: agentlans/snowflake-arctic-embed-xs-zyda-2
	tags:
	- generated_from_trainer
	- text-classification
	- grammar-classification
	metrics:
	- accuracy
	model-index:
	- name: agentlans/snowflake-arctic-xs-grammar-classifier
	results:
	- task:
	type: text-classification
	name: Grammar Classification
	dataset:
	name: agentlans/grammar-classification
	type: agentlans/grammar-classification
	metrics:
	- type: accuracy
	value: 0.8724
	name: Accuracy
	datasets:
	- agentlans/grammar-classification
	- liweili/c4_200m
	language:
	- en
	pipeline_tag: text-classification
	---

	# snowflake-arctic-xs-grammar-classifier

	This model is a fine-tuned version of [agentlans/snowflake-arctic-embed-xs-zyda-2](https://huggingface.co/agentlans/snowflake-arctic-embed-xs-zyda-2) for grammar classification. It achieves an accuracy of 0.8724 on the evaluation set.

	## Model description

	The snowflake-arctic-xs-grammar-classifier is designed to classify the grammatical correctness of English sentences.
	It is based on the snowflake-arctic-embed-xs-zyda-2 model and has been fine-tuned on a grammar classification dataset derived from the C4 (Colossal Clean Crawled Corpus).

	## Intended uses & limitations

	This model is intended for classifying the grammatical correctness of English sentences. It can be used in various applications such as writing assistance tools, educational software, or content moderation systems.

	### Usage example

	```python
	from transformers import pipeline
	import torch

	device = 0 if torch.cuda.is_available() else -1
	classifier = pipeline(
	"text-classification",
	model="agentlans/snowflake-arctic-xs-grammar-classifier",
	device=device,
	)

	text = "I absolutely loved this movie!"
	result = classifier(text)
	print(result) # [{'label': 'grammatical', 'score': 0.8963921666145325}]
	```

	### Example Classifications

	\| Status \| Text \| Explanation \|
	\|:--------:\|------\|-------------\|
	\| ✔️ \| I absolutely loved this movie! \| Grammatically correct, clear sentence structure \|
	\| ❌ \| How do I shot web? \| Grammatically incorrect, improper verb usage \|
	\| ✔️ \| Beware the Jabberwock, my son! \| Poetic language, grammatically sound \|
	\| ✔️ \| Colourless green ideas sleep furiously. \| Grammatically correct, though semantically nonsensical \|
	\| ❌ \| Has anyone really been far even as decided to use even go want to do look more like? \| Completely incoherent and grammatically incorrect \|

	### Limitations

	The model's performance is limited by the quality and diversity of its training data. It may not perform well on specialized or domain-specific text, or on languages other than English. Additionally, it may struggle with complex grammatical structures or nuanced language use.

	## Training and evaluation data

	The model was trained on the [agentlans/grammar-classification](https://huggingface.co/datasets/agentlans/grammar-classification) dataset, which contains 600 000 examples for binary classification of grammatical correctness in English. This dataset is derived from a subset of the C4_200M Synthetic Dataset for Grammatical Error Correction.

	## Training procedure

	### Training hyperparameters

	- Learning rate: 5e-05
	- Batch size: 128
	- Number of epochs: 10
	- Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
	- Learning rate scheduler: Linear

	<details>
	<summary>📊 Detailed Training Results</summary>

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| Input Tokens Seen \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:--------:\|:-----------------:\|
	\| 0.5192 \| 1.0 \| 3750 \| 0.4722 \| 0.7738 \| 61 440 000 \|
	\| 0.4875 \| 2.0 \| 7500 \| 0.4521 \| 0.7881 \| 122 880 000 \|
	\| 0.4590 \| 3.0 \| 11250 \| 0.3895 \| 0.8227 \| 184 320 000 \|
	\| 0.4351 \| 4.0 \| 15000 \| 0.3981 \| 0.8197 \| 245 760 000 \|
	\| 0.4157 \| 5.0 \| 18750 \| 0.3690 \| 0.8337 \| 307 200 000 \|
	\| 0.3955 \| 6.0 \| 22500 \| 0.3260 \| 0.8585 \| 368 640 000 \|
	\| 0.3788 \| 7.0 \| 26250 \| 0.3267 \| 0.8566 \| 430 080 000 \|
	\| 0.3616 \| 8.0 \| 30000 \| 0.3192 \| 0.8621 \| 491 520 000 \|
	\| 0.3459 \| 9.0 \| 33750 \| 0.3017 \| 0.8707 \| 552 960 000 \|
	\| 0.3382 \| 10.0 \| 37500 \| 0.2971 \| 0.8724 \| 614 400 000 \|

	</details>

	### Framework versions

	- Transformers: 4.46.3
	- PyTorch: 2.5.1+cu124
	- Datasets: 3.2.0
	- Tokenizers: 20.3