Need for Clarification and support
I am currently working on a grammar classification model , but trying to make it light weight and fast as a scorer for a synthetic dataset. Would be great if I can get some guidance from you!
Hi, what do you want to learn more about?
From my experience training this model, you should beware:
- Synthetic training data won't work. You need human-generated grammar mistakes.
- The grammar correct group should have similar content as the grammar incorrect group. Otherwise, your classifier will be biased towards certain kinds of content.
- Prepare a validation dataset because this task is easy to overfit.
- Use small lightweight fast models due to the overfitting problem.
Hi agentlans, thanks for responding would be great if I can get your email or any id if possible so can contact directly as your work on these topics are something which I want to learn and do. I have many questions , have posted them in this comment. Would be great to get a reply or a chance to discuss more about this!!!
Models:
• Is it necessary to start from a pretrained language model and then fine-tune it for the grammar classification task?
• Would directly training a transformer model from scratch which has a architecture where the final layer is a classification head and pretrain on proper labeled good grammar , bad grammar work ( like the grammar classification dataset which you created work) especially for smaller models?
Regarding dataset construction:
• You mentioned that grammatically correct and incorrect sentences should ideally share similar semantic content.
Does this imply that the dataset should contain one correct sentence and multiple grammatically incorrect variants derived from it?
• Why wouldn’t it be sufficient to have: Correct sentences with certain content ? Incorrect sentences with completely different content?
• What kinds of failure modes or biases could arise if semantic alignment between correct and incorrect samples is not maintained?
For dataset creation:
•Would querying large language models to generate domain-controlled synthetic data be a good alternative to cleaning and filtering large raw text corpora?
• In your experience, does LLM-based controlled generation help improve coverage and balance for grammar classification tasks?
I would really appreciate any suggestions on a recommended IDEAS or dataset corpus I can look into, especially focused on:
• Dataset design for this use case
• Ensuring grammatical diversity without introducing unintended biases.
My primary goal is to create a diverse dataset and train a small model that still generalizes well across a wide range of grammatical phenomena, so dataset diversity is critical.
I would highly appreciate it if you could help answer these questions. I’m genuinely trying to learn, contribute, and work in this area, and your work is particularly relevant to what I’m aiming to do
Would be great to get a reply or a chance to discuss more about this!!!
My e-mail is langesant@outlook.com
• Is it necessary to start from a pretrained language model and then fine-tune it for the grammar classification task?
Practically, you must start from a pretrained English language model (for example, a pretrained BERT). Otherwise, the model won't recognize English at all. While it's possible to pretrain the English language model from scratch, that is, from randomly initialized weights, only the biggest AI companies and labs can afford to do that.
• Would directly training a transformer model from scratch which has a architecture where the final layer is a classification head and pretrain on proper labeled good grammar , bad grammar work ( like the grammar classification dataset which you created work) especially for smaller models?
The model should have basic English understanding already before you finetune. For the finetune itself, it involves making a classification head (good/bad grammar) and updating the model's weights on the training data. This model was full-parameter finetuned but you can also freeze weights and only update the classification head. That's most useful if you don't have a lot of training data.
• Does this imply that the dataset should contain one correct sentence and multiple grammatically incorrect variants derived from it?
• Why wouldn’t it be sufficient to have: Correct sentences with certain content ? Incorrect sentences with completely different content?
• What kinds of failure modes or biases could arise if semantic alignment between correct and incorrect samples is not maintained?
- It's not necessary to have multiple incorrect variants for each correct sentence. Just a balanced mix of one correct, one incorrect is good enough.
- The problem with different domains is the confounding. For example, if you train a classifier on these datasets:
- mostly "correct grammar": agentlans/high-quality-english-sentences
- mostly "incorrect grammar": agentlans/bluesky
- You'd be training the model to classify "educational vs. social media posts" but not "correct vs. incorrect grammar" which is what you want
• Would querying large language models to generate domain-controlled synthetic data be a good alternative to cleaning and filtering large raw text corpora?
• In your experience, does LLM-based controlled generation help improve coverage and balance for grammar classification tasks?
It depends on what you're trying to do. But overall, LLM-generated data isn't as diverse as large raw text corpora, even if you already have a diverse list of domains (such as agentlans/library-classification-systems). In any case, you probably want human-generated grammar and spelling mistakes and not artificially mutated ones.
• Dataset design for this use case
• Ensuring grammatical diversity without introducing unintended biases.
- This dataset is probably the simplest design for the grammar classification task: agentlans/grammar-classification.
- You can also look for Grammatical Error Correction (GEC) datasets like agentlans/grammar-correction and cut the columns to make labelled rows for the classifier data. That gives well-balanced domains without any confounding effect between the classes.
My primary goal is to create a diverse dataset and train a small model that still generalizes well across a wide range of grammatical phenomena, so dataset diversity is critical.
Small models are the way to go for this classification task. It's not so complex that you need large LLMs and it needs to be fast enough to detect grammar mistakes in real time.
I would highly appreciate it if you could help answer these questions. I’m genuinely trying to learn, contribute, and work in this area, and your work is particularly relevant to what I’m aiming to do
You're welcome and thanks for your interest.