Model Card for Model ID
Model Details
Introduction
Welcome to the Virginia Gardening chatbot! As someone who enjoys nature and gardening (and is learning about LLMS), I've noticed many large generalized models give horticultural advice that is either out of date or for an incorrect region. If you open up a new session with your LLM of choice and ask it, for example "My rose leaves have white spots and holes in them. What causes this and how do I treat it?", you will receive information most likely gathered across the web from a variety of regions and about a variety of rose species. The more specialized advice you need, the less certain you are to get the answer you want.
I wanted this chatbot designed to someone like me. Someone of limited gardening experience, who is in Hardiness Zone 7a in coastal climate with clay soil, for example.
So my model is trained very specifically for the climate and soil of Northern Virginia (though it should be helpful for anyone in the Midd-Atlantic region).
My goal was to create a chatbot that gave timely, relevant advice for this region, while still being a fairly lightweight model.
Model Data
While I got my information from a variety of sources, I primarily sourced from the following:
- The Virginia Cooperative Extension's Educational Resources (https://ext.vt.edu/)
- The Master Gardeners of Northern Virginia (https://mgnv.org/) Fact sheets and blogs
- The University of Maryland Extension (https://extension.umd.edu/home/)
- (this may seem counterintuitive, given it is a Virginia chatbot, but all of the information used was transferrable due to geographic proximity. No state legal information was used) -USDA Plants Database (https://plants.usda.gov/)
- Fairfax County Master Gardeners (https://fairfaxgardening.org/)
The dataset creation was multi-stage.First, data was collected using web scraping (from the sites that allowed it; some data was collected manually) and stored in JSON. Questions were generated both programmatically and manually from this document.
Of the programmatically generated questions, some were created by feeding the tokenized JSON to meta-llama/Llama-3.2-1B to create question/answer pairs. More structured data, such as from the USDA plants database, were turned into questions using a simple Python script.
For example, if the database had a scientific name of "Parthenocissus quinquefolia" and a common name of "Virginia creeper", the generated question/answer pair was "What is the common name of Parthenocissus quinquefolia?" "The common name of Parthenocissus quinquefolia is the Virginia creeper."
Still other questions created manually and reviewed for structure and completeness. After QA pair generation, results were reviewed and stripped of duplices or nonsense entries. Initial reslts were poor, so a reduced dataset is currently in use in training. This dataset was highly curated for both information and tone, and consists of 1,000 high quality examples. The goal is to expand this training set in the future.
The data was given an 80/15/5 split, for training, testing, and validation.
Model Methodology
- At a high level, this model is based on google/gemma-2-2b-it.
- Evaluation Benchmarks - GSM8k, Hellaswag, Piqa
- Training data: Curated question/answer pairs on local gardening topics
Model Parameters
My final choice in parameters was determined largely by the size of my dataset.
I found I needed stricter token sizes and larger learning rates to avoid overfitting.
The final paramters for LoRA were
| Metric | Value |
|---|---|
| R | 64 |
| Alpha | 64 |
| Dropout | 0.5 |
| Learning rate | 0.00002 |
| Epochs | 5 |
| Batch size | 1 |
I found that while changing the alpha and dropout didn't reflect too strongly in my scores, the model was most sensitive to the learning rate For example, an lr or 0.0002 resulted in a training loss of over 6, while lr of 0.000002 gave an impracticle loss close of 0.0041. 0.0002 yielded a far more reasonable 1.8.
The final combination of the highly curated, smaller dataset, along with the switch to LoRA resulted in a model that responded far more accurately and in the style that was recommended. While I found this result counterintuitive at first, due to the small model size, my theory is that Gemma was already well suited to the gardening chatbot task. And since LoRA updates relatively few parameters, the fact that the highly curated dataset was was specific to the region and formatted for tone was sufficient to guide Gemma to my intended result.
Model Evaluation - Model Choice and Benchmark Performance
I wanted to choose a model that was both conversational in tone and had a good aptitude for scientific domains.
As such, I chose a mixture of benchmarks primarily intended to test tone, reasoning, and general aptitude in scientific questions.
I chose GSM8k for its strength in general reasoning, and because it was the closest widely accepted benchmark in scientific reasoning specifically.
Hellaswag was selected as a general-purpose llm/chatbot-style benchmark also known for good common sense.
Piqa was selected for its ability to check commonsense reasoning.
I chose three models to compare:
- Qwen/Qwen3-4B
- meta-llama/Llama-3.2-1B
- google/gemma-2-2b-it Each was evaluated all three benchmarks, tuned and untuned, and then the same for the holdout data.
| Metric | Model | GSM8K | Hellaswag | Piqa |
|---|---|---|---|---|
| Accuracy | Gemma Base | 0.45 | 0.53 | 0.90 |
| Standard error | Gemma base Model | 0.09 | 0.09 | 0.05 |
| Accuracy | Gemma Base Tuned Model | 0.6 | 0.59 | 0.43 |
| Standard error | Gemma Tuned Model | 0.09 | 0.08 | 0.09 |
| Metric | Model | GSM8K | Hellaswag | Piqa |
|---|---|---|---|---|
| Accuracy | Qwen Base | 0.83 | 0.53 | .77 |
| Standard error | Qwen Base Model | .06 | 0.09 | 0.8 |
| Accuracy | Qwen Tuned Model | 0.09 | 0.43 | 0.5 |
| Standard error | Qwen Tuned Model | 0.09 | 0.09 | 0.1 |
| Metric | Model | GSM8K | Hellaswag | Piqa |
|---|---|---|---|---|
| Accuracy | Meta Llama Base | 0.06 | 0.50 | .75 |
| Standard error | Meta Llama Base Model | .05 | 0.09 | 0.7 |
| Accuracy | Meta Llama Tuned Model | 0.33 | 0.40 | |
| Standard error | Meta Llama Tuned Model | 0.09 | 0.09 |
While no model was a clear standout, Gemma consistently responded the most naturally, and was the easiest to tweak the tone of.
I also tested this data on a small holdout set of data, ~5% of the total volumn of Question/Answer pairs (using BertScore)
All three models were tested on the holdset data. Results as below:
| Metric | Model Type | Gemma | Qwen | Meta Llama |
|---|---|---|---|---|
| Precision | Base | .76 | .75 | .77 |
| F1 | Base | .78 | .74 | .77 |
| Precision | Tuned | .80 | 0.80 | 0.73 |
| F1 | Tuned | .81 | 0.83 | 0.79 |
The results for this were promising, though of course the small size limits the amount we can infer.
While Qwen performed well on metrics in benchmarks, its consistently rambled and did not maintain its Virginia-specific tuning.
As such, Gemma was chosen as strongly-performing model that was sensitive to tuning.
Model Usage and Example Code
This chatbot is designed to help mid-Atlantic gardeners get information accurate to their area, without the generalized information or advice that can be found when querying a larger model.
Below is some sample code to get you started!
I recommend including the prompt formatting in order to get the best answer.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "YOUR_USERNAME/va-gemma"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
def format_prompt(question: str) -> str:
return (
"<bos><start_of_turn>user\n"
"You are a friendly Virginia gardening assistant.\n"
"Answer concisely in 1 to 3 sentences.\n"
f"{question}\n"
"<end_of_turn>\n"
"<start_of_turn>model\n"
)
question = "When is the best time to plant tomatoes in Virginia?"
prompt = format_prompt(question)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.3,
top_p=0.9,
)
answer = tokenizer.decode(output[0], skip_special_tokens=True)
print(answer)
| Sample Data | Sample Answer |
|---|---|
| What shrubs are well-suited to drier, well-drained sites in Northern Virginia? | Kalmia latifolia (mountain laurel), physocarpus opulifolius (common or Eastern ninebark), and viburnum prunifolium (black haw) are all native Virginia shrubs that will grow well in dry, well-drained soil in Northern Virginia |
Intended Uses
This model is designed to give garderners and farmers in Virginia and the Mid-Atlantic region advice. It is not guaranteed to be totally correct, so please double check.
While the model was trained with some information on pesticides and poisons, the focus was primarily horticultural, so it should not be used as a primary source on any toxic materials.
Bias, Risks, and Limitations
There are several limitations with the model that lead to a number of risks. Limitations include:
- A small dataset size: The primary limiting factor are the small number of examples in training.
- Smaller, lightweight model: Power was sacrificed for speed and portability. Thus, the model is not as robust as a larger model.
- No current RAG implementation: As changes in the Virginia landscape occur, some of this information will fall out of date.
Models Used
Models Tested
https://huggingface.co/google/gemma-2b-it
https://huggingface.co/Qwen/Qwen3-4B
https://huggingface.co/meta-llama/Llama-3.2-1B
Models Used for Question Generation
https://huggingface.co/meta-llama/Llama-3.2-1B
Models Used for Evaluation
https://huggingface.co/spaces/evaluate-metric/bertscore
- Downloads last month
- 47