Model Card for Model ID

Model Details

Introduction

Welcome to the Virginia Gardening chatbot! As someone who enjoys nature and gardening (and is learning about LLMS), I've noticed many large generalized models give horticultural advice that is either out of date or for an incorrect region. If you open up a new session with your LLM of choice and ask it, for example "My rose leaves have white spots and holes in them. What causes this and how do I treat it?", you will receive information most likely gathered across the web from a variety of regions and about a variety of rose species. The more specialized advice you need, the less certain you are to get the answer you want.

I wanted this chatbot designed to someone like me. Someone of limited gardening experience, who is in Hardiness Zone 7a in coastal climate with clay soil, for example.

So my model is trained very specifically for the climate and soil of Northern Virginia (though it should be helpful for anyone in the Midd-Atlantic region).

My goal was to create a chatbot that gave timely, relevant advice for this region, while still being a fairly lightweight model.

Model Data

While I got my information from a variety of sources, I primarily sourced from the following:

The Virginia Cooperative Extension's Educational Resources (https://ext.vt.edu/)
The Master Gardeners of Northern Virginia (https://mgnv.org/) Fact sheets and blogs
The University of Maryland Extension (https://extension.umd.edu/home/)
(this may seem counterintuitive, given it is a Virginia chatbot, but all of the information used was transferrable due to geographic proximity. No state legal information was used) -USDA Plants Database (https://plants.usda.gov/)
Fairfax County Master Gardeners (https://fairfaxgardening.org/)

The dataset creation was multi-stage.First, data was collected using web scraping (from the sites that allowed it; some data was collected manually) and stored in JSON. Questions were generated both programmatically and manually from this document.

Of the programmatically generated questions, some were created by feeding the tokenized JSON to meta-llama/Llama-3.2-1B to create question/answer pairs. More structured data, such as from the USDA plants database, were turned into questions using a simple Python script.

For example, if the database had a scientific name of "Parthenocissus quinquefolia" and a common name of "Virginia creeper", the generated question/answer pair was "What is the common name of Parthenocissus quinquefolia?" "The common name of Parthenocissus quinquefolia is the Virginia creeper."

Still other questions created manually and reviewed for structure and completeness. After QA pair generation, results were reviewed and stripped of duplices or nonsense entries. Initial reslts were poor, so a reduced dataset is currently in use in training. This dataset was highly curated for both information and tone, and consists of 1,000 high quality examples. The goal is to expand this training set in the future.

The data was given an 80/15/5 split, for training, testing, and validation.

Model Methodology

At a high level, this model is based on google/gemma-2-2b-it.
Evaluation Benchmarks - GSM8k, Hellaswag, Piqa
Training data: Curated question/answer pairs on local gardening topics

Model Parameters

My final choice in parameters was determined largely by the size of my dataset.
I found I needed stricter token sizes and larger learning rates to avoid overfitting.

The final paramters for LoRA were

Metric	Value
R	64
Alpha	64
Dropout	0.5
Learning rate	0.00002
Epochs	5
Batch size	1

I found that while changing the alpha and dropout didn't reflect too strongly in my scores, the model was most sensitive to the learning rate For example, an lr or 0.0002 resulted in a training loss of over 6, while lr of 0.000002 gave an impracticle loss close of 0.0041. 0.0002 yielded a far more reasonable 1.8.

The final combination of the highly curated, smaller dataset, along with the switch to LoRA resulted in a model that responded far more accurately and in the style that was recommended. While I found this result counterintuitive at first, due to the small model size, my theory is that Gemma was already well suited to the gardening chatbot task. And since LoRA updates relatively few parameters, the fact that the highly curated dataset was was specific to the region and formatted for tone was sufficient to guide Gemma to my intended result.

Model Evaluation - Model Choice and Benchmark Performance

I wanted to choose a model that was both conversational in tone and had a good aptitude for scientific domains.
As such, I chose a mixture of benchmarks primarily intended to test tone, reasoning, and general aptitude in scientific questions.

I chose GSM8k for its strength in general reasoning, and because it was the closest widely accepted benchmark in scientific reasoning specifically.
Hellaswag was selected as a general-purpose llm/chatbot-style benchmark also known for good common sense.
Piqa was selected for its ability to check commonsense reasoning.

I chose three models to compare:

Qwen/Qwen3-4B
meta-llama/Llama-3.2-1B
google/gemma-2-2b-it Each was evaluated all three benchmarks, tuned and untuned, and then the same for the holdout data.

Metric	Model	GSM8K	Hellaswag	Piqa
Accuracy	Gemma Base	0.45	0.53	0.90
Standard error	Gemma base Model	0.09	0.09	0.05
Accuracy	Gemma Base Tuned Model	0.6	0.59	0.43
Standard error	Gemma Tuned Model	0.09	0.08	0.09

Metric	Model	GSM8K	Hellaswag	Piqa
Accuracy	Qwen Base	0.83	0.53	.77
Standard error	Qwen Base Model	.06	0.09	0.8
Accuracy	Qwen Tuned Model	0.09	0.43	0.5
Standard error	Qwen Tuned Model	0.09	0.09	0.1

Metric	Model	GSM8K	Hellaswag	Piqa
Accuracy	Meta Llama Base	0.06	0.50	.75
Standard error	Meta Llama Base Model	.05	0.09	0.7
Accuracy	Meta Llama Tuned Model	0.33	0.40
Standard error	Meta Llama Tuned Model	0.09	0.09

While no model was a clear standout, Gemma consistently responded the most naturally, and was the easiest to tweak the tone of.

I also tested this data on a small holdout set of data, ~5% of the total volumn of Question/Answer pairs (using BertScore)

All three models were tested on the holdset data. Results as below:

Metric	Model Type	Gemma	Qwen	Meta Llama
Precision	Base	.76	.75	.77
F1	Base	.78	.74	.77
Precision	Tuned	.80	0.80	0.73
F1	Tuned	.81	0.83	0.79

The results for this were promising, though of course the small size limits the amount we can infer.

While Qwen performed well on metrics in benchmarks, its consistently rambled and did not maintain its Virginia-specific tuning.

As such, Gemma was chosen as strongly-performing model that was sensitive to tuning.

Model Usage and Example Code

This chatbot is designed to help mid-Atlantic gardeners get information accurate to their area, without the generalized information or advice that can be found when querying a larger model.

Below is some sample code to get you started!

I recommend including the prompt formatting in order to get the best answer.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "YOUR_USERNAME/va-gemma"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

def format_prompt(question: str) -> str:
    return (
        "<bos><start_of_turn>user\n"
        "You are a friendly Virginia gardening assistant.\n"
        "Answer concisely in 1 to 3 sentences.\n"
        f"{question}\n"
        "<end_of_turn>\n"
        "<start_of_turn>model\n"
    )

question = "When is the best time to plant tomatoes in Virginia?"

prompt = format_prompt(question)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.3,
        top_p=0.9,
    )

answer = tokenizer.decode(output[0], skip_special_tokens=True)
print(answer)

Sample Data	Sample Answer
What shrubs are well-suited to drier, well-drained sites in Northern Virginia?	Kalmia latifolia (mountain laurel), physocarpus opulifolius (common or Eastern ninebark), and viburnum prunifolium (black haw) are all native Virginia shrubs that will grow well in dry, well-drained soil in Northern Virginia

Intended Uses

This model is designed to give garderners and farmers in Virginia and the Mid-Atlantic region advice. It is not guaranteed to be totally correct, so please double check.

While the model was trained with some information on pesticides and poisons, the focus was primarily horticultural, so it should not be used as a primary source on any toxic materials.

Bias, Risks, and Limitations

There are several limitations with the model that lead to a number of risks. Limitations include:

A small dataset size: The primary limiting factor are the small number of examples in training.
Smaller, lightweight model: Power was sacrificed for speed and portability. Thus, the model is not as robust as a larger model.
No current RAG implementation: As changes in the Virginia landscape occur, some of this information will fall out of date.

Models Used

Models Tested
https://huggingface.co/google/gemma-2b-it
https://huggingface.co/Qwen/Qwen3-4B
https://huggingface.co/meta-llama/Llama-3.2-1B

Models Used for Question Generation
https://huggingface.co/meta-llama/Llama-3.2-1B

Models Used for Evaluation
https://huggingface.co/spaces/evaluate-metric/bertscore

Downloads last month: -

Model tree for CEJ2VH/va-gardening-gemma

Base model

google/gemma-2-2b

Finetuned

google/gemma-2-2b-it

Adapter

(406)

this model