Recipe Extractor - Gemma-3 270M Fine-tuned
This is a fine-tuned version of Gemma-3 270M trained to extract structured JSON-LD recipe data from unstructured blog posts. The model can parse messy recipe blog posts (with stories, ads, and chaotic formatting) and output clean, valid schema.org Recipe objects.
Model Details
Model Description
This model extracts structured recipe information from natural language text in the form of JSON-LD following the schema.org Recipe specification. It's designed to handle various blog post styles including:
- Minimal, organized formats
- Fluffy blog posts with stories and advertisements
- Chaotic, poorly formatted text
- Short Instagram-style posts
- Extremely verbose, unstructured content
The model was fine-tuned using LoRA (Low-Rank Adaptation) to maintain efficiency while achieving good extraction accuracy.
- Developed by: Vlad Rusu
- Model type: Text Generation (Recipe Extraction)
- Language(s): English
- License: MIT
- Fine-tuned from model: unsloth/gemma-3-270m-it
- Training Framework: Unsloth + PEFT (LoRA)
Model Sources
- Repository: https://github.com/v-rusu/finetune-recipe-extractor
- Dataset: https://huggingface.co/datasets/v-rusu/recipe-extractor-dataset
- Developer Website: https://vladr.com
- LinkedIn: https://www.linkedin.com/in/vrusu
Uses
Direct Use
The model can be used directly to extract recipe information from blog posts, social media posts, or any text containing recipe information. It outputs valid JSON-LD in schema.org Recipe format, which can be:
- Embedded in web pages for SEO
- Used in recipe management applications
- Parsed by search engines and recipe aggregators
- Stored in structured databases
Example Usage
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="v-rusu/recipe-extractor-gemma-3-270m",
max_seq_length=8192,
load_in_4bit=False,
)
messages = [
{"role": "system", "content": "You are a recipe extraction assistant. Extract recipe information from the provided text and output it as a valid JSON-LD object following the schema.org Recipe format."},
{"role": "user", "content": "Extract recipe information from this text:\n\n[your blog post text here]"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
add_generation_prompt=True,
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=1500,
temperature=1,
top_p=0.95,
top_k=64,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Downstream Use
The extracted JSON-LD can be integrated into:
- Recipe website SEO optimization
- Recipe aggregation platforms
- Meal planning applications
- Nutritional analysis tools
- Content management systems
Out-of-Scope Use
This model is not suitable for:
- Medical or dietary advice
- Allergen detection (requires specialized models)
- Nutritional calculation (outputs rely on source accuracy)
- Non-recipe content extraction
- Languages other than English
Bias, Risks, and Limitations
- Dataset Bias: The model was trained on data derived from AllRecipes, which may not represent global cuisine diversity
- Synthetic Data: Training data was synthetically generated, which may not capture all real-world edge cases
- Format Assumptions: The model expects blog-style recipe text and may not handle highly structured or tabular inputs well
- Accuracy: Recipe quantities and instructions depend on accurate extraction from source text
- No Validation: The model does not verify recipe feasibility or safety
Recommendations
- Always validate extracted recipes for completeness and accuracy
- Use with diverse recipe sources to identify potential biases
- Implement additional validation for allergen and dietary information
- Consider human review for production recipe applications
- Test with your specific use case before deployment
How to Get Started with the Model
Installation
pip install unsloth transformers
Quick Start
from unsloth import FastLanguageModel
# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="v-rusu/recipe-extractor-gemma-3-270m",
max_seq_length=8192,
)
# Prepare your input
blog_post = """
Today I'm sharing my grandmother's famous chocolate chip cookies!
These cookies are the best - soft, chewy, and packed with chocolate.
Ingredients:
- 2 cups all-purpose flour
- 1 cup sugar
- 1 cup chocolate chips
- 2 eggs
- 1 tsp vanilla extract
Instructions:
First, preheat your oven to 350°F. Then mix all dry ingredients...
"""
messages = [
{"role": "system", "content": "You are a recipe extraction assistant. Extract recipe information from the provided text and output it as a valid JSON-LD object following the schema.org Recipe format."},
{"role": "user", "content": f"Extract recipe information from this text:\n\n{blog_post}"},
]
# Generate
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
The model was trained on the recipe-extractor-dataset, which contains:
- 6,196 training examples
- 100% synthetic data generated using Deepseek v3.2
- Source: Derived from the AllRecipes Kaggle Dataset
The dataset generation pipeline:
- Real recipe data downloaded from Kaggle
- Synthetic blog posts generated in 5 different styles (minimal, fluffy, chaotic, etc.) using Deepseek v3.2
- Recipe JSON-LD extraction with chain-of-thought reasoning traces using Deepseek v3.2
- Reasoning traces removed for training (preserved in dataset for potential reasoning model training)
Blog Post Styles (weighted distribution):
- Instagram short (weight: 1) - Very brief posts
- Minimal organized (weight: 2) - Clean, structured format
- Fluffy organized (weight: 5) - Typical recipe blogs with stories
- Chaotic unstructured (weight: 2) - Poorly formatted content
- Super fluffy chaotic (weight: 1) - Extremely verbose and messy
Training Procedure
Preprocessing
- Reasoning traces (
<think>...</think>) removed from assistant responses since Gemma-3 270M is a non-reasoning model - Messages formatted using ChatML (Gemma3 chat template)
- 95/5 train/test split (seed: 42)
- Maximum sequence length: 8,192 tokens
Training Hyperparameters
- Base Model: unsloth/gemma-3-270m-it
- Training regime: LoRA fine-tuning with 4-bit quantization
- LoRA Configuration:
- Rank (r): 64
- Alpha: 64
- Dropout: 0
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Training steps: 35 (max_steps)
- Batch size: 4 per device
- Gradient accumulation: 4 steps (effective batch size: 16)
- Learning rate: 5e-4
- Optimizer: AdamW 8-bit
- Weight decay: 0.001
- LR scheduler: Linear
- Warmup steps: 5
- Seed: 3407
- Training objective: Supervised fine-tuning (SFT) on assistant responses only
Speeds, Sizes, Times
- Training framework: Unsloth (optimized training)
- Model size: ~270M parameters base + LoRA adapters
- Training time: Varies by hardware (optimized for Google Colab T4 GPU)
- Gradient checkpointing: Enabled (Unsloth mode)
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on a held-out 5% test set from the recipe-extractor-dataset (~310 examples).
Evaluation Methodology
Two-stage evaluation process:
Structural Validation (
RecipeEvaluator):- Validates JSON syntax
- Checks required fields:
@context,@type,name - Validates optional fields: description, ingredients, instructions, times, yield, category, cuisine, keywords
- Verifies schema.org Recipe format compliance
- Checks ISO 8601 duration formats
Quality Assessment (
VibesEvaluator):- LLM-based evaluation using DeepSeek v3.2
- Assesses extraction quality and completeness
- Scores on a 1-10 scale
Metrics
- Structural Validity Rate: Percentage of outputs that are valid JSON-LD
- Field Completeness: Coverage of optional schema.org fields
- Quality Score: Average LLM-assessed quality rating
Results
Results vary by training configuration. The best performing model (gemma-3-r64-max35-lr5e-4) achieved:
- High structural validity on test set
- Consistent extraction of required fields
- Good handling of diverse blog post styles
Full evaluation results are saved per training run and include per-sample validation details.
Environmental Impact
- Hardware Type: NVIDIA GPU (T4 or better recommended)
- Training optimizations: Unsloth framework, 4-bit quantization, LoRA
- Compute efficiency: Optimized for Google Colab free tier
- Carbon footprint: Minimal due to small model size and efficient training
The use of a small model (270M parameters) and efficient fine-tuning techniques (LoRA, quantization) significantly reduces computational requirements compared to training larger models.
Technical Specifications
Model Architecture and Objective
- Architecture: Gemma-3 270M (decoder-only transformer)
- Fine-tuning method: LoRA (Low-Rank Adaptation)
- Objective: Supervised fine-tuning for structured information extraction
- Context length: 8,192 tokens
- Output format: JSON-LD (schema.org Recipe)
Compute Infrastructure
Hardware
- Development: Local machines with LM Studio (CPU/GPU)
- Training: Google Colab with T4 GPU (recommended)
- Inference: CPU or GPU (model supports various quantization levels)
Software
- Framework: Unsloth
- Libraries: PEFT 0.18.1, Transformers, TRL, PyTorch
- Dataset Generation: LM Studio with Qwen3-14B and Deepseek v3.2
- Quantization: Supports GGUF export (Q8_0, BF16, F16)
- Compatible with: llama.cpp, Ollama, LM Studio
Citation
BibTeX:
@software{rusu2026recipe_extractor,
author = {Rusu, Vlad},
title = {Recipe Extractor: Fine-tuned Gemma-3 270M for Recipe JSON-LD Extraction},
year = {2026},
url = {https://github.com/v-rusu/finetune-recipe-extractor},
note = {Fine-tuned on synthetic recipe blog data}
}
APA:
Rusu, V. (2026). Recipe Extractor: Fine-tuned Gemma-3 270M for Recipe JSON-LD Extraction [Computer software]. https://github.com/v-rusu/finetune-recipe-extractor
Glossary
- JSON-LD: JSON for Linking Data, a method of encoding linked data using JSON
- Schema.org Recipe: A standardized format for representing recipe information on the web
- LoRA: Low-Rank Adaptation, a parameter-efficient fine-tuning technique
- Unsloth: An optimized framework for efficient LLM training
- Chain-of-thought: A reasoning approach where models show step-by-step thinking
- GGUF: A file format for storing language models for efficient inference
More Information
Project Resources
- GitHub Repository: https://github.com/v-rusu/finetune-recipe-extractor
- Training Dataset: https://huggingface.co/datasets/v-rusu/recipe-extractor-dataset
- Blog: https://vladr.com
Pipeline Scripts
The training pipeline includes:
01_download_dataset.py- Download AllRecipes from Kaggle02_generate_blogs.py- Generate synthetic blog posts03_generate_recipe_json.py- Extract recipes with reasoning04_generate_finetuning_dataset.py- Create training dataset05_finetune.py- Fine-tune the model06_eval.py- Evaluate model performance
Complete documentation available in the GitHub repository.
Model Card Authors
Vlad Rusu
Model Card Contact
- LinkedIn: https://www.linkedin.com/in/vrusu
- Website: https://vladr.com
- GitHub: https://github.com/v-rusu
Framework versions
- PEFT 0.18.1
- Transformers (latest compatible version)
- Unsloth
- Downloads last month
- 1