Model Card for SmolLM3-3B_rl_vending_local
This model is a fine-tuned version of Qwen/Qwen3-0.6B, optimized using Group Relative Policy Optimization (GRPO) for strategic decision-making in a competitive vending machine simulation environment.
Model Details
Model Description
The model acts as a Strategic Business Manager for 'LLMMachine'. It is trained to adjust product prices (Soda, Chips, Candy Bar, Water) to maximize net profit while competing with a 'BasicMachine' in a shared client pool. The training focuses on balancing profit margins against sales volume based on market data and stockout feedback.
- Developed by: Andres Santos
- Model type: Causal Language Model (Fine-tuned with RL/GRPO)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: HuggingFaceTB/SmolLM3-3B
Model Sources
- Repository: santos-sanz/vending-machine-rl-model
Uses
Direct Use
The model is intended to generate strategic price adjustments in response to market conditions. It provides reasoning in <thought> tags followed by a JSON action block.
{
"action": "change_price",
"parameters": {
"machine_name": "LLMMachine",
"product_name": "Soda",
"new_price": 2.75
}
}
Out-of-Scope Use
This model is a specialized simulation agent and should not be used as a general-purpose assistant or for financial advice in real-world markets without further validation.
Bias, Risks, and Limitations
The model is trained on synthetic simulation data. Its strategies are optimized for the specific logic of the VendingMachine simulation engine and may not translate to real-world consumer behavior which is more complex and less predictable.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model_id = "santos-sanz/vending-machine-rl-model"
base_model = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, model_id)
prompt = """
You are a Strategic Business Manager for 'LLMMachine'.
... (see training script for full prompt)
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
The model was trained using synthetic market data generated by a custom simulation engine. Each prompt includes:
- Competitor prices.
- Estimated market demand (Low/Medium/High).
- Marketing intensity and upcoming events.
- Feedback on stockout events from the previous week.
Training Procedure
Training Hyperparameters
- Training regime: GRPO (float16 mixed precision with LoRA)
- Learning rate: 1e-4
- Max steps: 300
- Batch size: 1
- Gradient accumulation steps: 4
- Num generations per prompt: 2
- LoRA Rank (R): 16
- LoRA Alpha: 32
Speeds, Sizes, Times
- Hardware: MacBook Air (Apple Silicon - MPS)
- Training duration: ~8.54 hours
- Average step time: ~102 seconds
Evaluation
The model was evaluated based on reward functions tracking:
- Format Reward: Adherence to JSON output format.
- Profit Reward: Accumulated net profit over a 5-week simulation window.
Results
Final training logs showed significant improvement in the profit_reward_func:
- Initial Reward: ~ -5.00
- Final Reward: ~ +69.00 (per step average)
Environmental Impact
- Hardware Type: Apple M-series (MPS)
- Hours used: 8.5
- Carbon Emitted: Est. low (local consumer hardware)
Technical Specifications
Model Architecture and Objective
The model uses a LoRA adapter on top of the SmolLM3-3B architecture. The GRPO objective optimizes the policy to maximize the expected reward from the vending machine simulation.
Citation
@article{shao2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = 2024,
eprint = {arXiv:2402.03300},
}
Model Card Authors
Andres Santos