Model Card for SmolLM3-3B_rl_vending_local

This model is a fine-tuned version of Qwen/Qwen3-0.6B, optimized using Group Relative Policy Optimization (GRPO) for strategic decision-making in a competitive vending machine simulation environment.

Model Details

Model Description

The model acts as a Strategic Business Manager for 'LLMMachine'. It is trained to adjust product prices (Soda, Chips, Candy Bar, Water) to maximize net profit while competing with a 'BasicMachine' in a shared client pool. The training focuses on balancing profit margins against sales volume based on market data and stockout feedback.

Developed by: Andres Santos
Model type: Causal Language Model (Fine-tuned with RL/GRPO)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: HuggingFaceTB/SmolLM3-3B

Model Sources

Repository: santos-sanz/vending-machine-rl-model

Uses

Direct Use

The model is intended to generate strategic price adjustments in response to market conditions. It provides reasoning in <thought> tags followed by a JSON action block.

{
  "action": "change_price",
  "parameters": {
    "machine_name": "LLMMachine",
    "product_name": "Soda",
    "new_price": 2.75
  }
}

Out-of-Scope Use

This model is a specialized simulation agent and should not be used as a general-purpose assistant or for financial advice in real-world markets without further validation.

Bias, Risks, and Limitations

The model is trained on synthetic simulation data. Its strategies are optimized for the specific logic of the VendingMachine simulation engine and may not translate to real-world consumer behavior which is more complex and less predictable.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model_id = "santos-sanz/vending-machine-rl-model"
base_model = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, model_id)

prompt = """
    You are a Strategic Business Manager for 'LLMMachine'. 
    ... (see training script for full prompt)
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained using synthetic market data generated by a custom simulation engine. Each prompt includes:

Competitor prices.
Estimated market demand (Low/Medium/High).
Marketing intensity and upcoming events.
Feedback on stockout events from the previous week.

Training Procedure

Training Hyperparameters

Training regime: GRPO (float16 mixed precision with LoRA)
Learning rate: 1e-4
Max steps: 300
Batch size: 1
Gradient accumulation steps: 4
Num generations per prompt: 2
LoRA Rank (R): 16
LoRA Alpha: 32

Speeds, Sizes, Times

Hardware: MacBook Air (Apple Silicon - MPS)
Training duration: ~8.54 hours
Average step time: ~102 seconds

Evaluation

The model was evaluated based on reward functions tracking:

Format Reward: Adherence to JSON output format.
Profit Reward: Accumulated net profit over a 5-week simulation window.

Results

Final training logs showed significant improvement in the profit_reward_func:

Initial Reward: ~ -5.00
Final Reward: ~ +69.00 (per step average)

Environmental Impact

Hardware Type: Apple M-series (MPS)
Hours used: 8.5
Carbon Emitted: Est. low (local consumer hardware)

Technical Specifications

Model Architecture and Objective

The model uses a LoRA adapter on top of the SmolLM3-3B architecture. The GRPO objective optimizes the policy to maximize the expected reward from the vending machine simulation.

Citation

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Model Card Authors

Andres Santos

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for santos-sanz/vending-machine-rl-model

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(914)

this model

Paper for santos-sanz/vending-machine-rl-model

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145