MuratcanKoylan's picture
Upload README.md with huggingface_hub
192d8d2 verified
---
license: apache-2.0
language:
- en
tags:
- memory-routing
- marketing
- classification
- llama
- lora
- tinker
- prompt-distillation
base_model: meta-llama/Llama-3.1-8B
metrics:
- f1
- accuracy
pipeline_tag: text-classification
---
# Memory Routing Agent
**A specialized 8B model that outperforms its 104B teacher on marketing conversation classification.**
[![HuggingFace](https://img.shields.io/badge/πŸ€—%20Model-Marketing--Memory--Routing--8B-blue)](https://huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B)
[![GitHub](https://img.shields.io/badge/GitHub-memory--routing--agent-black)](https://github.com/muratcankoylan/memory-routing-agent)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
---
## The Experiment
This project demonstrates **prompt distillation**: training a small, specialized model to outperform the large model that generated its training data.
### The Challenge
Marketing AI assistants need to remember the right information from conversations. Not everything is worth storing - you need to distinguish between:
- **Valuable**: "Our brand voice is professional but approachable" β†’ Store in long-term memory
- **Transactional**: "What time is the meeting tomorrow?" β†’ Don't store
This is a **13-category classification problem** with nuanced distinctions between company-level and user-level information, different persistence horizons, and the critical ability to say "none" for irrelevant content.
### The Approach
1. **Generate synthetic data** using Cohere Command-R-Plus (104B) as the teacher
2. **Fine-tune Llama-3.1-8B** with LoRA using Tinker's training platform
3. **Apply reinforcement learning** with a custom reward function
4. **Benchmark against the teacher** on challenging, held-out scenarios
### The Result
| Model | Parameters | Avg F1 | Exact Match |
|-------|------------|--------|-------------|
| **Ours** | **8B** | **0.68** | **60%** |
| Cohere Command-R-Plus | 104B | 0.61 | 26% |
**Our 8B model achieves 11.1% higher F1 and 2.3x better exact match accuracy than the 104B teacher, while being 13x smaller.**
The student surpassed the teacher through:
- **Focused training**: The model only learns this one task, not general capabilities
- **RL refinement**: The reward function optimizes for exact category matching, not just plausible outputs
- **Clean data**: Synthetic data with consistent labeling, no noise from human annotation disagreements
---
## Training Visualizations
### Phase 1: Supervised Fine-Tuning
![SFT Loss](assets/sft_loss.png)
100 training steps reduced loss from 5.47 to 0.26 (95% reduction). The model learned the basic classification task in the first epoch.
### Phase 2: Reinforcement Learning
![RL Reward](assets/rl_reward.png)
30 RL iterations improved mean reward from 0.73 to 0.93. The reward function combines F1 score, temporal alignment, scope correctness, and storage efficiency.
### Model Comparison
![Model Comparison](assets/model_comparison.png)
Our model excels at exact matching (60% vs 26%) because RL optimizes for getting all categories right, not just some.
### Performance by Difficulty
![Difficulty Comparison](assets/difficulty_comparison.png)
The 8B model dominates on easy cases (+79% F1) and matches on medium cases. The 104B model still wins on hard multi-label scenarios.
---
## Key Results
| Metric | Our Model (8B) | Cohere (104B) |
|--------|----------------|---------------|
| **Avg F1** | **0.68** | 0.61 |
| **Exact Match** | **60%** | 26% |
| Any Match | 72% | 82% |
| Model Size | 8B | 104B |
| **Improvement** | **+11.1% F1** | baseline |
### Reward Components (Final RL Iteration)
| Component | Score | Description |
|-----------|-------|-------------|
| R_F1 | 0.90 | F1 score vs gold labels |
| R_temp | 0.95 | Temporal alignment |
| R_parity | 1.00 | Company/user scope |
| R_eff | 1.00 | Storage efficiency |
---
## What It Does
The Memory Routing Agent classifies marketing conversations into 13 memory categories:
### Company Categories (Long-term business context)
| Category | Description | Persistence |
|----------|-------------|-------------|
| `company.brand_core` | Voice, values, positioning | Long (>1y) |
| `company.strategic_signatures` | Decision frameworks | Long (>1y) |
| `company.knowledge_artifacts` | Docs, style guides | Long (>1y) |
| `company.business_priorities` | Quarterly goals | Short (<3m) |
| `company.tools_config` | Integrations, APIs | Medium (~6m) |
| `company.performance_context` | Campaign metrics | Rolling (~6m) |
### User Categories (Personal preferences)
| Category | Description | Persistence |
|----------|-------------|-------------|
| `user.communication_style` | Tone, format preferences | Long (>1y) |
| `user.strategic_approach` | Personal priorities | Long (>1y) |
| `user.role_context` | Title, scope | Medium (~1y) |
| `user.workflow_patterns` | Review cadence | Medium (~1y) |
| `user.session_history` | Immediate context | Short (<2w) |
| `user.interaction_preferences` | Coaching style | Evolving |
### Special
| Category | Description |
|----------|-------------|
| `none` | Transactional or irrelevant content |
---
## Training Pipeline
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TRAINING PIPELINE β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ 1. SYNTHETIC DATA GENERATION β”‚
β”‚ β”œβ”€β”€ Cohere Command-R-Plus (104B) as teacher β”‚
β”‚ β”œβ”€β”€ 2,001 marketing conversations β”‚
β”‚ └── 13 category labels + persistence horizons β”‚
β”‚ β”‚
β”‚ 2. SUPERVISED FINE-TUNING (SFT) β”‚
β”‚ β”œβ”€β”€ Base: meta-llama/Llama-3.1-8B β”‚
β”‚ β”œβ”€β”€ LoRA rank 32 β”‚
β”‚ β”œβ”€β”€ 100 steps, batch size 128 β”‚
β”‚ └── Cross-entropy loss β”‚
β”‚ β”‚
β”‚ 3. REINFORCEMENT LEARNING (RL) β”‚
β”‚ β”œβ”€β”€ 30 iterations, 64 groups Γ— 32 samples β”‚
β”‚ β”œβ”€β”€ Importance sampling policy gradient β”‚
β”‚ └── Composite reward: F1 + temporal + parity + efficiency β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Reward Function
```
R_total = 0.6 Γ— R_F1 + 0.2 Γ— R_temp + 0.1 Γ— R_parity + 0.1 Γ— R_eff
```
| Component | Weight | Description |
|-----------|--------|-------------|
| R_F1 | 60% | F1 score vs gold labels |
| R_temp | 20% | Persistence horizon alignment |
| R_parity | 10% | Company/user scope correctness |
| R_eff | 10% | Storage efficiency (≀3 categories) |
---
## Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/muratcankoylan/memory-routing-agent.git
cd memory-routing-agent
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
```
### Environment Setup
```bash
# Create .env file with your API keys
echo "TINKER_API_KEY=your_tinker_key" >> .env
echo "COHERE_API_KEY=your_cohere_key" >> .env
echo "HF_TOKEN=your_huggingface_token" >> .env
```
### Run Inference
```python
import tinker
from tinker import types
from tinker_cookbook import renderers
from tinker_cookbook.tokenizer_utils import get_tokenizer
# Load model from Tinker checkpoint
service_client = tinker.ServiceClient()
checkpoint = "tinker://4f4bae1f-5a95-5f53-a55a-a14f2872825c:train:0/sampler_weights/rl_iter_012"
sampling_client = service_client.create_sampling_client(model_path=checkpoint)
# Setup tokenizer and renderer
tokenizer = get_tokenizer("meta-llama/Llama-3.1-8B")
renderer = renderers.get_renderer(name="llama3", tokenizer=tokenizer)
# Classify a conversation
conversation = """
USER: Our brand voice is professional but approachable. Think Harvard Business Review meets Slack.
ASSISTANT: So authoritative content with a conversational tone?
USER: Exactly. We never use jargon without explaining it first.
"""
messages = [
{"role": "system", "content": "You route marketing conversations into structured memory categories..."},
{"role": "user", "content": f"Analyze this conversation:\n\n{conversation}"}
]
prompt = renderer.build_generation_prompt(messages)
params = types.SamplingParams(max_tokens=100, temperature=0.1, stop=renderer.get_stop_sequences())
result = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=1).result()
response, _ = renderer.parse_response(result.sequences[0].tokens)
print(f"Categories: {response['content']}")
# Output: company.brand_core
```
---
## Project Structure
```
memory-routing-agent/
β”œβ”€β”€ assets/ # Training visualizations
β”‚ β”œβ”€β”€ sft_loss.png
β”‚ β”œβ”€β”€ rl_reward.png
β”‚ β”œβ”€β”€ rl_components.png
β”‚ β”œβ”€β”€ model_comparison.png
β”‚ └── difficulty_comparison.png
β”œβ”€β”€ synthetic_data/ # Data generation pipeline
β”‚ β”œβ”€β”€ pipeline.py # Cohere-based conversation generator
β”‚ β”œβ”€β”€ run_diverse_generation.py
β”‚ └── merged_training_dataset_2001.jsonl
β”œβ”€β”€ training/ # Training scripts
β”‚ β”œβ”€β”€ train_v2.py # Main training script (SFT + RL)
β”‚ β”œβ”€β”€ preprocess.py # Data preprocessing
β”‚ β”œβ”€β”€ rl_env.py # RL environment and reward function
β”‚ β”œβ”€β”€ final_benchmark.py # Benchmark evaluation
β”‚ β”œβ”€β”€ logs/ # Training logs (JSONL)
β”‚ └── benchmarks/ # Benchmark results
β”œβ”€β”€ huggingface/ # HuggingFace upload scripts
β”œβ”€β”€ docs/ # Documentation
β”‚ β”œβ”€β”€ PRD.md # Product requirements
β”‚ └── tinker_docs.md # Tinker reference
β”œβ”€β”€ MODEL_CARD.md # Model card
└── README.md # This file
```
---
## Benchmark
The Marketing Routing Benchmark contains 50 challenging scenarios across 7 domains:
| Domain | Scenarios | Description |
|--------|-----------|-------------|
| Brand & Positioning | 8 | Brand voice, values, identity |
| Strategic Decisions | 8 | Decision frameworks, heuristics |
| Performance & Metrics | 8 | Campaign metrics, learnings |
| Tools & Integrations | 6 | Tech stack, APIs |
| User Preferences | 10 | Communication style, workflow |
| Business Priorities | 6 | Goals, focus areas |
| Knowledge Artifacts | 4 | Docs, playbooks, templates |
### Run Benchmark
```bash
python training/final_benchmark.py
```
---
## Training Your Own Model
### 1. Generate Synthetic Data
```bash
cd synthetic_data
python run_diverse_generation.py --num_items 1000
```
### 2. Preprocess Data
```bash
python training/prepare_data.py
```
### 3. Run Training
```bash
python training/train_v2.py
```
### 4. Evaluate
```bash
python training/final_benchmark.py
```
---
## Limitations
- **Multi-label**: Under-predicts when multiple categories apply
- **Overlap**: Struggles with company/user category overlap on edge cases
- **Domain**: Marketing-specific; not tested on other domains
---
## Links
- **HuggingFace Model**: [MuratcanKoylan/Marketing-Memory-Routing-8B](https://huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B)
- **GitHub Repository**: [muratcankoylan/memory-routing-agent](https://github.com/muratcankoylan/memory-routing-agent)
- **Training Platform**: [Tinker by Thinking Machines](https://thinkingmachines.ai/)
---
## Citation
```bibtex
@misc{memory-routing-agent-2025,
title={Memory Routing Agent: Prompt Distillation for Marketing AI},
author={Muratcan Koylan},
year={2025},
howpublished={\url{https://github.com/muratcankoylan/memory-routing-agent}},
}
```
---
## License
Apache 2.0
---
## Acknowledgments
- [Thinking Machines](https://thinkingmachines.ai/) for the Tinker training platform
- [Cohere](https://cohere.com/) for Command-R-Plus teacher model
- [Meta](https://ai.meta.com/) for Llama 3.1 base model
- [Anthropic](https://anthropic.com/) for Claude, which assisted in developing this project