Upload README.md with huggingface_hub

192d8d2 verified about 1 month ago

13 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- memory-routing
	- marketing
	- classification
	- llama
	- lora
	- tinker
	- prompt-distillation
	base_model: meta-llama/Llama-3.1-8B
	metrics:
	- f1
	- accuracy
	pipeline_tag: text-classification
	---

	# Memory Routing Agent

	A specialized 8B model that outperforms its 104B teacher on marketing conversation classification.

	[![HuggingFace](https://img.shields.io/badge/🤗%20Model-Marketing--Memory--Routing--8B-blue)](https://huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B)
	[![GitHub](https://img.shields.io/badge/GitHub-memory--routing--agent-black)](https://github.com/muratcankoylan/memory-routing-agent)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)

	---

	## The Experiment

	This project demonstrates prompt distillation: training a small, specialized model to outperform the large model that generated its training data.

	### The Challenge

	Marketing AI assistants need to remember the right information from conversations. Not everything is worth storing - you need to distinguish between:
	- Valuable: "Our brand voice is professional but approachable" → Store in long-term memory
	- Transactional: "What time is the meeting tomorrow?" → Don't store

	This is a 13-category classification problem with nuanced distinctions between company-level and user-level information, different persistence horizons, and the critical ability to say "none" for irrelevant content.

	### The Approach

	1. Generate synthetic data using Cohere Command-R-Plus (104B) as the teacher
	2. Fine-tune Llama-3.1-8B with LoRA using Tinker's training platform
	3. Apply reinforcement learning with a custom reward function
	4. Benchmark against the teacher on challenging, held-out scenarios

	### The Result

	\| Model \| Parameters \| Avg F1 \| Exact Match \|
	\|-------\|------------\|--------\|-------------\|
	\| Ours \| 8B \| 0.68 \| 60% \|
	\| Cohere Command-R-Plus \| 104B \| 0.61 \| 26% \|

	Our 8B model achieves 11.1% higher F1 and 2.3x better exact match accuracy than the 104B teacher, while being 13x smaller.

	The student surpassed the teacher through:
	- Focused training: The model only learns this one task, not general capabilities
	- RL refinement: The reward function optimizes for exact category matching, not just plausible outputs
	- Clean data: Synthetic data with consistent labeling, no noise from human annotation disagreements

	---

	## Training Visualizations

	### Phase 1: Supervised Fine-Tuning

	![SFT Loss](assets/sft_loss.png)

	100 training steps reduced loss from 5.47 to 0.26 (95% reduction). The model learned the basic classification task in the first epoch.

	### Phase 2: Reinforcement Learning

	![RL Reward](assets/rl_reward.png)

	30 RL iterations improved mean reward from 0.73 to 0.93. The reward function combines F1 score, temporal alignment, scope correctness, and storage efficiency.

	### Model Comparison

	![Model Comparison](assets/model_comparison.png)

	Our model excels at exact matching (60% vs 26%) because RL optimizes for getting all categories right, not just some.

	### Performance by Difficulty

	![Difficulty Comparison](assets/difficulty_comparison.png)

	The 8B model dominates on easy cases (+79% F1) and matches on medium cases. The 104B model still wins on hard multi-label scenarios.

	---

	## Key Results

	\| Metric \| Our Model (8B) \| Cohere (104B) \|
	\|--------\|----------------\|---------------\|
	\| Avg F1 \| 0.68 \| 0.61 \|
	\| Exact Match \| 60% \| 26% \|
	\| Any Match \| 72% \| 82% \|
	\| Model Size \| 8B \| 104B \|
	\| Improvement \| +11.1% F1 \| baseline \|

	### Reward Components (Final RL Iteration)

	\| Component \| Score \| Description \|
	\|-----------\|-------\|-------------\|
	\| R_F1 \| 0.90 \| F1 score vs gold labels \|
	\| R_temp \| 0.95 \| Temporal alignment \|
	\| R_parity \| 1.00 \| Company/user scope \|
	\| R_eff \| 1.00 \| Storage efficiency \|

	---

	## What It Does

	The Memory Routing Agent classifies marketing conversations into 13 memory categories:

	### Company Categories (Long-term business context)
	\| Category \| Description \| Persistence \|
	\|----------\|-------------\|-------------\|
	\| `company.brand_core` \| Voice, values, positioning \| Long (>1y) \|
	\| `company.strategic_signatures` \| Decision frameworks \| Long (>1y) \|
	\| `company.knowledge_artifacts` \| Docs, style guides \| Long (>1y) \|
	\| `company.business_priorities` \| Quarterly goals \| Short (<3m) \|
	\| `company.tools_config` \| Integrations, APIs \| Medium (~6m) \|
	\| `company.performance_context` \| Campaign metrics \| Rolling (~6m) \|

	### User Categories (Personal preferences)
	\| Category \| Description \| Persistence \|
	\|----------\|-------------\|-------------\|
	\| `user.communication_style` \| Tone, format preferences \| Long (>1y) \|
	\| `user.strategic_approach` \| Personal priorities \| Long (>1y) \|
	\| `user.role_context` \| Title, scope \| Medium (~1y) \|
	\| `user.workflow_patterns` \| Review cadence \| Medium (~1y) \|
	\| `user.session_history` \| Immediate context \| Short (<2w) \|
	\| `user.interaction_preferences` \| Coaching style \| Evolving \|

	### Special
	\| Category \| Description \|
	\|----------\|-------------\|
	\| `none` \| Transactional or irrelevant content \|

	---

	## Training Pipeline

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ TRAINING PIPELINE │
	├─────────────────────────────────────────────────────────────────┤
	│ │
	│ 1. SYNTHETIC DATA GENERATION │
	│ ├── Cohere Command-R-Plus (104B) as teacher │
	│ ├── 2,001 marketing conversations │
	│ └── 13 category labels + persistence horizons │
	│ │
	│ 2. SUPERVISED FINE-TUNING (SFT) │
	│ ├── Base: meta-llama/Llama-3.1-8B │
	│ ├── LoRA rank 32 │
	│ ├── 100 steps, batch size 128 │
	│ └── Cross-entropy loss │
	│ │
	│ 3. REINFORCEMENT LEARNING (RL) │
	│ ├── 30 iterations, 64 groups × 32 samples │
	│ ├── Importance sampling policy gradient │
	│ └── Composite reward: F1 + temporal + parity + efficiency │
	│ │
	└─────────────────────────────────────────────────────────────────┘
	```

	### Reward Function

	```
	R_total = 0.6 × R_F1 + 0.2 × R_temp + 0.1 × R_parity + 0.1 × R_eff
	```

	\| Component \| Weight \| Description \|
	\|-----------\|--------\|-------------\|
	\| R_F1 \| 60% \| F1 score vs gold labels \|
	\| R_temp \| 20% \| Persistence horizon alignment \|
	\| R_parity \| 10% \| Company/user scope correctness \|
	\| R_eff \| 10% \| Storage efficiency (≤3 categories) \|

	---

	## Quick Start

	### Installation

	```bash
	# Clone the repository
	git clone https://github.com/muratcankoylan/memory-routing-agent.git
	cd memory-routing-agent

	# Create virtual environment
	python -m venv venv
	source venv/bin/activate

	# Install dependencies
	pip install -r requirements.txt
	```

	### Environment Setup

	```bash
	# Create .env file with your API keys
	echo "TINKER_API_KEY=your_tinker_key" >> .env
	echo "COHERE_API_KEY=your_cohere_key" >> .env
	echo "HF_TOKEN=your_huggingface_token" >> .env
	```

	### Run Inference

	```python
	import tinker
	from tinker import types
	from tinker_cookbook import renderers
	from tinker_cookbook.tokenizer_utils import get_tokenizer

	# Load model from Tinker checkpoint
	service_client = tinker.ServiceClient()
	checkpoint = "tinker://4f4bae1f-5a95-5f53-a55a-a14f2872825c:train:0/sampler_weights/rl_iter_012"
	sampling_client = service_client.create_sampling_client(model_path=checkpoint)

	# Setup tokenizer and renderer
	tokenizer = get_tokenizer("meta-llama/Llama-3.1-8B")
	renderer = renderers.get_renderer(name="llama3", tokenizer=tokenizer)

	# Classify a conversation
	conversation = """
	USER: Our brand voice is professional but approachable. Think Harvard Business Review meets Slack.
	ASSISTANT: So authoritative content with a conversational tone?
	USER: Exactly. We never use jargon without explaining it first.
	"""

	messages = [
	{"role": "system", "content": "You route marketing conversations into structured memory categories..."},
	{"role": "user", "content": f"Analyze this conversation:\n\n{conversation}"}
	]

	prompt = renderer.build_generation_prompt(messages)
	params = types.SamplingParams(max_tokens=100, temperature=0.1, stop=renderer.get_stop_sequences())
	result = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=1).result()

	response, _ = renderer.parse_response(result.sequences[0].tokens)
	print(f"Categories: {response['content']}")
	# Output: company.brand_core
	```

	---

	## Project Structure

	```
	memory-routing-agent/
	├── assets/ # Training visualizations
	│ ├── sft_loss.png
	│ ├── rl_reward.png
	│ ├── rl_components.png
	│ ├── model_comparison.png
	│ └── difficulty_comparison.png
	├── synthetic_data/ # Data generation pipeline
	│ ├── pipeline.py # Cohere-based conversation generator
	│ ├── run_diverse_generation.py
	│ └── merged_training_dataset_2001.jsonl
	├── training/ # Training scripts
	│ ├── train_v2.py # Main training script (SFT + RL)
	│ ├── preprocess.py # Data preprocessing
	│ ├── rl_env.py # RL environment and reward function
	│ ├── final_benchmark.py # Benchmark evaluation
	│ ├── logs/ # Training logs (JSONL)
	│ └── benchmarks/ # Benchmark results
	├── huggingface/ # HuggingFace upload scripts
	├── docs/ # Documentation
	│ ├── PRD.md # Product requirements
	│ └── tinker_docs.md # Tinker reference
	├── MODEL_CARD.md # Model card
	└── README.md # This file
	```

	---

	## Benchmark

	The Marketing Routing Benchmark contains 50 challenging scenarios across 7 domains:

	\| Domain \| Scenarios \| Description \|
	\|--------\|-----------\|-------------\|
	\| Brand & Positioning \| 8 \| Brand voice, values, identity \|
	\| Strategic Decisions \| 8 \| Decision frameworks, heuristics \|
	\| Performance & Metrics \| 8 \| Campaign metrics, learnings \|
	\| Tools & Integrations \| 6 \| Tech stack, APIs \|
	\| User Preferences \| 10 \| Communication style, workflow \|
	\| Business Priorities \| 6 \| Goals, focus areas \|
	\| Knowledge Artifacts \| 4 \| Docs, playbooks, templates \|

	### Run Benchmark

	```bash
	python training/final_benchmark.py
	```

	---

	## Training Your Own Model

	### 1. Generate Synthetic Data

	```bash
	cd synthetic_data
	python run_diverse_generation.py --num_items 1000
	```

	### 2. Preprocess Data

	```bash
	python training/prepare_data.py
	```

	### 3. Run Training

	```bash
	python training/train_v2.py
	```

	### 4. Evaluate

	```bash
	python training/final_benchmark.py
	```

	---

	## Limitations

	- Multi-label: Under-predicts when multiple categories apply
	- Overlap: Struggles with company/user category overlap on edge cases
	- Domain: Marketing-specific; not tested on other domains

	---

	## Links

	- HuggingFace Model: [MuratcanKoylan/Marketing-Memory-Routing-8B](https://huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B)
	- GitHub Repository: [muratcankoylan/memory-routing-agent](https://github.com/muratcankoylan/memory-routing-agent)
	- Training Platform: [Tinker by Thinking Machines](https://thinkingmachines.ai/)

	---

	## Citation

	```bibtex
	@misc{memory-routing-agent-2025,
	title={Memory Routing Agent: Prompt Distillation for Marketing AI},
	author={Muratcan Koylan},
	year={2025},
	howpublished={\url{https://github.com/muratcankoylan/memory-routing-agent}},
	}
	```

	---

	## License

	Apache 2.0

	---

	## Acknowledgments

	- [Thinking Machines](https://thinkingmachines.ai/) for the Tinker training platform
	- [Cohere](https://cohere.com/) for Command-R-Plus teacher model
	- [Meta](https://ai.meta.com/) for Llama 3.1 base model
	- [Anthropic](https://anthropic.com/) for Claude, which assisted in developing this project