MuratcanKoylan
/

Marketing-Memory-Routing-8B

@@ -19,69 +19,128 @@ pipeline_tag: text-classification
 # Memory Routing Agent
-A specialized 8B parameter model that **outperforms 104B models** on marketing conversation classification.
-## Key Results
-![Model Comparison](assets/model_comparison.png)
-| Metric | Our Model (8B) | Cohere (104B) |
-|--------|----------------|---------------|
-| **Avg F1** | **0.68** | 0.61 |
-| Exact Match | **60%** | 26% |
-| Model Size | 8B | 104B |
-| **Improvement** | **+11.1% F1** | baseline |
-Our 8B model achieves **11.1% higher F1 score** than the 104B teacher model that generated its training data, while being **13x smaller**.
-## Training Results
-### Phase 1: Supervised Fine-Tuning (SFT)
 ![SFT Loss](assets/sft_loss.png)
-- **100 training steps** on 2,001 synthetic conversations
-- Loss dropped from **5.47 → 0.26** (95% reduction)
-- Best test loss: **0.105** at step 90
-### Phase 2: Reinforcement Learning (RL)
 ![RL Reward](assets/rl_reward.png)
-- **30 RL iterations** with importance sampling policy gradient
-- Mean reward improved from **0.73 → 0.93** (+27%)
-- Accuracy maintained at **99.9%+** throughout
-### Reward Components
-![RL Components](assets/rl_components.png)
-| Component | Start | End | Description |
-|-----------|-------|-----|-------------|
-| R_F1 | 0.64 | 0.90 | F1 score vs gold labels |
-| R_temp | 0.81 | 0.95 | Temporal alignment |
-| R_parity | 0.86 | 1.00 | Company/user scope |
-| R_eff | 1.00 | 1.00 | Storage efficiency |
-## Performance by Difficulty
 ![Difficulty Comparison](assets/difficulty_comparison.png)
-| Difficulty | Our Model | Cohere (104B) | Winner |
-|------------|-----------|---------------|--------|
-| Easy | **0.86** | 0.48 | Ours (+79%) |
-| Medium | **0.65** | 0.64 | Ours (+2%) |
-| Hard | 0.50 | **0.72** | Cohere |
-Our model excels at clear-cut cases but the larger model handles ambiguous multi-label scenarios better.
 ## What It Does
-The Memory Routing Agent classifies marketing conversations into 13 categories to determine what information should be stored in an AI assistant's long-term memory:
-- **Company categories**: brand_core, strategic_signatures, knowledge_artifacts, business_priorities, tools_config, performance_context
-- **User categories**: communication_style, strategic_approach, role_context, workflow_patterns, session_history, interaction_preferences
-- **None**: Transactional or irrelevant content
 ## Training Pipeline
@@ -109,6 +168,21 @@ The Memory Routing Agent classifies marketing conversations into 13 categories t
 └─────────────────────────────────────────────────────────────────┘
 ```
 ## Quick Start
 ### Installation
@@ -123,14 +197,13 @@ python -m venv venv
 source venv/bin/activate
 # Install dependencies
-pip install tinker-toolkit python-dotenv cohere
-pip install -e ".[envs]"
 ```
 ### Environment Setup
 ```bash
-# Create .env file
 echo "TINKER_API_KEY=your_tinker_key" >> .env
 echo "COHERE_API_KEY=your_cohere_key" >> .env
 echo "HF_TOKEN=your_huggingface_token" >> .env
@@ -144,7 +217,7 @@ from tinker import types
 from tinker_cookbook import renderers
 from tinker_cookbook.tokenizer_utils import get_tokenizer
-# Load model
 service_client = tinker.ServiceClient()
 checkpoint = "tinker://4f4bae1f-5a95-5f53-a55a-a14f2872825c:train:0/sampler_weights/rl_iter_012"
 sampling_client = service_client.create_sampling_client(model_path=checkpoint)
@@ -174,6 +247,8 @@ print(f"Categories: {response['content']}")
 # Output: company.brand_core
 ```
 ## Project Structure
 ```
@@ -203,6 +278,8 @@ memory-routing-agent/
 └── README.md                 # This file
 ```
 ## Benchmark
 The Marketing Routing Benchmark contains 50 challenging scenarios across 7 domains:
@@ -223,6 +300,8 @@ The Marketing Routing Benchmark contains 50 challenging scenarios across 7 domai
 python training/final_benchmark.py
 ```
 ## Training Your Own Model
 ### 1. Generate Synthetic Data
@@ -250,27 +329,24 @@ python training/train_v2.py
 python training/final_benchmark.py
 ```
-## Reward Function
-The RL phase uses a composite reward:
-```
-R_total = 0.6 × R_F1 + 0.2 × R_temp + 0.1 × R_parity + 0.1 × R_eff
-```
-| Component | Weight | Description |
-|-----------|--------|-------------|
-| R_F1 | 60% | F1 score vs gold labels |
-| R_temp | 20% | Persistence horizon alignment |
-| R_parity | 10% | Company/user scope correctness |
-| R_eff | 10% | Storage efficiency (≤3 categories) |
 ## Limitations
 - **Multi-label**: Under-predicts when multiple categories apply
-- **Overlap**: Struggles with company/user category overlap
 - **Domain**: Marketing-specific; not tested on other domains
 ## Citation
 ```bibtex
@@ -282,12 +358,17 @@ R_total = 0.6 × R_F1 + 0.2 × R_temp + 0.1 × R_parity + 0.1 × R_eff
 }
 ```
 ## License
 Apache 2.0
 ## Acknowledgments
-- [Thinking Machines](https://thinkingmachines.ai/) for Tinker training platform
 - [Cohere](https://cohere.com/) for Command-R-Plus teacher model
 - [Meta](https://ai.meta.com/) for Llama 3.1 base model

 # Memory Routing Agent
+**A specialized 8B model that outperforms its 104B teacher on marketing conversation classification.**
+[![HuggingFace](https://img.shields.io/badge/🤗%20Model-Marketing--Memory--Routing--8B-blue)](https://huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B)
+[![GitHub](https://img.shields.io/badge/GitHub-memory--routing--agent-black)](https://github.com/muratcankoylan/memory-routing-agent)
+[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
+---
+## The Experiment
+This project demonstrates **prompt distillation**: training a small, specialized model to outperform the large model that generated its training data.
+### The Challenge
+Marketing AI assistants need to remember the right information from conversations. Not everything is worth storing - you need to distinguish between:
+- **Valuable**: "Our brand voice is professional but approachable" → Store in long-term memory
+- **Transactional**: "What time is the meeting tomorrow?" → Don't store
+This is a **13-category classification problem** with nuanced distinctions between company-level and user-level information, different persistence horizons, and the critical ability to say "none" for irrelevant content.
+### The Approach
+1. **Generate synthetic data** using Cohere Command-R-Plus (104B) as the teacher
+2. **Fine-tune Llama-3.1-8B** with LoRA using Tinker's training platform
+3. **Apply reinforcement learning** with a custom reward function
+4. **Benchmark against the teacher** on challenging, held-out scenarios
+### The Result
+| Model | Parameters | Avg F1 | Exact Match |
+|-------|------------|--------|-------------|
+| **Ours** | **8B** | **0.68** | **60%** |
+| Cohere Command-R-Plus | 104B | 0.61 | 26% |
+**Our 8B model achieves 11.1% higher F1 and 2.3x better exact match accuracy than the 104B teacher, while being 13x smaller.**
+The student surpassed the teacher through:
+- **Focused training**: The model only learns this one task, not general capabilities
+- **RL refinement**: The reward function optimizes for exact category matching, not just plausible outputs
+- **Clean data**: Synthetic data with consistent labeling, no noise from human annotation disagreements
+---
+## Training Visualizations
+### Phase 1: Supervised Fine-Tuning
 ![SFT Loss](assets/sft_loss.png)
+100 training steps reduced loss from 5.47 to 0.26 (95% reduction). The model learned the basic classification task in the first epoch.
+### Phase 2: Reinforcement Learning
 ![RL Reward](assets/rl_reward.png)
+30 RL iterations improved mean reward from 0.73 to 0.93. The reward function combines F1 score, temporal alignment, scope correctness, and storage efficiency.
+### Model Comparison
+![Model Comparison](assets/model_comparison.png)
+Our model excels at exact matching (60% vs 26%) because RL optimizes for getting all categories right, not just some.
+### Performance by Difficulty
 ![Difficulty Comparison](assets/difficulty_comparison.png)
+The 8B model dominates on easy cases (+79% F1) and matches on medium cases. The 104B model still wins on hard multi-label scenarios.
+---
+## Key Results
+| Metric | Our Model (8B) | Cohere (104B) |
+|--------|----------------|---------------|
+| **Avg F1** | **0.68** | 0.61 |
+| **Exact Match** | **60%** | 26% |
+| Any Match | 72% | 82% |
+| Model Size | 8B | 104B |
+| **Improvement** | **+11.1% F1** | baseline |
+### Reward Components (Final RL Iteration)
+| Component | Score | Description |
+|-----------|-------|-------------|
+| R_F1 | 0.90 | F1 score vs gold labels |
+| R_temp | 0.95 | Temporal alignment |
+| R_parity | 1.00 | Company/user scope |
+| R_eff | 1.00 | Storage efficiency |
+---
 ## What It Does
+The Memory Routing Agent classifies marketing conversations into 13 memory categories:
+### Company Categories (Long-term business context)
+| Category | Description | Persistence |
+|----------|-------------|-------------|
+| `company.brand_core` | Voice, values, positioning | Long (>1y) |
+| `company.strategic_signatures` | Decision frameworks | Long (>1y) |
+| `company.knowledge_artifacts` | Docs, style guides | Long (>1y) |
+| `company.business_priorities` | Quarterly goals | Short (<3m) |
+| `company.tools_config` | Integrations, APIs | Medium (~6m) |
+| `company.performance_context` | Campaign metrics | Rolling (~6m) |
+### User Categories (Personal preferences)
+| Category | Description | Persistence |
+|----------|-------------|-------------|
+| `user.communication_style` | Tone, format preferences | Long (>1y) |
+| `user.strategic_approach` | Personal priorities | Long (>1y) |
+| `user.role_context` | Title, scope | Medium (~1y) |
+| `user.workflow_patterns` | Review cadence | Medium (~1y) |
+| `user.session_history` | Immediate context | Short (<2w) |
+| `user.interaction_preferences` | Coaching style | Evolving |
+### Special
+| Category | Description |
+|----------|-------------|
+| `none` | Transactional or irrelevant content |
+---
 ## Training Pipeline
 └─────────────────────────────────────────────────────────────────┘
 ```
+### Reward Function
+```
+R_total = 0.6 × R_F1 + 0.2 × R_temp + 0.1 × R_parity + 0.1 × R_eff
+```
+| Component | Weight | Description |
+|-----------|--------|-------------|
+| R_F1 | 60% | F1 score vs gold labels |
+| R_temp | 20% | Persistence horizon alignment |
+| R_parity | 10% | Company/user scope correctness |
+| R_eff | 10% | Storage efficiency (≤3 categories) |
+---
 ## Quick Start
 ### Installation
 source venv/bin/activate
 # Install dependencies
+pip install -r requirements.txt
 ```
 ### Environment Setup
 ```bash
+# Create .env file with your API keys
 echo "TINKER_API_KEY=your_tinker_key" >> .env
 echo "COHERE_API_KEY=your_cohere_key" >> .env
 echo "HF_TOKEN=your_huggingface_token" >> .env
 from tinker_cookbook import renderers
 from tinker_cookbook.tokenizer_utils import get_tokenizer
+# Load model from Tinker checkpoint
 service_client = tinker.ServiceClient()
 checkpoint = "tinker://4f4bae1f-5a95-5f53-a55a-a14f2872825c:train:0/sampler_weights/rl_iter_012"
 sampling_client = service_client.create_sampling_client(model_path=checkpoint)
 # Output: company.brand_core
 ```
+---
 ## Project Structure
 ```
 └── README.md                 # This file
 ```
+---
 ## Benchmark
 The Marketing Routing Benchmark contains 50 challenging scenarios across 7 domains:
 python training/final_benchmark.py
 ```
+---
 ## Training Your Own Model
 ### 1. Generate Synthetic Data
 python training/final_benchmark.py
 ```
+---
 ## Limitations
 - **Multi-label**: Under-predicts when multiple categories apply
+- **Overlap**: Struggles with company/user category overlap on edge cases
 - **Domain**: Marketing-specific; not tested on other domains
+---
+## Links
+- **HuggingFace Model**: [MuratcanKoylan/Marketing-Memory-Routing-8B](https://huggingface.co/MuratcanKoylan/Marketing-Memory-Routing-8B)
+- **GitHub Repository**: [muratcankoylan/memory-routing-agent](https://github.com/muratcankoylan/memory-routing-agent)
+- **Training Platform**: [Tinker by Thinking Machines](https://thinkingmachines.ai/)
+---
 ## Citation
 ```bibtex
 }
 ```
+---
 ## License
 Apache 2.0
+---
 ## Acknowledgments
+- [Thinking Machines](https://thinkingmachines.ai/) for the Tinker training platform
 - [Cohere](https://cohere.com/) for Command-R-Plus teacher model
 - [Meta](https://ai.meta.com/) for Llama 3.1 base model
+- [Anthropic](https://anthropic.com/) for Claude, which assisted in developing this project