| # Product Requirements Document: Intelligent Memory Routing System (Tinker Implementation) | |
| ## 1. Executive Summary | |
| Deliver a production memory-routing agent for marketing AI systems using Tinker as the exclusive training platform. The agent ingests conversation snippets, decides if the content merits storage, assigns the correct ontology slot, and respects persistence horizons. We follow a two-stage pipeline: supervised prompt distillation for initialization and reinforcement learning to optimize downstream retrieval utility. All code must rely on Tinkerβs asynchronous APIs and built-in loss functions (cross-entropy and importance sampling) to stay within supported patterns. | |
| ## 2. Goals and Non-Goals | |
| - **Goals** | |
| - Maintain selective, temporally-aware storage across the 12-category taxonomy plus `none`. | |
| - Achieve reliable multi-label routing with tight output formatting compatible with Tinker renderers. | |
| - Support RL reward shaping for retrieval F1, temporal correctness, company/user separation, and storage efficiency. | |
| - Produce checkpoints consumable by downstream services via Tinker sampling clients. | |
| - **Non-Goals** | |
| - Building a retrieval engine or serving layer. | |
| - Extending Tinker beyond LoRA or supported loss functions. | |
| ## 3. Success Metrics | |
| - Exact-match accuracy β₯80% on held-out labeled data. | |
| - Macro F1 β₯90% across categories; `none` precision β₯90%, recall β₯85%. | |
| - Average predicted categories per utterance β€2.0. | |
| - Temporal alignment accuracy β₯90% (long/medium/short mapping). | |
| - KL divergence: target <0.005, warn 0.005β0.01, critical >0.01. | |
| ## 4. System Architecture Overview | |
| 1. **Synthetic Conversation Library** β Scenario templates drive GPT-5 generation to cover ontology breadth and noise patterns. | |
| 2. **Teacher Labeling** β GPT-5, prompted with taxonomy/persistence guidance, produces gold labels (multi-label + `none`). | |
| 3. **Prompt Distillation (SFT)** β Llama-3.1-8B LoRA (rank 32) is trained asynchronously via Tinker `forward_backward_async(..., loss_fn="cross_entropy")`. | |
| 4. **RL Optimization** β Same model undergoes importance sampling policy gradient loss with a custom `MemoryRoutingEnv`. | |
| 5. **Evaluation Harness** β Tinker evaluator builders and offline scripts verify accuracy, pruning behavior, and reward stability. | |
| ### Model Selection Rationale | |
| - Using `meta-llama/Llama-3.1-8B` (π Base, π§± Dense, π¦ Small) as the foundation model for this classification task. While Tinker recommends MoE models for cost efficiency and instruction-tuned models for task-specific work, we choose the base model for three reasons: | |
| 1. **Routing Neutrality**: Instruction-tuned models may have ingrained biases toward helpfulness/verbosity that conflict with selective storage decisions. The base model learns routing behavior purely from our synthetic data. | |
| 2. **Prompt Distillation Alignment**: Our two-stage pipeline (teacher labels β SFT β RL) is a classic prompt distillation setup where starting from a base model ensures we're not fighting pre-existing instruction-following patterns. | |
| 3. **Evaluation Baseline**: Establishes a clean baseline for comparing LoRA vs full fine-tuning effects without confounding variables from prior post-training. | |
| - LoRA rank 32 mirrors Tinker defaults for classification-style tasks. Higher ranks can be evaluated later if capacity becomes a bottleneck. | |
| - **Post-MVP**: Once baseline performance is established, evaluate `meta-llama/Llama-3.1-8B-Instruct` (to measure instruction-tuning impact) and `Qwen/Qwen3-30B-A3B` (MoE cost efficiency) as alternative starting points. | |
| ### Environment Design Notes | |
| - Each `MemoryRoutingEnv` is a single-step bandit: `initial_observation()` returns a tokenized conversation + stop conditions, `step()` receives the model's generated classification tokens and terminates immediately with reward. | |
| - EnvGroupBuilder clones each conversation across `group_size` rollouts for variance reduction; dataset builder provides `batch_size` EnvGroupBuilders per iteration. | |
| - No multi-turn transitions, which matches Tinker's Env definitions and keeps reward computation simple. | |
| ### Reward Computation Details | |
| The `step()` method in `MemoryRoutingEnv` performs the following sequence: | |
| 1. **Parse Model Output**: Extract predicted categories from generated tokens using renderer stop sequences. Expected format: `category1, category2, category3` (comma-separated, from valid taxonomy). | |
| 2. **Format Validation**: If parsing fails or any category is invalid, assign `R_format = -1.0` and return immediately (zero for all other reward components). | |
| 3. **Component Calculation**: | |
| - `R_F1`: Token-level F1 between predicted and gold category sets. Use macro-averaging if multi-label. | |
| - `R_temp`: Persistence alignment. +1.0 if predicted persistence matches gold (long/medium/short), +0.5 if adjacent (longβmedium or mediumβshort), 0.0 otherwise. Use majority vote if multi-label predictions span multiple persistence horizons. | |
| - `R_parity`: Company/user scope alignment. +1.0 if predicted scope (company/user/mixed/none) matches gold scope exactly, 0.0 otherwise. | |
| - `R_eff`: Storage efficiency. `1.0` if β€3 categories predicted, `0.7` if 4 categories, `0.4` if 5 categories, `0.0` if β₯6 categories. | |
| 4. **Composite Reward**: `R_total = 0.6 * R_F1 + 0.2 * R_temp + 0.1 * R_parity + 0.1 * R_eff` (unless format validation failed, then `R_total = -1.0`). | |
| **Edge Cases**: | |
| - Model outputs empty string or only stop tokens β format validation failure. | |
| - Model outputs `none` + other categories β invalid, format failure (none must be exclusive). | |
| - Model outputs duplicate categories β deduplicate before computing metrics. | |
| - Model exceeds max_tokens without hitting stop sequence β truncate and attempt parse, format failure if no valid categories extracted. | |
| ## 5. Memory Ontology | |
| | Category | Description | Persistence | | |
| | --- | --- | --- | | |
| | `company.brand_core` | Voice, values, positioning, identity anchors. | Long (>1y) | | |
| | `company.strategic_signatures` | Decision frameworks, strategic heuristics. | Long (>1y) | | |
| | `company.knowledge_artifacts` | Docs, style guides, playbooks. | Long (>1y) | | |
| | `company.business_priorities` | Quarterly/seasonal goals, active campaigns. | Short (<3m) | | |
| | `company.tools_config` | Integrations, API keys, workflow settings. | Medium (~6m) | | |
| | `company.performance_context` | Campaign metrics, retrospectives, learnings. | Rolling (~6m) | | |
| | `user.communication_style` | Tone, verbosity, format expectations. | Long (>1y) | | |
| | `user.strategic_approach` | Personal priorities, success definitions. | Long (>1y) | | |
| | `user.role_context` | Title, scope, decision authority. | Medium (~1y) | | |
| | `user.workflow_patterns` | Review cadence, collaboration norms. | Medium (~1y) | | |
| | `user.session_history` | Immediate context, recent asks. | Short (<2w) | | |
| | `user.interaction_preferences` | Coaching style, feedback expectations. | Evolving | | |
| | `none` | Irrelevant, vague, or transactional content. | Critical for noise reduction | | |
| ## 6. Data & Prompt Strategy | |
| ### Scenario Generation | |
| - Script: customize `tinker_cookbook/recipes/prompt_distillation/create_data.py`. | |
| - Inputs: category focus, distractor category, emotional tone, required signal; 4β10 turns per dialogue. | |
| - Outputs: JSONL with scenario metadata, teacher confidence, persistence hints. | |
| ### Teacher Prompt | |
| ``` | |
| System: You route marketing conversations into persistent memory. Consider each utterance and decide if it conveys a durable fact. Prefer `none` unless confident. | |
| Ontology: <category table with definitions + persistence> | |
| Rules: | |
| 1. Distinguish company.* from user.* details. | |
| 2. Match persistence horizon (long/medium/short) to signal lifetime. | |
| 3. Predict β€3 categories unless strictly necessary. | |
| Output: | |
| categories: cat1, cat2 (use `none` for no storage) | |
| ``` | |
| - Temperature 0.2, max tokens 256, stop newline. | |
| ### Student Prompt | |
| ``` | |
| System: You route marketing conversations into structured memory categories. | |
| User: Conversation: | |
| {dialogue} | |
| Available categories: | |
| - company.brand_core ... | |
| - ... | |
| - none | |
| Respond with comma-separated categories. | |
| ``` | |
| ### Renderer Configuration | |
| ```python | |
| from tinker_cookbook import renderers, tokenizer_utils | |
| tokenizer = tokenizer_utils.get_tokenizer("meta-llama/Llama-3.1-8B") | |
| renderer = renderers.get_renderer(name="llama3", tokenizer=tokenizer) | |
| stop_sequences = renderer.get_stop_sequences() | |
| sampling_params = types.SamplingParams( | |
| max_tokens=150, | |
| temperature=0.0, | |
| stop=stop_sequences, | |
| ) | |
| ``` | |
| ### Parsing & Validation | |
| - Normalize whitespace/case, strip bullets, deduplicate, enforce taxonomy membership. | |
| - Validation helper: | |
| ```python | |
| def validate_datum(datum: types.Datum, vocab_size: int) -> bool: | |
| if datum.model_input.length > 512: | |
| return False | |
| weights = datum.loss_fn_inputs["weights"].tolist() | |
| if sum(weights) == 0: | |
| return False | |
| target_tokens = datum.loss_fn_inputs["target_tokens"].tolist() | |
| if not all(0 <= t < vocab_size for t in target_tokens): | |
| return False | |
| return True | |
| ``` | |
| ## 6.5 Synthetic Data Quality Assurance | |
| - **Coverage:** β₯20 examples per category Γ persistence; β₯20% multi-label, β₯10% `none`-only. | |
| - **Noise:** β₯30% dialogues include distractors to stress selectivity. | |
| - **Signal Density:** >60% of turns include relevant info; length 6.5 Β± 1.5 turns. | |
| - **Human Audit:** Spot-check 100 samples per refresh; require β₯95% teacher agreement. | |
| - **Continuous Improvement:** Log production misses, refresh quarterly, retrain teacher prompt if accuracy drops >10%. | |
| ## 6.6 Data Preprocessing Pipeline | |
| Before SFT training, synthetic JSONL conversations must be converted to Tinker-compatible `types.Datum` objects: | |
| **Step 1: Load Synthetic Data** | |
| ```python | |
| import json | |
| with open("train.jsonl", "r") as f: | |
| conversations = [json.loads(line) for line in f] | |
| ``` | |
| **Step 2: Convert to Datum Objects** | |
| ```python | |
| from tinker import types | |
| from tinker_cookbook import renderers, tokenizer_utils | |
| tokenizer = tokenizer_utils.get_tokenizer("meta-llama/Llama-3.1-8B") | |
| renderer = renderers.get_renderer(name="llama3", tokenizer=tokenizer) | |
| def conversation_to_datum(conversation_json: dict) -> types.Datum: | |
| """Convert synthetic conversation to training datum.""" | |
| tokens, weights = renderer.build_supervised_example( | |
| conversation_json["conversation"] | |
| ) | |
| model_input = types.ModelInput.from_ints(tokens[:-1]) | |
| datum = types.Datum( | |
| model_input=model_input, | |
| loss_fn_inputs=dict( | |
| target_tokens=tokens[1:], | |
| weights=weights[1:], | |
| ), | |
| ) | |
| return datum | |
| train_data = [conversation_to_datum(conv) for conv in conversations] | |
| ``` | |
| **Step 3: Validate Datum Objects** | |
| ```python | |
| vocab_size = len(tokenizer) | |
| valid_data = [] | |
| for datum in train_data: | |
| if datum.model_input.length > 4096: | |
| print(f"Warning: Skipping example with length {datum.model_input.length}") | |
| continue | |
| weights = datum.loss_fn_inputs["weights"].tolist() | |
| if sum(weights) == 0: | |
| print("Warning: Skipping example with zero loss weights") | |
| continue | |
| target_tokens = datum.loss_fn_inputs["target_tokens"].tolist() | |
| if not all(0 <= t < vocab_size for t in target_tokens): | |
| print(f"Warning: Invalid token IDs found") | |
| continue | |
| valid_data.append(datum) | |
| print(f"Preprocessed {len(valid_data)}/{len(train_data)} examples") | |
| ``` | |
| **Step 4: Split and Save** | |
| ```python | |
| train_size = int(0.8 * len(valid_data)) | |
| train_dataset = valid_data[:train_size] | |
| test_dataset = valid_data[train_size:] | |
| ``` | |
| ## 7. Stage 1 β Prompt Distillation (Supervised Learning) | |
| ### Dataset & Batch Size | |
| - 1β2k labeled conversations (80/20 split after preprocessing per Section 6.6). | |
| - Batch size 128 (per Tinker SL guidance) balances stability/throughput; if changed, scale LR β βbatch_size. | |
| - Expected preprocessing yield: ~90-95% of raw JSONL (some examples filtered for length/validity). | |
| ### Hyperparameter Selection | |
| ```python | |
| from tinker_cookbook.hyperparam_utils import get_lr | |
| model_name = "meta-llama/Llama-3.1-8B" | |
| learning_rate = get_lr(model_name) # Returns LoRA-adjusted LR: ~2.86e-4 | |
| ``` | |
| - Tinker's `get_lr()` utility already returns the LoRA-optimized learning rate for the specified model, accounting for model size and architecture. No manual scaling needed. | |
| - Use Adam Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8 (Tinker SL defaults). | |
| - **Training Duration**: Start with 300 steps minimum (β20-25 epochs for 1.5k samples at batch_size=128). Tinker SL guidance recommends "at least 100 steps but usually best results with 1000 or more" - for LoRA classification tasks, 300-500 steps typically ensures convergence. | |
| - **Early Stopping**: Validate every 20 steps on test set. Stop if test loss doesn't improve for 5 consecutive evaluations (100 steps patience). | |
| - **Convergence Check**: Plot train/test loss curves. If test loss hasn't plateaued by step 300, extend to 500 steps before RL initialization. | |
| ### Async Training Loop | |
| ```python | |
| import tinker | |
| from tinker import types | |
| from tinker_cookbook.hyperparam_utils import get_lr | |
| service_client = tinker.ServiceClient() | |
| training_client = await service_client.create_lora_training_client_async( | |
| base_model="meta-llama/Llama-3.1-8B", | |
| rank=32, | |
| ) | |
| learning_rate = get_lr("meta-llama/Llama-3.1-8B") | |
| for step in range(num_steps): | |
| # Submit forward-backward pass | |
| fwd_bwd_future = await training_client.forward_backward_async( | |
| batch_data, | |
| loss_fn="cross_entropy", | |
| ) | |
| # Submit optimizer step (can overlap with forward-backward) | |
| adam_params = types.AdamParams( | |
| learning_rate=learning_rate, | |
| beta1=0.9, | |
| beta2=0.95, | |
| eps=1e-8, | |
| ) | |
| optim_future = await training_client.optim_step_async(adam_params) | |
| # Wait for both operations to complete | |
| fwd_bwd_result = await fwd_bwd_future.result_async() | |
| optim_result = await optim_future.result_async() | |
| # Log metrics from both operations | |
| log_metrics(step, fwd_bwd_result, optim_result) | |
| ``` | |
| ### Checkpointing & Sampling | |
| ```python | |
| # Save checkpoint for sampling (every 20 steps) | |
| checkpoint_future = await training_client.save_weights_for_sampler_async( | |
| name=f"sft_{step:04d}" | |
| ) | |
| checkpoint_result = await checkpoint_future.result_async() | |
| sampling_path = checkpoint_result.path | |
| # Create sampling client with the checkpoint | |
| sampling_client = service_client.create_sampling_client( | |
| model_path=sampling_path | |
| ) | |
| # Use with renderer stop sequences for evaluation | |
| stop_sequences = renderer.get_stop_sequences() | |
| sampling_params = types.SamplingParams( | |
| max_tokens=150, | |
| temperature=0.0, | |
| stop=stop_sequences, | |
| ) | |
| ``` | |
| - Save weights every 20 steps for periodic evaluation. | |
| - Critical: Must call `.result_async()` on the checkpoint future to get the path before creating sampling client. | |
| ## 8. Stage 2 β Reinforcement Learning | |
| ### Environment & Reward | |
| - `MemoryRoutingEnv` implements single-step episodes; EnvGroupBuilder replicates conversations across `group_size=8`. | |
| - Reward: `0.6 * R_F1 + 0.2 * R_temp + 0.1 * R_parity + 0.1 * R_eff`. | |
| - `R_F1`: F1 overlap with teacher labels. | |
| - `R_temp`: +1 (correct persistence), +0.5 (adjacent), 0 otherwise. | |
| - `R_parity`: +1 when company/user presence matches ground truth. | |
| - `R_eff`: 1.0 (β€3 cats), 0.7 (4), 0.4 (5), 0 (β₯6) with hard penalty for parser failures. | |
| ### Policy & Sampling Workflow | |
| ```python | |
| # Save current policy weights for sampling | |
| checkpoint_future = await training_client.save_weights_for_sampler_async( | |
| name=f"rl_step_{step:04d}" | |
| ) | |
| checkpoint_result = await checkpoint_future.result_async() | |
| sampling_path = checkpoint_result.path | |
| # Create sampling client with current policy | |
| sampling_client = service_client.create_sampling_client( | |
| model_path=sampling_path, | |
| ) | |
| # Wrap in policy completer for RL rollouts | |
| policy = TinkerTokenCompleter( | |
| sampling_client=sampling_client, | |
| max_tokens=150, | |
| temperature=0.0, | |
| stop=renderer.get_stop_sequences(), | |
| ) | |
| ``` | |
| ### Async Training Loop | |
| ```python | |
| for iteration in range(num_iterations): | |
| # 1. Gather rollouts concurrently | |
| trajectory_groups = await asyncio.gather( | |
| *[do_group_rollout(env_builder, policy) for env_builder in env_builders] | |
| ) | |
| # 2. Process trajectories | |
| filtered_groups = remove_constant_reward_groups(trajectory_groups) | |
| advantages = compute_advantages(filtered_groups) | |
| train_data, metadata = assemble_training_data(filtered_groups, advantages) | |
| # 3. Submit forward-backward pass | |
| fwd_bwd_future = await training_client.forward_backward_async( | |
| train_data, | |
| loss_fn="importance_sampling" | |
| ) | |
| # 4. Submit optimizer step | |
| adam_params = types.AdamParams( | |
| learning_rate=2e-5, | |
| beta1=0.9, | |
| beta2=0.95, | |
| eps=1e-8 | |
| ) | |
| optim_future = await training_client.optim_step_async(adam_params) | |
| # 5. Wait for both operations to complete | |
| fwd_bwd_result = await fwd_bwd_future.result_async() | |
| optim_result = await optim_future.result_async() | |
| # 6. Log metrics including KL divergence | |
| log_metrics(iteration, fwd_bwd_result, optim_result, metadata) | |
| ``` | |
| - Run β25 iterations (256 rollouts each). Adjust based on convergence and KL monitoring. | |
| ### KL Monitoring | |
| | Status | KL Range | Action | | |
| | --- | --- | --- | | |
| | Target | <0.005 | Optimal on-policy stability | | |
| | Warning | 0.005β0.01 | Log warning, monitor closely; still stable per Tinker guidance | | |
| | Critical | >0.01 | Halt run immediately, inspect sampler vs learner drift | | |
| **Implementation Notes**: | |
| - Always log `kl_sample_train_v1` and `kl_sample_train_v2` (two KL estimators per Tinker RL docs). | |
| - Per Tinker: "training is stable with KL divergence below 0.01" - values above this threshold indicate numerical instability or off-policy issues. | |
| - Even with full on-policy training, KL won't be exactly zero due to [non-determinism](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/) in batched inference. | |
| - Keep sampling temperature at 0.0 for deterministic rollouts. | |
| - Enable gradient clipping (max_norm=1.0) if KL repeatedly exceeds 0.005. | |
| - If KL exceeds 0.01, halt training, inspect checkpoint drift, and verify sampling client is using correct weights. | |
| ### Future Throughput Optimizations | |
| - After MVP, consider `StreamMinibatchConfig` to overlap sampling/training per Tinker RL docs (20β30% throughput gain). | |
| ## 9. Evaluation & Monitoring | |
| ### Inline | |
| - SFT: track train/test loss, exact-match, macro/micro F1, avg categories. | |
| - RL: log reward components, KL metrics, entropy, avg categories, stop reasons. | |
| ### Evaluators | |
| ```python | |
| from tinker_cookbook.evaluators import SamplingClientEvaluator | |
| from tinker import types | |
| class MemoryRoutingEvaluator(SamplingClientEvaluator): | |
| """Evaluates memory routing classification on held-out test set.""" | |
| def __init__(self, test_set, renderer, tokenizer): | |
| self.test_set = test_set # List of preprocessed conversations with gold labels | |
| self.renderer = renderer | |
| self.tokenizer = tokenizer | |
| async def __call__(self, sampling_client): | |
| """Run holdout evaluation on the test set.""" | |
| predictions = [] | |
| gold_labels = [] | |
| # Sample predictions for each test conversation | |
| for example in self.test_set: | |
| # Build generation prompt from conversation | |
| prompt = self.renderer.build_generation_prompt( | |
| example["conversation"] | |
| ) | |
| # Generate classification | |
| sampling_params = types.SamplingParams( | |
| max_tokens=150, | |
| temperature=0.0, | |
| stop=self.renderer.get_stop_sequences(), | |
| ) | |
| result = await sampling_client.sample_async( | |
| prompt=prompt, | |
| num_samples=1, | |
| sampling_params=sampling_params | |
| ) | |
| # Parse model output into categories | |
| pred_tokens = result.sequences[0].tokens | |
| pred_text = self.tokenizer.decode(pred_tokens) | |
| pred_categories = self._parse_categories(pred_text) | |
| predictions.append(pred_categories) | |
| gold_labels.append(set(example["labels"]["categories"])) | |
| # Compute metrics | |
| return { | |
| "exact_match": self._compute_exact_match(predictions, gold_labels), | |
| "macro_f1": self._compute_macro_f1(predictions, gold_labels), | |
| "none_precision": self._compute_none_precision(predictions, gold_labels), | |
| "temporal_accuracy": self._compute_temporal_accuracy(predictions, gold_labels), | |
| } | |
| def _parse_categories(self, text: str) -> set: | |
| """Parse comma-separated categories from model output.""" | |
| # Implementation: split on comma, strip whitespace, validate against taxonomy | |
| # Return set of valid categories or {"none"} if parsing fails | |
| pass | |
| def _compute_exact_match(self, preds, golds) -> float: | |
| """Fraction of examples where predicted set exactly matches gold set.""" | |
| pass | |
| def _compute_macro_f1(self, preds, golds) -> float: | |
| """Macro-averaged F1 across all categories.""" | |
| pass | |
| def _compute_none_precision(self, preds, golds) -> float: | |
| """Precision of 'none' category predictions.""" | |
| pass | |
| def _compute_temporal_accuracy(self, preds, golds) -> float: | |
| """Accuracy of persistence horizon alignment (requires loading full examples).""" | |
| pass | |
| ``` | |
| - Register evaluator builders with `eval_every=20` for SFT (every checkpoint) and RL loops. | |
| - Consider Inspect AI tasks after MVP for standardized benchmarking. | |
| ### Offline & Compliance | |
| - Offline script computes exact-match, macro/micro F1, `none` precision/recall, temporal accuracy, confusion matrix. | |
| - Regression suite: 100 held-out dialogues rerun after each checkpoint. | |
| - Format validator ensures comma-separated taxonomy outputs and β€3 categories typical. | |
| ## 10. Implementation Plan | |
| 1. **Scenario Refresh & QA** β Generate new datasets, run teacher labeling, enforce Section 6.5 checks. | |
| 2. **Preprocessing & Validation** β Convert to `Datum`, run parser + validator. | |
| 3. **SFT Training** β 120β160 async steps with early stopping, checkpoint weights. | |
| 4. **RL Environment Build** β Implement env/reward/evaluators, add unit tests. | |
| 5. **RL Training** β 25 iteration importance sampling run with KL monitoring. | |
| 6. **Evaluation & Sign-off** β Execute evaluator builders + offline scripts, capture qualitative samples, document results. | |
| 7. **Future Optimization** β Investigate streaming minibatch and Inspect AI integration after MVP. | |
| ## 11. Risks & Mitigations | |
| - **Format Drift:** reward penalty + strict parser; renderer stop sequences enforce termination. | |
| - **`none` Collapse:** reward weights emphasize recall, track per-category confusion, rebalance data. | |
| - **Off-Policy Instability:** monitor KL each step, warn at 0.01, halt at 0.05, keep temperature=0.0, clip gradients. | |
| - **Temporal Mislabeling:** targeted scenario generation plus dedicated reward component; run temporal audits weekly. | |
| - **Synthetic Bias:** quarterly data refresh with human audits; ingest production edge cases. | |
| ## 12. Deployment Considerations | |
| - **Inference:** Export final LoRA checkpoint via `save_weights_for_sampler(name="prod_v1")`; serve via Tinker SamplingClient or export to preferred inference stack. | |
| - **Performance Targets:** <200 ms p95 latency per routing decision; β₯100 decisions/sec on A100 (LoRA overhead β8 GB). | |
| - **Monitoring:** Weekly dashboards for category distribution, `none` precision (>85%), avg categories (<2.5), temporal accuracy, reward drift. | |
| - **Versioning:** Semantic versioning (major.minor.patch); record lineage (base β SFT β RL); keep last 3 versions for rollback. | |
| --- | |
| **Owner:** Technical Architecture Lead | |