Spaces:
Sleeping
Sleeping
Commit
Β·
e8e26aa
1
Parent(s):
a40c8da
Add RL design doc with beginner-friendly explanations
Browse filesDocument covers Double DQN, Actor-Critic, network architecture,
training loop, API endpoints, and hyperparameters. Each section
includes accessible explanations for newcomers to RL concepts.
π€ Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- App_RL_Design_Doc.md +178 -0
App_RL_Design_Doc.md
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RL Text Classification Agent - Design Document
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This application implements a **Double DQN with Actor-Critic** approach for text classification. It classifies user messages into three actions: `TRIP`, `GITHUB`, or `MAIL`.
|
| 6 |
+
|
| 7 |
+
> **For Beginners:** Think of this as a smart assistant that reads your message and decides what you want to do - book a trip, do something on GitHub, or send an email. Instead of using fixed rules, it learns from examples like a human would.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Architecture
|
| 12 |
+
|
| 13 |
+
```
|
| 14 |
+
User Message β DistilBERT Encoder β State (768-dim) β RL Agent β Action + Confidence
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
> **For Beginners:** The flow is simple: your text message gets converted into numbers (768 of them) by a pre-trained language model. These numbers capture the meaning of your message. Then our RL agent looks at these numbers and picks the best action.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Libraries
|
| 22 |
+
|
| 23 |
+
| Library | Purpose |
|
| 24 |
+
|---------|---------|
|
| 25 |
+
| `torch` | Neural network framework |
|
| 26 |
+
| `transformers` | Text encoding (DistilBERT) |
|
| 27 |
+
| `FastAPI` | REST API server |
|
| 28 |
+
| `pydantic` | Request/response validation |
|
| 29 |
+
|
| 30 |
+
> **For Beginners:** These are the main tools we use. PyTorch builds the brain (neural networks), Transformers helps understand text, FastAPI creates a web server so other apps can talk to ours, and Pydantic makes sure data is in the right format.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## RL Concepts Used
|
| 35 |
+
|
| 36 |
+
> **For Beginners:** Reinforcement Learning (RL) is like training a dog - the agent tries actions, gets rewards for good ones, and learns to make better choices over time. Below are the specific techniques we use.
|
| 37 |
+
|
| 38 |
+
### Double DQN
|
| 39 |
+
Separates action selection from evaluation to reduce overestimation:
|
| 40 |
+
```python
|
| 41 |
+
# Select action with online network
|
| 42 |
+
best_actions = self.q_net(next_states).argmax(dim=1)
|
| 43 |
+
# Evaluate with target network
|
| 44 |
+
next_q_values = self.target_q_net(next_states).gather(1, best_actions)
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
> **For Beginners:** Regular DQN tends to be overconfident about how good actions are. Double DQN uses two networks - one picks the action, another judges it. It's like having a friend double-check your decisions to avoid being too optimistic.
|
| 48 |
+
|
| 49 |
+
### Actor-Critic
|
| 50 |
+
- **Critic (Q-Network)**: Estimates action values
|
| 51 |
+
- **Actor (Policy Network)**: Outputs action probabilities
|
| 52 |
+
|
| 53 |
+
> **For Beginners:** Imagine a movie set - the Actor performs actions, while the Critic scores how good they were. The Actor learns to do better based on the Critic's feedback. Together, they improve faster than either would alone.
|
| 54 |
+
|
| 55 |
+
### Soft Target Update
|
| 56 |
+
Gradual target network updates for stability:
|
| 57 |
+
```python
|
| 58 |
+
target_param = tau * param + (1 - tau) * target_param # tau=0.005
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
> **For Beginners:** Instead of suddenly copying all knowledge to the target network (which can cause instability), we blend in just 0.5% of the new knowledge each time. It's like slowly adjusting to new information rather than making sudden changes.
|
| 62 |
+
|
| 63 |
+
### Entropy Regularization
|
| 64 |
+
Encourages exploration by penalizing confident policies:
|
| 65 |
+
```python
|
| 66 |
+
entropy = -(probs * log_probs).sum(dim=-1).mean()
|
| 67 |
+
policy_loss = -advantage_weighted_loss - 0.05 * entropy
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
> **For Beginners:** We don't want the agent to become too stubborn and always pick the same action. Entropy measures "randomness" - by rewarding some entropy, we encourage the agent to stay open-minded and keep exploring different options.
|
| 71 |
+
|
| 72 |
+
### Epsilon-Greedy Exploration
|
| 73 |
+
During training: random actions with probability Ξ΅ (decays from 1.0 β 0.05)
|
| 74 |
+
|
| 75 |
+
> **For Beginners:** At the start, the agent picks random actions 100% of the time (exploring). As training progresses, it gradually shifts to using what it learned, eventually only being random 5% of the time. It's like a new employee who tries everything at first, then settles into what works.
|
| 76 |
+
|
| 77 |
+
### Confidence Scoring
|
| 78 |
+
Combines entropy and probability for uncertainty estimation:
|
| 79 |
+
```python
|
| 80 |
+
confidence = (1 - entropy/max_entropy) * raw_probability
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
> **For Beginners:** The agent tells us how sure it is about its decision. If it's torn between options (high entropy) or the chosen action has low probability, confidence drops. This helps us know when to trust the agent vs. when to ask a human.
|
| 84 |
+
|
| 85 |
+
### Outlier Detection
|
| 86 |
+
Uses cosine similarity to class centroids to reject out-of-distribution inputs.
|
| 87 |
+
|
| 88 |
+
> **For Beginners:** If someone asks "What's the weather?" (not TRIP, GITHUB, or MAIL), the agent shouldn't guess. We measure how similar the input is to known categories - if it's too different from anything we trained on, we return "NONE" instead of a wrong guess.
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Network Architecture
|
| 93 |
+
|
| 94 |
+
> **For Beginners:** Neural networks are layers of math operations stacked together. Data flows through each layer, getting transformed until we get our final answer. Below are our two networks - one decides what to do (Policy), the other evaluates how good actions are (Q-Network).
|
| 95 |
+
|
| 96 |
+
**Policy Network:**
|
| 97 |
+
```
|
| 98 |
+
Linear(768β128) β LayerNorm β ReLU β Dropout(0.1) β
|
| 99 |
+
Linear(128β128) β LayerNorm β ReLU β Dropout(0.1) β
|
| 100 |
+
Linear(128β3) β Softmax
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
> **For Beginners:** Takes the 768 numbers from DistilBERT, shrinks them to 128, processes them, then outputs 3 probabilities (one for each action). Dropout randomly ignores some neurons during training to prevent overfitting. Softmax ensures probabilities sum to 1.
|
| 104 |
+
|
| 105 |
+
**Q-Network:**
|
| 106 |
+
```
|
| 107 |
+
Linear(768β128) β LayerNorm β ReLU β
|
| 108 |
+
Linear(128β128) β LayerNorm β ReLU β
|
| 109 |
+
Linear(128β3)
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
> **For Beginners:** Similar structure but outputs raw scores (Q-values) for each action instead of probabilities. No Dropout here because we want stable value estimates.
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Training Loop
|
| 117 |
+
|
| 118 |
+
> **For Beginners:** Training is how the agent learns. We show it examples, tell it what's right/wrong, and it adjusts its internal numbers (weights) to do better next time.
|
| 119 |
+
|
| 120 |
+
1. Encode texts with DistilBERT (frozen)
|
| 121 |
+
2. For each batch:
|
| 122 |
+
- Create positive examples (correct action β reward +1)
|
| 123 |
+
- Create negative examples (wrong action β reward -1)
|
| 124 |
+
- Update Q-network via TD learning
|
| 125 |
+
- Update policy via advantage-weighted loss
|
| 126 |
+
- Soft update target network
|
| 127 |
+
3. Decay epsilon
|
| 128 |
+
|
| 129 |
+
> **For Beginners:** We freeze DistilBERT (don't change it) because it's already great at understanding text. We only train our smaller RL networks. "Frozen" means we use it as a fixed tool, like using a calculator without modifying it.
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
## API Endpoints
|
| 134 |
+
|
| 135 |
+
> **For Beginners:** APIs let other programs talk to ours over the internet. Think of endpoints as different "phone numbers" - you call the right one depending on what you need.
|
| 136 |
+
|
| 137 |
+
| Endpoint | Method | Description |
|
| 138 |
+
|----------|--------|-------------|
|
| 139 |
+
| `/health` | GET | Check model status |
|
| 140 |
+
| `/action` | POST | Classify message β action + score |
|
| 141 |
+
|
| 142 |
+
**Request:**
|
| 143 |
+
```json
|
| 144 |
+
{"message": "Book a flight to Paris"}
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
**Response:**
|
| 148 |
+
```json
|
| 149 |
+
{"action": "TRIP", "score": 0.87}
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
> **For Beginners:** Send a message, get back an action and confidence score. The score (0.87 = 87%) tells you how confident the agent is. If confidence is too low, you'll get "NONE" instead.
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Key Hyperparameters
|
| 157 |
+
|
| 158 |
+
> **For Beginners:** Hyperparameters are settings we choose before training - they control how the learning happens. Think of them as recipe ingredients that affect the final result.
|
| 159 |
+
|
| 160 |
+
| Parameter | Value | What it means |
|
| 161 |
+
|-----------|-------|---------------|
|
| 162 |
+
| Learning rate | 1e-3 | How big each learning step is (0.001) |
|
| 163 |
+
| Gamma (discount) | 0.95 | How much future rewards matter vs immediate |
|
| 164 |
+
| Batch size | 64 | Examples processed together per step |
|
| 165 |
+
| Epochs | 50 | Times we go through the entire dataset |
|
| 166 |
+
| Confidence threshold | 0.6 | Below this, return "NONE" |
|
| 167 |
+
| Distance threshold | 0.93 | Similarity needed to not be an outlier |
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## In General
|
| 172 |
+
|
| 173 |
+
Our OS assistant app communicates with the RL agent. App informs agent what it's on the screen at the moment and RL knows what action should be performed. For the demo we introduced three types of actions:
|
| 174 |
+
- TRIP
|
| 175 |
+
- GITHUB
|
| 176 |
+
- MAIL
|
| 177 |
+
|
| 178 |
+
> **For Beginners:** This is the big picture - imagine an assistant watching your screen. When you type something, it figures out your intent and triggers the right action automatically. It's like having a smart helper that knows whether you want to travel, code, or email just from what you say.
|