PiotrPasztor Claude commited on
Commit
e8e26aa
Β·
1 Parent(s): a40c8da

Add RL design doc with beginner-friendly explanations

Browse files

Document covers Double DQN, Actor-Critic, network architecture,
training loop, API endpoints, and hyperparameters. Each section
includes accessible explanations for newcomers to RL concepts.

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (1) hide show
  1. App_RL_Design_Doc.md +178 -0
App_RL_Design_Doc.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RL Text Classification Agent - Design Document
2
+
3
+ ## Overview
4
+
5
+ This application implements a **Double DQN with Actor-Critic** approach for text classification. It classifies user messages into three actions: `TRIP`, `GITHUB`, or `MAIL`.
6
+
7
+ > **For Beginners:** Think of this as a smart assistant that reads your message and decides what you want to do - book a trip, do something on GitHub, or send an email. Instead of using fixed rules, it learns from examples like a human would.
8
+
9
+ ---
10
+
11
+ ## Architecture
12
+
13
+ ```
14
+ User Message β†’ DistilBERT Encoder β†’ State (768-dim) β†’ RL Agent β†’ Action + Confidence
15
+ ```
16
+
17
+ > **For Beginners:** The flow is simple: your text message gets converted into numbers (768 of them) by a pre-trained language model. These numbers capture the meaning of your message. Then our RL agent looks at these numbers and picks the best action.
18
+
19
+ ---
20
+
21
+ ## Libraries
22
+
23
+ | Library | Purpose |
24
+ |---------|---------|
25
+ | `torch` | Neural network framework |
26
+ | `transformers` | Text encoding (DistilBERT) |
27
+ | `FastAPI` | REST API server |
28
+ | `pydantic` | Request/response validation |
29
+
30
+ > **For Beginners:** These are the main tools we use. PyTorch builds the brain (neural networks), Transformers helps understand text, FastAPI creates a web server so other apps can talk to ours, and Pydantic makes sure data is in the right format.
31
+
32
+ ---
33
+
34
+ ## RL Concepts Used
35
+
36
+ > **For Beginners:** Reinforcement Learning (RL) is like training a dog - the agent tries actions, gets rewards for good ones, and learns to make better choices over time. Below are the specific techniques we use.
37
+
38
+ ### Double DQN
39
+ Separates action selection from evaluation to reduce overestimation:
40
+ ```python
41
+ # Select action with online network
42
+ best_actions = self.q_net(next_states).argmax(dim=1)
43
+ # Evaluate with target network
44
+ next_q_values = self.target_q_net(next_states).gather(1, best_actions)
45
+ ```
46
+
47
+ > **For Beginners:** Regular DQN tends to be overconfident about how good actions are. Double DQN uses two networks - one picks the action, another judges it. It's like having a friend double-check your decisions to avoid being too optimistic.
48
+
49
+ ### Actor-Critic
50
+ - **Critic (Q-Network)**: Estimates action values
51
+ - **Actor (Policy Network)**: Outputs action probabilities
52
+
53
+ > **For Beginners:** Imagine a movie set - the Actor performs actions, while the Critic scores how good they were. The Actor learns to do better based on the Critic's feedback. Together, they improve faster than either would alone.
54
+
55
+ ### Soft Target Update
56
+ Gradual target network updates for stability:
57
+ ```python
58
+ target_param = tau * param + (1 - tau) * target_param # tau=0.005
59
+ ```
60
+
61
+ > **For Beginners:** Instead of suddenly copying all knowledge to the target network (which can cause instability), we blend in just 0.5% of the new knowledge each time. It's like slowly adjusting to new information rather than making sudden changes.
62
+
63
+ ### Entropy Regularization
64
+ Encourages exploration by penalizing confident policies:
65
+ ```python
66
+ entropy = -(probs * log_probs).sum(dim=-1).mean()
67
+ policy_loss = -advantage_weighted_loss - 0.05 * entropy
68
+ ```
69
+
70
+ > **For Beginners:** We don't want the agent to become too stubborn and always pick the same action. Entropy measures "randomness" - by rewarding some entropy, we encourage the agent to stay open-minded and keep exploring different options.
71
+
72
+ ### Epsilon-Greedy Exploration
73
+ During training: random actions with probability Ξ΅ (decays from 1.0 β†’ 0.05)
74
+
75
+ > **For Beginners:** At the start, the agent picks random actions 100% of the time (exploring). As training progresses, it gradually shifts to using what it learned, eventually only being random 5% of the time. It's like a new employee who tries everything at first, then settles into what works.
76
+
77
+ ### Confidence Scoring
78
+ Combines entropy and probability for uncertainty estimation:
79
+ ```python
80
+ confidence = (1 - entropy/max_entropy) * raw_probability
81
+ ```
82
+
83
+ > **For Beginners:** The agent tells us how sure it is about its decision. If it's torn between options (high entropy) or the chosen action has low probability, confidence drops. This helps us know when to trust the agent vs. when to ask a human.
84
+
85
+ ### Outlier Detection
86
+ Uses cosine similarity to class centroids to reject out-of-distribution inputs.
87
+
88
+ > **For Beginners:** If someone asks "What's the weather?" (not TRIP, GITHUB, or MAIL), the agent shouldn't guess. We measure how similar the input is to known categories - if it's too different from anything we trained on, we return "NONE" instead of a wrong guess.
89
+
90
+ ---
91
+
92
+ ## Network Architecture
93
+
94
+ > **For Beginners:** Neural networks are layers of math operations stacked together. Data flows through each layer, getting transformed until we get our final answer. Below are our two networks - one decides what to do (Policy), the other evaluates how good actions are (Q-Network).
95
+
96
+ **Policy Network:**
97
+ ```
98
+ Linear(768β†’128) β†’ LayerNorm β†’ ReLU β†’ Dropout(0.1) β†’
99
+ Linear(128β†’128) β†’ LayerNorm β†’ ReLU β†’ Dropout(0.1) β†’
100
+ Linear(128β†’3) β†’ Softmax
101
+ ```
102
+
103
+ > **For Beginners:** Takes the 768 numbers from DistilBERT, shrinks them to 128, processes them, then outputs 3 probabilities (one for each action). Dropout randomly ignores some neurons during training to prevent overfitting. Softmax ensures probabilities sum to 1.
104
+
105
+ **Q-Network:**
106
+ ```
107
+ Linear(768β†’128) β†’ LayerNorm β†’ ReLU β†’
108
+ Linear(128β†’128) β†’ LayerNorm β†’ ReLU β†’
109
+ Linear(128β†’3)
110
+ ```
111
+
112
+ > **For Beginners:** Similar structure but outputs raw scores (Q-values) for each action instead of probabilities. No Dropout here because we want stable value estimates.
113
+
114
+ ---
115
+
116
+ ## Training Loop
117
+
118
+ > **For Beginners:** Training is how the agent learns. We show it examples, tell it what's right/wrong, and it adjusts its internal numbers (weights) to do better next time.
119
+
120
+ 1. Encode texts with DistilBERT (frozen)
121
+ 2. For each batch:
122
+ - Create positive examples (correct action β†’ reward +1)
123
+ - Create negative examples (wrong action β†’ reward -1)
124
+ - Update Q-network via TD learning
125
+ - Update policy via advantage-weighted loss
126
+ - Soft update target network
127
+ 3. Decay epsilon
128
+
129
+ > **For Beginners:** We freeze DistilBERT (don't change it) because it's already great at understanding text. We only train our smaller RL networks. "Frozen" means we use it as a fixed tool, like using a calculator without modifying it.
130
+
131
+ ---
132
+
133
+ ## API Endpoints
134
+
135
+ > **For Beginners:** APIs let other programs talk to ours over the internet. Think of endpoints as different "phone numbers" - you call the right one depending on what you need.
136
+
137
+ | Endpoint | Method | Description |
138
+ |----------|--------|-------------|
139
+ | `/health` | GET | Check model status |
140
+ | `/action` | POST | Classify message β†’ action + score |
141
+
142
+ **Request:**
143
+ ```json
144
+ {"message": "Book a flight to Paris"}
145
+ ```
146
+
147
+ **Response:**
148
+ ```json
149
+ {"action": "TRIP", "score": 0.87}
150
+ ```
151
+
152
+ > **For Beginners:** Send a message, get back an action and confidence score. The score (0.87 = 87%) tells you how confident the agent is. If confidence is too low, you'll get "NONE" instead.
153
+
154
+ ---
155
+
156
+ ## Key Hyperparameters
157
+
158
+ > **For Beginners:** Hyperparameters are settings we choose before training - they control how the learning happens. Think of them as recipe ingredients that affect the final result.
159
+
160
+ | Parameter | Value | What it means |
161
+ |-----------|-------|---------------|
162
+ | Learning rate | 1e-3 | How big each learning step is (0.001) |
163
+ | Gamma (discount) | 0.95 | How much future rewards matter vs immediate |
164
+ | Batch size | 64 | Examples processed together per step |
165
+ | Epochs | 50 | Times we go through the entire dataset |
166
+ | Confidence threshold | 0.6 | Below this, return "NONE" |
167
+ | Distance threshold | 0.93 | Similarity needed to not be an outlier |
168
+
169
+ ---
170
+
171
+ ## In General
172
+
173
+ Our OS assistant app communicates with the RL agent. App informs agent what it's on the screen at the moment and RL knows what action should be performed. For the demo we introduced three types of actions:
174
+ - TRIP
175
+ - GITHUB
176
+ - MAIL
177
+
178
+ > **For Beginners:** This is the big picture - imagine an assistant watching your screen. When you type something, it figures out your intent and triggers the right action automatically. It's like having a smart helper that knows whether you want to travel, code, or email just from what you say.