mangubee commited on
Commit
87de1a7
·
1 Parent(s): 8e48c56

Update dev record with correct test/ folder paths

Browse files
dev/dev_260102_13_stage2_tool_development.md CHANGED
@@ -20,12 +20,14 @@ Stage 1 established the LangGraph StateGraph skeleton with placeholder nodes. St
20
  **Chosen:** Direct Python function implementations for all tools
21
 
22
  **Why:**
 
23
  - HuggingFace Spaces doesn't support running MCP servers (requires separate processes)
24
  - Direct API approach is simpler and more reliable for deployment
25
  - Full control over retry logic, error handling, and timeouts
26
  - MCP servers are external dependencies with additional failure points
27
 
28
  **Rejected alternative:** Using MCP protocol servers for Tavily/Exa
 
29
  - Would require complex Docker configuration on HF Spaces
30
  - Additional process management overhead
31
  - Not necessary for MVP stage
@@ -35,12 +37,14 @@ Stage 1 established the LangGraph StateGraph skeleton with placeholder nodes. St
35
  **Chosen:** Use `tenacity` library with exponential backoff, max 3 retries
36
 
37
  **Why:**
 
38
  - Industry-standard retry library with clean decorator syntax
39
  - Exponential backoff prevents API rate limit issues
40
  - Configurable retry conditions (only retry on connection errors, not on validation errors)
41
  - Easy to test with mocking
42
 
43
  **Configuration:**
 
44
  - Max retries: 3
45
  - Min wait: 1 second
46
  - Max wait: 10 seconds
@@ -49,11 +53,13 @@ Stage 1 established the LangGraph StateGraph skeleton with placeholder nodes. St
49
  ### Decision 3: Tool Architecture - Unified Functions with Fallback
50
 
51
  **Pattern applied to all tools:**
 
52
  - Primary implementation (e.g., `tavily_search`)
53
  - Fallback implementation (e.g., `exa_search`)
54
  - Unified function with automatic fallback (e.g., `search`)
55
 
56
  **Example:**
 
57
  ```python
58
  def search(query):
59
  if default_tool == "tavily":
@@ -70,6 +76,7 @@ def search(query):
70
  **Chosen:** Custom AST visitor with whitelisted operations only
71
 
72
  **Why:**
 
73
  - Python's `eval()` is dangerous (arbitrary code execution)
74
  - `ast.literal_eval()` is too restrictive (doesn't support math operations)
75
  - Custom AST visitor allows precise control over allowed operations
@@ -77,10 +84,12 @@ def search(query):
77
  - Whitelist approach: only allow known-safe operations (add, multiply, sin, cos, etc.)
78
 
79
  **Rejected alternatives:**
 
80
  - Using `eval()`: Major security vulnerability
81
  - Using `sympify()` from sympy: Too complex, allows too much
82
 
83
  **Security layers:**
 
84
  1. AST whitelist (only allow specific node types)
85
  2. Expression length limit (500 chars)
86
  3. Number size limit (prevent huge calculations)
@@ -102,6 +111,7 @@ def parse_file(file_path):
102
  ```
103
 
104
  **Why:**
 
105
  - Simple interface for users (one function for all file types)
106
  - Easy to add new file types (just add new parser and update dispatcher)
107
  - Each parser can have format-specific logic
@@ -112,11 +122,13 @@ def parse_file(file_path):
112
  **Chosen:** Gemini 2.0 Flash as primary, Claude Sonnet 4.5 as fallback
113
 
114
  **Why:**
 
115
  - Gemini 2.0 Flash: Free tier (1500 req/day), fast, good quality
116
  - Claude Sonnet 4.5: Paid but highest quality, automatic fallback if Gemini fails
117
  - Same pattern as web search (primary + fallback = reliability)
118
 
119
  **Image handling:**
 
120
  - Load file, encode as base64
121
  - Check file size (max 10MB)
122
  - Support common formats (JPG, PNG, GIF, WEBP, BMP)
@@ -132,7 +144,7 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
132
  - Tavily API integration (primary, free tier)
133
  - Exa API integration (fallback, paid)
134
  - Automatic fallback if primary fails
135
- - 10 passing tests (mock API, retry logic, fallback mechanism)
136
 
137
  2. **File Parser Tool** ([src/tools/file_parser.py](../src/tools/file_parser.py))
138
  - PDF parsing (PyPDF2)
@@ -140,21 +152,21 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
140
  - Word parsing (python-docx)
141
  - Text/CSV parsing (built-in open)
142
  - Generic `parse_file()` dispatcher
143
- - 19 passing tests (real files + error handling)
144
 
145
  3. **Calculator Tool** ([src/tools/calculator.py](../src/tools/calculator.py))
146
  - Safe AST-based expression evaluation
147
  - Whitelisted operations only (no code execution)
148
  - Mathematical functions (sin, cos, sqrt, factorial, etc.)
149
  - Security hardened (timeout, complexity limits)
150
- - 41 passing tests (arithmetic, functions, security)
151
 
152
  4. **Vision Tool** ([src/tools/vision.py](../src/tools/vision.py))
153
  - Multimodal image analysis using LLMs
154
  - Gemini 2.0 Flash (primary, free)
155
  - Claude Sonnet 4.5 (fallback, paid)
156
  - Image loading and base64 encoding
157
- - 15 passing tests (mock LLM responses)
158
 
159
  5. **Tool Registry** ([src/tools/__init__.py](../src/tools/__init__.py))
160
  - Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
@@ -167,12 +179,14 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
167
  - Stage 3: Will add dynamic tool selection and execution
168
 
169
  **Test Coverage:**
 
170
  - 85 tool tests passing (web_search: 10, file_parser: 19, calculator: 41, vision: 15)
171
  - 6 existing agent tests still passing
172
  - 91 total tests passing
173
  - No regressions from Stage 1
174
 
175
  **Deployment:**
 
176
  - All changes committed and pushed to HuggingFace Spaces
177
  - Build succeeded
178
  - Agent now reports: "Stage 2 complete: 4 tools ready for execution in Stage 3"
@@ -198,6 +212,7 @@ def tool_name(args):
198
  ```
199
 
200
  **Why it works:**
 
201
  - Maximizes reliability (2 chances to succeed)
202
  - Transparent to users (single function call)
203
  - Preserves cost optimization (use free tier first, paid only as fallback)
@@ -209,17 +224,19 @@ def tool_name(args):
209
  Creating real test fixtures (sample.pdf, sample.xlsx, etc.) was critical for file parser testing:
210
 
211
  **What worked:**
 
212
  - Tests are realistic (test actual file parsing, not just mocks)
213
  - Easy to add new test cases (just add new fixture files)
214
  - Catches edge cases that mocks miss
215
 
216
  **Created fixtures:**
217
- - `tests/fixtures/sample.txt` - Plain text
218
- - `tests/fixtures/sample.csv` - CSV data
219
- - `tests/fixtures/sample.xlsx` - Excel spreadsheet
220
- - `tests/fixtures/sample.docx` - Word document
221
- - `tests/fixtures/test_image.jpg` - Test image (red square)
222
- - `tests/fixtures/generate_fixtures.py` - Script to regenerate fixtures
 
223
 
224
  **Recommendation:** For any file processing tool, create comprehensive fixture library.
225
 
@@ -241,6 +258,7 @@ with patch('google.genai.Client') as mock_client:
241
  Initially planned to create `tests/test_tools_integration.py` for cross-tool testing. **Decision:** Skip for Stage 2.
242
 
243
  **Why:**
 
244
  - Tools work independently (don't need to interact yet)
245
  - Integration testing makes sense in Stage 3 when tools are orchestrated
246
  - Unit tests provide sufficient coverage for Stage 2
@@ -255,16 +273,16 @@ Initially planned to create `tests/test_tools_integration.py` for cross-tool tes
255
  - `src/tools/file_parser.py` - PDF/Excel/Word/Text parsing with retry logic
256
  - `src/tools/calculator.py` - Safe AST-based math evaluation
257
  - `src/tools/vision.py` - Multimodal image analysis (Gemini/Claude)
258
- - `tests/test_web_search.py` - 10 tests for web search tool
259
- - `tests/test_file_parser.py` - 19 tests for file parser
260
- - `tests/test_calculator.py` - 41 tests for calculator (including security)
261
- - `tests/test_vision.py` - 15 tests for vision tool
262
- - `tests/fixtures/sample.txt` - Test text file
263
- - `tests/fixtures/sample.csv` - Test CSV file
264
- - `tests/fixtures/sample.xlsx` - Test Excel file
265
- - `tests/fixtures/sample.docx` - Test Word document
266
- - `tests/fixtures/test_image.jpg` - Test image
267
- - `tests/fixtures/generate_fixtures.py` - Fixture generation script
268
 
269
  **What was modified:**
270
 
 
20
  **Chosen:** Direct Python function implementations for all tools
21
 
22
  **Why:**
23
+
24
  - HuggingFace Spaces doesn't support running MCP servers (requires separate processes)
25
  - Direct API approach is simpler and more reliable for deployment
26
  - Full control over retry logic, error handling, and timeouts
27
  - MCP servers are external dependencies with additional failure points
28
 
29
  **Rejected alternative:** Using MCP protocol servers for Tavily/Exa
30
+
31
  - Would require complex Docker configuration on HF Spaces
32
  - Additional process management overhead
33
  - Not necessary for MVP stage
 
37
  **Chosen:** Use `tenacity` library with exponential backoff, max 3 retries
38
 
39
  **Why:**
40
+
41
  - Industry-standard retry library with clean decorator syntax
42
  - Exponential backoff prevents API rate limit issues
43
  - Configurable retry conditions (only retry on connection errors, not on validation errors)
44
  - Easy to test with mocking
45
 
46
  **Configuration:**
47
+
48
  - Max retries: 3
49
  - Min wait: 1 second
50
  - Max wait: 10 seconds
 
53
  ### Decision 3: Tool Architecture - Unified Functions with Fallback
54
 
55
  **Pattern applied to all tools:**
56
+
57
  - Primary implementation (e.g., `tavily_search`)
58
  - Fallback implementation (e.g., `exa_search`)
59
  - Unified function with automatic fallback (e.g., `search`)
60
 
61
  **Example:**
62
+
63
  ```python
64
  def search(query):
65
  if default_tool == "tavily":
 
76
  **Chosen:** Custom AST visitor with whitelisted operations only
77
 
78
  **Why:**
79
+
80
  - Python's `eval()` is dangerous (arbitrary code execution)
81
  - `ast.literal_eval()` is too restrictive (doesn't support math operations)
82
  - Custom AST visitor allows precise control over allowed operations
 
84
  - Whitelist approach: only allow known-safe operations (add, multiply, sin, cos, etc.)
85
 
86
  **Rejected alternatives:**
87
+
88
  - Using `eval()`: Major security vulnerability
89
  - Using `sympify()` from sympy: Too complex, allows too much
90
 
91
  **Security layers:**
92
+
93
  1. AST whitelist (only allow specific node types)
94
  2. Expression length limit (500 chars)
95
  3. Number size limit (prevent huge calculations)
 
111
  ```
112
 
113
  **Why:**
114
+
115
  - Simple interface for users (one function for all file types)
116
  - Easy to add new file types (just add new parser and update dispatcher)
117
  - Each parser can have format-specific logic
 
122
  **Chosen:** Gemini 2.0 Flash as primary, Claude Sonnet 4.5 as fallback
123
 
124
  **Why:**
125
+
126
  - Gemini 2.0 Flash: Free tier (1500 req/day), fast, good quality
127
  - Claude Sonnet 4.5: Paid but highest quality, automatic fallback if Gemini fails
128
  - Same pattern as web search (primary + fallback = reliability)
129
 
130
  **Image handling:**
131
+
132
  - Load file, encode as base64
133
  - Check file size (max 10MB)
134
  - Support common formats (JPG, PNG, GIF, WEBP, BMP)
 
144
  - Tavily API integration (primary, free tier)
145
  - Exa API integration (fallback, paid)
146
  - Automatic fallback if primary fails
147
+ - 10 passing tests in [test/test_web_search.py](../test/test_web_search.py)
148
 
149
  2. **File Parser Tool** ([src/tools/file_parser.py](../src/tools/file_parser.py))
150
  - PDF parsing (PyPDF2)
 
152
  - Word parsing (python-docx)
153
  - Text/CSV parsing (built-in open)
154
  - Generic `parse_file()` dispatcher
155
+ - 19 passing tests in [test/test_file_parser.py](../test/test_file_parser.py)
156
 
157
  3. **Calculator Tool** ([src/tools/calculator.py](../src/tools/calculator.py))
158
  - Safe AST-based expression evaluation
159
  - Whitelisted operations only (no code execution)
160
  - Mathematical functions (sin, cos, sqrt, factorial, etc.)
161
  - Security hardened (timeout, complexity limits)
162
+ - 41 passing tests in [test/test_calculator.py](../test/test_calculator.py)
163
 
164
  4. **Vision Tool** ([src/tools/vision.py](../src/tools/vision.py))
165
  - Multimodal image analysis using LLMs
166
  - Gemini 2.0 Flash (primary, free)
167
  - Claude Sonnet 4.5 (fallback, paid)
168
  - Image loading and base64 encoding
169
+ - 15 passing tests in [test/test_vision.py](../test/test_vision.py)
170
 
171
  5. **Tool Registry** ([src/tools/__init__.py](../src/tools/__init__.py))
172
  - Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
 
179
  - Stage 3: Will add dynamic tool selection and execution
180
 
181
  **Test Coverage:**
182
+
183
  - 85 tool tests passing (web_search: 10, file_parser: 19, calculator: 41, vision: 15)
184
  - 6 existing agent tests still passing
185
  - 91 total tests passing
186
  - No regressions from Stage 1
187
 
188
  **Deployment:**
189
+
190
  - All changes committed and pushed to HuggingFace Spaces
191
  - Build succeeded
192
  - Agent now reports: "Stage 2 complete: 4 tools ready for execution in Stage 3"
 
212
  ```
213
 
214
  **Why it works:**
215
+
216
  - Maximizes reliability (2 chances to succeed)
217
  - Transparent to users (single function call)
218
  - Preserves cost optimization (use free tier first, paid only as fallback)
 
224
  Creating real test fixtures (sample.pdf, sample.xlsx, etc.) was critical for file parser testing:
225
 
226
  **What worked:**
227
+
228
  - Tests are realistic (test actual file parsing, not just mocks)
229
  - Easy to add new test cases (just add new fixture files)
230
  - Catches edge cases that mocks miss
231
 
232
  **Created fixtures:**
233
+
234
+ - `test/fixtures/sample.txt` - Plain text
235
+ - `test/fixtures/sample.csv` - CSV data
236
+ - `test/fixtures/sample.xlsx` - Excel spreadsheet
237
+ - `test/fixtures/sample.docx` - Word document
238
+ - `test/fixtures/test_image.jpg` - Test image (red square)
239
+ - `test/fixtures/generate_fixtures.py` - Script to regenerate fixtures
240
 
241
  **Recommendation:** For any file processing tool, create comprehensive fixture library.
242
 
 
258
  Initially planned to create `tests/test_tools_integration.py` for cross-tool testing. **Decision:** Skip for Stage 2.
259
 
260
  **Why:**
261
+
262
  - Tools work independently (don't need to interact yet)
263
  - Integration testing makes sense in Stage 3 when tools are orchestrated
264
  - Unit tests provide sufficient coverage for Stage 2
 
273
  - `src/tools/file_parser.py` - PDF/Excel/Word/Text parsing with retry logic
274
  - `src/tools/calculator.py` - Safe AST-based math evaluation
275
  - `src/tools/vision.py` - Multimodal image analysis (Gemini/Claude)
276
+ - `test/test_web_search.py` - 10 tests for web search tool
277
+ - `test/test_file_parser.py` - 19 tests for file parser
278
+ - `test/test_calculator.py` - 41 tests for calculator (including security)
279
+ - `test/test_vision.py` - 15 tests for vision tool
280
+ - `test/fixtures/sample.txt` - Test text file
281
+ - `test/fixtures/sample.csv` - Test CSV file
282
+ - `test/fixtures/sample.xlsx` - Test Excel file
283
+ - `test/fixtures/sample.docx` - Test Word document
284
+ - `test/fixtures/test_image.jpg` - Test image
285
+ - `test/fixtures/generate_fixtures.py` - Fixture generation script
286
 
287
  **What was modified:**
288