Update dev record with correct test/ folder paths
Browse files
dev/dev_260102_13_stage2_tool_development.md
CHANGED
|
@@ -20,12 +20,14 @@ Stage 1 established the LangGraph StateGraph skeleton with placeholder nodes. St
|
|
| 20 |
**Chosen:** Direct Python function implementations for all tools
|
| 21 |
|
| 22 |
**Why:**
|
|
|
|
| 23 |
- HuggingFace Spaces doesn't support running MCP servers (requires separate processes)
|
| 24 |
- Direct API approach is simpler and more reliable for deployment
|
| 25 |
- Full control over retry logic, error handling, and timeouts
|
| 26 |
- MCP servers are external dependencies with additional failure points
|
| 27 |
|
| 28 |
**Rejected alternative:** Using MCP protocol servers for Tavily/Exa
|
|
|
|
| 29 |
- Would require complex Docker configuration on HF Spaces
|
| 30 |
- Additional process management overhead
|
| 31 |
- Not necessary for MVP stage
|
|
@@ -35,12 +37,14 @@ Stage 1 established the LangGraph StateGraph skeleton with placeholder nodes. St
|
|
| 35 |
**Chosen:** Use `tenacity` library with exponential backoff, max 3 retries
|
| 36 |
|
| 37 |
**Why:**
|
|
|
|
| 38 |
- Industry-standard retry library with clean decorator syntax
|
| 39 |
- Exponential backoff prevents API rate limit issues
|
| 40 |
- Configurable retry conditions (only retry on connection errors, not on validation errors)
|
| 41 |
- Easy to test with mocking
|
| 42 |
|
| 43 |
**Configuration:**
|
|
|
|
| 44 |
- Max retries: 3
|
| 45 |
- Min wait: 1 second
|
| 46 |
- Max wait: 10 seconds
|
|
@@ -49,11 +53,13 @@ Stage 1 established the LangGraph StateGraph skeleton with placeholder nodes. St
|
|
| 49 |
### Decision 3: Tool Architecture - Unified Functions with Fallback
|
| 50 |
|
| 51 |
**Pattern applied to all tools:**
|
|
|
|
| 52 |
- Primary implementation (e.g., `tavily_search`)
|
| 53 |
- Fallback implementation (e.g., `exa_search`)
|
| 54 |
- Unified function with automatic fallback (e.g., `search`)
|
| 55 |
|
| 56 |
**Example:**
|
|
|
|
| 57 |
```python
|
| 58 |
def search(query):
|
| 59 |
if default_tool == "tavily":
|
|
@@ -70,6 +76,7 @@ def search(query):
|
|
| 70 |
**Chosen:** Custom AST visitor with whitelisted operations only
|
| 71 |
|
| 72 |
**Why:**
|
|
|
|
| 73 |
- Python's `eval()` is dangerous (arbitrary code execution)
|
| 74 |
- `ast.literal_eval()` is too restrictive (doesn't support math operations)
|
| 75 |
- Custom AST visitor allows precise control over allowed operations
|
|
@@ -77,10 +84,12 @@ def search(query):
|
|
| 77 |
- Whitelist approach: only allow known-safe operations (add, multiply, sin, cos, etc.)
|
| 78 |
|
| 79 |
**Rejected alternatives:**
|
|
|
|
| 80 |
- Using `eval()`: Major security vulnerability
|
| 81 |
- Using `sympify()` from sympy: Too complex, allows too much
|
| 82 |
|
| 83 |
**Security layers:**
|
|
|
|
| 84 |
1. AST whitelist (only allow specific node types)
|
| 85 |
2. Expression length limit (500 chars)
|
| 86 |
3. Number size limit (prevent huge calculations)
|
|
@@ -102,6 +111,7 @@ def parse_file(file_path):
|
|
| 102 |
```
|
| 103 |
|
| 104 |
**Why:**
|
|
|
|
| 105 |
- Simple interface for users (one function for all file types)
|
| 106 |
- Easy to add new file types (just add new parser and update dispatcher)
|
| 107 |
- Each parser can have format-specific logic
|
|
@@ -112,11 +122,13 @@ def parse_file(file_path):
|
|
| 112 |
**Chosen:** Gemini 2.0 Flash as primary, Claude Sonnet 4.5 as fallback
|
| 113 |
|
| 114 |
**Why:**
|
|
|
|
| 115 |
- Gemini 2.0 Flash: Free tier (1500 req/day), fast, good quality
|
| 116 |
- Claude Sonnet 4.5: Paid but highest quality, automatic fallback if Gemini fails
|
| 117 |
- Same pattern as web search (primary + fallback = reliability)
|
| 118 |
|
| 119 |
**Image handling:**
|
|
|
|
| 120 |
- Load file, encode as base64
|
| 121 |
- Check file size (max 10MB)
|
| 122 |
- Support common formats (JPG, PNG, GIF, WEBP, BMP)
|
|
@@ -132,7 +144,7 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
|
|
| 132 |
- Tavily API integration (primary, free tier)
|
| 133 |
- Exa API integration (fallback, paid)
|
| 134 |
- Automatic fallback if primary fails
|
| 135 |
-
- 10 passing tests (
|
| 136 |
|
| 137 |
2. **File Parser Tool** ([src/tools/file_parser.py](../src/tools/file_parser.py))
|
| 138 |
- PDF parsing (PyPDF2)
|
|
@@ -140,21 +152,21 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
|
|
| 140 |
- Word parsing (python-docx)
|
| 141 |
- Text/CSV parsing (built-in open)
|
| 142 |
- Generic `parse_file()` dispatcher
|
| 143 |
-
- 19 passing tests (
|
| 144 |
|
| 145 |
3. **Calculator Tool** ([src/tools/calculator.py](../src/tools/calculator.py))
|
| 146 |
- Safe AST-based expression evaluation
|
| 147 |
- Whitelisted operations only (no code execution)
|
| 148 |
- Mathematical functions (sin, cos, sqrt, factorial, etc.)
|
| 149 |
- Security hardened (timeout, complexity limits)
|
| 150 |
-
- 41 passing tests (
|
| 151 |
|
| 152 |
4. **Vision Tool** ([src/tools/vision.py](../src/tools/vision.py))
|
| 153 |
- Multimodal image analysis using LLMs
|
| 154 |
- Gemini 2.0 Flash (primary, free)
|
| 155 |
- Claude Sonnet 4.5 (fallback, paid)
|
| 156 |
- Image loading and base64 encoding
|
| 157 |
-
- 15 passing tests (
|
| 158 |
|
| 159 |
5. **Tool Registry** ([src/tools/__init__.py](../src/tools/__init__.py))
|
| 160 |
- Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
|
|
@@ -167,12 +179,14 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
|
|
| 167 |
- Stage 3: Will add dynamic tool selection and execution
|
| 168 |
|
| 169 |
**Test Coverage:**
|
|
|
|
| 170 |
- 85 tool tests passing (web_search: 10, file_parser: 19, calculator: 41, vision: 15)
|
| 171 |
- 6 existing agent tests still passing
|
| 172 |
- 91 total tests passing
|
| 173 |
- No regressions from Stage 1
|
| 174 |
|
| 175 |
**Deployment:**
|
|
|
|
| 176 |
- All changes committed and pushed to HuggingFace Spaces
|
| 177 |
- Build succeeded
|
| 178 |
- Agent now reports: "Stage 2 complete: 4 tools ready for execution in Stage 3"
|
|
@@ -198,6 +212,7 @@ def tool_name(args):
|
|
| 198 |
```
|
| 199 |
|
| 200 |
**Why it works:**
|
|
|
|
| 201 |
- Maximizes reliability (2 chances to succeed)
|
| 202 |
- Transparent to users (single function call)
|
| 203 |
- Preserves cost optimization (use free tier first, paid only as fallback)
|
|
@@ -209,17 +224,19 @@ def tool_name(args):
|
|
| 209 |
Creating real test fixtures (sample.pdf, sample.xlsx, etc.) was critical for file parser testing:
|
| 210 |
|
| 211 |
**What worked:**
|
|
|
|
| 212 |
- Tests are realistic (test actual file parsing, not just mocks)
|
| 213 |
- Easy to add new test cases (just add new fixture files)
|
| 214 |
- Catches edge cases that mocks miss
|
| 215 |
|
| 216 |
**Created fixtures:**
|
| 217 |
-
|
| 218 |
-
- `
|
| 219 |
-
- `
|
| 220 |
-
- `
|
| 221 |
-
- `
|
| 222 |
-
- `
|
|
|
|
| 223 |
|
| 224 |
**Recommendation:** For any file processing tool, create comprehensive fixture library.
|
| 225 |
|
|
@@ -241,6 +258,7 @@ with patch('google.genai.Client') as mock_client:
|
|
| 241 |
Initially planned to create `tests/test_tools_integration.py` for cross-tool testing. **Decision:** Skip for Stage 2.
|
| 242 |
|
| 243 |
**Why:**
|
|
|
|
| 244 |
- Tools work independently (don't need to interact yet)
|
| 245 |
- Integration testing makes sense in Stage 3 when tools are orchestrated
|
| 246 |
- Unit tests provide sufficient coverage for Stage 2
|
|
@@ -255,16 +273,16 @@ Initially planned to create `tests/test_tools_integration.py` for cross-tool tes
|
|
| 255 |
- `src/tools/file_parser.py` - PDF/Excel/Word/Text parsing with retry logic
|
| 256 |
- `src/tools/calculator.py` - Safe AST-based math evaluation
|
| 257 |
- `src/tools/vision.py` - Multimodal image analysis (Gemini/Claude)
|
| 258 |
-
- `
|
| 259 |
-
- `
|
| 260 |
-
- `
|
| 261 |
-
- `
|
| 262 |
-
- `
|
| 263 |
-
- `
|
| 264 |
-
- `
|
| 265 |
-
- `
|
| 266 |
-
- `
|
| 267 |
-
- `
|
| 268 |
|
| 269 |
**What was modified:**
|
| 270 |
|
|
|
|
| 20 |
**Chosen:** Direct Python function implementations for all tools
|
| 21 |
|
| 22 |
**Why:**
|
| 23 |
+
|
| 24 |
- HuggingFace Spaces doesn't support running MCP servers (requires separate processes)
|
| 25 |
- Direct API approach is simpler and more reliable for deployment
|
| 26 |
- Full control over retry logic, error handling, and timeouts
|
| 27 |
- MCP servers are external dependencies with additional failure points
|
| 28 |
|
| 29 |
**Rejected alternative:** Using MCP protocol servers for Tavily/Exa
|
| 30 |
+
|
| 31 |
- Would require complex Docker configuration on HF Spaces
|
| 32 |
- Additional process management overhead
|
| 33 |
- Not necessary for MVP stage
|
|
|
|
| 37 |
**Chosen:** Use `tenacity` library with exponential backoff, max 3 retries
|
| 38 |
|
| 39 |
**Why:**
|
| 40 |
+
|
| 41 |
- Industry-standard retry library with clean decorator syntax
|
| 42 |
- Exponential backoff prevents API rate limit issues
|
| 43 |
- Configurable retry conditions (only retry on connection errors, not on validation errors)
|
| 44 |
- Easy to test with mocking
|
| 45 |
|
| 46 |
**Configuration:**
|
| 47 |
+
|
| 48 |
- Max retries: 3
|
| 49 |
- Min wait: 1 second
|
| 50 |
- Max wait: 10 seconds
|
|
|
|
| 53 |
### Decision 3: Tool Architecture - Unified Functions with Fallback
|
| 54 |
|
| 55 |
**Pattern applied to all tools:**
|
| 56 |
+
|
| 57 |
- Primary implementation (e.g., `tavily_search`)
|
| 58 |
- Fallback implementation (e.g., `exa_search`)
|
| 59 |
- Unified function with automatic fallback (e.g., `search`)
|
| 60 |
|
| 61 |
**Example:**
|
| 62 |
+
|
| 63 |
```python
|
| 64 |
def search(query):
|
| 65 |
if default_tool == "tavily":
|
|
|
|
| 76 |
**Chosen:** Custom AST visitor with whitelisted operations only
|
| 77 |
|
| 78 |
**Why:**
|
| 79 |
+
|
| 80 |
- Python's `eval()` is dangerous (arbitrary code execution)
|
| 81 |
- `ast.literal_eval()` is too restrictive (doesn't support math operations)
|
| 82 |
- Custom AST visitor allows precise control over allowed operations
|
|
|
|
| 84 |
- Whitelist approach: only allow known-safe operations (add, multiply, sin, cos, etc.)
|
| 85 |
|
| 86 |
**Rejected alternatives:**
|
| 87 |
+
|
| 88 |
- Using `eval()`: Major security vulnerability
|
| 89 |
- Using `sympify()` from sympy: Too complex, allows too much
|
| 90 |
|
| 91 |
**Security layers:**
|
| 92 |
+
|
| 93 |
1. AST whitelist (only allow specific node types)
|
| 94 |
2. Expression length limit (500 chars)
|
| 95 |
3. Number size limit (prevent huge calculations)
|
|
|
|
| 111 |
```
|
| 112 |
|
| 113 |
**Why:**
|
| 114 |
+
|
| 115 |
- Simple interface for users (one function for all file types)
|
| 116 |
- Easy to add new file types (just add new parser and update dispatcher)
|
| 117 |
- Each parser can have format-specific logic
|
|
|
|
| 122 |
**Chosen:** Gemini 2.0 Flash as primary, Claude Sonnet 4.5 as fallback
|
| 123 |
|
| 124 |
**Why:**
|
| 125 |
+
|
| 126 |
- Gemini 2.0 Flash: Free tier (1500 req/day), fast, good quality
|
| 127 |
- Claude Sonnet 4.5: Paid but highest quality, automatic fallback if Gemini fails
|
| 128 |
- Same pattern as web search (primary + fallback = reliability)
|
| 129 |
|
| 130 |
**Image handling:**
|
| 131 |
+
|
| 132 |
- Load file, encode as base64
|
| 133 |
- Check file size (max 10MB)
|
| 134 |
- Support common formats (JPG, PNG, GIF, WEBP, BMP)
|
|
|
|
| 144 |
- Tavily API integration (primary, free tier)
|
| 145 |
- Exa API integration (fallback, paid)
|
| 146 |
- Automatic fallback if primary fails
|
| 147 |
+
- 10 passing tests in [test/test_web_search.py](../test/test_web_search.py)
|
| 148 |
|
| 149 |
2. **File Parser Tool** ([src/tools/file_parser.py](../src/tools/file_parser.py))
|
| 150 |
- PDF parsing (PyPDF2)
|
|
|
|
| 152 |
- Word parsing (python-docx)
|
| 153 |
- Text/CSV parsing (built-in open)
|
| 154 |
- Generic `parse_file()` dispatcher
|
| 155 |
+
- 19 passing tests in [test/test_file_parser.py](../test/test_file_parser.py)
|
| 156 |
|
| 157 |
3. **Calculator Tool** ([src/tools/calculator.py](../src/tools/calculator.py))
|
| 158 |
- Safe AST-based expression evaluation
|
| 159 |
- Whitelisted operations only (no code execution)
|
| 160 |
- Mathematical functions (sin, cos, sqrt, factorial, etc.)
|
| 161 |
- Security hardened (timeout, complexity limits)
|
| 162 |
+
- 41 passing tests in [test/test_calculator.py](../test/test_calculator.py)
|
| 163 |
|
| 164 |
4. **Vision Tool** ([src/tools/vision.py](../src/tools/vision.py))
|
| 165 |
- Multimodal image analysis using LLMs
|
| 166 |
- Gemini 2.0 Flash (primary, free)
|
| 167 |
- Claude Sonnet 4.5 (fallback, paid)
|
| 168 |
- Image loading and base64 encoding
|
| 169 |
+
- 15 passing tests in [test/test_vision.py](../test/test_vision.py)
|
| 170 |
|
| 171 |
5. **Tool Registry** ([src/tools/__init__.py](../src/tools/__init__.py))
|
| 172 |
- Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
|
|
|
|
| 179 |
- Stage 3: Will add dynamic tool selection and execution
|
| 180 |
|
| 181 |
**Test Coverage:**
|
| 182 |
+
|
| 183 |
- 85 tool tests passing (web_search: 10, file_parser: 19, calculator: 41, vision: 15)
|
| 184 |
- 6 existing agent tests still passing
|
| 185 |
- 91 total tests passing
|
| 186 |
- No regressions from Stage 1
|
| 187 |
|
| 188 |
**Deployment:**
|
| 189 |
+
|
| 190 |
- All changes committed and pushed to HuggingFace Spaces
|
| 191 |
- Build succeeded
|
| 192 |
- Agent now reports: "Stage 2 complete: 4 tools ready for execution in Stage 3"
|
|
|
|
| 212 |
```
|
| 213 |
|
| 214 |
**Why it works:**
|
| 215 |
+
|
| 216 |
- Maximizes reliability (2 chances to succeed)
|
| 217 |
- Transparent to users (single function call)
|
| 218 |
- Preserves cost optimization (use free tier first, paid only as fallback)
|
|
|
|
| 224 |
Creating real test fixtures (sample.pdf, sample.xlsx, etc.) was critical for file parser testing:
|
| 225 |
|
| 226 |
**What worked:**
|
| 227 |
+
|
| 228 |
- Tests are realistic (test actual file parsing, not just mocks)
|
| 229 |
- Easy to add new test cases (just add new fixture files)
|
| 230 |
- Catches edge cases that mocks miss
|
| 231 |
|
| 232 |
**Created fixtures:**
|
| 233 |
+
|
| 234 |
+
- `test/fixtures/sample.txt` - Plain text
|
| 235 |
+
- `test/fixtures/sample.csv` - CSV data
|
| 236 |
+
- `test/fixtures/sample.xlsx` - Excel spreadsheet
|
| 237 |
+
- `test/fixtures/sample.docx` - Word document
|
| 238 |
+
- `test/fixtures/test_image.jpg` - Test image (red square)
|
| 239 |
+
- `test/fixtures/generate_fixtures.py` - Script to regenerate fixtures
|
| 240 |
|
| 241 |
**Recommendation:** For any file processing tool, create comprehensive fixture library.
|
| 242 |
|
|
|
|
| 258 |
Initially planned to create `tests/test_tools_integration.py` for cross-tool testing. **Decision:** Skip for Stage 2.
|
| 259 |
|
| 260 |
**Why:**
|
| 261 |
+
|
| 262 |
- Tools work independently (don't need to interact yet)
|
| 263 |
- Integration testing makes sense in Stage 3 when tools are orchestrated
|
| 264 |
- Unit tests provide sufficient coverage for Stage 2
|
|
|
|
| 273 |
- `src/tools/file_parser.py` - PDF/Excel/Word/Text parsing with retry logic
|
| 274 |
- `src/tools/calculator.py` - Safe AST-based math evaluation
|
| 275 |
- `src/tools/vision.py` - Multimodal image analysis (Gemini/Claude)
|
| 276 |
+
- `test/test_web_search.py` - 10 tests for web search tool
|
| 277 |
+
- `test/test_file_parser.py` - 19 tests for file parser
|
| 278 |
+
- `test/test_calculator.py` - 41 tests for calculator (including security)
|
| 279 |
+
- `test/test_vision.py` - 15 tests for vision tool
|
| 280 |
+
- `test/fixtures/sample.txt` - Test text file
|
| 281 |
+
- `test/fixtures/sample.csv` - Test CSV file
|
| 282 |
+
- `test/fixtures/sample.xlsx` - Test Excel file
|
| 283 |
+
- `test/fixtures/sample.docx` - Test Word document
|
| 284 |
+
- `test/fixtures/test_image.jpg` - Test image
|
| 285 |
+
- `test/fixtures/generate_fixtures.py` - Fixture generation script
|
| 286 |
|
| 287 |
**What was modified:**
|
| 288 |
|