File size: 9,929 Bytes
aca8ab4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
# MCP Download Issue - Fix Documentation
## Problem Summary
The MCP arXiv client was experiencing an issue where the `download_paper` tool would complete successfully on the remote MCP server, but the downloaded PDF files would not appear in the client's local `data/mcp_papers/` directory.
### Root Cause
The issue stems from the **client-server architecture** of MCP (Model Context Protocol):
1. **MCP Server** runs as a separate process (possibly remote)
2. **Server downloads PDFs** to its own storage location
3. **Server returns** `{"status": "success"}` without file path
4. **Client expects files** in its local `data/mcp_papers/` directory
5. **No file transfer mechanism** exists between server and client storage
This is fundamentally a **storage path mismatch** between what the server uses and what the client expects.
## Solution Implemented
### 1. Tool Discovery (Diagnostic)
Added automatic tool discovery when connecting to MCP server:
- Lists all available MCP tools at session initialization
- Logs tool names, descriptions, and schemas
- Helps diagnose what capabilities the server provides
**Location:** `utils/mcp_arxiv_client.py:88-112` (`_discover_tools` method)
### 2. Direct Download Fallback
Implemented a fallback mechanism that downloads PDFs directly from arXiv when MCP download fails:
- Detects when MCP download completes but file is not accessible
- Downloads PDF directly from `https://arxiv.org/pdf/{paper_id}.pdf`
- Writes file to client's local storage directory
- Maintains same retry logic and error handling
**Location:** `utils/mcp_arxiv_client.py:114-152` (`_download_from_arxiv_direct` method)
### 3. Enhanced Error Handling
Updated `download_paper_async` to:
- Try MCP download first (preserves existing functionality)
- Check multiple possible file locations
- Fall back to direct download if MCP fails
- Provide detailed logging at each step
**Location:** `utils/mcp_arxiv_client.py:462-479` (updated error handling)
## How It Works Now
### Download Flow
```
1. Check if file already exists locally β Return if found
2. Call MCP server's download_paper tool
3. Check if file appeared in expected locations:
a. Expected path: data/mcp_papers/{paper_id}.pdf
b. MCP-returned path (if provided in response)
c. Any file in storage matching paper_id
4. If file not found β Fall back to direct arXiv download
5. Download PDF directly to client storage
6. Return path to downloaded file
```
### Benefits
- **Zero breaking changes**: Existing MCP functionality preserved
- **Automatic fallback**: Works even with remote MCP servers
- **Better diagnostics**: Tool discovery helps troubleshoot issues
- **Guaranteed downloads**: Direct fallback ensures files are retrieved
- **Client-side storage**: Files always accessible to client process
## Using the Fix
### Running the Application
No changes needed! The fix is automatic:
```bash
# Set environment variables (optional - defaults work)
export USE_MCP_ARXIV=true
export MCP_ARXIV_STORAGE_PATH=data/mcp_papers
# Run the application
python app.py
```
The system will:
1. Try MCP download first
2. Automatically fall back to direct download if needed
3. Log which method succeeded
### Running Diagnostics
Use the diagnostic script to test your MCP setup:
```bash
python test_mcp_diagnostic.py
```
This will:
- Check environment configuration
- Verify storage directory setup
- List available MCP tools
- Test search functionality
- Test download with detailed logging
- Show file system state before/after
**Expected Output:**
```
================================================================================
MCP arXiv Client Diagnostic Test
================================================================================
[1] Environment Configuration:
USE_MCP_ARXIV: true
MCP_ARXIV_STORAGE_PATH: data/mcp_papers
[2] Storage Directory:
Path: /path/to/data/mcp_papers
Exists: True
Contains 0 PDF files
[3] Initializing MCP Client:
β Client initialized successfully
[4] Testing Search Functionality:
β Search successful, found 2 papers
First paper: Attention Is All You Need...
Paper ID: 1706.03762
[5] Testing Download Functionality:
Attempting to download: 1706.03762
PDF URL: https://arxiv.org/pdf/1706.03762.pdf
β Download successful!
File path: data/mcp_papers/1706.03762v7.pdf
File exists: True
File size: 2,215,520 bytes (2.11 MB)
[6] Storage Directory After Download:
Contains 1 PDF files
Files: ['1706.03762v7.pdf']
[7] Cleaning Up:
β MCP session closed
================================================================================
Diagnostic Test Complete
================================================================================
```
## Interpreting Logs
### Successful MCP Download
If MCP server works correctly, you'll see:
```
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - Downloading paper 2203.08975v2 via MCP
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - MCP download_paper response type: <class 'dict'>
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - Successfully downloaded paper to data/mcp_papers/2203.08975v2.pdf
```
### Fallback to Direct Download
If MCP fails but direct download succeeds:
```
2025-11-12 01:50:27 - utils.mcp_arxiv_client - WARNING - File not found at expected path
2025-11-12 01:50:27 - utils.mcp_arxiv_client - ERROR - MCP download call completed but file not found
2025-11-12 01:50:27 - utils.mcp_arxiv_client - WARNING - Falling back to direct arXiv download...
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - Attempting direct download from arXiv for 2203.08975v2
2025-11-12 01:50:28 - utils.mcp_arxiv_client - INFO - Successfully downloaded 1234567 bytes to data/mcp_papers/2203.08975v2.pdf
```
### Tool Discovery
At session initialization:
```
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO - MCP server provides 3 tools:
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO - - search_papers: Search arXiv for papers
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO - - download_paper: Download paper PDF
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO - - list_papers: List cached papers
```
## Troubleshooting
### Issue: MCP server not found
**Symptom:** Error during initialization: `command not found: arxiv-mcp-server`
**Solution:**
- Ensure MCP server is installed and in PATH
- Check server configuration in your MCP settings
- Try using direct ArxivClient instead: `export USE_MCP_ARXIV=false`
### Issue: Files still not downloading
**Symptom:** Both MCP and direct download fail
**Possible causes:**
1. Network connectivity issues
2. arXiv API rate limiting
3. Invalid paper IDs
4. Storage directory permissions
**Debugging steps:**
```bash
# Check network connectivity
curl https://arxiv.org/pdf/1706.03762.pdf -o test.pdf
# Check storage permissions
ls -la data/mcp_papers/
touch data/mcp_papers/test.txt
# Run diagnostic script
python test_mcp_diagnostic.py
```
### Issue: MCP server uses different storage path
**Symptom:** MCP downloads succeed but client can't find files
**Current solution:** Direct download fallback handles this automatically
**Future enhancement:** Could add file transfer mechanism if MCP provides retrieval tools
## Technical Details
### Architecture Decision: Why Fallback Instead of File Transfer?
We chose direct download fallback over implementing a file transfer mechanism because:
1. **Server is third-party**: Cannot modify MCP server to add file retrieval tools
2. **Simpler implementation**: Direct download is straightforward and reliable
3. **Better performance**: Avoids two-step download (server β client transfer)
4. **Same result**: Client gets PDFs either way
5. **Fail-safe**: Works even if MCP server is completely unavailable
### Performance Impact
- **MCP successful**: No performance change (same as before)
- **MCP fails**: Extra ~2-5 seconds for direct download
- **Network overhead**: Same (one download either way)
- **Storage**: Client-side only (no redundant server storage)
### Comparison with Direct ArxivClient
| Feature | MCPArxivClient (with fallback) | Direct ArxivClient |
|---------|-------------------------------|-------------------|
| Search via MCP | β | β |
| Download via MCP | Tries first | β |
| Direct download | Fallback | Primary |
| Remote MCP server | β | N/A |
| File storage | Client-side | Client-side |
| Reliability | High (dual method) | High |
## Future Enhancements
If MCP server capabilities expand, possible improvements:
1. **File retrieval tool**: MCP server adds `get_file(paper_id)` tool
2. **Streaming transfer**: MCP response includes base64-encoded PDF
3. **Shared storage**: Configure MCP server to write to shared filesystem
4. **Batch downloads**: Optimize multi-paper downloads
For now, the fallback solution provides robust, reliable downloads without requiring MCP server changes.
## Files Modified
1. `utils/mcp_arxiv_client.py` - Core client with fallback logic
2. `test_mcp_diagnostic.py` - New diagnostic script
3. `MCP_FIX_DOCUMENTATION.md` - This document
## Testing
Run the test suite to verify the fix:
```bash
# Test MCP client
pytest tests/test_mcp_arxiv_client.py -v
# Run diagnostic
python test_mcp_diagnostic.py
# Full integration test
python app.py
# Then use the Gradio UI to analyze papers with MCP enabled
```
## Summary
The fix ensures **reliable PDF downloads** by combining MCP capabilities with direct arXiv fallback:
- β
**Preserves MCP functionality** for servers that work correctly
- β
**Automatic fallback** when MCP fails or files aren't accessible
- β
**No configuration changes** required
- β
**Better diagnostics** via tool discovery
- β
**Comprehensive logging** for troubleshooting
- β
**Zero breaking changes** to existing code
The system now works reliably with **remote MCP servers**, **local servers**, or **no MCP at all**.
|