Spaces:

samir72
/

Multi-Agent-Research-Paper-Analysis-System

Sleeping

Multi-Agent-Research-Paper-Analysis-System

File size: 9,929 Bytes

aca8ab4

# MCP Download Issue - Fix Documentation

## Problem Summary

The MCP arXiv client was experiencing an issue where the `download_paper` tool would complete successfully on the remote MCP server, but the downloaded PDF files would not appear in the client's local `data/mcp_papers/` directory.

### Root Cause

The issue stems from the **client-server architecture** of MCP (Model Context Protocol):

1. **MCP Server** runs as a separate process (possibly remote)
2. **Server downloads PDFs** to its own storage location
3. **Server returns** `{"status": "success"}` without file path
4. **Client expects files** in its local `data/mcp_papers/` directory
5. **No file transfer mechanism** exists between server and client storage

This is fundamentally a **storage path mismatch** between what the server uses and what the client expects.

## Solution Implemented

### 1. Tool Discovery (Diagnostic)

Added automatic tool discovery when connecting to MCP server:
- Lists all available MCP tools at session initialization
- Logs tool names, descriptions, and schemas
- Helps diagnose what capabilities the server provides

**Location:** `utils/mcp_arxiv_client.py:88-112` (`_discover_tools` method)

### 2. Direct Download Fallback

Implemented a fallback mechanism that downloads PDFs directly from arXiv when MCP download fails:
- Detects when MCP download completes but file is not accessible
- Downloads PDF directly from `https://arxiv.org/pdf/{paper_id}.pdf`
- Writes file to client's local storage directory
- Maintains same retry logic and error handling

**Location:** `utils/mcp_arxiv_client.py:114-152` (`_download_from_arxiv_direct` method)

### 3. Enhanced Error Handling

Updated `download_paper_async` to:
- Try MCP download first (preserves existing functionality)
- Check multiple possible file locations
- Fall back to direct download if MCP fails
- Provide detailed logging at each step

**Location:** `utils/mcp_arxiv_client.py:462-479` (updated error handling)

## How It Works Now

### Download Flow

```
1. Check if file already exists locally → Return if found
2. Call MCP server's download_paper tool
3. Check if file appeared in expected locations:
   a. Expected path: data/mcp_papers/{paper_id}.pdf
   b. MCP-returned path (if provided in response)
   c. Any file in storage matching paper_id
4. If file not found → Fall back to direct arXiv download
5. Download PDF directly to client storage
6. Return path to downloaded file
```

### Benefits

- **Zero breaking changes**: Existing MCP functionality preserved
- **Automatic fallback**: Works even with remote MCP servers
- **Better diagnostics**: Tool discovery helps troubleshoot issues
- **Guaranteed downloads**: Direct fallback ensures files are retrieved
- **Client-side storage**: Files always accessible to client process

## Using the Fix

### Running the Application

No changes needed! The fix is automatic:

```bash
# Set environment variables (optional - defaults work)
export USE_MCP_ARXIV=true
export MCP_ARXIV_STORAGE_PATH=data/mcp_papers

# Run the application
python app.py
```

The system will:
1. Try MCP download first
2. Automatically fall back to direct download if needed
3. Log which method succeeded

### Running Diagnostics

Use the diagnostic script to test your MCP setup:

```bash
python test_mcp_diagnostic.py
```

This will:
- Check environment configuration
- Verify storage directory setup
- List available MCP tools
- Test search functionality
- Test download with detailed logging
- Show file system state before/after

**Expected Output:**

```
================================================================================
MCP arXiv Client Diagnostic Test
================================================================================

[1] Environment Configuration:
  USE_MCP_ARXIV: true
  MCP_ARXIV_STORAGE_PATH: data/mcp_papers

[2] Storage Directory:
  Path: /path/to/data/mcp_papers
  Exists: True
  Contains 0 PDF files

[3] Initializing MCP Client:
  ✓ Client initialized successfully

[4] Testing Search Functionality:
  ✓ Search successful, found 2 papers
  First paper: Attention Is All You Need...
  Paper ID: 1706.03762

[5] Testing Download Functionality:
  Attempting to download: 1706.03762
  PDF URL: https://arxiv.org/pdf/1706.03762.pdf
  ✓ Download successful!
  File path: data/mcp_papers/1706.03762v7.pdf
  File exists: True
  File size: 2,215,520 bytes (2.11 MB)

[6] Storage Directory After Download:
  Contains 1 PDF files
  Files: ['1706.03762v7.pdf']

[7] Cleaning Up:
  ✓ MCP session closed

================================================================================
Diagnostic Test Complete
================================================================================
```

## Interpreting Logs

### Successful MCP Download

If MCP server works correctly, you'll see:

```
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - Downloading paper 2203.08975v2 via MCP
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - MCP download_paper response type: <class 'dict'>
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - Successfully downloaded paper to data/mcp_papers/2203.08975v2.pdf
```

### Fallback to Direct Download

If MCP fails but direct download succeeds:

```
2025-11-12 01:50:27 - utils.mcp_arxiv_client - WARNING - File not found at expected path
2025-11-12 01:50:27 - utils.mcp_arxiv_client - ERROR - MCP download call completed but file not found
2025-11-12 01:50:27 - utils.mcp_arxiv_client - WARNING - Falling back to direct arXiv download...
2025-11-12 01:50:27 - utils.mcp_arxiv_client - INFO - Attempting direct download from arXiv for 2203.08975v2
2025-11-12 01:50:28 - utils.mcp_arxiv_client - INFO - Successfully downloaded 1234567 bytes to data/mcp_papers/2203.08975v2.pdf
```

### Tool Discovery

At session initialization:

```
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO - MCP server provides 3 tools:
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO -   - search_papers: Search arXiv for papers
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO -   - download_paper: Download paper PDF
2025-11-12 01:50:26 - utils.mcp_arxiv_client - INFO -   - list_papers: List cached papers
```

## Troubleshooting

### Issue: MCP server not found

**Symptom:** Error during initialization: `command not found: arxiv-mcp-server`

**Solution:**
- Ensure MCP server is installed and in PATH
- Check server configuration in your MCP settings
- Try using direct ArxivClient instead: `export USE_MCP_ARXIV=false`

### Issue: Files still not downloading

**Symptom:** Both MCP and direct download fail

**Possible causes:**
1. Network connectivity issues
2. arXiv API rate limiting
3. Invalid paper IDs
4. Storage directory permissions

**Debugging steps:**
```bash
# Check network connectivity
curl https://arxiv.org/pdf/1706.03762.pdf -o test.pdf

# Check storage permissions
ls -la data/mcp_papers/
touch data/mcp_papers/test.txt

# Run diagnostic script
python test_mcp_diagnostic.py
```

### Issue: MCP server uses different storage path

**Symptom:** MCP downloads succeed but client can't find files

**Current solution:** Direct download fallback handles this automatically

**Future enhancement:** Could add file transfer mechanism if MCP provides retrieval tools

## Technical Details

### Architecture Decision: Why Fallback Instead of File Transfer?

We chose direct download fallback over implementing a file transfer mechanism because:

1. **Server is third-party**: Cannot modify MCP server to add file retrieval tools
2. **Simpler implementation**: Direct download is straightforward and reliable
3. **Better performance**: Avoids two-step download (server → client transfer)
4. **Same result**: Client gets PDFs either way
5. **Fail-safe**: Works even if MCP server is completely unavailable

### Performance Impact

- **MCP successful**: No performance change (same as before)
- **MCP fails**: Extra ~2-5 seconds for direct download
- **Network overhead**: Same (one download either way)
- **Storage**: Client-side only (no redundant server storage)

### Comparison with Direct ArxivClient

| Feature | MCPArxivClient (with fallback) | Direct ArxivClient |
|---------|-------------------------------|-------------------|
| Search via MCP | ✓ | ✗ |
| Download via MCP | Tries first | ✗ |
| Direct download | Fallback | Primary |
| Remote MCP server | ✓ | N/A |
| File storage | Client-side | Client-side |
| Reliability | High (dual method) | High |

## Future Enhancements

If MCP server capabilities expand, possible improvements:

1. **File retrieval tool**: MCP server adds `get_file(paper_id)` tool
2. **Streaming transfer**: MCP response includes base64-encoded PDF
3. **Shared storage**: Configure MCP server to write to shared filesystem
4. **Batch downloads**: Optimize multi-paper downloads

For now, the fallback solution provides robust, reliable downloads without requiring MCP server changes.

## Files Modified

1. `utils/mcp_arxiv_client.py` - Core client with fallback logic
2. `test_mcp_diagnostic.py` - New diagnostic script
3. `MCP_FIX_DOCUMENTATION.md` - This document

## Testing

Run the test suite to verify the fix:

```bash
# Test MCP client
pytest tests/test_mcp_arxiv_client.py -v

# Run diagnostic
python test_mcp_diagnostic.py

# Full integration test
python app.py
# Then use the Gradio UI to analyze papers with MCP enabled
```

## Summary

The fix ensures **reliable PDF downloads** by combining MCP capabilities with direct arXiv fallback:

- ✅ **Preserves MCP functionality** for servers that work correctly
- ✅ **Automatic fallback** when MCP fails or files aren't accessible
- ✅ **No configuration changes** required
- ✅ **Better diagnostics** via tool discovery
- ✅ **Comprehensive logging** for troubleshooting
- ✅ **Zero breaking changes** to existing code

The system now works reliably with **remote MCP servers**, **local servers**, or **no MCP at all**.