A newer version of the Gradio SDK is available:
6.5.1
Data Validation Fix Documentation
Problem Summary
Original Error
2025-11-12 14:36:16,506 - agents.retriever - ERROR - Error processing paper 1411.6643v4:
int() argument must be a string, a bytes-like object or a real number, not 'dict'
Root Cause
The MCP arXiv server was returning paper metadata with dict objects instead of the expected primitive types (lists, strings). Specifically:
authorsfield: Dict instead ofList[str]categoriesfield: Dict instead ofList[str]- Other fields: Potentially dicts instead of strings
When these malformed Paper objects were passed to PDFProcessor.chunk_text(), the metadata creation failed because it tried to use dict values where lists or strings were expected.
Impact
- All 4 papers failed PDF processing
- Entire pipeline broken at the Retriever stage
- All downstream agents (Analyzer, Synthesis, Citation) never executed
Solution: Multi-Layer Data Validation
We implemented a defense-in-depth approach with validation at multiple levels:
1. Pydantic Schema Validators (utils/schemas.py)
Added @validator decorators to the Paper class that automatically normalize malformed data:
Features:
Authors normalization: Handles dict, list, string, or unknown types
- Dict format: Extracts values from nested structures
- String format: Converts to single-element list
- Invalid format: Returns empty list with warning
Categories normalization: Same robust handling as authors
String field normalization: Ensures title, abstract, pdf_url are always strings
- Dict format: Extracts nested values
- Invalid format: Converts to string representation
Code Example:
@validator('authors', pre=True)
def normalize_authors(cls, v):
if isinstance(v, list):
return [str(author) if not isinstance(author, str) else author for author in v]
elif isinstance(v, dict):
logger.warning(f"Authors field is dict, extracting values: {v}")
if 'names' in v:
return v['names'] if isinstance(v['names'], list) else [str(v['names'])]
# ... more extraction logic
elif isinstance(v, str):
return [v]
else:
logger.warning(f"Unexpected authors format: {type(v)}, returning empty list")
return []
2. MCP Client Data Parsing (utils/mcp_arxiv_client.py)
Enhanced _parse_mcp_paper() method with explicit type checking and normalization:
Features:
- Pre-validation: Checks and normalizes data types before creating Paper object
- Comprehensive logging: Warnings for each malformed field
- Graceful fallbacks: Safe defaults for invalid data
- Detailed error context: Logs raw paper data on parsing failure
Key Improvements:
- Authors: Explicit type checking and dict extraction (lines 209-225)
- Categories: Same robust handling (lines 227-243)
- Title, abstract, pdf_url: String normalization (lines 245-270)
- Published date: Enhanced datetime parsing with fallbacks (lines 195-207)
3. PDF Processor Error Handling (utils/pdf_processor.py)
Added defensive metadata creation in chunk_text():
Features:
- Type validation: Checks authors is list before use
- Safe conversion: Falls back to empty list if invalid
- Try-except blocks: Catches and logs chunk creation errors
- Graceful continuation: Processes remaining chunks even if one fails
Code Example:
try:
# Ensure authors is a list of strings
authors_metadata = paper.authors
if not isinstance(authors_metadata, list):
logger.warning(f"Paper {paper.arxiv_id} has invalid authors type: {type(authors_metadata)}, converting to list")
authors_metadata = [str(authors_metadata)] if authors_metadata else []
metadata = {
"title": title_metadata,
"authors": authors_metadata,
"chunk_index": chunk_index,
"token_count": len(chunk_tokens)
}
except Exception as e:
logger.warning(f"Error creating metadata for chunk {chunk_index}: {str(e)}, using fallback")
# Use safe fallback metadata
4. Retriever Agent Validation (agents/retriever.py)
Added post-parsing validation to check data quality:
Features:
- Diagnostic checks: Validates all Paper object fields after MCP parsing
- Quality reporting: Logs specific data quality issues
- Filtering: Can skip papers with critical validation failures
- Error tracking: Reports validation failures in state["errors"]
Checks Performed:
- Authors is list type
- Categories is list type
- Title, pdf_url, abstract are string types
- Authors list is not empty
Testing
Created comprehensive test suite (test_data_validation.py) that verifies:
Test 1: Paper Schema Validators
- β Authors as dict β normalized to list
- β Categories as dict β normalized to list
- β Multiple malformed fields β all normalized correctly
Test 2: PDF Processor Resilience
- β Processes Papers with normalized data successfully
- β Creates chunks with proper metadata structure
- β Chunk metadata contains lists for authors field
Test Results:
β ALL TESTS PASSED - The data validation fixes are working correctly!
Impact on All Agents
RetrieverAgent β
- Primary beneficiary of all fixes
- Handles malformed MCP responses gracefully
- Validates and filters papers before processing
- Continues with valid papers even if some fail
AnalyzerAgent β
- Protected by upstream validation
- Receives only validated Paper objects
- No changes required
- Works with clean, normalized data
SynthesisAgent β
- No changes needed
- Operates on validated analyses
- Unaffected by MCP data issues
CitationAgent β
- No changes needed
- Gets validated citations from upstream
- Unaffected by MCP data issues
Files Modified
utils/schemas.py (lines 1-93)
- Added logging import
- Added 6 Pydantic validators for Paper class
- Normalizes authors, categories, title, abstract, pdf_url
utils/mcp_arxiv_client.py (lines 175-290)
- Enhanced
_parse_mcp_paper()method - Added explicit type checking for all fields
- Improved logging and error handling
- Enhanced
utils/pdf_processor.py (lines 134-175)
- Added metadata validation in
chunk_text() - Try-except around metadata creation
- Try-except around chunk creation
- Graceful continuation on errors
- Added metadata validation in
agents/retriever.py (lines 89-134)
- Added post-parsing validation loop
- Diagnostic checks for all Paper fields
- Paper filtering capability
- Enhanced error reporting
test_data_validation.py (NEW)
- Comprehensive test suite
- Verifies all validation layers work correctly
How to Verify the Fix
Run the validation test:
python test_data_validation.py
Expected output:
β ALL TESTS PASSED - The data validation fixes are working correctly!
Run with your actual MCP data:
The next time you run the application with MCP papers that previously failed, you should see:
- Warning logs for malformed fields (e.g., "Authors field is dict, extracting values")
- Successful PDF processing instead of errors
- Papers properly chunked and stored in vector database
- All downstream agents execute successfully
Check logs for validation warnings:
# Run your application and look for these log patterns:
# - "Authors field is dict, extracting values"
# - "Categories field is dict, extracting values"
# - "Paper X has data quality issues: ..."
# - "Successfully parsed paper X: Y authors, Z categories"
Why This Works
Defense in Depth: Multiple validation layers ensure data quality
- MCP client normalizes on parse
- Pydantic validators normalize on object creation
- PDF processor validates before use
- Retriever agent performs diagnostic checks
Graceful Degradation: System continues with valid papers even if some fail
- Individual paper failures don't stop the pipeline
- Partial results better than complete failure
- Clear error reporting shows what failed and why
Clear Error Reporting: Users see which papers had issues and why
- Warnings logged for each malformed field
- Diagnostic checks report specific issues
- Errors accumulated in state["errors"]
Future-Proof: Handles variations in MCP server response formats
- Supports multiple dict structures
- Falls back to safe defaults
- Continues to work if MCP format changes
Known Limitations
Data Extraction from Dicts: We extract values from dicts heuristically
- May not capture all data in complex nested structures
- Assumes common field names ('names', 'authors', 'categories')
- Better than failing completely, but may lose some metadata
Empty Authors Lists: If authors dict has no extractable values
- Falls back to empty list
- Papers still process but lack author metadata
- Logged as warning for manual review
Performance: Additional validation adds small overhead
- Negligible impact for typical workloads
- Logging warnings can increase log size
- Trade-off for robustness is worthwhile
Recommendations
Monitor Logs: Watch for validation warnings in production
- Indicates ongoing MCP data quality issues
- May need to work with MCP server maintainers
Report to MCP Maintainers: The MCP server should return proper types
- Authors should be
List[str], notDict - Categories should be
List[str], notDict - This fix is a workaround, not a permanent solution
- Authors should be
Extend Validation: If more fields show issues, add validators
- Follow the same pattern used for authors/categories
- Add tests to verify behavior
- Document in this file
Consider Alternative MCP Servers: If issues persist
- Try different arXiv MCP implementations
- Or fallback to direct arXiv API (already supported)
- Set
USE_MCP_ARXIV=falsein .env
Rollback Instructions
If this fix causes issues, you can rollback by:
Revert the files:
git checkout HEAD~1 utils/schemas.py utils/mcp_arxiv_client.py utils/pdf_processor.py agents/retriever.pyRemove the test file:
rm test_data_validation.pySwitch to direct arXiv API:
# In .env file: USE_MCP_ARXIV=false
Version History
- v1.0 (2025-11-12): Initial implementation
- Added Pydantic validators
- Enhanced MCP client parsing
- Improved PDF processor error handling
- Added Retriever validation
- Created comprehensive tests
- All tests passing β