| # Data Validation Fix Documentation | |
| ## Problem Summary | |
| ### Original Error | |
| ``` | |
| 2025-11-12 14:36:16,506 - agents.retriever - ERROR - Error processing paper 1411.6643v4: | |
| int() argument must be a string, a bytes-like object or a real number, not 'dict' | |
| ``` | |
| ### Root Cause | |
| The MCP arXiv server was returning paper metadata with **dict objects** instead of the expected primitive types (lists, strings). Specifically: | |
| - `authors` field: Dict instead of `List[str]` | |
| - `categories` field: Dict instead of `List[str]` | |
| - Other fields: Potentially dicts instead of strings | |
| When these malformed Paper objects were passed to `PDFProcessor.chunk_text()`, the metadata creation failed because it tried to use dict values where lists or strings were expected. | |
| ### Impact | |
| - **All 4 papers** failed PDF processing | |
| - **Entire pipeline** broken at the Retriever stage | |
| - **All downstream agents** (Analyzer, Synthesis, Citation) never executed | |
| ## Solution: Multi-Layer Data Validation | |
| We implemented a **defense-in-depth** approach with validation at multiple levels: | |
| ### 1. Pydantic Schema Validators (`utils/schemas.py`) | |
| Added `@validator` decorators to the `Paper` class that automatically normalize malformed data: | |
| **Features:** | |
| - **Authors normalization**: Handles dict, list, string, or unknown types | |
| - Dict format: Extracts values from nested structures | |
| - String format: Converts to single-element list | |
| - Invalid format: Returns empty list with warning | |
| - **Categories normalization**: Same robust handling as authors | |
| - **String field normalization**: Ensures title, abstract, pdf_url are always strings | |
| - Dict format: Extracts nested values | |
| - Invalid format: Converts to string representation | |
| **Code Example:** | |
| ```python | |
| @validator('authors', pre=True) | |
| def normalize_authors(cls, v): | |
| if isinstance(v, list): | |
| return [str(author) if not isinstance(author, str) else author for author in v] | |
| elif isinstance(v, dict): | |
| logger.warning(f"Authors field is dict, extracting values: {v}") | |
| if 'names' in v: | |
| return v['names'] if isinstance(v['names'], list) else [str(v['names'])] | |
| # ... more extraction logic | |
| elif isinstance(v, str): | |
| return [v] | |
| else: | |
| logger.warning(f"Unexpected authors format: {type(v)}, returning empty list") | |
| return [] | |
| ``` | |
| ### 2. MCP Client Data Parsing (`utils/mcp_arxiv_client.py`) | |
| Enhanced `_parse_mcp_paper()` method with explicit type checking and normalization: | |
| **Features:** | |
| - **Pre-validation**: Checks and normalizes data types before creating Paper object | |
| - **Comprehensive logging**: Warnings for each malformed field | |
| - **Graceful fallbacks**: Safe defaults for invalid data | |
| - **Detailed error context**: Logs raw paper data on parsing failure | |
| **Key Improvements:** | |
| - Authors: Explicit type checking and dict extraction (lines 209-225) | |
| - Categories: Same robust handling (lines 227-243) | |
| - Title, abstract, pdf_url: String normalization (lines 245-270) | |
| - Published date: Enhanced datetime parsing with fallbacks (lines 195-207) | |
| ### 3. PDF Processor Error Handling (`utils/pdf_processor.py`) | |
| Added defensive metadata creation in `chunk_text()`: | |
| **Features:** | |
| - **Type validation**: Checks authors is list before use | |
| - **Safe conversion**: Falls back to empty list if invalid | |
| - **Try-except blocks**: Catches and logs chunk creation errors | |
| - **Graceful continuation**: Processes remaining chunks even if one fails | |
| **Code Example:** | |
| ```python | |
| try: | |
| # Ensure authors is a list of strings | |
| authors_metadata = paper.authors | |
| if not isinstance(authors_metadata, list): | |
| logger.warning(f"Paper {paper.arxiv_id} has invalid authors type: {type(authors_metadata)}, converting to list") | |
| authors_metadata = [str(authors_metadata)] if authors_metadata else [] | |
| metadata = { | |
| "title": title_metadata, | |
| "authors": authors_metadata, | |
| "chunk_index": chunk_index, | |
| "token_count": len(chunk_tokens) | |
| } | |
| except Exception as e: | |
| logger.warning(f"Error creating metadata for chunk {chunk_index}: {str(e)}, using fallback") | |
| # Use safe fallback metadata | |
| ``` | |
| ### 4. Retriever Agent Validation (`agents/retriever.py`) | |
| Added post-parsing validation to check data quality: | |
| **Features:** | |
| - **Diagnostic checks**: Validates all Paper object fields after MCP parsing | |
| - **Quality reporting**: Logs specific data quality issues | |
| - **Filtering**: Can skip papers with critical validation failures | |
| - **Error tracking**: Reports validation failures in state["errors"] | |
| **Checks Performed:** | |
| - Authors is list type | |
| - Categories is list type | |
| - Title, pdf_url, abstract are string types | |
| - Authors list is not empty | |
| ## Testing | |
| Created comprehensive test suite (`test_data_validation.py`) that verifies: | |
| ### Test 1: Paper Schema Validators | |
| - β Authors as dict β normalized to list | |
| - β Categories as dict β normalized to list | |
| - β Multiple malformed fields β all normalized correctly | |
| ### Test 2: PDF Processor Resilience | |
| - β Processes Papers with normalized data successfully | |
| - β Creates chunks with proper metadata structure | |
| - β Chunk metadata contains lists for authors field | |
| **Test Results:** | |
| ``` | |
| β ALL TESTS PASSED - The data validation fixes are working correctly! | |
| ``` | |
| ## Impact on All Agents | |
| ### RetrieverAgent β | |
| - **Primary beneficiary** of all fixes | |
| - Handles malformed MCP responses gracefully | |
| - Validates and filters papers before processing | |
| - Continues with valid papers even if some fail | |
| ### AnalyzerAgent β | |
| - **Protected by upstream validation** | |
| - Receives only validated Paper objects | |
| - No changes required | |
| - Works with clean, normalized data | |
| ### SynthesisAgent β | |
| - **No changes needed** | |
| - Operates on validated analyses | |
| - Unaffected by MCP data issues | |
| ### CitationAgent β | |
| - **No changes needed** | |
| - Gets validated citations from upstream | |
| - Unaffected by MCP data issues | |
| ## Files Modified | |
| 1. **utils/schemas.py** (lines 1-93) | |
| - Added logging import | |
| - Added 6 Pydantic validators for Paper class | |
| - Normalizes authors, categories, title, abstract, pdf_url | |
| 2. **utils/mcp_arxiv_client.py** (lines 175-290) | |
| - Enhanced `_parse_mcp_paper()` method | |
| - Added explicit type checking for all fields | |
| - Improved logging and error handling | |
| 3. **utils/pdf_processor.py** (lines 134-175) | |
| - Added metadata validation in `chunk_text()` | |
| - Try-except around metadata creation | |
| - Try-except around chunk creation | |
| - Graceful continuation on errors | |
| 4. **agents/retriever.py** (lines 89-134) | |
| - Added post-parsing validation loop | |
| - Diagnostic checks for all Paper fields | |
| - Paper filtering capability | |
| - Enhanced error reporting | |
| 5. **test_data_validation.py** (NEW) | |
| - Comprehensive test suite | |
| - Verifies all validation layers work correctly | |
| ## How to Verify the Fix | |
| ### Run the validation test: | |
| ```bash | |
| python test_data_validation.py | |
| ``` | |
| Expected output: | |
| ``` | |
| β ALL TESTS PASSED - The data validation fixes are working correctly! | |
| ``` | |
| ### Run with your actual MCP data: | |
| The next time you run the application with MCP papers that previously failed, you should see: | |
| - Warning logs for malformed fields (e.g., "Authors field is dict, extracting values") | |
| - Successful PDF processing instead of errors | |
| - Papers properly chunked and stored in vector database | |
| - All downstream agents execute successfully | |
| ### Check logs for validation warnings: | |
| ```bash | |
| # Run your application and look for these log patterns: | |
| # - "Authors field is dict, extracting values" | |
| # - "Categories field is dict, extracting values" | |
| # - "Paper X has data quality issues: ..." | |
| # - "Successfully parsed paper X: Y authors, Z categories" | |
| ``` | |
| ## Why This Works | |
| 1. **Defense in Depth**: Multiple validation layers ensure data quality | |
| - MCP client normalizes on parse | |
| - Pydantic validators normalize on object creation | |
| - PDF processor validates before use | |
| - Retriever agent performs diagnostic checks | |
| 2. **Graceful Degradation**: System continues with valid papers even if some fail | |
| - Individual paper failures don't stop the pipeline | |
| - Partial results better than complete failure | |
| - Clear error reporting shows what failed and why | |
| 3. **Clear Error Reporting**: Users see which papers had issues and why | |
| - Warnings logged for each malformed field | |
| - Diagnostic checks report specific issues | |
| - Errors accumulated in state["errors"] | |
| 4. **Future-Proof**: Handles variations in MCP server response formats | |
| - Supports multiple dict structures | |
| - Falls back to safe defaults | |
| - Continues to work if MCP format changes | |
| ## Known Limitations | |
| 1. **Data Extraction from Dicts**: We extract values from dicts heuristically | |
| - May not capture all data in complex nested structures | |
| - Assumes common field names ('names', 'authors', 'categories') | |
| - Better than failing completely, but may lose some metadata | |
| 2. **Empty Authors Lists**: If authors dict has no extractable values | |
| - Falls back to empty list | |
| - Papers still process but lack author metadata | |
| - Logged as warning for manual review | |
| 3. **Performance**: Additional validation adds small overhead | |
| - Negligible impact for typical workloads | |
| - Logging warnings can increase log size | |
| - Trade-off for robustness is worthwhile | |
| ## Recommendations | |
| 1. **Monitor Logs**: Watch for validation warnings in production | |
| - Indicates ongoing MCP data quality issues | |
| - May need to work with MCP server maintainers | |
| 2. **Report to MCP Maintainers**: The MCP server should return proper types | |
| - Authors should be `List[str]`, not `Dict` | |
| - Categories should be `List[str]`, not `Dict` | |
| - This fix is a workaround, not a permanent solution | |
| 3. **Extend Validation**: If more fields show issues, add validators | |
| - Follow the same pattern used for authors/categories | |
| - Add tests to verify behavior | |
| - Document in this file | |
| 4. **Consider Alternative MCP Servers**: If issues persist | |
| - Try different arXiv MCP implementations | |
| - Or fallback to direct arXiv API (already supported) | |
| - Set `USE_MCP_ARXIV=false` in .env | |
| ## Rollback Instructions | |
| If this fix causes issues, you can rollback by: | |
| 1. **Revert the files**: | |
| ```bash | |
| git checkout HEAD~1 utils/schemas.py utils/mcp_arxiv_client.py utils/pdf_processor.py agents/retriever.py | |
| ``` | |
| 2. **Remove the test file**: | |
| ```bash | |
| rm test_data_validation.py | |
| ``` | |
| 3. **Switch to direct arXiv API**: | |
| ```bash | |
| # In .env file: | |
| USE_MCP_ARXIV=false | |
| ``` | |
| ## Version History | |
| - **v1.0** (2025-11-12): Initial implementation | |
| - Added Pydantic validators | |
| - Enhanced MCP client parsing | |
| - Improved PDF processor error handling | |
| - Added Retriever validation | |
| - Created comprehensive tests | |
| - All tests passing β | |