GitHub Actions
Clean sync from GitHub - no large files in history
aca8ab4

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Data Validation Fix Documentation

Problem Summary

Original Error

2025-11-12 14:36:16,506 - agents.retriever - ERROR - Error processing paper 1411.6643v4:
int() argument must be a string, a bytes-like object or a real number, not 'dict'

Root Cause

The MCP arXiv server was returning paper metadata with dict objects instead of the expected primitive types (lists, strings). Specifically:

  • authors field: Dict instead of List[str]
  • categories field: Dict instead of List[str]
  • Other fields: Potentially dicts instead of strings

When these malformed Paper objects were passed to PDFProcessor.chunk_text(), the metadata creation failed because it tried to use dict values where lists or strings were expected.

Impact

  • All 4 papers failed PDF processing
  • Entire pipeline broken at the Retriever stage
  • All downstream agents (Analyzer, Synthesis, Citation) never executed

Solution: Multi-Layer Data Validation

We implemented a defense-in-depth approach with validation at multiple levels:

1. Pydantic Schema Validators (utils/schemas.py)

Added @validator decorators to the Paper class that automatically normalize malformed data:

Features:

  • Authors normalization: Handles dict, list, string, or unknown types

    • Dict format: Extracts values from nested structures
    • String format: Converts to single-element list
    • Invalid format: Returns empty list with warning
  • Categories normalization: Same robust handling as authors

  • String field normalization: Ensures title, abstract, pdf_url are always strings

    • Dict format: Extracts nested values
    • Invalid format: Converts to string representation

Code Example:

@validator('authors', pre=True)
def normalize_authors(cls, v):
    if isinstance(v, list):
        return [str(author) if not isinstance(author, str) else author for author in v]
    elif isinstance(v, dict):
        logger.warning(f"Authors field is dict, extracting values: {v}")
        if 'names' in v:
            return v['names'] if isinstance(v['names'], list) else [str(v['names'])]
        # ... more extraction logic
    elif isinstance(v, str):
        return [v]
    else:
        logger.warning(f"Unexpected authors format: {type(v)}, returning empty list")
        return []

2. MCP Client Data Parsing (utils/mcp_arxiv_client.py)

Enhanced _parse_mcp_paper() method with explicit type checking and normalization:

Features:

  • Pre-validation: Checks and normalizes data types before creating Paper object
  • Comprehensive logging: Warnings for each malformed field
  • Graceful fallbacks: Safe defaults for invalid data
  • Detailed error context: Logs raw paper data on parsing failure

Key Improvements:

  • Authors: Explicit type checking and dict extraction (lines 209-225)
  • Categories: Same robust handling (lines 227-243)
  • Title, abstract, pdf_url: String normalization (lines 245-270)
  • Published date: Enhanced datetime parsing with fallbacks (lines 195-207)

3. PDF Processor Error Handling (utils/pdf_processor.py)

Added defensive metadata creation in chunk_text():

Features:

  • Type validation: Checks authors is list before use
  • Safe conversion: Falls back to empty list if invalid
  • Try-except blocks: Catches and logs chunk creation errors
  • Graceful continuation: Processes remaining chunks even if one fails

Code Example:

try:
    # Ensure authors is a list of strings
    authors_metadata = paper.authors
    if not isinstance(authors_metadata, list):
        logger.warning(f"Paper {paper.arxiv_id} has invalid authors type: {type(authors_metadata)}, converting to list")
        authors_metadata = [str(authors_metadata)] if authors_metadata else []

    metadata = {
        "title": title_metadata,
        "authors": authors_metadata,
        "chunk_index": chunk_index,
        "token_count": len(chunk_tokens)
    }
except Exception as e:
    logger.warning(f"Error creating metadata for chunk {chunk_index}: {str(e)}, using fallback")
    # Use safe fallback metadata

4. Retriever Agent Validation (agents/retriever.py)

Added post-parsing validation to check data quality:

Features:

  • Diagnostic checks: Validates all Paper object fields after MCP parsing
  • Quality reporting: Logs specific data quality issues
  • Filtering: Can skip papers with critical validation failures
  • Error tracking: Reports validation failures in state["errors"]

Checks Performed:

  • Authors is list type
  • Categories is list type
  • Title, pdf_url, abstract are string types
  • Authors list is not empty

Testing

Created comprehensive test suite (test_data_validation.py) that verifies:

Test 1: Paper Schema Validators

  • βœ“ Authors as dict β†’ normalized to list
  • βœ“ Categories as dict β†’ normalized to list
  • βœ“ Multiple malformed fields β†’ all normalized correctly

Test 2: PDF Processor Resilience

  • βœ“ Processes Papers with normalized data successfully
  • βœ“ Creates chunks with proper metadata structure
  • βœ“ Chunk metadata contains lists for authors field

Test Results:

βœ“ ALL TESTS PASSED - The data validation fixes are working correctly!

Impact on All Agents

RetrieverAgent βœ“

  • Primary beneficiary of all fixes
  • Handles malformed MCP responses gracefully
  • Validates and filters papers before processing
  • Continues with valid papers even if some fail

AnalyzerAgent βœ“

  • Protected by upstream validation
  • Receives only validated Paper objects
  • No changes required
  • Works with clean, normalized data

SynthesisAgent βœ“

  • No changes needed
  • Operates on validated analyses
  • Unaffected by MCP data issues

CitationAgent βœ“

  • No changes needed
  • Gets validated citations from upstream
  • Unaffected by MCP data issues

Files Modified

  1. utils/schemas.py (lines 1-93)

    • Added logging import
    • Added 6 Pydantic validators for Paper class
    • Normalizes authors, categories, title, abstract, pdf_url
  2. utils/mcp_arxiv_client.py (lines 175-290)

    • Enhanced _parse_mcp_paper() method
    • Added explicit type checking for all fields
    • Improved logging and error handling
  3. utils/pdf_processor.py (lines 134-175)

    • Added metadata validation in chunk_text()
    • Try-except around metadata creation
    • Try-except around chunk creation
    • Graceful continuation on errors
  4. agents/retriever.py (lines 89-134)

    • Added post-parsing validation loop
    • Diagnostic checks for all Paper fields
    • Paper filtering capability
    • Enhanced error reporting
  5. test_data_validation.py (NEW)

    • Comprehensive test suite
    • Verifies all validation layers work correctly

How to Verify the Fix

Run the validation test:

python test_data_validation.py

Expected output:

βœ“ ALL TESTS PASSED - The data validation fixes are working correctly!

Run with your actual MCP data:

The next time you run the application with MCP papers that previously failed, you should see:

  • Warning logs for malformed fields (e.g., "Authors field is dict, extracting values")
  • Successful PDF processing instead of errors
  • Papers properly chunked and stored in vector database
  • All downstream agents execute successfully

Check logs for validation warnings:

# Run your application and look for these log patterns:
# - "Authors field is dict, extracting values"
# - "Categories field is dict, extracting values"
# - "Paper X has data quality issues: ..."
# - "Successfully parsed paper X: Y authors, Z categories"

Why This Works

  1. Defense in Depth: Multiple validation layers ensure data quality

    • MCP client normalizes on parse
    • Pydantic validators normalize on object creation
    • PDF processor validates before use
    • Retriever agent performs diagnostic checks
  2. Graceful Degradation: System continues with valid papers even if some fail

    • Individual paper failures don't stop the pipeline
    • Partial results better than complete failure
    • Clear error reporting shows what failed and why
  3. Clear Error Reporting: Users see which papers had issues and why

    • Warnings logged for each malformed field
    • Diagnostic checks report specific issues
    • Errors accumulated in state["errors"]
  4. Future-Proof: Handles variations in MCP server response formats

    • Supports multiple dict structures
    • Falls back to safe defaults
    • Continues to work if MCP format changes

Known Limitations

  1. Data Extraction from Dicts: We extract values from dicts heuristically

    • May not capture all data in complex nested structures
    • Assumes common field names ('names', 'authors', 'categories')
    • Better than failing completely, but may lose some metadata
  2. Empty Authors Lists: If authors dict has no extractable values

    • Falls back to empty list
    • Papers still process but lack author metadata
    • Logged as warning for manual review
  3. Performance: Additional validation adds small overhead

    • Negligible impact for typical workloads
    • Logging warnings can increase log size
    • Trade-off for robustness is worthwhile

Recommendations

  1. Monitor Logs: Watch for validation warnings in production

    • Indicates ongoing MCP data quality issues
    • May need to work with MCP server maintainers
  2. Report to MCP Maintainers: The MCP server should return proper types

    • Authors should be List[str], not Dict
    • Categories should be List[str], not Dict
    • This fix is a workaround, not a permanent solution
  3. Extend Validation: If more fields show issues, add validators

    • Follow the same pattern used for authors/categories
    • Add tests to verify behavior
    • Document in this file
  4. Consider Alternative MCP Servers: If issues persist

    • Try different arXiv MCP implementations
    • Or fallback to direct arXiv API (already supported)
    • Set USE_MCP_ARXIV=false in .env

Rollback Instructions

If this fix causes issues, you can rollback by:

  1. Revert the files:

    git checkout HEAD~1 utils/schemas.py utils/mcp_arxiv_client.py utils/pdf_processor.py agents/retriever.py
    
  2. Remove the test file:

    rm test_data_validation.py
    
  3. Switch to direct arXiv API:

    # In .env file:
    USE_MCP_ARXIV=false
    

Version History

  • v1.0 (2025-11-12): Initial implementation
    • Added Pydantic validators
    • Enhanced MCP client parsing
    • Improved PDF processor error handling
    • Added Retriever validation
    • Created comprehensive tests
    • All tests passing βœ“