# SPARKNET Document Intelligence A vision-first agentic document understanding platform that goes beyond OCR, supports complex layouts, and produces LLM-ready, visually grounded outputs suitable for RAG and field extraction at scale. ## Overview The Document Intelligence subsystem provides: - **Vision-First Understanding**: Treats documents as visual objects, not just text - **Semantic Chunking**: Classifies regions by type (text, table, figure, chart, form, etc.) - **Visual Grounding**: Every extraction includes evidence (page, bbox, snippet, confidence) - **Zero-Shot Capability**: Works across diverse document formats without training - **Schema-Driven Extraction**: Define fields using JSON Schema or Pydantic models - **Abstention Policy**: Never guesses - abstains when confidence is low - **Local-First**: All processing happens locally for privacy ## Quick Start ### Basic Parsing ```python from src.document_intelligence import DocumentParser, ParserConfig # Configure parser config = ParserConfig( render_dpi=200, max_pages=10, include_markdown=True, ) parser = DocumentParser(config=config) result = parser.parse("document.pdf") print(f"Parsed {len(result.chunks)} chunks from {result.num_pages} pages") # Access chunks for chunk in result.chunks: print(f"[Page {chunk.page}] {chunk.chunk_type.value}: {chunk.text[:100]}...") ``` ### Field Extraction ```python from src.document_intelligence import ( FieldExtractor, ExtractionSchema, create_invoice_schema, ) # Use preset schema schema = create_invoice_schema() # Or create custom schema schema = ExtractionSchema(name="CustomSchema") schema.add_string_field("company_name", "Name of the company", required=True) schema.add_date_field("document_date", "Date on document") schema.add_currency_field("total_amount", "Total amount") # Extract fields extractor = FieldExtractor() extraction = extractor.extract(parse_result, schema) print("Extracted Data:") for key, value in extraction.data.items(): if key in extraction.abstained_fields: print(f" {key}: [ABSTAINED]") else: print(f" {key}: {value}") print(f"Confidence: {extraction.overall_confidence:.2f}") ``` ### Visual Grounding ```python from src.document_intelligence import ( load_document, RenderOptions, ) from src.document_intelligence.grounding import ( crop_region, create_annotated_image, EvidenceBuilder, ) # Load and render page loader, renderer = load_document("document.pdf") page_image = renderer.render_page(1, RenderOptions(dpi=200)) # Create annotated visualization bboxes = [chunk.bbox for chunk in result.chunks if chunk.page == 1] labels = [chunk.chunk_type.value for chunk in result.chunks if chunk.page == 1] annotated = create_annotated_image(page_image, bboxes, labels) # Crop specific region crop = crop_region(page_image, chunk.bbox, padding_percent=0.02) ``` ### Question Answering ```python from src.document_intelligence.tools import get_tool qa_tool = get_tool("answer_question") result = qa_tool.execute( parse_result=parse_result, question="What is the total amount due?", ) if result.success: print(f"Answer: {result.data['answer']}") print(f"Confidence: {result.data['confidence']:.2f}") for ev in result.evidence: print(f" Evidence: Page {ev['page']}, {ev['snippet'][:50]}...") ``` ## Architecture ### Module Structure ``` src/document_intelligence/ ├── __init__.py # Main exports ├── chunks/ # Core data models │ ├── models.py # BoundingBox, DocumentChunk, TableChunk, etc. │ └── __init__.py ├── io/ # Document loading │ ├── base.py # Abstract interfaces │ ├── pdf.py # PDF loading (PyMuPDF) │ ├── image.py # Image loading (PIL) │ ├── cache.py # Page caching │ └── __init__.py ├── models/ # Model interfaces │ ├── base.py # BaseModel, BatchableModel │ ├── ocr.py # OCRModel interface │ ├── layout.py # LayoutModel interface │ ├── table.py # TableModel interface │ ├── chart.py # ChartModel interface │ ├── vlm.py # VisionLanguageModel interface │ └── __init__.py ├── parsing/ # Document parsing │ ├── parser.py # DocumentParser orchestrator │ ├── chunking.py # Semantic chunking utilities │ └── __init__.py ├── grounding/ # Visual evidence │ ├── evidence.py # EvidenceBuilder, EvidenceTracker │ ├── crops.py # Image cropping utilities │ └── __init__.py ├── extraction/ # Field extraction │ ├── schema.py # ExtractionSchema, FieldSpec │ ├── extractor.py # FieldExtractor │ ├── validator.py # ExtractionValidator │ └── __init__.py ├── tools/ # Agent tools │ ├── document_tools.py # Tool implementations │ └── __init__.py ├── validation/ # Result validation │ └── __init__.py └── agent_adapter.py # Agent integration ``` ### Data Models #### BoundingBox Represents a rectangular region in XYXY format: ```python from src.document_intelligence.chunks import BoundingBox # Normalized coordinates (0-1) bbox = BoundingBox( x_min=0.1, y_min=0.2, x_max=0.9, y_max=0.3, normalized=True ) # Convert to pixels pixel_bbox = bbox.to_pixel(width=1000, height=800) # Calculate IoU overlap = bbox1.iou(bbox2) # Check containment is_inside = bbox.contains((0.5, 0.25)) ``` #### DocumentChunk Base semantic chunk: ```python from src.document_intelligence.chunks import DocumentChunk, ChunkType chunk = DocumentChunk( chunk_id="abc123", doc_id="doc001", chunk_type=ChunkType.PARAGRAPH, text="Content...", page=1, bbox=bbox, confidence=0.95, sequence_index=0, ) ``` #### TableChunk Table with cell structure: ```python from src.document_intelligence.chunks import TableChunk, TableCell # Access cells cell = table.get_cell(row=0, col=1) # Export formats csv_data = table.to_csv() markdown = table.to_markdown() json_data = table.to_structured_json() ``` #### EvidenceRef Links extractions to visual sources: ```python from src.document_intelligence.chunks import EvidenceRef evidence = EvidenceRef( chunk_id="chunk_001", doc_id="doc_001", page=1, bbox=bbox, source_type="text", snippet="The total is $500", confidence=0.9, cell_id=None, # For table cells crop_path=None, # Path to cropped image ) ``` ## CLI Commands ```bash # Parse document sparknet docint parse document.pdf -o result.json sparknet docint parse document.pdf --format markdown # Extract fields sparknet docint extract invoice.pdf --preset invoice sparknet docint extract doc.pdf -f vendor_name -f total_amount sparknet docint extract doc.pdf --schema my_schema.json # Ask questions sparknet docint ask document.pdf "What is the contract value?" # Classify document sparknet docint classify document.pdf # Search content sparknet docint search document.pdf -q "payment terms" sparknet docint search document.pdf --type table # Visualize regions sparknet docint visualize document.pdf --page 1 --annotate ``` ## Configuration ### Parser Configuration ```python from src.document_intelligence import ParserConfig config = ParserConfig( # Rendering render_dpi=200, # DPI for page rasterization max_pages=None, # Limit pages (None = all) # OCR ocr_enabled=True, ocr_languages=["en"], ocr_min_confidence=0.5, # Layout layout_enabled=True, reading_order_enabled=True, # Specialized extraction table_extraction_enabled=True, chart_extraction_enabled=True, # Chunking merge_adjacent_text=True, min_chunk_chars=10, max_chunk_chars=4000, # Output include_markdown=True, cache_enabled=True, ) ``` ### Extraction Configuration ```python from src.document_intelligence import ExtractionConfig config = ExtractionConfig( # Confidence min_field_confidence=0.5, min_overall_confidence=0.5, # Abstention abstain_on_low_confidence=True, abstain_threshold=0.3, # Search search_all_chunks=True, prefer_structured_sources=True, # Validation validate_extracted_values=True, normalize_values=True, ) ``` ## Preset Schemas ### Invoice ```python from src.document_intelligence import create_invoice_schema schema = create_invoice_schema() # Fields: invoice_number, invoice_date, due_date, vendor_name, vendor_address, # customer_name, customer_address, subtotal, tax_amount, total_amount, # currency, payment_terms ``` ### Receipt ```python from src.document_intelligence import create_receipt_schema schema = create_receipt_schema() # Fields: merchant_name, merchant_address, transaction_date, transaction_time, # subtotal, tax_amount, total_amount, payment_method, last_four_digits ``` ### Contract ```python from src.document_intelligence import create_contract_schema schema = create_contract_schema() # Fields: contract_title, effective_date, expiration_date, party_a_name, # party_b_name, contract_value, governing_law, termination_clause ``` ## Agent Integration ```python from src.document_intelligence.agent_adapter import ( DocumentIntelligenceAdapter, EnhancedDocumentAgent, AgentConfig, ) # Create adapter config = AgentConfig( render_dpi=200, min_confidence=0.5, max_iterations=10, ) # With existing LLM client agent = EnhancedDocumentAgent( llm_client=ollama_client, config=config, ) # Load document await agent.load_document("document.pdf") # Extract with schema result = await agent.extract_fields(schema) # Answer questions answer, evidence = await agent.answer_question("What is the total?") # Classify classification = await agent.classify() ``` ## Available Tools | Tool | Description | |------|-------------| | `parse_document` | Parse document into semantic chunks | | `extract_fields` | Schema-driven field extraction | | `search_chunks` | Search document content | | `get_chunk_details` | Get detailed chunk information | | `get_table_data` | Extract structured table data | | `answer_question` | Document Q&A | | `crop_region` | Extract visual regions | ## Best Practices ### 1. Always Check Confidence ```python if extraction.overall_confidence < 0.7: print("Low confidence - manual review recommended") for field, value in extraction.data.items(): if field in extraction.abstained_fields: print(f"{field}: Needs manual verification") ``` ### 2. Use Evidence for Verification ```python for evidence in extraction.evidence: print(f"Found on page {evidence.page}") print(f"Location: {evidence.bbox.xyxy}") print(f"Source text: {evidence.snippet}") ``` ### 3. Handle Abstention Gracefully ```python result = extractor.extract(parse_result, schema) for field in schema.get_required_fields(): if field.name in result.abstained_fields: # Request human review flag_for_review(field.name, parse_result.doc_id) ``` ### 4. Validate Before Use ```python from src.document_intelligence import ExtractionValidator validator = ExtractionValidator(min_confidence=0.7) validation = validator.validate(result, schema) if not validation.is_valid: for issue in validation.issues: print(f"[{issue.severity}] {issue.field_name}: {issue.message}") ``` ## Dependencies - `pymupdf` - PDF loading and rendering - `pillow` - Image processing - `numpy` - Array operations - `pydantic` - Data validation Optional: - `paddleocr` - OCR engine - `tesseract` - Alternative OCR - `chromadb` - Vector storage for RAG ## License MIT License - see LICENSE file for details.