| # SPARKNET Document Intelligence | |
| A vision-first agentic document understanding platform that goes beyond OCR, supports complex layouts, and produces LLM-ready, visually grounded outputs suitable for RAG and field extraction at scale. | |
| ## Overview | |
| The Document Intelligence subsystem provides: | |
| - **Vision-First Understanding**: Treats documents as visual objects, not just text | |
| - **Semantic Chunking**: Classifies regions by type (text, table, figure, chart, form, etc.) | |
| - **Visual Grounding**: Every extraction includes evidence (page, bbox, snippet, confidence) | |
| - **Zero-Shot Capability**: Works across diverse document formats without training | |
| - **Schema-Driven Extraction**: Define fields using JSON Schema or Pydantic models | |
| - **Abstention Policy**: Never guesses - abstains when confidence is low | |
| - **Local-First**: All processing happens locally for privacy | |
| ## Quick Start | |
| ### Basic Parsing | |
| ```python | |
| from src.document_intelligence import DocumentParser, ParserConfig | |
| # Configure parser | |
| config = ParserConfig( | |
| render_dpi=200, | |
| max_pages=10, | |
| include_markdown=True, | |
| ) | |
| parser = DocumentParser(config=config) | |
| result = parser.parse("document.pdf") | |
| print(f"Parsed {len(result.chunks)} chunks from {result.num_pages} pages") | |
| # Access chunks | |
| for chunk in result.chunks: | |
| print(f"[Page {chunk.page}] {chunk.chunk_type.value}: {chunk.text[:100]}...") | |
| ``` | |
| ### Field Extraction | |
| ```python | |
| from src.document_intelligence import ( | |
| FieldExtractor, | |
| ExtractionSchema, | |
| create_invoice_schema, | |
| ) | |
| # Use preset schema | |
| schema = create_invoice_schema() | |
| # Or create custom schema | |
| schema = ExtractionSchema(name="CustomSchema") | |
| schema.add_string_field("company_name", "Name of the company", required=True) | |
| schema.add_date_field("document_date", "Date on document") | |
| schema.add_currency_field("total_amount", "Total amount") | |
| # Extract fields | |
| extractor = FieldExtractor() | |
| extraction = extractor.extract(parse_result, schema) | |
| print("Extracted Data:") | |
| for key, value in extraction.data.items(): | |
| if key in extraction.abstained_fields: | |
| print(f" {key}: [ABSTAINED]") | |
| else: | |
| print(f" {key}: {value}") | |
| print(f"Confidence: {extraction.overall_confidence:.2f}") | |
| ``` | |
| ### Visual Grounding | |
| ```python | |
| from src.document_intelligence import ( | |
| load_document, | |
| RenderOptions, | |
| ) | |
| from src.document_intelligence.grounding import ( | |
| crop_region, | |
| create_annotated_image, | |
| EvidenceBuilder, | |
| ) | |
| # Load and render page | |
| loader, renderer = load_document("document.pdf") | |
| page_image = renderer.render_page(1, RenderOptions(dpi=200)) | |
| # Create annotated visualization | |
| bboxes = [chunk.bbox for chunk in result.chunks if chunk.page == 1] | |
| labels = [chunk.chunk_type.value for chunk in result.chunks if chunk.page == 1] | |
| annotated = create_annotated_image(page_image, bboxes, labels) | |
| # Crop specific region | |
| crop = crop_region(page_image, chunk.bbox, padding_percent=0.02) | |
| ``` | |
| ### Question Answering | |
| ```python | |
| from src.document_intelligence.tools import get_tool | |
| qa_tool = get_tool("answer_question") | |
| result = qa_tool.execute( | |
| parse_result=parse_result, | |
| question="What is the total amount due?", | |
| ) | |
| if result.success: | |
| print(f"Answer: {result.data['answer']}") | |
| print(f"Confidence: {result.data['confidence']:.2f}") | |
| for ev in result.evidence: | |
| print(f" Evidence: Page {ev['page']}, {ev['snippet'][:50]}...") | |
| ``` | |
| ## Architecture | |
| ### Module Structure | |
| ``` | |
| src/document_intelligence/ | |
| ├── __init__.py # Main exports | |
| ├── chunks/ # Core data models | |
| │ ├── models.py # BoundingBox, DocumentChunk, TableChunk, etc. | |
| │ └── __init__.py | |
| ├── io/ # Document loading | |
| │ ├── base.py # Abstract interfaces | |
| │ ├── pdf.py # PDF loading (PyMuPDF) | |
| │ ├── image.py # Image loading (PIL) | |
| │ ├── cache.py # Page caching | |
| │ └── __init__.py | |
| ├── models/ # Model interfaces | |
| │ ├── base.py # BaseModel, BatchableModel | |
| │ ├── ocr.py # OCRModel interface | |
| │ ├── layout.py # LayoutModel interface | |
| │ ├── table.py # TableModel interface | |
| │ ├── chart.py # ChartModel interface | |
| │ ├── vlm.py # VisionLanguageModel interface | |
| │ └── __init__.py | |
| ├── parsing/ # Document parsing | |
| │ ├── parser.py # DocumentParser orchestrator | |
| │ ├── chunking.py # Semantic chunking utilities | |
| │ └── __init__.py | |
| ├── grounding/ # Visual evidence | |
| │ ├── evidence.py # EvidenceBuilder, EvidenceTracker | |
| │ ├── crops.py # Image cropping utilities | |
| │ └── __init__.py | |
| ├── extraction/ # Field extraction | |
| │ ├── schema.py # ExtractionSchema, FieldSpec | |
| │ ├── extractor.py # FieldExtractor | |
| │ ├── validator.py # ExtractionValidator | |
| │ └── __init__.py | |
| ├── tools/ # Agent tools | |
| │ ├── document_tools.py # Tool implementations | |
| │ └── __init__.py | |
| ├── validation/ # Result validation | |
| │ └── __init__.py | |
| └── agent_adapter.py # Agent integration | |
| ``` | |
| ### Data Models | |
| #### BoundingBox | |
| Represents a rectangular region in XYXY format: | |
| ```python | |
| from src.document_intelligence.chunks import BoundingBox | |
| # Normalized coordinates (0-1) | |
| bbox = BoundingBox( | |
| x_min=0.1, y_min=0.2, | |
| x_max=0.9, y_max=0.3, | |
| normalized=True | |
| ) | |
| # Convert to pixels | |
| pixel_bbox = bbox.to_pixel(width=1000, height=800) | |
| # Calculate IoU | |
| overlap = bbox1.iou(bbox2) | |
| # Check containment | |
| is_inside = bbox.contains((0.5, 0.25)) | |
| ``` | |
| #### DocumentChunk | |
| Base semantic chunk: | |
| ```python | |
| from src.document_intelligence.chunks import DocumentChunk, ChunkType | |
| chunk = DocumentChunk( | |
| chunk_id="abc123", | |
| doc_id="doc001", | |
| chunk_type=ChunkType.PARAGRAPH, | |
| text="Content...", | |
| page=1, | |
| bbox=bbox, | |
| confidence=0.95, | |
| sequence_index=0, | |
| ) | |
| ``` | |
| #### TableChunk | |
| Table with cell structure: | |
| ```python | |
| from src.document_intelligence.chunks import TableChunk, TableCell | |
| # Access cells | |
| cell = table.get_cell(row=0, col=1) | |
| # Export formats | |
| csv_data = table.to_csv() | |
| markdown = table.to_markdown() | |
| json_data = table.to_structured_json() | |
| ``` | |
| #### EvidenceRef | |
| Links extractions to visual sources: | |
| ```python | |
| from src.document_intelligence.chunks import EvidenceRef | |
| evidence = EvidenceRef( | |
| chunk_id="chunk_001", | |
| doc_id="doc_001", | |
| page=1, | |
| bbox=bbox, | |
| source_type="text", | |
| snippet="The total is $500", | |
| confidence=0.9, | |
| cell_id=None, # For table cells | |
| crop_path=None, # Path to cropped image | |
| ) | |
| ``` | |
| ## CLI Commands | |
| ```bash | |
| # Parse document | |
| sparknet docint parse document.pdf -o result.json | |
| sparknet docint parse document.pdf --format markdown | |
| # Extract fields | |
| sparknet docint extract invoice.pdf --preset invoice | |
| sparknet docint extract doc.pdf -f vendor_name -f total_amount | |
| sparknet docint extract doc.pdf --schema my_schema.json | |
| # Ask questions | |
| sparknet docint ask document.pdf "What is the contract value?" | |
| # Classify document | |
| sparknet docint classify document.pdf | |
| # Search content | |
| sparknet docint search document.pdf -q "payment terms" | |
| sparknet docint search document.pdf --type table | |
| # Visualize regions | |
| sparknet docint visualize document.pdf --page 1 --annotate | |
| ``` | |
| ## Configuration | |
| ### Parser Configuration | |
| ```python | |
| from src.document_intelligence import ParserConfig | |
| config = ParserConfig( | |
| # Rendering | |
| render_dpi=200, # DPI for page rasterization | |
| max_pages=None, # Limit pages (None = all) | |
| # OCR | |
| ocr_enabled=True, | |
| ocr_languages=["en"], | |
| ocr_min_confidence=0.5, | |
| # Layout | |
| layout_enabled=True, | |
| reading_order_enabled=True, | |
| # Specialized extraction | |
| table_extraction_enabled=True, | |
| chart_extraction_enabled=True, | |
| # Chunking | |
| merge_adjacent_text=True, | |
| min_chunk_chars=10, | |
| max_chunk_chars=4000, | |
| # Output | |
| include_markdown=True, | |
| cache_enabled=True, | |
| ) | |
| ``` | |
| ### Extraction Configuration | |
| ```python | |
| from src.document_intelligence import ExtractionConfig | |
| config = ExtractionConfig( | |
| # Confidence | |
| min_field_confidence=0.5, | |
| min_overall_confidence=0.5, | |
| # Abstention | |
| abstain_on_low_confidence=True, | |
| abstain_threshold=0.3, | |
| # Search | |
| search_all_chunks=True, | |
| prefer_structured_sources=True, | |
| # Validation | |
| validate_extracted_values=True, | |
| normalize_values=True, | |
| ) | |
| ``` | |
| ## Preset Schemas | |
| ### Invoice | |
| ```python | |
| from src.document_intelligence import create_invoice_schema | |
| schema = create_invoice_schema() | |
| # Fields: invoice_number, invoice_date, due_date, vendor_name, vendor_address, | |
| # customer_name, customer_address, subtotal, tax_amount, total_amount, | |
| # currency, payment_terms | |
| ``` | |
| ### Receipt | |
| ```python | |
| from src.document_intelligence import create_receipt_schema | |
| schema = create_receipt_schema() | |
| # Fields: merchant_name, merchant_address, transaction_date, transaction_time, | |
| # subtotal, tax_amount, total_amount, payment_method, last_four_digits | |
| ``` | |
| ### Contract | |
| ```python | |
| from src.document_intelligence import create_contract_schema | |
| schema = create_contract_schema() | |
| # Fields: contract_title, effective_date, expiration_date, party_a_name, | |
| # party_b_name, contract_value, governing_law, termination_clause | |
| ``` | |
| ## Agent Integration | |
| ```python | |
| from src.document_intelligence.agent_adapter import ( | |
| DocumentIntelligenceAdapter, | |
| EnhancedDocumentAgent, | |
| AgentConfig, | |
| ) | |
| # Create adapter | |
| config = AgentConfig( | |
| render_dpi=200, | |
| min_confidence=0.5, | |
| max_iterations=10, | |
| ) | |
| # With existing LLM client | |
| agent = EnhancedDocumentAgent( | |
| llm_client=ollama_client, | |
| config=config, | |
| ) | |
| # Load document | |
| await agent.load_document("document.pdf") | |
| # Extract with schema | |
| result = await agent.extract_fields(schema) | |
| # Answer questions | |
| answer, evidence = await agent.answer_question("What is the total?") | |
| # Classify | |
| classification = await agent.classify() | |
| ``` | |
| ## Available Tools | |
| | Tool | Description | | |
| |------|-------------| | |
| | `parse_document` | Parse document into semantic chunks | | |
| | `extract_fields` | Schema-driven field extraction | | |
| | `search_chunks` | Search document content | | |
| | `get_chunk_details` | Get detailed chunk information | | |
| | `get_table_data` | Extract structured table data | | |
| | `answer_question` | Document Q&A | | |
| | `crop_region` | Extract visual regions | | |
| ## Best Practices | |
| ### 1. Always Check Confidence | |
| ```python | |
| if extraction.overall_confidence < 0.7: | |
| print("Low confidence - manual review recommended") | |
| for field, value in extraction.data.items(): | |
| if field in extraction.abstained_fields: | |
| print(f"{field}: Needs manual verification") | |
| ``` | |
| ### 2. Use Evidence for Verification | |
| ```python | |
| for evidence in extraction.evidence: | |
| print(f"Found on page {evidence.page}") | |
| print(f"Location: {evidence.bbox.xyxy}") | |
| print(f"Source text: {evidence.snippet}") | |
| ``` | |
| ### 3. Handle Abstention Gracefully | |
| ```python | |
| result = extractor.extract(parse_result, schema) | |
| for field in schema.get_required_fields(): | |
| if field.name in result.abstained_fields: | |
| # Request human review | |
| flag_for_review(field.name, parse_result.doc_id) | |
| ``` | |
| ### 4. Validate Before Use | |
| ```python | |
| from src.document_intelligence import ExtractionValidator | |
| validator = ExtractionValidator(min_confidence=0.7) | |
| validation = validator.validate(result, schema) | |
| if not validation.is_valid: | |
| for issue in validation.issues: | |
| print(f"[{issue.severity}] {issue.field_name}: {issue.message}") | |
| ``` | |
| ## Dependencies | |
| - `pymupdf` - PDF loading and rendering | |
| - `pillow` - Image processing | |
| - `numpy` - Array operations | |
| - `pydantic` - Data validation | |
| Optional: | |
| - `paddleocr` - OCR engine | |
| - `tesseract` - Alternative OCR | |
| - `chromadb` - Vector storage for RAG | |
| ## License | |
| MIT License - see LICENSE file for details. | |