SPARKNET / docs /DOCUMENT_INTELLIGENCE.md
MHamdan's picture
Initial commit: SPARKNET framework
d520909
# SPARKNET Document Intelligence
A vision-first agentic document understanding platform that goes beyond OCR, supports complex layouts, and produces LLM-ready, visually grounded outputs suitable for RAG and field extraction at scale.
## Overview
The Document Intelligence subsystem provides:
- **Vision-First Understanding**: Treats documents as visual objects, not just text
- **Semantic Chunking**: Classifies regions by type (text, table, figure, chart, form, etc.)
- **Visual Grounding**: Every extraction includes evidence (page, bbox, snippet, confidence)
- **Zero-Shot Capability**: Works across diverse document formats without training
- **Schema-Driven Extraction**: Define fields using JSON Schema or Pydantic models
- **Abstention Policy**: Never guesses - abstains when confidence is low
- **Local-First**: All processing happens locally for privacy
## Quick Start
### Basic Parsing
```python
from src.document_intelligence import DocumentParser, ParserConfig
# Configure parser
config = ParserConfig(
render_dpi=200,
max_pages=10,
include_markdown=True,
)
parser = DocumentParser(config=config)
result = parser.parse("document.pdf")
print(f"Parsed {len(result.chunks)} chunks from {result.num_pages} pages")
# Access chunks
for chunk in result.chunks:
print(f"[Page {chunk.page}] {chunk.chunk_type.value}: {chunk.text[:100]}...")
```
### Field Extraction
```python
from src.document_intelligence import (
FieldExtractor,
ExtractionSchema,
create_invoice_schema,
)
# Use preset schema
schema = create_invoice_schema()
# Or create custom schema
schema = ExtractionSchema(name="CustomSchema")
schema.add_string_field("company_name", "Name of the company", required=True)
schema.add_date_field("document_date", "Date on document")
schema.add_currency_field("total_amount", "Total amount")
# Extract fields
extractor = FieldExtractor()
extraction = extractor.extract(parse_result, schema)
print("Extracted Data:")
for key, value in extraction.data.items():
if key in extraction.abstained_fields:
print(f" {key}: [ABSTAINED]")
else:
print(f" {key}: {value}")
print(f"Confidence: {extraction.overall_confidence:.2f}")
```
### Visual Grounding
```python
from src.document_intelligence import (
load_document,
RenderOptions,
)
from src.document_intelligence.grounding import (
crop_region,
create_annotated_image,
EvidenceBuilder,
)
# Load and render page
loader, renderer = load_document("document.pdf")
page_image = renderer.render_page(1, RenderOptions(dpi=200))
# Create annotated visualization
bboxes = [chunk.bbox for chunk in result.chunks if chunk.page == 1]
labels = [chunk.chunk_type.value for chunk in result.chunks if chunk.page == 1]
annotated = create_annotated_image(page_image, bboxes, labels)
# Crop specific region
crop = crop_region(page_image, chunk.bbox, padding_percent=0.02)
```
### Question Answering
```python
from src.document_intelligence.tools import get_tool
qa_tool = get_tool("answer_question")
result = qa_tool.execute(
parse_result=parse_result,
question="What is the total amount due?",
)
if result.success:
print(f"Answer: {result.data['answer']}")
print(f"Confidence: {result.data['confidence']:.2f}")
for ev in result.evidence:
print(f" Evidence: Page {ev['page']}, {ev['snippet'][:50]}...")
```
## Architecture
### Module Structure
```
src/document_intelligence/
├── __init__.py # Main exports
├── chunks/ # Core data models
│ ├── models.py # BoundingBox, DocumentChunk, TableChunk, etc.
│ └── __init__.py
├── io/ # Document loading
│ ├── base.py # Abstract interfaces
│ ├── pdf.py # PDF loading (PyMuPDF)
│ ├── image.py # Image loading (PIL)
│ ├── cache.py # Page caching
│ └── __init__.py
├── models/ # Model interfaces
│ ├── base.py # BaseModel, BatchableModel
│ ├── ocr.py # OCRModel interface
│ ├── layout.py # LayoutModel interface
│ ├── table.py # TableModel interface
│ ├── chart.py # ChartModel interface
│ ├── vlm.py # VisionLanguageModel interface
│ └── __init__.py
├── parsing/ # Document parsing
│ ├── parser.py # DocumentParser orchestrator
│ ├── chunking.py # Semantic chunking utilities
│ └── __init__.py
├── grounding/ # Visual evidence
│ ├── evidence.py # EvidenceBuilder, EvidenceTracker
│ ├── crops.py # Image cropping utilities
│ └── __init__.py
├── extraction/ # Field extraction
│ ├── schema.py # ExtractionSchema, FieldSpec
│ ├── extractor.py # FieldExtractor
│ ├── validator.py # ExtractionValidator
│ └── __init__.py
├── tools/ # Agent tools
│ ├── document_tools.py # Tool implementations
│ └── __init__.py
├── validation/ # Result validation
│ └── __init__.py
└── agent_adapter.py # Agent integration
```
### Data Models
#### BoundingBox
Represents a rectangular region in XYXY format:
```python
from src.document_intelligence.chunks import BoundingBox
# Normalized coordinates (0-1)
bbox = BoundingBox(
x_min=0.1, y_min=0.2,
x_max=0.9, y_max=0.3,
normalized=True
)
# Convert to pixels
pixel_bbox = bbox.to_pixel(width=1000, height=800)
# Calculate IoU
overlap = bbox1.iou(bbox2)
# Check containment
is_inside = bbox.contains((0.5, 0.25))
```
#### DocumentChunk
Base semantic chunk:
```python
from src.document_intelligence.chunks import DocumentChunk, ChunkType
chunk = DocumentChunk(
chunk_id="abc123",
doc_id="doc001",
chunk_type=ChunkType.PARAGRAPH,
text="Content...",
page=1,
bbox=bbox,
confidence=0.95,
sequence_index=0,
)
```
#### TableChunk
Table with cell structure:
```python
from src.document_intelligence.chunks import TableChunk, TableCell
# Access cells
cell = table.get_cell(row=0, col=1)
# Export formats
csv_data = table.to_csv()
markdown = table.to_markdown()
json_data = table.to_structured_json()
```
#### EvidenceRef
Links extractions to visual sources:
```python
from src.document_intelligence.chunks import EvidenceRef
evidence = EvidenceRef(
chunk_id="chunk_001",
doc_id="doc_001",
page=1,
bbox=bbox,
source_type="text",
snippet="The total is $500",
confidence=0.9,
cell_id=None, # For table cells
crop_path=None, # Path to cropped image
)
```
## CLI Commands
```bash
# Parse document
sparknet docint parse document.pdf -o result.json
sparknet docint parse document.pdf --format markdown
# Extract fields
sparknet docint extract invoice.pdf --preset invoice
sparknet docint extract doc.pdf -f vendor_name -f total_amount
sparknet docint extract doc.pdf --schema my_schema.json
# Ask questions
sparknet docint ask document.pdf "What is the contract value?"
# Classify document
sparknet docint classify document.pdf
# Search content
sparknet docint search document.pdf -q "payment terms"
sparknet docint search document.pdf --type table
# Visualize regions
sparknet docint visualize document.pdf --page 1 --annotate
```
## Configuration
### Parser Configuration
```python
from src.document_intelligence import ParserConfig
config = ParserConfig(
# Rendering
render_dpi=200, # DPI for page rasterization
max_pages=None, # Limit pages (None = all)
# OCR
ocr_enabled=True,
ocr_languages=["en"],
ocr_min_confidence=0.5,
# Layout
layout_enabled=True,
reading_order_enabled=True,
# Specialized extraction
table_extraction_enabled=True,
chart_extraction_enabled=True,
# Chunking
merge_adjacent_text=True,
min_chunk_chars=10,
max_chunk_chars=4000,
# Output
include_markdown=True,
cache_enabled=True,
)
```
### Extraction Configuration
```python
from src.document_intelligence import ExtractionConfig
config = ExtractionConfig(
# Confidence
min_field_confidence=0.5,
min_overall_confidence=0.5,
# Abstention
abstain_on_low_confidence=True,
abstain_threshold=0.3,
# Search
search_all_chunks=True,
prefer_structured_sources=True,
# Validation
validate_extracted_values=True,
normalize_values=True,
)
```
## Preset Schemas
### Invoice
```python
from src.document_intelligence import create_invoice_schema
schema = create_invoice_schema()
# Fields: invoice_number, invoice_date, due_date, vendor_name, vendor_address,
# customer_name, customer_address, subtotal, tax_amount, total_amount,
# currency, payment_terms
```
### Receipt
```python
from src.document_intelligence import create_receipt_schema
schema = create_receipt_schema()
# Fields: merchant_name, merchant_address, transaction_date, transaction_time,
# subtotal, tax_amount, total_amount, payment_method, last_four_digits
```
### Contract
```python
from src.document_intelligence import create_contract_schema
schema = create_contract_schema()
# Fields: contract_title, effective_date, expiration_date, party_a_name,
# party_b_name, contract_value, governing_law, termination_clause
```
## Agent Integration
```python
from src.document_intelligence.agent_adapter import (
DocumentIntelligenceAdapter,
EnhancedDocumentAgent,
AgentConfig,
)
# Create adapter
config = AgentConfig(
render_dpi=200,
min_confidence=0.5,
max_iterations=10,
)
# With existing LLM client
agent = EnhancedDocumentAgent(
llm_client=ollama_client,
config=config,
)
# Load document
await agent.load_document("document.pdf")
# Extract with schema
result = await agent.extract_fields(schema)
# Answer questions
answer, evidence = await agent.answer_question("What is the total?")
# Classify
classification = await agent.classify()
```
## Available Tools
| Tool | Description |
|------|-------------|
| `parse_document` | Parse document into semantic chunks |
| `extract_fields` | Schema-driven field extraction |
| `search_chunks` | Search document content |
| `get_chunk_details` | Get detailed chunk information |
| `get_table_data` | Extract structured table data |
| `answer_question` | Document Q&A |
| `crop_region` | Extract visual regions |
## Best Practices
### 1. Always Check Confidence
```python
if extraction.overall_confidence < 0.7:
print("Low confidence - manual review recommended")
for field, value in extraction.data.items():
if field in extraction.abstained_fields:
print(f"{field}: Needs manual verification")
```
### 2. Use Evidence for Verification
```python
for evidence in extraction.evidence:
print(f"Found on page {evidence.page}")
print(f"Location: {evidence.bbox.xyxy}")
print(f"Source text: {evidence.snippet}")
```
### 3. Handle Abstention Gracefully
```python
result = extractor.extract(parse_result, schema)
for field in schema.get_required_fields():
if field.name in result.abstained_fields:
# Request human review
flag_for_review(field.name, parse_result.doc_id)
```
### 4. Validate Before Use
```python
from src.document_intelligence import ExtractionValidator
validator = ExtractionValidator(min_confidence=0.7)
validation = validator.validate(result, schema)
if not validation.is_valid:
for issue in validation.issues:
print(f"[{issue.severity}] {issue.field_name}: {issue.message}")
```
## Dependencies
- `pymupdf` - PDF loading and rendering
- `pillow` - Image processing
- `numpy` - Array operations
- `pydantic` - Data validation
Optional:
- `paddleocr` - OCR engine
- `tesseract` - Alternative OCR
- `chromadb` - Vector storage for RAG
## License
MIT License - see LICENSE file for details.