Digi-Biz / docs /SYSTEM_ARCHITECTURE.md
Deployment Bot
Automated deployment to Hugging Face
255cbd1
# System Architecture: Agentic Business Digitization Framework
## Architecture Overview
### System Philosophy
The architecture follows a **multi-agent microservices pattern** where specialized agents collaborate to transform unstructured documents into structured business profiles. Each agent has a single responsibility and communicates through well-defined interfaces.
### Core Principles
1. **Separation of Concerns**: Each agent handles one aspect of processing
2. **Fail Gracefully**: Missing information results in empty fields, not errors
3. **Deterministic Parsing**: Scripts handle extraction, LLMs handle intelligence
4. **Data Provenance**: Track source of every extracted field
5. **Extensibility**: Easy to add new document types or agents
## High-Level Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Interface Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ ZIP Upload β”‚ β”‚ Profile View β”‚ β”‚ Edit Interfaceβ”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Orchestration Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ BusinessDigitizationPipeline β”‚ β”‚
β”‚ β”‚ - Workflow Coordination β”‚ β”‚
β”‚ β”‚ - Error Handling β”‚ β”‚
β”‚ β”‚ - Progress Tracking β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚File Discoveryβ”‚ β”‚Document Parseβ”‚ β”‚Media Extract β”‚
β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Table Extract β”‚ β”‚Vision/Image β”‚ β”‚Schema Mappingβ”‚
β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Indexing & RAG Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Page Index (Vectorless) β”‚ β”‚
β”‚ β”‚ - Document-level indexing β”‚ β”‚
β”‚ β”‚ - Page-level context β”‚ β”‚
β”‚ β”‚ - Metadata storage β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Validation Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Schema Validator β”‚ β”‚
β”‚ β”‚ - Field validation β”‚ β”‚
β”‚ β”‚ - Completeness scoring β”‚ β”‚
β”‚ β”‚ - Data quality checks β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ File Storage β”‚ β”‚ Index Store β”‚ β”‚ Profile Storeβ”‚ β”‚
β”‚ β”‚ (Filesystem) β”‚ β”‚ (SQLite/JSON)β”‚ β”‚ (JSON) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Component Architecture
### 1. User Interface Layer
#### 1.1 Upload Component
**Purpose**: Accept ZIP files from users
**Technology**: React with react-dropzone
**Responsibilities**:
- Drag-and-drop file upload
- ZIP validation (size, format)
- Upload progress tracking
- Error messaging
**Interface**:
```typescript
interface UploadComponentProps {
onUploadComplete: (jobId: string) => void;
maxFileSize: number; // in MB
acceptedFormats: string[];
}
```
#### 1.2 Profile Viewer
**Purpose**: Display generated business profiles
**Technology**: React with dynamic rendering
**Responsibilities**:
- Conditional rendering based on business type
- Product inventory display
- Service inventory display
- Media gallery
- Metadata presentation
**Interface**:
```typescript
interface BusinessProfile {
businessInfo: BusinessInfo;
products?: Product[];
services?: Service[];
media: MediaFile[];
metadata: ProfileMetadata;
}
```
#### 1.3 Edit Interface
**Purpose**: Allow post-digitization editing
**Technology**: React Hook Form with Zod validation
**Responsibilities**:
- Form-based editing
- Field validation
- Media upload/removal
- Save/discard changes
- Version history
### 2. Orchestration Layer
#### BusinessDigitizationPipeline
**Purpose**: Coordinate multi-agent workflow
**Technology**: Python async/await with concurrent processing
**Core Workflow**:
```python
class BusinessDigitizationPipeline:
def __init__(self):
self.file_discovery = FileDiscoveryAgent()
self.parsing = DocumentParsingAgent()
self.table_extraction = TableExtractionAgent()
self.media_extraction = MediaExtractionAgent()
self.vision = VisionAgent()
self.indexing = IndexingAgent()
self.schema_mapping = SchemaMappingAgent()
self.validation = ValidationAgent()
async def process(self, zip_path: str) -> BusinessProfile:
try:
# Phase 1: Discover files
files = await self.file_discovery.discover(zip_path)
# Phase 2: Parse documents (parallel)
parsed_docs = await asyncio.gather(*[
self.parsing.parse(f) for f in files.documents
])
# Phase 3: Extract tables (parallel)
tables = await asyncio.gather(*[
self.table_extraction.extract(doc) for doc in parsed_docs
])
# Phase 4: Extract media
media = await self.media_extraction.extract_all(
parsed_docs, files.media_files
)
# Phase 5: Vision processing for images
image_metadata = await asyncio.gather(*[
self.vision.analyze(img) for img in media.images
])
# Phase 6: Build page index
page_index = await self.indexing.build_index(
parsed_docs, tables, media
)
# Phase 7: LLM-assisted schema mapping
profile = await self.schema_mapping.map_to_schema(
page_index, image_metadata
)
# Phase 8: Validation
validated_profile = await self.validation.validate(profile)
return validated_profile
except Exception as e:
self.handle_error(e)
raise
```
**Error Handling Strategy**:
- Graceful degradation per agent
- Detailed error logging
- Partial results on failure
- User-friendly error messages
### 3. Agent Layer
#### 3.1 File Discovery Agent
**Purpose**: Extract and classify files from ZIP
**Input**: ZIP file path
**Output**: Classified file collection
**Implementation**:
```python
class FileDiscoveryAgent:
def discover(self, zip_path: str) -> FileCollection:
"""
Extract ZIP and classify files by type
"""
extracted_files = self.extract_zip(zip_path)
return FileCollection(
documents=self.classify_documents(extracted_files),
media_files=self.classify_media(extracted_files),
spreadsheets=self.classify_spreadsheets(extracted_files),
directory_structure=self.map_structure(extracted_files)
)
def classify_file(self, file_path: str) -> FileType:
"""
Determine file type using mimetypes and extension
"""
mime_type, _ = mimetypes.guess_type(file_path)
return self.mime_to_file_type(mime_type)
```
**Supported File Types**:
- Documents: PDF, DOC, DOCX
- Spreadsheets: XLS, XLSX, CSV
- Images: JPG, PNG, GIF, WEBP
- Videos: MP4, AVI, MOV
#### 3.2 Document Parsing Agent
**Purpose**: Extract text and structure from documents
**Input**: Document file path
**Output**: Parsed document with metadata
**Implementation**:
```python
class DocumentParsingAgent:
def __init__(self):
self.parsers = {
FileType.PDF: PDFParser(),
FileType.DOCX: DOCXParser(),
FileType.DOC: DOCParser()
}
def parse(self, file_path: str) -> ParsedDocument:
"""
Factory pattern to select appropriate parser
"""
file_type = self.detect_type(file_path)
parser = self.parsers.get(file_type)
if not parser:
raise UnsupportedFileTypeError(file_type)
return parser.parse(file_path)
```
**PDF Parser**:
```python
class PDFParser:
def parse(self, pdf_path: str) -> ParsedDocument:
"""
Extract text, preserve structure, identify sections
"""
with pdfplumber.open(pdf_path) as pdf:
pages = []
for i, page in enumerate(pdf.pages):
pages.append(Page(
number=i + 1,
text=page.extract_text(),
tables=page.extract_tables(),
images=self.extract_images(page),
metadata=self.extract_page_metadata(page)
))
return ParsedDocument(
source=pdf_path,
pages=pages,
total_pages=len(pages),
metadata=self.extract_doc_metadata(pdf)
)
```
**DOCX Parser**:
```python
class DOCXParser:
def parse(self, docx_path: str) -> ParsedDocument:
"""
Extract paragraphs, tables, images with structure
"""
doc = Document(docx_path)
elements = []
for elem in iter_block_items(doc):
if isinstance(elem, Paragraph):
elements.append(TextElement(
text=elem.text,
style=elem.style.name,
formatting=self.extract_formatting(elem)
))
elif isinstance(elem, Table):
elements.append(TableElement(
data=self.parse_table(elem),
style=elem.style.name
))
return ParsedDocument(
source=docx_path,
elements=elements,
images=self.extract_images(doc),
metadata=self.extract_metadata(doc)
)
```
#### 3.3 Table Extraction Agent
**Purpose**: Identify and structure table data
**Input**: Parsed document
**Output**: Structured table data
**Implementation**:
```python
class TableExtractionAgent:
def extract(self, parsed_doc: ParsedDocument) -> List[StructuredTable]:
"""
Convert raw tables to structured format
"""
tables = []
for page in parsed_doc.pages:
for raw_table in page.tables:
structured = self.structure_table(raw_table)
if self.is_valid_table(structured):
tables.append(StructuredTable(
data=structured,
context=self.extract_context(page, raw_table),
type=self.classify_table(structured),
source_page=page.number
))
return tables
def classify_table(self, table: List[List[str]]) -> TableType:
"""
Identify table purpose (pricing, itinerary, specs, etc.)
"""
headers = table[0] if table else []
if self.has_price_columns(headers):
return TableType.PRICING
elif self.has_time_columns(headers):
return TableType.ITINERARY
elif self.has_spec_columns(headers):
return TableType.SPECIFICATIONS
else:
return TableType.GENERAL
```
**Table Types**:
- Pricing tables (product/service pricing)
- Itinerary tables (schedules, timelines)
- Specification tables (product specs)
- Inventory tables (stock levels)
- General tables (miscellaneous data)
#### 3.4 Media Extraction Agent
**Purpose**: Extract and organize media files
**Input**: Parsed documents + standalone media files
**Output**: Organized media collection
**Implementation**:
```python
class MediaExtractionAgent:
def extract_all(
self,
parsed_docs: List[ParsedDocument],
media_files: List[str]
) -> MediaCollection:
"""
Extract embedded + standalone media
"""
embedded_images = []
for doc in parsed_docs:
embedded_images.extend(self.extract_embedded(doc))
standalone_media = self.process_standalone(media_files)
return MediaCollection(
images=embedded_images + standalone_media.images,
videos=standalone_media.videos,
metadata=self.generate_metadata_all()
)
def extract_embedded(self, doc: ParsedDocument) -> List[Image]:
"""
Extract images from PDFs and DOCX
"""
if doc.source.endswith('.pdf'):
return self.extract_from_pdf(doc)
elif doc.source.endswith('.docx'):
return self.extract_from_docx(doc)
return []
```
#### 3.5 Vision Agent
**Purpose**: Analyze images using vision-language models
**Input**: Image files
**Output**: Descriptive metadata
**Implementation**:
```python
class VisionAgent:
def __init__(self):
from ollama import Client
self.ollama_client = Client(host='http://localhost:11434')
self.model = "qwen3.5:0.8b"
async def analyze(self, image: Image) -> ImageMetadata:
"""
Generate descriptive metadata using Qwen3.5:0.8B vision (via Ollama)
"""
# Call Qwen via Ollama with image
response = self.ollama_client.chat(
model=self.model,
messages=[{
"role": "user",
"content": self.get_vision_prompt(),
"images": [image.path]
}]
)
return ImageMetadata(
description=response['message']['content'],
suggested_category=self.extract_category(response),
tags=self.extract_tags(response),
is_product_image=self.is_product(response),
confidence=0.85
)
def get_vision_prompt(self) -> str:
return """
Analyze this image and provide:
1. A brief description (2-3 sentences)
2. Category (product, service, food, destination, other)
3. Relevant tags (comma-separated)
4. Is this a product image? (yes/no)
Format your response as JSON.
"""
```
#### 3.6 Schema Mapping Agent
**Purpose**: Map extracted data to business profile schema
**Input**: Page index, parsed data, media metadata
**Output**: Structured business profile
**Implementation**:
```python
class SchemaMappingAgent:
def __init__(self):
from openai import OpenAI
# Groq API endpoint
self.client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.getenv("GROQ_API_KEY")
)
self.model = "gpt-oss-120b"
async def map_to_schema(
self,
page_index: PageIndex,
image_metadata: List[ImageMetadata]
) -> BusinessProfile:
"""
Use Groq (gpt-oss-120b) to intelligently map data to schema fields
"""
# Step 1: Classify business type
business_type = await self.classify_business_type(page_index)
# Step 2: Extract business info
business_info = await self.extract_business_info(page_index)
# Step 3: Extract products or services
if business_type in [BusinessType.PRODUCT, BusinessType.MIXED]:
products = await self.extract_products(page_index, image_metadata)
else:
products = None
if business_type in [BusinessType.SERVICE, BusinessType.MIXED]:
services = await self.extract_services(page_index, image_metadata)
else:
services = None
return BusinessProfile(
business_info=business_info,
products=products,
services=services,
business_type=business_type,
extraction_metadata=self.generate_metadata()
)
async def extract_business_info(self, page_index: PageIndex) -> BusinessInfo:
"""
Extract core business information using Groq
"""
context = page_index.get_relevant_context([
"business name",
"description",
"hours",
"location",
"contact"
])
prompt = self.build_extraction_prompt(context, "business_info")
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=2000
)
extracted_data = json.loads(response.choices[0].message.content)
return BusinessInfo(
description=extracted_data.get("description", ""),
working_hours=extracted_data.get("working_hours", ""),
location=extracted_data.get("location", {}),
contact=extracted_data.get("contact", {}),
payment_methods=extracted_data.get("payment_methods", []),
tags=extracted_data.get("tags", [])
)
```
### 4. Indexing & RAG Layer
#### Page Index (Vectorless RAG)
**Purpose**: Enable efficient context retrieval without embeddings
**Architecture**:
```python
class PageIndex:
"""
Vectorless retrieval using inverted index on pages
"""
def __init__(self):
self.documents: Dict[str, ParsedDocument] = {}
self.page_index: Dict[str, List[PageReference]] = {}
self.table_index: Dict[str, List[TableReference]] = {}
self.media_index: Dict[str, List[MediaReference]] = {}
def build_index(self, parsed_docs: List[ParsedDocument]) -> None:
"""
Create inverted index for fast lookup
"""
for doc in parsed_docs:
self.documents[doc.id] = doc
for page in doc.pages:
# Index by keywords
keywords = self.extract_keywords(page.text)
for keyword in keywords:
if keyword not in self.page_index:
self.page_index[keyword] = []
self.page_index[keyword].append(PageReference(
doc_id=doc.id,
page_number=page.number,
context=self.extract_snippet(page.text, keyword)
))
def get_relevant_context(self, query_terms: List[str]) -> str:
"""
Retrieve relevant pages/context for given terms
"""
relevant_pages = set()
for term in query_terms:
if term.lower() in self.page_index:
relevant_pages.update(self.page_index[term.lower()])
# Rank by relevance
ranked = self.rank_pages(relevant_pages, query_terms)
# Build context from top pages
return self.build_context(ranked[:5])
```
**Advantages**:
- No embedding generation overhead
- Fast exact keyword matching
- Easy to debug and understand
- Low memory footprint
- Deterministic results
### 5. Validation Layer
#### Schema Validator
**Purpose**: Ensure data quality and completeness
**Implementation**:
```python
class SchemaValidator:
def validate(self, profile: BusinessProfile) -> ValidationResult:
"""
Validate business profile against schema rules
"""
errors = []
warnings = []
# Validate business info
if not profile.business_info.description:
warnings.append("Missing business description")
if profile.business_info.contact:
if not self.is_valid_email(profile.business_info.contact.email):
errors.append("Invalid email format")
# Validate products
if profile.products:
for i, product in enumerate(profile.products):
product_errors = self.validate_product(product)
if product_errors:
errors.extend([f"Product {i+1}: {e}" for e in product_errors])
# Calculate completeness score
completeness = self.calculate_completeness(profile)
return ValidationResult(
is_valid=len(errors) == 0,
errors=errors,
warnings=warnings,
completeness_score=completeness,
profile=profile
)
def calculate_completeness(self, profile: BusinessProfile) -> float:
"""
Score based on populated vs empty fields
"""
total_fields = self.count_schema_fields()
populated_fields = self.count_populated_fields(profile)
return populated_fields / total_fields
```
## Data Flow
### End-to-End Processing Flow
```
User uploads ZIP
↓
FileDiscoveryAgent extracts and classifies files
↓
DocumentParsingAgent parses each document (parallel)
↓
TableExtractionAgent extracts tables from parsed docs
↓
MediaExtractionAgent extracts embedded + standalone media
↓
VisionAgent analyzes images (parallel)
↓
IndexingAgent builds page index
↓
SchemaMappingAgent uses Groq + page index to map fields
↓
ValidationAgent validates and scores profile
↓
BusinessProfile saved as JSON
↓
UI renders profile dynamically
```
## Technology Stack
### Backend
- **Language**: Python 3.10+
- **Async Framework**: asyncio
- **Document Parsing**: pdfplumber, python-docx, openpyxl
- **Image Processing**: Pillow, pdf2image
- **LLM Integration**: Groq API (gpt-oss-120b), Ollama (Qwen3.5:0.8B for vision)
- **Validation**: Pydantic
- **Testing**: pytest, pytest-asyncio
### Frontend
- **Framework**: React 18 with TypeScript
- **State Management**: Zustand
- **UI Components**: shadcn/ui
- **Forms**: React Hook Form + Zod
- **File Upload**: react-dropzone
- **Build Tool**: Vite
### Storage
- **Documents**: Filesystem with organized structure
- **Index**: SQLite or JSON-based lightweight store
- **Profiles**: JSON files with schema validation
## Deployment Architecture
### Development Environment
```
/project
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ agents/
β”‚ β”œβ”€β”€ parsers/
β”‚ β”œβ”€β”€ indexing/
β”‚ β”œβ”€β”€ validation/
β”‚ └── main.py
β”œβ”€β”€ frontend/
β”‚ β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ components/
β”‚ └── pages/
β”œβ”€β”€ storage/
β”‚ β”œβ”€β”€ uploads/
β”‚ β”œβ”€β”€ extracted/
β”‚ β”œβ”€β”€ profiles/
β”‚ └── index/
└── tests/
```
### Production Considerations
- Docker containerization for consistent deployment
- Environment variable management for API keys
- Logging and monitoring integration
- Error tracking (Sentry)
- Performance monitoring
## Security Considerations
1. **File Upload Security**
- Virus scanning on uploaded ZIPs
- Size limits (500MB max)
- Type validation
- Sandboxed extraction
2. **API Key Management**
- Environment variables only
- Never commit keys
- Rotate periodically
3. **Data Privacy**
- No data sent to third parties except Groq API
- Vision processing is fully local (Ollama)
- User data isolated by session
- Option to delete processed files
## Performance Optimization
1. **Parallel Processing**
- Parse documents concurrently
- Process images in parallel
- Async LLM calls
2. **Caching**
- Cache parsed documents
- Reuse vision analysis results
- Index caching
3. **Resource Management**
- Stream large files
- Cleanup temporary files
- Memory limits for document processing
## Monitoring & Observability
### Metrics to Track
- Processing time per phase
- Success/failure rates
- LLM token usage
- Extraction accuracy (sampled)
- User satisfaction scores
### Logging Strategy
- Structured JSON logging
- Log levels: DEBUG, INFO, WARN, ERROR
- Contextual information (job_id, file_name)
- Performance timings
## Conclusion
This architecture provides a robust, scalable foundation for the agentic business digitization system. The multi-agent approach allows for:
- Independent development and testing of each component
- Graceful handling of failures
- Easy extension with new capabilities
- Clear data provenance and debugging
The vectorless RAG approach keeps the system lightweight while the LLM integration provides intelligent field mapping and classification.