Spaces:

Divs0910
/

Digi-Biz

Sleeping

App Files Files Community

Digi-Biz / docs /SYSTEM_ARCHITECTURE.md

Deployment Bot

Automated deployment to Hugging Face

255cbd1 16 days ago

preview code

raw

history blame contribute delete

28.7 kB

System Architecture: Agentic Business Digitization Framework

Architecture Overview

System Philosophy

The architecture follows a multi-agent microservices pattern where specialized agents collaborate to transform unstructured documents into structured business profiles. Each agent has a single responsibility and communicates through well-defined interfaces.

Core Principles

Separation of Concerns: Each agent handles one aspect of processing
Fail Gracefully: Missing information results in empty fields, not errors
Deterministic Parsing: Scripts handle extraction, LLMs handle intelligence
Data Provenance: Track source of every extracted field
Extensibility: Easy to add new document types or agents

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                     User Interface Layer                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ ZIP Upload   │  │ Profile View │  │ Edit Interface│      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  Orchestration Layer                         │
│         ┌────────────────────────────────┐                  │
│         │  BusinessDigitizationPipeline  │                  │
│         │  - Workflow Coordination       │                  │
│         │  - Error Handling              │                  │
│         │  - Progress Tracking           │                  │
│         └────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│File Discovery│   │Document Parse│   │Media Extract │
│    Agent     │   │    Agent     │   │    Agent     │
└──────────────┘   └──────────────┘   └──────────────┘
        │                   │                   │
        ▼                   ▼                   ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│Table Extract │   │Vision/Image  │   │Schema Mapping│
│    Agent     │   │    Agent     │   │    Agent     │
└──────────────┘   └──────────────┘   └──────────────┘
        │                   │                   │
        └───────────────────┼───────────────────┘
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    Indexing & RAG Layer                      │
│         ┌────────────────────────────────┐                  │
│         │     Page Index (Vectorless)    │                  │
│         │  - Document-level indexing     │                  │
│         │  - Page-level context          │                  │
│         │  - Metadata storage            │                  │
│         └────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Validation Layer                           │
│         ┌────────────────────────────────┐                  │
│         │      Schema Validator          │                  │
│         │  - Field validation            │                  │
│         │  - Completeness scoring        │                  │
│         │  - Data quality checks         │                  │
│         └────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    Data Layer                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ File Storage │  │ Index Store  │  │ Profile Store│      │
│  │ (Filesystem) │  │ (SQLite/JSON)│  │    (JSON)    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘

Component Architecture

1. User Interface Layer

1.1 Upload Component

Purpose: Accept ZIP files from users

Technology: React with react-dropzone

Responsibilities:

Drag-and-drop file upload
ZIP validation (size, format)
Upload progress tracking
Error messaging

Interface:

interface UploadComponentProps {
  onUploadComplete: (jobId: string) => void;
  maxFileSize: number; // in MB
  acceptedFormats: string[];
}

1.2 Profile Viewer

Purpose: Display generated business profiles

Technology: React with dynamic rendering

Responsibilities:

Conditional rendering based on business type
Product inventory display
Service inventory display
Media gallery
Metadata presentation

Interface:

interface BusinessProfile {
  businessInfo: BusinessInfo;
  products?: Product[];
  services?: Service[];
  media: MediaFile[];
  metadata: ProfileMetadata;
}

1.3 Edit Interface

Purpose: Allow post-digitization editing

Technology: React Hook Form with Zod validation

Responsibilities:

Form-based editing
Field validation
Media upload/removal
Save/discard changes
Version history

2. Orchestration Layer

BusinessDigitizationPipeline

Purpose: Coordinate multi-agent workflow

Technology: Python async/await with concurrent processing

Core Workflow:

class BusinessDigitizationPipeline:
    def __init__(self):
        self.file_discovery = FileDiscoveryAgent()
        self.parsing = DocumentParsingAgent()
        self.table_extraction = TableExtractionAgent()
        self.media_extraction = MediaExtractionAgent()
        self.vision = VisionAgent()
        self.indexing = IndexingAgent()
        self.schema_mapping = SchemaMappingAgent()
        self.validation = ValidationAgent()
    
    async def process(self, zip_path: str) -> BusinessProfile:
        try:
            # Phase 1: Discover files
            files = await self.file_discovery.discover(zip_path)
            
            # Phase 2: Parse documents (parallel)
            parsed_docs = await asyncio.gather(*[
                self.parsing.parse(f) for f in files.documents
            ])
            
            # Phase 3: Extract tables (parallel)
            tables = await asyncio.gather(*[
                self.table_extraction.extract(doc) for doc in parsed_docs
            ])
            
            # Phase 4: Extract media
            media = await self.media_extraction.extract_all(
                parsed_docs, files.media_files
            )
            
            # Phase 5: Vision processing for images
            image_metadata = await asyncio.gather(*[
                self.vision.analyze(img) for img in media.images
            ])
            
            # Phase 6: Build page index
            page_index = await self.indexing.build_index(
                parsed_docs, tables, media
            )
            
            # Phase 7: LLM-assisted schema mapping
            profile = await self.schema_mapping.map_to_schema(
                page_index, image_metadata
            )
            
            # Phase 8: Validation
            validated_profile = await self.validation.validate(profile)
            
            return validated_profile
            
        except Exception as e:
            self.handle_error(e)
            raise

Error Handling Strategy:

Graceful degradation per agent
Detailed error logging
Partial results on failure
User-friendly error messages

3. Agent Layer

3.1 File Discovery Agent

Purpose: Extract and classify files from ZIP

Input: ZIP file path

Output: Classified file collection

Implementation:

class FileDiscoveryAgent:
    def discover(self, zip_path: str) -> FileCollection:
        """
        Extract ZIP and classify files by type
        """
        extracted_files = self.extract_zip(zip_path)
        
        return FileCollection(
            documents=self.classify_documents(extracted_files),
            media_files=self.classify_media(extracted_files),
            spreadsheets=self.classify_spreadsheets(extracted_files),
            directory_structure=self.map_structure(extracted_files)
        )
    
    def classify_file(self, file_path: str) -> FileType:
        """
        Determine file type using mimetypes and extension
        """
        mime_type, _ = mimetypes.guess_type(file_path)
        return self.mime_to_file_type(mime_type)

Supported File Types:

Documents: PDF, DOC, DOCX
Spreadsheets: XLS, XLSX, CSV
Images: JPG, PNG, GIF, WEBP
Videos: MP4, AVI, MOV

3.2 Document Parsing Agent

Purpose: Extract text and structure from documents

Input: Document file path

Output: Parsed document with metadata

Implementation:

class DocumentParsingAgent:
    def __init__(self):
        self.parsers = {
            FileType.PDF: PDFParser(),
            FileType.DOCX: DOCXParser(),
            FileType.DOC: DOCParser()
        }
    
    def parse(self, file_path: str) -> ParsedDocument:
        """
        Factory pattern to select appropriate parser
        """
        file_type = self.detect_type(file_path)
        parser = self.parsers.get(file_type)
        
        if not parser:
            raise UnsupportedFileTypeError(file_type)
        
        return parser.parse(file_path)

PDF Parser:

class PDFParser:
    def parse(self, pdf_path: str) -> ParsedDocument:
        """
        Extract text, preserve structure, identify sections
        """
        with pdfplumber.open(pdf_path) as pdf:
            pages = []
            for i, page in enumerate(pdf.pages):
                pages.append(Page(
                    number=i + 1,
                    text=page.extract_text(),
                    tables=page.extract_tables(),
                    images=self.extract_images(page),
                    metadata=self.extract_page_metadata(page)
                ))
            
            return ParsedDocument(
                source=pdf_path,
                pages=pages,
                total_pages=len(pages),
                metadata=self.extract_doc_metadata(pdf)
            )

DOCX Parser:

class DOCXParser:
    def parse(self, docx_path: str) -> ParsedDocument:
        """
        Extract paragraphs, tables, images with structure
        """
        doc = Document(docx_path)
        
        elements = []
        for elem in iter_block_items(doc):
            if isinstance(elem, Paragraph):
                elements.append(TextElement(
                    text=elem.text,
                    style=elem.style.name,
                    formatting=self.extract_formatting(elem)
                ))
            elif isinstance(elem, Table):
                elements.append(TableElement(
                    data=self.parse_table(elem),
                    style=elem.style.name
                ))
        
        return ParsedDocument(
            source=docx_path,
            elements=elements,
            images=self.extract_images(doc),
            metadata=self.extract_metadata(doc)
        )

3.3 Table Extraction Agent

Purpose: Identify and structure table data

Input: Parsed document

Output: Structured table data

Implementation:

class TableExtractionAgent:
    def extract(self, parsed_doc: ParsedDocument) -> List[StructuredTable]:
        """
        Convert raw tables to structured format
        """
        tables = []
        for page in parsed_doc.pages:
            for raw_table in page.tables:
                structured = self.structure_table(raw_table)
                if self.is_valid_table(structured):
                    tables.append(StructuredTable(
                        data=structured,
                        context=self.extract_context(page, raw_table),
                        type=self.classify_table(structured),
                        source_page=page.number
                    ))
        return tables
    
    def classify_table(self, table: List[List[str]]) -> TableType:
        """
        Identify table purpose (pricing, itinerary, specs, etc.)
        """
        headers = table[0] if table else []
        
        if self.has_price_columns(headers):
            return TableType.PRICING
        elif self.has_time_columns(headers):
            return TableType.ITINERARY
        elif self.has_spec_columns(headers):
            return TableType.SPECIFICATIONS
        else:
            return TableType.GENERAL

Table Types:

Pricing tables (product/service pricing)
Itinerary tables (schedules, timelines)
Specification tables (product specs)
Inventory tables (stock levels)
General tables (miscellaneous data)

3.4 Media Extraction Agent

Purpose: Extract and organize media files

Input: Parsed documents + standalone media files

Output: Organized media collection

Implementation:

class MediaExtractionAgent:
    def extract_all(
        self, 
        parsed_docs: List[ParsedDocument],
        media_files: List[str]
    ) -> MediaCollection:
        """
        Extract embedded + standalone media
        """
        embedded_images = []
        for doc in parsed_docs:
            embedded_images.extend(self.extract_embedded(doc))
        
        standalone_media = self.process_standalone(media_files)
        
        return MediaCollection(
            images=embedded_images + standalone_media.images,
            videos=standalone_media.videos,
            metadata=self.generate_metadata_all()
        )
    
    def extract_embedded(self, doc: ParsedDocument) -> List[Image]:
        """
        Extract images from PDFs and DOCX
        """
        if doc.source.endswith('.pdf'):
            return self.extract_from_pdf(doc)
        elif doc.source.endswith('.docx'):
            return self.extract_from_docx(doc)
        return []

3.5 Vision Agent

Purpose: Analyze images using vision-language models

Input: Image files

Output: Descriptive metadata

Implementation:

class VisionAgent:
    def __init__(self):
        from ollama import Client
        self.ollama_client = Client(host='http://localhost:11434')
        self.model = "qwen3.5:0.8b"

    async def analyze(self, image: Image) -> ImageMetadata:
        """
        Generate descriptive metadata using Qwen3.5:0.8B vision (via Ollama)
        """
        # Call Qwen via Ollama with image
        response = self.ollama_client.chat(
            model=self.model,
            messages=[{
                "role": "user",
                "content": self.get_vision_prompt(),
                "images": [image.path]
            }]
        )

        return ImageMetadata(
            description=response['message']['content'],
            suggested_category=self.extract_category(response),
            tags=self.extract_tags(response),
            is_product_image=self.is_product(response),
            confidence=0.85
        )

    def get_vision_prompt(self) -> str:
        return """
        Analyze this image and provide:
        1. A brief description (2-3 sentences)
        2. Category (product, service, food, destination, other)
        3. Relevant tags (comma-separated)
        4. Is this a product image? (yes/no)

        Format your response as JSON.
        """

3.6 Schema Mapping Agent

Purpose: Map extracted data to business profile schema

Input: Page index, parsed data, media metadata

Output: Structured business profile

Implementation:

class SchemaMappingAgent:
    def __init__(self):
        from openai import OpenAI
        # Groq API endpoint
        self.client = OpenAI(
            base_url="https://api.groq.com/openai/v1",
            api_key=os.getenv("GROQ_API_KEY")
        )
        self.model = "gpt-oss-120b"

    async def map_to_schema(
        self,
        page_index: PageIndex,
        image_metadata: List[ImageMetadata]
    ) -> BusinessProfile:
        """
        Use Groq (gpt-oss-120b) to intelligently map data to schema fields
        """
        # Step 1: Classify business type
        business_type = await self.classify_business_type(page_index)

        # Step 2: Extract business info
        business_info = await self.extract_business_info(page_index)

        # Step 3: Extract products or services
        if business_type in [BusinessType.PRODUCT, BusinessType.MIXED]:
            products = await self.extract_products(page_index, image_metadata)
        else:
            products = None

        if business_type in [BusinessType.SERVICE, BusinessType.MIXED]:
            services = await self.extract_services(page_index, image_metadata)
        else:
            services = None

        return BusinessProfile(
            business_info=business_info,
            products=products,
            services=services,
            business_type=business_type,
            extraction_metadata=self.generate_metadata()
        )

    async def extract_business_info(self, page_index: PageIndex) -> BusinessInfo:
        """
        Extract core business information using Groq
        """
        context = page_index.get_relevant_context([
            "business name",
            "description",
            "hours",
            "location",
            "contact"
        ])

        prompt = self.build_extraction_prompt(context, "business_info")

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=2000
        )

        extracted_data = json.loads(response.choices[0].message.content)

        return BusinessInfo(
            description=extracted_data.get("description", ""),
            working_hours=extracted_data.get("working_hours", ""),
            location=extracted_data.get("location", {}),
            contact=extracted_data.get("contact", {}),
            payment_methods=extracted_data.get("payment_methods", []),
            tags=extracted_data.get("tags", [])
        )

4. Indexing & RAG Layer

Page Index (Vectorless RAG)

Purpose: Enable efficient context retrieval without embeddings

Architecture:

class PageIndex:
    """
    Vectorless retrieval using inverted index on pages
    """
    def __init__(self):
        self.documents: Dict[str, ParsedDocument] = {}
        self.page_index: Dict[str, List[PageReference]] = {}
        self.table_index: Dict[str, List[TableReference]] = {}
        self.media_index: Dict[str, List[MediaReference]] = {}
    
    def build_index(self, parsed_docs: List[ParsedDocument]) -> None:
        """
        Create inverted index for fast lookup
        """
        for doc in parsed_docs:
            self.documents[doc.id] = doc
            
            for page in doc.pages:
                # Index by keywords
                keywords = self.extract_keywords(page.text)
                for keyword in keywords:
                    if keyword not in self.page_index:
                        self.page_index[keyword] = []
                    
                    self.page_index[keyword].append(PageReference(
                        doc_id=doc.id,
                        page_number=page.number,
                        context=self.extract_snippet(page.text, keyword)
                    ))
    
    def get_relevant_context(self, query_terms: List[str]) -> str:
        """
        Retrieve relevant pages/context for given terms
        """
        relevant_pages = set()
        
        for term in query_terms:
            if term.lower() in self.page_index:
                relevant_pages.update(self.page_index[term.lower()])
        
        # Rank by relevance
        ranked = self.rank_pages(relevant_pages, query_terms)
        
        # Build context from top pages
        return self.build_context(ranked[:5])

Advantages:

No embedding generation overhead
Fast exact keyword matching
Easy to debug and understand
Low memory footprint
Deterministic results

5. Validation Layer

Schema Validator

Purpose: Ensure data quality and completeness

Implementation:

class SchemaValidator:
    def validate(self, profile: BusinessProfile) -> ValidationResult:
        """
        Validate business profile against schema rules
        """
        errors = []
        warnings = []
        
        # Validate business info
        if not profile.business_info.description:
            warnings.append("Missing business description")
        
        if profile.business_info.contact:
            if not self.is_valid_email(profile.business_info.contact.email):
                errors.append("Invalid email format")
        
        # Validate products
        if profile.products:
            for i, product in enumerate(profile.products):
                product_errors = self.validate_product(product)
                if product_errors:
                    errors.extend([f"Product {i+1}: {e}" for e in product_errors])
        
        # Calculate completeness score
        completeness = self.calculate_completeness(profile)
        
        return ValidationResult(
            is_valid=len(errors) == 0,
            errors=errors,
            warnings=warnings,
            completeness_score=completeness,
            profile=profile
        )
    
    def calculate_completeness(self, profile: BusinessProfile) -> float:
        """
        Score based on populated vs empty fields
        """
        total_fields = self.count_schema_fields()
        populated_fields = self.count_populated_fields(profile)
        
        return populated_fields / total_fields

Data Flow

End-to-End Processing Flow

User uploads ZIP
    ↓
FileDiscoveryAgent extracts and classifies files
    ↓
DocumentParsingAgent parses each document (parallel)
    ↓
TableExtractionAgent extracts tables from parsed docs
    ↓
MediaExtractionAgent extracts embedded + standalone media
    ↓
VisionAgent analyzes images (parallel)
    ↓
IndexingAgent builds page index
    ↓
SchemaMappingAgent uses Groq + page index to map fields
    ↓
ValidationAgent validates and scores profile
    ↓
BusinessProfile saved as JSON
    ↓
UI renders profile dynamically

Technology Stack

Backend

Language: Python 3.10+
Async Framework: asyncio
Document Parsing: pdfplumber, python-docx, openpyxl
Image Processing: Pillow, pdf2image
LLM Integration: Groq API (gpt-oss-120b), Ollama (Qwen3.5:0.8B for vision)
Validation: Pydantic
Testing: pytest, pytest-asyncio

Frontend

Framework: React 18 with TypeScript
State Management: Zustand
UI Components: shadcn/ui
Forms: React Hook Form + Zod
File Upload: react-dropzone
Build Tool: Vite

Storage

Documents: Filesystem with organized structure
Index: SQLite or JSON-based lightweight store
Profiles: JSON files with schema validation

Deployment Architecture

Development Environment

/project
├── backend/
│   ├── agents/
│   ├── parsers/
│   ├── indexing/
│   ├── validation/
│   └── main.py
├── frontend/
│   ├── src/
│   ├── components/
│   └── pages/
├── storage/
│   ├── uploads/
│   ├── extracted/
│   ├── profiles/
│   └── index/
└── tests/

Production Considerations

Docker containerization for consistent deployment
Environment variable management for API keys
Logging and monitoring integration
Error tracking (Sentry)
Performance monitoring

Security Considerations

File Upload Security
- Virus scanning on uploaded ZIPs
- Size limits (500MB max)
- Type validation
- Sandboxed extraction
API Key Management
- Environment variables only
- Never commit keys
- Rotate periodically
Data Privacy
- No data sent to third parties except Groq API
- Vision processing is fully local (Ollama)
- User data isolated by session
- Option to delete processed files

Performance Optimization

Parallel Processing
- Parse documents concurrently
- Process images in parallel
- Async LLM calls
Caching
- Cache parsed documents
- Reuse vision analysis results
- Index caching
Resource Management
- Stream large files
- Cleanup temporary files
- Memory limits for document processing

Monitoring & Observability

Metrics to Track

Processing time per phase
Success/failure rates
LLM token usage
Extraction accuracy (sampled)
User satisfaction scores

Logging Strategy

Structured JSON logging
Log levels: DEBUG, INFO, WARN, ERROR
Contextual information (job_id, file_name)
Performance timings

Conclusion

This architecture provides a robust, scalable foundation for the agentic business digitization system. The multi-agent approach allows for:

Independent development and testing of each component
Graceful handling of failures
Easy extension with new capabilities
Clear data provenance and debugging

The vectorless RAG approach keeps the system lightweight while the LLM integration provides intelligent field mapping and classification.