| # Digi-Biz Documentation |
|
|
| ## Agentic Business Digitization Framework |
|
|
| **Version:** 1.0.0 |
| **Last Updated:** March 17, 2026 |
|
|
| --- |
|
|
| ## π Table of Contents |
|
|
| 1. [Overview](#overview) |
| 2. [Architecture](#architecture) |
| 3. [Agents](#agents) |
| 4. [Installation](#installation) |
| 5. [Usage](#usage) |
| 6. [API Reference](#api-reference) |
| 7. [Troubleshooting](#troubleshooting) |
|
|
| --- |
|
|
| ## Overview |
|
|
| **Digi-Biz** is an AI-powered agentic framework that automatically converts unstructured business documents into structured digital business profiles. |
|
|
| ### What It Does |
|
|
| - Accepts ZIP files containing mixed business documents (PDF, DOCX, Excel, images, videos) |
| - Intelligently extracts and structures information using multi-agent workflows |
| - Generates comprehensive digital business profiles with product/service inventories |
| - Provides dynamic UI for viewing and editing results |
|
|
| ### Key Features |
|
|
| β
**Multi-Agent Pipeline** - 5 specialized agents working together |
| β
**Vectorless RAG** - Fast document retrieval without embeddings |
| β
**Groq Vision** - Image analysis with Llama-4-Scout (17B) |
| β
**Production-Ready** - Error handling, validation, logging |
| β
**Streamlit UI** - Interactive web interface |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ### High-Level Overview |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β User Interface (Streamlit) β |
| β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β |
| β β ZIP Upload β β Results View β β Vision Tab β β |
| β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β Agent Pipeline β |
| β 1. File Discovery β 2. Document Parsing β 3. Table Extract β |
| β 4. Media Extraction β 5. Vision (Groq) β 6. Indexing (RAG) β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β Data Layer β |
| β File Storage (FileSystem) β’ Index (In-Memory) β’ Profiles β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ### Technology Stack |
|
|
| | Component | Technology | |
| |-----------|-----------| |
| | **Backend** | Python 3.10+ | |
| | **Document Parsing** | pdfplumber, python-docx, openpyxl | |
| | **Image Processing** | Pillow, pdf2image | |
| | **Vision AI** | Groq API (Llama-4-Scout-17B) | |
| | **LLM (Text)** | Groq API (gpt-oss-120b) | |
| | **Validation** | Pydantic | |
| | **Frontend** | Streamlit | |
| | **Storage** | Local Filesystem | |
|
|
| --- |
|
|
| ## Agents |
|
|
| ### 1. File Discovery Agent |
|
|
| **Purpose:** Extract ZIP files and classify all contained files |
|
|
| **Input:** |
| ```python |
| FileDiscoveryInput( |
| zip_file_path="/path/to/upload.zip", |
| job_id="job_123", |
| max_file_size=524288000, # 500MB |
| max_files=100 |
| ) |
| ``` |
|
|
| **Output:** |
| ```python |
| FileDiscoveryOutput( |
| job_id="job_123", |
| success=True, |
| documents=[...], # PDFs, DOCX |
| spreadsheets=[...], # XLSX, CSV |
| images=[...], # JPG, PNG |
| videos=[...], # MP4, AVI |
| total_files=10, |
| extraction_dir="/storage/extracted/job_123" |
| ) |
| ``` |
|
|
| **Features:** |
| - ZIP bomb detection (1000:1 ratio limit) |
| - Path traversal prevention |
| - File type classification (3-strategy approach) |
| - Directory structure preservation |
|
|
| **File:** `backend/agents/file_discovery.py` |
|
|
| --- |
|
|
| ### 2. Document Parsing Agent |
|
|
| **Purpose:** Extract text and structure from PDF/DOCX files |
|
|
| **Input:** |
| ```python |
| DocumentParsingInput( |
| documents=[...], # From File Discovery |
| job_id="job_123", |
| enable_ocr=True |
| ) |
| ``` |
|
|
| **Output:** |
| ```python |
| DocumentParsingOutput( |
| job_id="job_123", |
| success=True, |
| parsed_documents=[...], |
| total_pages=56, |
| processing_time=2.5 |
| ) |
| ``` |
|
|
| **Features:** |
| - PDF parsing (pdfplumber primary, PyPDF2 fallback, OCR final) |
| - DOCX parsing with structure preservation |
| - Table extraction |
| - Embedded image extraction |
|
|
| **File:** `backend/agents/document_parsing.py` |
|
|
| --- |
|
|
| ### 3. Table Extraction Agent |
|
|
| **Purpose:** Detect and classify tables from parsed documents |
|
|
| **Input:** |
| ```python |
| TableExtractionInput( |
| parsed_documents=[...], |
| job_id="job_123" |
| ) |
| ``` |
|
|
| **Output:** |
| ```python |
| TableExtractionOutput( |
| job_id="job_123", |
| success=True, |
| tables=[...], |
| total_tables=42, |
| tables_by_type={ |
| "itinerary": 33, |
| "pricing": 6, |
| "general": 3 |
| } |
| ) |
| ``` |
|
|
| **Table Types:** |
| | Type | Detection Criteria | |
| |------|-------------------| |
| | **PRICING** | Headers: price/cost/rate; Currency: $, β¬, βΉ | |
| | **ITINERARY** | Headers: day/time/date; Patterns: "Day 1", "9:00 AM" | |
| | **SPECIFICATIONS** | Headers: spec/feature/dimension/weight | |
| | **MENU** | Headers: menu/dish/food/meal | |
| | **INVENTORY** | Headers: stock/quantity/available | |
| | **GENERAL** | Fallback | |
|
|
| **File:** `backend/agents/table_extraction.py` |
|
|
| --- |
|
|
| ### 4. Media Extraction Agent |
|
|
| **Purpose:** Extract embedded and standalone media |
|
|
| **Input:** |
| ```python |
| MediaExtractionInput( |
| parsed_documents=[...], |
| standalone_files=[...], |
| job_id="job_123" |
| ) |
| ``` |
|
|
| **Output:** |
| ```python |
| MediaExtractionOutput( |
| job_id="job_123", |
| success=True, |
| media=MediaCollection( |
| images=[...], |
| total_count=15, |
| extraction_summary={...} |
| ), |
| duplicates_removed=3 |
| ) |
| ``` |
|
|
| **Features:** |
| - PDF embedded image extraction (xref method) |
| - DOCX embedded image extraction (ZIP method) |
| - Perceptual hashing for deduplication |
| - Quality assessment |
|
|
| **File:** `backend/agents/media_extraction.py` |
|
|
| --- |
|
|
| ### 5. Vision Agent (Groq) |
|
|
| **Purpose:** Analyze images using Groq Vision API |
|
|
| **Input:** |
| ```python |
| VisionAnalysisInput( |
| image=ExtractedImage(...), |
| context="Restaurant menu with burgers", |
| job_id="job_123" |
| ) |
| ``` |
|
|
| **Output:** |
| ```python |
| ImageAnalysis( |
| image_id="img_001", |
| description="A delicious burger with lettuce...", |
| category=ImageCategory.FOOD, |
| tags=["burger", "food", "restaurant"], |
| is_product=False, |
| is_service_related=True, |
| confidence=0.92, |
| metadata={ |
| 'provider': 'groq', |
| 'model': 'llama-4-scout-17b', |
| 'processing_time': 1.85 |
| } |
| ) |
| ``` |
|
|
| **Features:** |
| - Groq API integration (Llama-4-Scout-17B) |
| - Ollama fallback |
| - Context-aware prompts |
| - JSON response parsing |
| - Batch processing |
| - Automatic image resizing (<4MB) |
|
|
| **File:** `backend/agents/vision_agent.py` |
|
|
| --- |
|
|
| ### 6. Indexing Agent (Vectorless RAG) |
|
|
| **Purpose:** Build inverted index for fast document retrieval |
|
|
| **Input:** |
| ```python |
| IndexingInput( |
| parsed_documents=[...], |
| tables=[...], |
| images=[...], |
| job_id="job_123" |
| ) |
| ``` |
|
|
| **Output:** |
| ```python |
| IndexingOutput( |
| job_id="job_123", |
| success=True, |
| page_index=PageIndex( |
| documents={...}, |
| page_index={ |
| "burger": [PageReference(...)], |
| "price": [PageReference(...)] |
| }, |
| table_index={...}, |
| media_index={...} |
| ), |
| total_keywords=1250 |
| ) |
| ``` |
|
|
| **Features:** |
| - Keyword extraction (tokenization, N-grams, entities) |
| - Inverted index creation |
| - Query expansion with synonyms |
| - Context-aware retrieval |
| - Relevance scoring |
|
|
| **File:** `backend/agents/indexing.py` |
|
|
| --- |
|
|
| ## Installation |
|
|
| ### Prerequisites |
|
|
| - Python 3.10+ |
| - Git (for cloning) |
| - Groq API account (free at https://console.groq.com) |
|
|
| ### Step 1: Clone Repository |
|
|
| ```bash |
| cd D:\Viswam_Projects\digi-biz |
| ``` |
|
|
| ### Step 2: Install Dependencies |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Step 3: Configure Environment |
|
|
| Create `.env` file: |
|
|
| ```bash |
| # Groq API (required for vision and text LLM) |
| GROQ_API_KEY=gsk_your_actual_key_here |
| GROQ_MODEL=gpt-oss-120b |
| GROQ_VISION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct |
| |
| # Optional: Ollama for local fallback |
| OLLAMA_HOST=http://localhost:11434 |
| OLLAMA_VISION_MODEL=qwen3.5:0.8b |
| |
| # Application settings |
| APP_ENV=development |
| LOG_LEVEL=INFO |
| MAX_FILE_SIZE=524288000 # 500MB |
| MAX_FILES_PER_ZIP=100 |
| |
| # Storage |
| STORAGE_BASE=./storage |
| ``` |
|
|
| ### Step 4: Get Groq API Key |
|
|
| 1. Visit https://console.groq.com |
| 2. Sign up / Log in |
| 3. Go to "API Keys" |
| 4. Create new key |
| 5. Copy to `.env` file |
|
|
| ### Step 5: Verify Installation |
|
|
| ```bash |
| # Test Groq connection |
| python test_groq_vision.py |
| |
| # Run tests |
| pytest tests/ -v |
| |
| # Start Streamlit app |
| streamlit run app.py |
| ``` |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Quick Start |
|
|
| 1. **Start the app:** |
| ```bash |
| streamlit run app.py |
| ``` |
|
|
| 2. **Open browser:** http://localhost:8501 |
|
|
| 3. **Upload ZIP** containing: |
| - Business documents (PDF, DOCX) |
| - Spreadsheets (XLSX, CSV) |
| - Images (JPG, PNG) |
| - Videos (MP4, AVI) |
|
|
| 4. **Click "Start Processing"** |
|
|
| 5. **View results** in tabs: |
| - Results (documents, tables) |
| - Vision Analysis (image descriptions) |
|
|
| ### Command Line Usage |
|
|
| ```python |
| from backend.agents.file_discovery import FileDiscoveryAgent, FileDiscoveryInput |
| |
| # Initialize agent |
| agent = FileDiscoveryAgent() |
| |
| # Create input |
| input_data = FileDiscoveryInput( |
| zip_file_path="business_docs.zip", |
| job_id="job_001" |
| ) |
| |
| # Run discovery |
| output = agent.discover(input_data) |
| |
| print(f"Discovered {output.total_files} files") |
| ``` |
|
|
| ### Batch Processing |
|
|
| ```python |
| from backend.agents.vision_agent import VisionAgent |
| |
| # Initialize with Groq |
| agent = VisionAgent(provider="groq") |
| |
| # Analyze multiple images |
| analyses = agent.analyze_batch(images, context="Product catalog") |
| |
| for analysis in analyses: |
| print(f"{analysis.category.value}: {analysis.description}") |
| ``` |
|
|
| --- |
|
|
| ## API Reference |
|
|
| ### File Discovery Agent |
|
|
| ```python |
| class FileDiscoveryAgent: |
| def discover(self, input: FileDiscoveryInput) -> FileDiscoveryOutput: |
| """Extract ZIP and classify files""" |
| pass |
| ``` |
|
|
| ### Document Parsing Agent |
|
|
| ```python |
| class DocumentParsingAgent: |
| def parse(self, input: DocumentParsingInput) -> DocumentParsingOutput: |
| """Parse documents and extract text/tables/images""" |
| pass |
| ``` |
|
|
| ### Vision Agent |
|
|
| ```python |
| class VisionAgent: |
| def analyze(self, input: VisionAnalysisInput) -> ImageAnalysis: |
| """Analyze single image""" |
| pass |
| |
| def analyze_batch(self, images: List[ExtractedImage], context: str) -> List[ImageAnalysis]: |
| """Analyze multiple images""" |
| pass |
| ``` |
|
|
| ### Indexing Agent |
|
|
| ```python |
| class IndexingAgent: |
| def build_index(self, input: IndexingInput) -> PageIndex: |
| """Build inverted index""" |
| pass |
| |
| def retrieve_context(self, query: str, page_index: PageIndex, max_pages: int) -> Dict: |
| """Retrieve relevant context""" |
| pass |
| ``` |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### Groq API Issues |
|
|
| **Error:** `Groq API Key Missing` |
|
|
| **Solution:** |
| ```bash |
| # Check .env file |
| cat .env | grep GROQ_API_KEY |
| |
| # Should show your actual key, not placeholder |
| GROQ_API_KEY=gsk_xxxxx |
| ``` |
|
|
| **Error:** `Request Entity Too Large (413)` |
|
|
| **Solution:** Images are automatically resized. If still failing, compress images before uploading. |
|
|
| --- |
|
|
| ### Ollama Issues |
|
|
| **Error:** `Cannot connect to Ollama` |
|
|
| **Solution:** |
| ```bash |
| # Start Ollama server |
| ollama serve |
| |
| # Verify running |
| ollama list |
| ``` |
|
|
| --- |
|
|
| ### Memory Issues |
|
|
| **Error:** `Out of memory` |
|
|
| **Solution:** |
| ```bash |
| # Reduce concurrent processing |
| # In .env: |
| MAX_CONCURRENT_PARSING=3 |
| MAX_CONCURRENT_VISION=2 |
| ``` |
|
|
| --- |
|
|
| ### Performance Issues |
|
|
| **Slow processing:** |
|
|
| 1. Check internet connection (Groq API requires internet) |
| 2. Reduce image sizes before upload |
| 3. Process fewer files at once |
| 4. Check Groq API status: https://status.groq.com |
|
|
| --- |
|
|
| ## Testing |
|
|
| ### Run All Tests |
|
|
| ```bash |
| pytest tests/ -v |
| ``` |
|
|
| ### Run Specific Agent Tests |
|
|
| ```bash |
| # File Discovery |
| pytest tests/agents/test_file_discovery.py -v |
| |
| # Document Parsing |
| pytest tests/agents/test_document_parsing.py -v |
| |
| # Vision Agent |
| pytest tests/agents/test_vision_agent.py -v |
| |
| # Indexing Agent |
| pytest tests/agents/test_indexing.py -v # (to be created) |
| ``` |
|
|
| ### Test Coverage |
|
|
| ```bash |
| pytest tests/ --cov=backend --cov-report=html |
| start htmlcov/index.html # Windows |
| open htmlcov/index.html # macOS/Linux |
| ``` |
|
|
| --- |
|
|
| ## Project Structure |
|
|
| ``` |
| digi-biz/ |
| βββ backend/ |
| β βββ agents/ |
| β β βββ file_discovery.py β
Complete |
| β β βββ document_parsing.py β
Complete |
| β β βββ table_extraction.py β
Complete |
| β β βββ media_extraction.py β
Complete |
| β β βββ vision_agent.py β
Complete |
| β β βββ indexing.py β
Complete |
| β βββ models/ |
| β β βββ schemas.py β
Complete |
| β β βββ enums.py β
Complete |
| β βββ utils/ |
| β βββ storage_manager.py |
| β βββ file_classifier.py |
| β βββ logger.py |
| β βββ groq_vision_client.py |
| βββ tests/ |
| β βββ agents/ |
| β βββ test_file_discovery.py |
| β βββ test_document_parsing.py |
| β βββ test_table_extraction.py |
| β βββ test_media_extraction.py |
| β βββ test_vision_agent.py |
| βββ app.py β
Streamlit App |
| βββ requirements.txt |
| βββ .env.example |
| βββ docs/ |
| βββ DOCUMENTATION.md β
This file |
| ``` |
|
|
| --- |
|
|
| ## Performance Benchmarks |
|
|
| | Agent | Processing Time | Test Data | |
| |-------|----------------|-----------| |
| | File Discovery | ~1-2s | 10 files ZIP | |
| | Document Parsing | ~50ms/doc | PDF 10 pages | |
| | Table Extraction | ~100ms/doc | 5 tables | |
| | Media Extraction | ~200ms/image | 5 images | |
| | Vision Analysis | ~2s/image | Groq API | |
| | Indexing | ~500ms | 50 pages | |
|
|
| **End-to-End:** <2 minutes for typical business folder (10 documents, 5 images) |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License - See LICENSE file for details |
|
|
| --- |
|
|
| ## Support |
|
|
| - **GitHub Issues:** Report bugs and feature requests |
| - **Documentation:** This file + inline code comments |
| - **Email:** [Your contact here] |
|
|
| --- |
|
|
| **Last Updated:** March 17, 2026 |
| **Version:** 1.0.0 |
| **Status:** Production Ready β
|
|
|