# Digi-Biz Documentation ## Agentic Business Digitization Framework **Version:** 1.0.0 **Last Updated:** March 17, 2026 --- ## 📋 Table of Contents 1. [Overview](#overview) 2. [Architecture](#architecture) 3. [Agents](#agents) 4. [Installation](#installation) 5. [Usage](#usage) 6. [API Reference](#api-reference) 7. [Troubleshooting](#troubleshooting) --- ## Overview **Digi-Biz** is an AI-powered agentic framework that automatically converts unstructured business documents into structured digital business profiles. ### What It Does - Accepts ZIP files containing mixed business documents (PDF, DOCX, Excel, images, videos) - Intelligently extracts and structures information using multi-agent workflows - Generates comprehensive digital business profiles with product/service inventories - Provides dynamic UI for viewing and editing results ### Key Features ✅ **Multi-Agent Pipeline** - 5 specialized agents working together ✅ **Vectorless RAG** - Fast document retrieval without embeddings ✅ **Groq Vision** - Image analysis with Llama-4-Scout (17B) ✅ **Production-Ready** - Error handling, validation, logging ✅ **Streamlit UI** - Interactive web interface --- ## Architecture ### High-Level Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ User Interface (Streamlit) │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ ZIP Upload │ │ Results View │ │ Vision Tab │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ Agent Pipeline │ │ 1. File Discovery → 2. Document Parsing → 3. Table Extract │ │ 4. Media Extraction → 5. Vision (Groq) → 6. Indexing (RAG) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ Data Layer │ │ File Storage (FileSystem) • Index (In-Memory) • Profiles │ └─────────────────────────────────────────────────────────────┘ ``` ### Technology Stack | Component | Technology | |-----------|-----------| | **Backend** | Python 3.10+ | | **Document Parsing** | pdfplumber, python-docx, openpyxl | | **Image Processing** | Pillow, pdf2image | | **Vision AI** | Groq API (Llama-4-Scout-17B) | | **LLM (Text)** | Groq API (gpt-oss-120b) | | **Validation** | Pydantic | | **Frontend** | Streamlit | | **Storage** | Local Filesystem | --- ## Agents ### 1. File Discovery Agent **Purpose:** Extract ZIP files and classify all contained files **Input:** ```python FileDiscoveryInput( zip_file_path="/path/to/upload.zip", job_id="job_123", max_file_size=524288000, # 500MB max_files=100 ) ``` **Output:** ```python FileDiscoveryOutput( job_id="job_123", success=True, documents=[...], # PDFs, DOCX spreadsheets=[...], # XLSX, CSV images=[...], # JPG, PNG videos=[...], # MP4, AVI total_files=10, extraction_dir="/storage/extracted/job_123" ) ``` **Features:** - ZIP bomb detection (1000:1 ratio limit) - Path traversal prevention - File type classification (3-strategy approach) - Directory structure preservation **File:** `backend/agents/file_discovery.py` --- ### 2. Document Parsing Agent **Purpose:** Extract text and structure from PDF/DOCX files **Input:** ```python DocumentParsingInput( documents=[...], # From File Discovery job_id="job_123", enable_ocr=True ) ``` **Output:** ```python DocumentParsingOutput( job_id="job_123", success=True, parsed_documents=[...], total_pages=56, processing_time=2.5 ) ``` **Features:** - PDF parsing (pdfplumber primary, PyPDF2 fallback, OCR final) - DOCX parsing with structure preservation - Table extraction - Embedded image extraction **File:** `backend/agents/document_parsing.py` --- ### 3. Table Extraction Agent **Purpose:** Detect and classify tables from parsed documents **Input:** ```python TableExtractionInput( parsed_documents=[...], job_id="job_123" ) ``` **Output:** ```python TableExtractionOutput( job_id="job_123", success=True, tables=[...], total_tables=42, tables_by_type={ "itinerary": 33, "pricing": 6, "general": 3 } ) ``` **Table Types:** | Type | Detection Criteria | |------|-------------------| | **PRICING** | Headers: price/cost/rate; Currency: $, €, ₹ | | **ITINERARY** | Headers: day/time/date; Patterns: "Day 1", "9:00 AM" | | **SPECIFICATIONS** | Headers: spec/feature/dimension/weight | | **MENU** | Headers: menu/dish/food/meal | | **INVENTORY** | Headers: stock/quantity/available | | **GENERAL** | Fallback | **File:** `backend/agents/table_extraction.py` --- ### 4. Media Extraction Agent **Purpose:** Extract embedded and standalone media **Input:** ```python MediaExtractionInput( parsed_documents=[...], standalone_files=[...], job_id="job_123" ) ``` **Output:** ```python MediaExtractionOutput( job_id="job_123", success=True, media=MediaCollection( images=[...], total_count=15, extraction_summary={...} ), duplicates_removed=3 ) ``` **Features:** - PDF embedded image extraction (xref method) - DOCX embedded image extraction (ZIP method) - Perceptual hashing for deduplication - Quality assessment **File:** `backend/agents/media_extraction.py` --- ### 5. Vision Agent (Groq) **Purpose:** Analyze images using Groq Vision API **Input:** ```python VisionAnalysisInput( image=ExtractedImage(...), context="Restaurant menu with burgers", job_id="job_123" ) ``` **Output:** ```python ImageAnalysis( image_id="img_001", description="A delicious burger with lettuce...", category=ImageCategory.FOOD, tags=["burger", "food", "restaurant"], is_product=False, is_service_related=True, confidence=0.92, metadata={ 'provider': 'groq', 'model': 'llama-4-scout-17b', 'processing_time': 1.85 } ) ``` **Features:** - Groq API integration (Llama-4-Scout-17B) - Ollama fallback - Context-aware prompts - JSON response parsing - Batch processing - Automatic image resizing (<4MB) **File:** `backend/agents/vision_agent.py` --- ### 6. Indexing Agent (Vectorless RAG) **Purpose:** Build inverted index for fast document retrieval **Input:** ```python IndexingInput( parsed_documents=[...], tables=[...], images=[...], job_id="job_123" ) ``` **Output:** ```python IndexingOutput( job_id="job_123", success=True, page_index=PageIndex( documents={...}, page_index={ "burger": [PageReference(...)], "price": [PageReference(...)] }, table_index={...}, media_index={...} ), total_keywords=1250 ) ``` **Features:** - Keyword extraction (tokenization, N-grams, entities) - Inverted index creation - Query expansion with synonyms - Context-aware retrieval - Relevance scoring **File:** `backend/agents/indexing.py` --- ## Installation ### Prerequisites - Python 3.10+ - Git (for cloning) - Groq API account (free at https://console.groq.com) ### Step 1: Clone Repository ```bash cd D:\Viswam_Projects\digi-biz ``` ### Step 2: Install Dependencies ```bash pip install -r requirements.txt ``` ### Step 3: Configure Environment Create `.env` file: ```bash # Groq API (required for vision and text LLM) GROQ_API_KEY=gsk_your_actual_key_here GROQ_MODEL=gpt-oss-120b GROQ_VISION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct # Optional: Ollama for local fallback OLLAMA_HOST=http://localhost:11434 OLLAMA_VISION_MODEL=qwen3.5:0.8b # Application settings APP_ENV=development LOG_LEVEL=INFO MAX_FILE_SIZE=524288000 # 500MB MAX_FILES_PER_ZIP=100 # Storage STORAGE_BASE=./storage ``` ### Step 4: Get Groq API Key 1. Visit https://console.groq.com 2. Sign up / Log in 3. Go to "API Keys" 4. Create new key 5. Copy to `.env` file ### Step 5: Verify Installation ```bash # Test Groq connection python test_groq_vision.py # Run tests pytest tests/ -v # Start Streamlit app streamlit run app.py ``` --- ## Usage ### Quick Start 1. **Start the app:** ```bash streamlit run app.py ``` 2. **Open browser:** http://localhost:8501 3. **Upload ZIP** containing: - Business documents (PDF, DOCX) - Spreadsheets (XLSX, CSV) - Images (JPG, PNG) - Videos (MP4, AVI) 4. **Click "Start Processing"** 5. **View results** in tabs: - Results (documents, tables) - Vision Analysis (image descriptions) ### Command Line Usage ```python from backend.agents.file_discovery import FileDiscoveryAgent, FileDiscoveryInput # Initialize agent agent = FileDiscoveryAgent() # Create input input_data = FileDiscoveryInput( zip_file_path="business_docs.zip", job_id="job_001" ) # Run discovery output = agent.discover(input_data) print(f"Discovered {output.total_files} files") ``` ### Batch Processing ```python from backend.agents.vision_agent import VisionAgent # Initialize with Groq agent = VisionAgent(provider="groq") # Analyze multiple images analyses = agent.analyze_batch(images, context="Product catalog") for analysis in analyses: print(f"{analysis.category.value}: {analysis.description}") ``` --- ## API Reference ### File Discovery Agent ```python class FileDiscoveryAgent: def discover(self, input: FileDiscoveryInput) -> FileDiscoveryOutput: """Extract ZIP and classify files""" pass ``` ### Document Parsing Agent ```python class DocumentParsingAgent: def parse(self, input: DocumentParsingInput) -> DocumentParsingOutput: """Parse documents and extract text/tables/images""" pass ``` ### Vision Agent ```python class VisionAgent: def analyze(self, input: VisionAnalysisInput) -> ImageAnalysis: """Analyze single image""" pass def analyze_batch(self, images: List[ExtractedImage], context: str) -> List[ImageAnalysis]: """Analyze multiple images""" pass ``` ### Indexing Agent ```python class IndexingAgent: def build_index(self, input: IndexingInput) -> PageIndex: """Build inverted index""" pass def retrieve_context(self, query: str, page_index: PageIndex, max_pages: int) -> Dict: """Retrieve relevant context""" pass ``` --- ## Troubleshooting ### Groq API Issues **Error:** `Groq API Key Missing` **Solution:** ```bash # Check .env file cat .env | grep GROQ_API_KEY # Should show your actual key, not placeholder GROQ_API_KEY=gsk_xxxxx ``` **Error:** `Request Entity Too Large (413)` **Solution:** Images are automatically resized. If still failing, compress images before uploading. --- ### Ollama Issues **Error:** `Cannot connect to Ollama` **Solution:** ```bash # Start Ollama server ollama serve # Verify running ollama list ``` --- ### Memory Issues **Error:** `Out of memory` **Solution:** ```bash # Reduce concurrent processing # In .env: MAX_CONCURRENT_PARSING=3 MAX_CONCURRENT_VISION=2 ``` --- ### Performance Issues **Slow processing:** 1. Check internet connection (Groq API requires internet) 2. Reduce image sizes before upload 3. Process fewer files at once 4. Check Groq API status: https://status.groq.com --- ## Testing ### Run All Tests ```bash pytest tests/ -v ``` ### Run Specific Agent Tests ```bash # File Discovery pytest tests/agents/test_file_discovery.py -v # Document Parsing pytest tests/agents/test_document_parsing.py -v # Vision Agent pytest tests/agents/test_vision_agent.py -v # Indexing Agent pytest tests/agents/test_indexing.py -v # (to be created) ``` ### Test Coverage ```bash pytest tests/ --cov=backend --cov-report=html start htmlcov/index.html # Windows open htmlcov/index.html # macOS/Linux ``` --- ## Project Structure ``` digi-biz/ ├── backend/ │ ├── agents/ │ │ ├── file_discovery.py ✅ Complete │ │ ├── document_parsing.py ✅ Complete │ │ ├── table_extraction.py ✅ Complete │ │ ├── media_extraction.py ✅ Complete │ │ ├── vision_agent.py ✅ Complete │ │ └── indexing.py ✅ Complete │ ├── models/ │ │ ├── schemas.py ✅ Complete │ │ └── enums.py ✅ Complete │ └── utils/ │ ├── storage_manager.py │ ├── file_classifier.py │ ├── logger.py │ └── groq_vision_client.py ├── tests/ │ └── agents/ │ ├── test_file_discovery.py │ ├── test_document_parsing.py │ ├── test_table_extraction.py │ ├── test_media_extraction.py │ └── test_vision_agent.py ├── app.py ✅ Streamlit App ├── requirements.txt ├── .env.example └── docs/ └── DOCUMENTATION.md ✅ This file ``` --- ## Performance Benchmarks | Agent | Processing Time | Test Data | |-------|----------------|-----------| | File Discovery | ~1-2s | 10 files ZIP | | Document Parsing | ~50ms/doc | PDF 10 pages | | Table Extraction | ~100ms/doc | 5 tables | | Media Extraction | ~200ms/image | 5 images | | Vision Analysis | ~2s/image | Groq API | | Indexing | ~500ms | 50 pages | **End-to-End:** <2 minutes for typical business folder (10 documents, 5 images) --- ## License MIT License - See LICENSE file for details --- ## Support - **GitHub Issues:** Report bugs and feature requests - **Documentation:** This file + inline code comments - **Email:** [Your contact here] --- **Last Updated:** March 17, 2026 **Version:** 1.0.0 **Status:** Production Ready ✅