Digi-Biz / docs /DOCUMENTATION.md
Deployment Bot
Automated deployment to Hugging Face
255cbd1
# Digi-Biz Documentation
## Agentic Business Digitization Framework
**Version:** 1.0.0
**Last Updated:** March 17, 2026
---
## πŸ“‹ Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Agents](#agents)
4. [Installation](#installation)
5. [Usage](#usage)
6. [API Reference](#api-reference)
7. [Troubleshooting](#troubleshooting)
---
## Overview
**Digi-Biz** is an AI-powered agentic framework that automatically converts unstructured business documents into structured digital business profiles.
### What It Does
- Accepts ZIP files containing mixed business documents (PDF, DOCX, Excel, images, videos)
- Intelligently extracts and structures information using multi-agent workflows
- Generates comprehensive digital business profiles with product/service inventories
- Provides dynamic UI for viewing and editing results
### Key Features
βœ… **Multi-Agent Pipeline** - 5 specialized agents working together
βœ… **Vectorless RAG** - Fast document retrieval without embeddings
βœ… **Groq Vision** - Image analysis with Llama-4-Scout (17B)
βœ… **Production-Ready** - Error handling, validation, logging
βœ… **Streamlit UI** - Interactive web interface
---
## Architecture
### High-Level Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Interface (Streamlit) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ ZIP Upload β”‚ β”‚ Results View β”‚ β”‚ Vision Tab β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Agent Pipeline β”‚
β”‚ 1. File Discovery β†’ 2. Document Parsing β†’ 3. Table Extract β”‚
β”‚ 4. Media Extraction β†’ 5. Vision (Groq) β†’ 6. Indexing (RAG) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Layer β”‚
β”‚ File Storage (FileSystem) β€’ Index (In-Memory) β€’ Profiles β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Technology Stack
| Component | Technology |
|-----------|-----------|
| **Backend** | Python 3.10+ |
| **Document Parsing** | pdfplumber, python-docx, openpyxl |
| **Image Processing** | Pillow, pdf2image |
| **Vision AI** | Groq API (Llama-4-Scout-17B) |
| **LLM (Text)** | Groq API (gpt-oss-120b) |
| **Validation** | Pydantic |
| **Frontend** | Streamlit |
| **Storage** | Local Filesystem |
---
## Agents
### 1. File Discovery Agent
**Purpose:** Extract ZIP files and classify all contained files
**Input:**
```python
FileDiscoveryInput(
zip_file_path="/path/to/upload.zip",
job_id="job_123",
max_file_size=524288000, # 500MB
max_files=100
)
```
**Output:**
```python
FileDiscoveryOutput(
job_id="job_123",
success=True,
documents=[...], # PDFs, DOCX
spreadsheets=[...], # XLSX, CSV
images=[...], # JPG, PNG
videos=[...], # MP4, AVI
total_files=10,
extraction_dir="/storage/extracted/job_123"
)
```
**Features:**
- ZIP bomb detection (1000:1 ratio limit)
- Path traversal prevention
- File type classification (3-strategy approach)
- Directory structure preservation
**File:** `backend/agents/file_discovery.py`
---
### 2. Document Parsing Agent
**Purpose:** Extract text and structure from PDF/DOCX files
**Input:**
```python
DocumentParsingInput(
documents=[...], # From File Discovery
job_id="job_123",
enable_ocr=True
)
```
**Output:**
```python
DocumentParsingOutput(
job_id="job_123",
success=True,
parsed_documents=[...],
total_pages=56,
processing_time=2.5
)
```
**Features:**
- PDF parsing (pdfplumber primary, PyPDF2 fallback, OCR final)
- DOCX parsing with structure preservation
- Table extraction
- Embedded image extraction
**File:** `backend/agents/document_parsing.py`
---
### 3. Table Extraction Agent
**Purpose:** Detect and classify tables from parsed documents
**Input:**
```python
TableExtractionInput(
parsed_documents=[...],
job_id="job_123"
)
```
**Output:**
```python
TableExtractionOutput(
job_id="job_123",
success=True,
tables=[...],
total_tables=42,
tables_by_type={
"itinerary": 33,
"pricing": 6,
"general": 3
}
)
```
**Table Types:**
| Type | Detection Criteria |
|------|-------------------|
| **PRICING** | Headers: price/cost/rate; Currency: $, €, β‚Ή |
| **ITINERARY** | Headers: day/time/date; Patterns: "Day 1", "9:00 AM" |
| **SPECIFICATIONS** | Headers: spec/feature/dimension/weight |
| **MENU** | Headers: menu/dish/food/meal |
| **INVENTORY** | Headers: stock/quantity/available |
| **GENERAL** | Fallback |
**File:** `backend/agents/table_extraction.py`
---
### 4. Media Extraction Agent
**Purpose:** Extract embedded and standalone media
**Input:**
```python
MediaExtractionInput(
parsed_documents=[...],
standalone_files=[...],
job_id="job_123"
)
```
**Output:**
```python
MediaExtractionOutput(
job_id="job_123",
success=True,
media=MediaCollection(
images=[...],
total_count=15,
extraction_summary={...}
),
duplicates_removed=3
)
```
**Features:**
- PDF embedded image extraction (xref method)
- DOCX embedded image extraction (ZIP method)
- Perceptual hashing for deduplication
- Quality assessment
**File:** `backend/agents/media_extraction.py`
---
### 5. Vision Agent (Groq)
**Purpose:** Analyze images using Groq Vision API
**Input:**
```python
VisionAnalysisInput(
image=ExtractedImage(...),
context="Restaurant menu with burgers",
job_id="job_123"
)
```
**Output:**
```python
ImageAnalysis(
image_id="img_001",
description="A delicious burger with lettuce...",
category=ImageCategory.FOOD,
tags=["burger", "food", "restaurant"],
is_product=False,
is_service_related=True,
confidence=0.92,
metadata={
'provider': 'groq',
'model': 'llama-4-scout-17b',
'processing_time': 1.85
}
)
```
**Features:**
- Groq API integration (Llama-4-Scout-17B)
- Ollama fallback
- Context-aware prompts
- JSON response parsing
- Batch processing
- Automatic image resizing (<4MB)
**File:** `backend/agents/vision_agent.py`
---
### 6. Indexing Agent (Vectorless RAG)
**Purpose:** Build inverted index for fast document retrieval
**Input:**
```python
IndexingInput(
parsed_documents=[...],
tables=[...],
images=[...],
job_id="job_123"
)
```
**Output:**
```python
IndexingOutput(
job_id="job_123",
success=True,
page_index=PageIndex(
documents={...},
page_index={
"burger": [PageReference(...)],
"price": [PageReference(...)]
},
table_index={...},
media_index={...}
),
total_keywords=1250
)
```
**Features:**
- Keyword extraction (tokenization, N-grams, entities)
- Inverted index creation
- Query expansion with synonyms
- Context-aware retrieval
- Relevance scoring
**File:** `backend/agents/indexing.py`
---
## Installation
### Prerequisites
- Python 3.10+
- Git (for cloning)
- Groq API account (free at https://console.groq.com)
### Step 1: Clone Repository
```bash
cd D:\Viswam_Projects\digi-biz
```
### Step 2: Install Dependencies
```bash
pip install -r requirements.txt
```
### Step 3: Configure Environment
Create `.env` file:
```bash
# Groq API (required for vision and text LLM)
GROQ_API_KEY=gsk_your_actual_key_here
GROQ_MODEL=gpt-oss-120b
GROQ_VISION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
# Optional: Ollama for local fallback
OLLAMA_HOST=http://localhost:11434
OLLAMA_VISION_MODEL=qwen3.5:0.8b
# Application settings
APP_ENV=development
LOG_LEVEL=INFO
MAX_FILE_SIZE=524288000 # 500MB
MAX_FILES_PER_ZIP=100
# Storage
STORAGE_BASE=./storage
```
### Step 4: Get Groq API Key
1. Visit https://console.groq.com
2. Sign up / Log in
3. Go to "API Keys"
4. Create new key
5. Copy to `.env` file
### Step 5: Verify Installation
```bash
# Test Groq connection
python test_groq_vision.py
# Run tests
pytest tests/ -v
# Start Streamlit app
streamlit run app.py
```
---
## Usage
### Quick Start
1. **Start the app:**
```bash
streamlit run app.py
```
2. **Open browser:** http://localhost:8501
3. **Upload ZIP** containing:
- Business documents (PDF, DOCX)
- Spreadsheets (XLSX, CSV)
- Images (JPG, PNG)
- Videos (MP4, AVI)
4. **Click "Start Processing"**
5. **View results** in tabs:
- Results (documents, tables)
- Vision Analysis (image descriptions)
### Command Line Usage
```python
from backend.agents.file_discovery import FileDiscoveryAgent, FileDiscoveryInput
# Initialize agent
agent = FileDiscoveryAgent()
# Create input
input_data = FileDiscoveryInput(
zip_file_path="business_docs.zip",
job_id="job_001"
)
# Run discovery
output = agent.discover(input_data)
print(f"Discovered {output.total_files} files")
```
### Batch Processing
```python
from backend.agents.vision_agent import VisionAgent
# Initialize with Groq
agent = VisionAgent(provider="groq")
# Analyze multiple images
analyses = agent.analyze_batch(images, context="Product catalog")
for analysis in analyses:
print(f"{analysis.category.value}: {analysis.description}")
```
---
## API Reference
### File Discovery Agent
```python
class FileDiscoveryAgent:
def discover(self, input: FileDiscoveryInput) -> FileDiscoveryOutput:
"""Extract ZIP and classify files"""
pass
```
### Document Parsing Agent
```python
class DocumentParsingAgent:
def parse(self, input: DocumentParsingInput) -> DocumentParsingOutput:
"""Parse documents and extract text/tables/images"""
pass
```
### Vision Agent
```python
class VisionAgent:
def analyze(self, input: VisionAnalysisInput) -> ImageAnalysis:
"""Analyze single image"""
pass
def analyze_batch(self, images: List[ExtractedImage], context: str) -> List[ImageAnalysis]:
"""Analyze multiple images"""
pass
```
### Indexing Agent
```python
class IndexingAgent:
def build_index(self, input: IndexingInput) -> PageIndex:
"""Build inverted index"""
pass
def retrieve_context(self, query: str, page_index: PageIndex, max_pages: int) -> Dict:
"""Retrieve relevant context"""
pass
```
---
## Troubleshooting
### Groq API Issues
**Error:** `Groq API Key Missing`
**Solution:**
```bash
# Check .env file
cat .env | grep GROQ_API_KEY
# Should show your actual key, not placeholder
GROQ_API_KEY=gsk_xxxxx
```
**Error:** `Request Entity Too Large (413)`
**Solution:** Images are automatically resized. If still failing, compress images before uploading.
---
### Ollama Issues
**Error:** `Cannot connect to Ollama`
**Solution:**
```bash
# Start Ollama server
ollama serve
# Verify running
ollama list
```
---
### Memory Issues
**Error:** `Out of memory`
**Solution:**
```bash
# Reduce concurrent processing
# In .env:
MAX_CONCURRENT_PARSING=3
MAX_CONCURRENT_VISION=2
```
---
### Performance Issues
**Slow processing:**
1. Check internet connection (Groq API requires internet)
2. Reduce image sizes before upload
3. Process fewer files at once
4. Check Groq API status: https://status.groq.com
---
## Testing
### Run All Tests
```bash
pytest tests/ -v
```
### Run Specific Agent Tests
```bash
# File Discovery
pytest tests/agents/test_file_discovery.py -v
# Document Parsing
pytest tests/agents/test_document_parsing.py -v
# Vision Agent
pytest tests/agents/test_vision_agent.py -v
# Indexing Agent
pytest tests/agents/test_indexing.py -v # (to be created)
```
### Test Coverage
```bash
pytest tests/ --cov=backend --cov-report=html
start htmlcov/index.html # Windows
open htmlcov/index.html # macOS/Linux
```
---
## Project Structure
```
digi-biz/
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ agents/
β”‚ β”‚ β”œβ”€β”€ file_discovery.py βœ… Complete
β”‚ β”‚ β”œβ”€β”€ document_parsing.py βœ… Complete
β”‚ β”‚ β”œβ”€β”€ table_extraction.py βœ… Complete
β”‚ β”‚ β”œβ”€β”€ media_extraction.py βœ… Complete
β”‚ β”‚ β”œβ”€β”€ vision_agent.py βœ… Complete
β”‚ β”‚ └── indexing.py βœ… Complete
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”œβ”€β”€ schemas.py βœ… Complete
β”‚ β”‚ └── enums.py βœ… Complete
β”‚ └── utils/
β”‚ β”œβ”€β”€ storage_manager.py
β”‚ β”œβ”€β”€ file_classifier.py
β”‚ β”œβ”€β”€ logger.py
β”‚ └── groq_vision_client.py
β”œβ”€β”€ tests/
β”‚ └── agents/
β”‚ β”œβ”€β”€ test_file_discovery.py
β”‚ β”œβ”€β”€ test_document_parsing.py
β”‚ β”œβ”€β”€ test_table_extraction.py
β”‚ β”œβ”€β”€ test_media_extraction.py
β”‚ └── test_vision_agent.py
β”œβ”€β”€ app.py βœ… Streamlit App
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── docs/
└── DOCUMENTATION.md βœ… This file
```
---
## Performance Benchmarks
| Agent | Processing Time | Test Data |
|-------|----------------|-----------|
| File Discovery | ~1-2s | 10 files ZIP |
| Document Parsing | ~50ms/doc | PDF 10 pages |
| Table Extraction | ~100ms/doc | 5 tables |
| Media Extraction | ~200ms/image | 5 images |
| Vision Analysis | ~2s/image | Groq API |
| Indexing | ~500ms | 50 pages |
**End-to-End:** <2 minutes for typical business folder (10 documents, 5 images)
---
## License
MIT License - See LICENSE file for details
---
## Support
- **GitHub Issues:** Report bugs and feature requests
- **Documentation:** This file + inline code comments
- **Email:** [Your contact here]
---
**Last Updated:** March 17, 2026
**Version:** 1.0.0
**Status:** Production Ready βœ