Spaces:

Divs0910
/

Digi-Biz

Sleeping

App Files Files Community

Digi-Biz / docs /DOCUMENTATION.md

Deployment Bot

Automated deployment to Hugging Face

255cbd1 16 days ago

preview code

raw

history blame contribute delete

15 kB

Digi-Biz Documentation

Agentic Business Digitization Framework

Version: 1.0.0
Last Updated: March 17, 2026

Overview

Digi-Biz is an AI-powered agentic framework that automatically converts unstructured business documents into structured digital business profiles.

What It Does

Accepts ZIP files containing mixed business documents (PDF, DOCX, Excel, images, videos)
Intelligently extracts and structures information using multi-agent workflows
Generates comprehensive digital business profiles with product/service inventories
Provides dynamic UI for viewing and editing results

Key Features

✅ Multi-Agent Pipeline - 5 specialized agents working together
✅ Vectorless RAG - Fast document retrieval without embeddings
✅ Groq Vision - Image analysis with Llama-4-Scout (17B)
✅ Production-Ready - Error handling, validation, logging
✅ Streamlit UI - Interactive web interface

Architecture

High-Level Overview

┌─────────────────────────────────────────────────────────────┐
│                     User Interface (Streamlit)               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ ZIP Upload   │  │ Results View │  │ Vision Tab   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                  Agent Pipeline                              │
│  1. File Discovery → 2. Document Parsing → 3. Table Extract │
│  4. Media Extraction → 5. Vision (Groq) → 6. Indexing (RAG) │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                    Data Layer                                │
│  File Storage (FileSystem) • Index (In-Memory) • Profiles   │
└─────────────────────────────────────────────────────────────┘

Technology Stack

Component	Technology
Backend	Python 3.10+
Document Parsing	pdfplumber, python-docx, openpyxl
Image Processing	Pillow, pdf2image
Vision AI	Groq API (Llama-4-Scout-17B)
LLM (Text)	Groq API (gpt-oss-120b)
Validation	Pydantic
Frontend	Streamlit
Storage	Local Filesystem

Agents

1. File Discovery Agent

Purpose: Extract ZIP files and classify all contained files

Input:

FileDiscoveryInput(
    zip_file_path="/path/to/upload.zip",
    job_id="job_123",
    max_file_size=524288000,  # 500MB
    max_files=100
)

Output:

FileDiscoveryOutput(
    job_id="job_123",
    success=True,
    documents=[...],      # PDFs, DOCX
    spreadsheets=[...],   # XLSX, CSV
    images=[...],         # JPG, PNG
    videos=[...],         # MP4, AVI
    total_files=10,
    extraction_dir="/storage/extracted/job_123"
)

Features:

ZIP bomb detection (1000:1 ratio limit)
Path traversal prevention
File type classification (3-strategy approach)
Directory structure preservation

File: backend/agents/file_discovery.py

2. Document Parsing Agent

Purpose: Extract text and structure from PDF/DOCX files

Input:

DocumentParsingInput(
    documents=[...],  # From File Discovery
    job_id="job_123",
    enable_ocr=True
)

Output:

DocumentParsingOutput(
    job_id="job_123",
    success=True,
    parsed_documents=[...],
    total_pages=56,
    processing_time=2.5
)

Features:

PDF parsing (pdfplumber primary, PyPDF2 fallback, OCR final)
DOCX parsing with structure preservation
Table extraction
Embedded image extraction

File: backend/agents/document_parsing.py

3. Table Extraction Agent

Purpose: Detect and classify tables from parsed documents

Input:

TableExtractionInput(
    parsed_documents=[...],
    job_id="job_123"
)

Output:

TableExtractionOutput(
    job_id="job_123",
    success=True,
    tables=[...],
    total_tables=42,
    tables_by_type={
        "itinerary": 33,
        "pricing": 6,
        "general": 3
    }
)

Table Types:

Type	Detection Criteria
PRICING	Headers: price/cost/rate; Currency: $, €, ₹
ITINERARY	Headers: day/time/date; Patterns: "Day 1", "9:00 AM"
SPECIFICATIONS	Headers: spec/feature/dimension/weight
MENU	Headers: menu/dish/food/meal
INVENTORY	Headers: stock/quantity/available
GENERAL	Fallback

File: backend/agents/table_extraction.py

4. Media Extraction Agent

Purpose: Extract embedded and standalone media

Input:

MediaExtractionInput(
    parsed_documents=[...],
    standalone_files=[...],
    job_id="job_123"
)

Output:

MediaExtractionOutput(
    job_id="job_123",
    success=True,
    media=MediaCollection(
        images=[...],
        total_count=15,
        extraction_summary={...}
    ),
    duplicates_removed=3
)

Features:

PDF embedded image extraction (xref method)
DOCX embedded image extraction (ZIP method)
Perceptual hashing for deduplication
Quality assessment

File: backend/agents/media_extraction.py

5. Vision Agent (Groq)

Purpose: Analyze images using Groq Vision API

Input:

VisionAnalysisInput(
    image=ExtractedImage(...),
    context="Restaurant menu with burgers",
    job_id="job_123"
)

Output:

ImageAnalysis(
    image_id="img_001",
    description="A delicious burger with lettuce...",
    category=ImageCategory.FOOD,
    tags=["burger", "food", "restaurant"],
    is_product=False,
    is_service_related=True,
    confidence=0.92,
    metadata={
        'provider': 'groq',
        'model': 'llama-4-scout-17b',
        'processing_time': 1.85
    }
)

Features:

Groq API integration (Llama-4-Scout-17B)
Ollama fallback
Context-aware prompts
JSON response parsing
Batch processing
Automatic image resizing (<4MB)

File: backend/agents/vision_agent.py

6. Indexing Agent (Vectorless RAG)

Purpose: Build inverted index for fast document retrieval

Input:

IndexingInput(
    parsed_documents=[...],
    tables=[...],
    images=[...],
    job_id="job_123"
)

Output:

IndexingOutput(
    job_id="job_123",
    success=True,
    page_index=PageIndex(
        documents={...},
        page_index={
            "burger": [PageReference(...)],
            "price": [PageReference(...)]
        },
        table_index={...},
        media_index={...}
    ),
    total_keywords=1250
)

Features:

Keyword extraction (tokenization, N-grams, entities)
Inverted index creation
Query expansion with synonyms
Context-aware retrieval
Relevance scoring

File: backend/agents/indexing.py

Installation

Prerequisites

Python 3.10+
Git (for cloning)
Groq API account (free at https://console.groq.com)

Step 1: Clone Repository

cd D:\Viswam_Projects\digi-biz

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Configure Environment

Create .env file:

# Groq API (required for vision and text LLM)
GROQ_API_KEY=gsk_your_actual_key_here
GROQ_MODEL=gpt-oss-120b
GROQ_VISION_MODEL=meta-llama/llama-4-scout-17b-16e-instruct

# Optional: Ollama for local fallback
OLLAMA_HOST=http://localhost:11434
OLLAMA_VISION_MODEL=qwen3.5:0.8b

# Application settings
APP_ENV=development
LOG_LEVEL=INFO
MAX_FILE_SIZE=524288000    # 500MB
MAX_FILES_PER_ZIP=100

# Storage
STORAGE_BASE=./storage

Step 4: Get Groq API Key

Visit https://console.groq.com
Sign up / Log in
Go to "API Keys"
Create new key
Copy to .env file

Step 5: Verify Installation

# Test Groq connection
python test_groq_vision.py

# Run tests
pytest tests/ -v

# Start Streamlit app
streamlit run app.py

Usage

Quick Start

Start the app:
```
streamlit run app.py
```
Open browser: http://localhost:8501
Upload ZIP containing:
- Business documents (PDF, DOCX)
- Spreadsheets (XLSX, CSV)
- Images (JPG, PNG)
- Videos (MP4, AVI)
Click "Start Processing"
View results in tabs:
- Results (documents, tables)
- Vision Analysis (image descriptions)

Command Line Usage

from backend.agents.file_discovery import FileDiscoveryAgent, FileDiscoveryInput

# Initialize agent
agent = FileDiscoveryAgent()

# Create input
input_data = FileDiscoveryInput(
    zip_file_path="business_docs.zip",
    job_id="job_001"
)

# Run discovery
output = agent.discover(input_data)

print(f"Discovered {output.total_files} files")

Batch Processing

from backend.agents.vision_agent import VisionAgent

# Initialize with Groq
agent = VisionAgent(provider="groq")

# Analyze multiple images
analyses = agent.analyze_batch(images, context="Product catalog")

for analysis in analyses:
    print(f"{analysis.category.value}: {analysis.description}")

API Reference

File Discovery Agent

class FileDiscoveryAgent:
    def discover(self, input: FileDiscoveryInput) -> FileDiscoveryOutput:
        """Extract ZIP and classify files"""
        pass

Document Parsing Agent

class DocumentParsingAgent:
    def parse(self, input: DocumentParsingInput) -> DocumentParsingOutput:
        """Parse documents and extract text/tables/images"""
        pass

Vision Agent

class VisionAgent:
    def analyze(self, input: VisionAnalysisInput) -> ImageAnalysis:
        """Analyze single image"""
        pass
    
    def analyze_batch(self, images: List[ExtractedImage], context: str) -> List[ImageAnalysis]:
        """Analyze multiple images"""
        pass

Indexing Agent

class IndexingAgent:
    def build_index(self, input: IndexingInput) -> PageIndex:
        """Build inverted index"""
        pass
    
    def retrieve_context(self, query: str, page_index: PageIndex, max_pages: int) -> Dict:
        """Retrieve relevant context"""
        pass

Troubleshooting

Groq API Issues

Error: Groq API Key Missing

Solution:

# Check .env file
cat .env | grep GROQ_API_KEY

# Should show your actual key, not placeholder
GROQ_API_KEY=gsk_xxxxx

Error: Request Entity Too Large (413)

Solution: Images are automatically resized. If still failing, compress images before uploading.

Ollama Issues

Error: Cannot connect to Ollama

Solution:

# Start Ollama server
ollama serve

# Verify running
ollama list

Memory Issues

Error: Out of memory

Solution:

# Reduce concurrent processing
# In .env:
MAX_CONCURRENT_PARSING=3
MAX_CONCURRENT_VISION=2

Performance Issues

Slow processing:

Check internet connection (Groq API requires internet)
Reduce image sizes before upload
Process fewer files at once
Check Groq API status: https://status.groq.com

Testing

Run All Tests

pytest tests/ -v

Run Specific Agent Tests

# File Discovery
pytest tests/agents/test_file_discovery.py -v

# Document Parsing
pytest tests/agents/test_document_parsing.py -v

# Vision Agent
pytest tests/agents/test_vision_agent.py -v

# Indexing Agent
pytest tests/agents/test_indexing.py -v  # (to be created)

Test Coverage

pytest tests/ --cov=backend --cov-report=html
start htmlcov/index.html  # Windows
open htmlcov/index.html   # macOS/Linux

Project Structure

digi-biz/
├── backend/
│   ├── agents/
│   │   ├── file_discovery.py      ✅ Complete
│   │   ├── document_parsing.py    ✅ Complete
│   │   ├── table_extraction.py    ✅ Complete
│   │   ├── media_extraction.py    ✅ Complete
│   │   ├── vision_agent.py        ✅ Complete
│   │   └── indexing.py            ✅ Complete
│   ├── models/
│   │   ├── schemas.py             ✅ Complete
│   │   └── enums.py               ✅ Complete
│   └── utils/
│       ├── storage_manager.py
│       ├── file_classifier.py
│       ├── logger.py
│       └── groq_vision_client.py
├── tests/
│   └── agents/
│       ├── test_file_discovery.py
│       ├── test_document_parsing.py
│       ├── test_table_extraction.py
│       ├── test_media_extraction.py
│       └── test_vision_agent.py
├── app.py                         ✅ Streamlit App
├── requirements.txt
├── .env.example
└── docs/
    └── DOCUMENTATION.md           ✅ This file

Performance Benchmarks

Agent	Processing Time	Test Data
File Discovery	~1-2s	10 files ZIP
Document Parsing	~50ms/doc	PDF 10 pages
Table Extraction	~100ms/doc	5 tables
Media Extraction	~200ms/image	5 images
Vision Analysis	~2s/image	Groq API
Indexing	~500ms	50 pages

End-to-End: <2 minutes for typical business folder (10 documents, 5 images)

License

MIT License - See LICENSE file for details

Support

GitHub Issues: Report bugs and feature requests
Documentation: This file + inline code comments
Email: [Your contact here]

Last Updated: March 17, 2026
Version: 1.0.0
Status: Production Ready ✅