SPARKNET / CHANGELOG.md
MHamdan's picture
Initial commit: SPARKNET framework
d520909

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

SPARKNET Changelog

All notable changes to the SPARKNET project are documented in this file.

[1.2.0] - 2026-01-20

Added (Phase 1B Continuation)

Table Extraction Preservation (FG-002) - HIGH PRIORITY

  • Enhanced SemanticChunker (src/document/chunking/chunker.py)

    • Table structure reconstruction from OCR regions
    • Markdown table generation with proper formatting
    • Header row detection using heuristics
    • Structured data storage in extra.table_structure
    • Cell positions preserved for evidence highlighting
    • Searchable text includes header context for better embedding
    • Configurable row/column thresholds
  • ChunkerConfig enhancements

    • preserve_table_structure - Enable markdown conversion
    • table_row_threshold - Y-coordinate grouping threshold
    • table_col_threshold - X-coordinate clustering threshold
    • detect_table_headers - Automatic header detection

Nginx Configuration (TG-005)

  • Nginx Reverse Proxy (nginx/nginx.conf)
    • Production-ready reverse proxy configuration
    • Rate limiting (30 req/s API, 5 req/s uploads)
    • WebSocket support for Streamlit
    • SSE support for RAG streaming
    • Gzip compression
    • Security headers (XSS, CSRF protection)
    • SSL/TLS configuration (commented, ready for production)
    • Connection limits and timeout tuning

Integration Tests (TG-006)

  • API Integration Tests (tests/integration/test_api_v2.py)

    • TestClient-based testing without server
    • Health/status endpoint tests
    • Authentication flow tests
    • Document upload/process/index workflow
    • RAG query and search tests
    • Error handling verification
    • Concurrency tests
    • Performance benchmarks (marked slow)
  • Table Chunker Unit Tests (tests/unit/test_table_chunker.py)

    • Table structure reconstruction tests
    • Markdown generation tests
    • Header detection tests
    • Column detection tests
    • Edge case handling

Cross-Module State Synchronization (Phase 1B)

  • Enhanced State Manager (demo/state_manager.py)
    • Event system with pub/sub pattern
    • EventType enum for type-safe events
    • Evidence highlighting synchronization
    • Page/chunk selection sync across modules
    • RAG query/response sharing
    • Module-specific state storage
    • Sync version tracking for change detection
    • Helper components: render_evidence_panel(), render_document_selector()

[1.1.0] - 2026-01-20

Added

REST API (Phase 1B - TG-003)

  • Document API (api/routes/documents.py)

    • POST /api/documents/upload - Upload and process documents
    • GET /api/documents - List all documents with filtering
    • GET /api/documents/{doc_id} - Get document by ID
    • GET /api/documents/{doc_id}/detail - Get detailed document info
    • GET /api/documents/{doc_id}/chunks - Get document chunks
    • POST /api/documents/{doc_id}/process - Trigger processing
    • POST /api/documents/{doc_id}/index - Index to RAG
    • POST /api/documents/batch-index - Batch index multiple documents
    • DELETE /api/documents/{doc_id} - Delete a document
  • RAG API (api/routes/rag.py)

    • POST /api/rag/query - Execute RAG query with 5-agent pipeline
    • POST /api/rag/query/stream - Stream RAG response (SSE)
    • POST /api/rag/search - Semantic search without synthesis
    • GET /api/rag/store/status - Get vector store status
    • DELETE /api/rag/store/collection/{name} - Clear collection
    • GET /api/rag/cache/stats - Get cache statistics
    • DELETE /api/rag/cache - Clear query cache
  • API Schemas (api/schemas.py)

    • Request/response models for all endpoints
    • Document, Query, Search, Citation schemas
    • Pydantic validation with comprehensive field definitions

Authentication (Phase 1C - TG-002)

  • JWT Authentication (api/auth.py)
    • OAuth2 password bearer scheme
    • POST /api/auth/token - Get access token
    • POST /api/auth/register - Register new user
    • GET /api/auth/me - Get current user info
    • GET /api/auth/users - List users (admin only)
    • DELETE /api/auth/users/{username} - Delete user (admin only)
    • Password hashing with bcrypt
    • Default admin user creation on startup

Extended Document Support (Phase 1B - FG-001)

  • Added support for new document formats in document processing:
    • Word (.docx) - Full text and table extraction
    • Excel (.xlsx, .xls) - Multi-sheet extraction
    • PowerPoint (.pptx) - Slide-by-slide text extraction
    • Text (.txt) - Plain text processing
    • Markdown (.md) - Markdown file support

Caching (Phase 1B - TG-004)

  • Cache Manager (src/utils/cache_manager.py)
    • Redis-based caching with in-memory fallback
    • QueryCache - Cache RAG query results (1 hour TTL)
    • EmbeddingCache - Cache embeddings (24 hour TTL)
    • @cached decorator for function-level caching
    • Automatic cache cleanup and size limits

Docker Containerization (Phase 1C - TG-007)

  • Dockerfile - Multi-stage build

    • Production stage with optimized image
    • Development stage with hot reload
    • Health checks and proper dependencies
  • docker-compose.yml - Full stack deployment

    • SPARKNET API service
    • Streamlit Demo service
    • Ollama LLM service with GPU support
    • ChromaDB vector store
    • Redis cache
    • Optional Nginx reverse proxy
  • docker-compose.dev.yml - Development configuration

    • Volume mounts for code changes
    • Hot reload enabled
    • Connects to host Ollama
  • .dockerignore - Optimized build context

Changed

API Main (api/main.py)

  • Enhanced lifespan initialization with graceful degradation
  • Added RAG component initialization
  • Improved health check with component status
  • New /api/status endpoint for comprehensive system status
  • Better error handling allowing partial functionality

Technical Details

New Files Created

api/
β”œβ”€β”€ auth.py              # Authentication module
β”œβ”€β”€ schemas.py           # Pydantic models
└── routes/
    β”œβ”€β”€ documents.py     # Document endpoints
    └── rag.py           # RAG endpoints

src/utils/
└── cache_manager.py     # Redis/memory caching

docker/
β”œβ”€β”€ Dockerfile           # Multi-stage build
β”œβ”€β”€ docker-compose.yml   # Production stack
β”œβ”€β”€ docker-compose.dev.yml # Development stack
└── .dockerignore        # Build optimization

Dependencies Added

  • python-jose[cryptography] - JWT tokens
  • passlib[bcrypt] - Password hashing
  • python-multipart - Form data handling
  • redis - Redis client (optional)
  • python-docx - Word document support
  • openpyxl - Excel support
  • python-pptx - PowerPoint support

Configuration

  • SPARKNET_SECRET_KEY - JWT secret (environment variable)
  • REDIS_URL - Redis connection string
  • OLLAMA_HOST - Ollama server URL
  • CHROMA_HOST / CHROMA_PORT - ChromaDB connection

API Quick Reference

# Health check
curl http://localhost:8000/api/health

# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload

# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the main findings?"}'

# Get token
curl -X POST http://localhost:8000/api/auth/token \
  -d "username=admin&password=admin123"

Docker Quick Start

# Production deployment
docker-compose up -d

# Development with hot reload
docker-compose -f docker-compose.dev.yml up

# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest

[1.0.0] - 2026-01-19

Initial Release

  • Multi-Agent RAG Pipeline (5 agents)
  • Document Processing Pipeline (OCR, Layout, Chunking)
  • Streamlit Demo Application (5 modules)
  • ChromaDB Vector Store
  • Ollama LLM Integration