# SPARKNET Changelog All notable changes to the SPARKNET project are documented in this file. ## [1.2.0] - 2026-01-20 ### Added (Phase 1B Continuation) #### Table Extraction Preservation (FG-002) - HIGH PRIORITY - **Enhanced SemanticChunker** (`src/document/chunking/chunker.py`) - Table structure reconstruction from OCR regions - Markdown table generation with proper formatting - Header row detection using heuristics - Structured data storage in `extra.table_structure` - Cell positions preserved for evidence highlighting - Searchable text includes header context for better embedding - Configurable row/column thresholds - **ChunkerConfig enhancements** - `preserve_table_structure` - Enable markdown conversion - `table_row_threshold` - Y-coordinate grouping threshold - `table_col_threshold` - X-coordinate clustering threshold - `detect_table_headers` - Automatic header detection #### Nginx Configuration (TG-005) - **Nginx Reverse Proxy** (`nginx/nginx.conf`) - Production-ready reverse proxy configuration - Rate limiting (30 req/s API, 5 req/s uploads) - WebSocket support for Streamlit - SSE support for RAG streaming - Gzip compression - Security headers (XSS, CSRF protection) - SSL/TLS configuration (commented, ready for production) - Connection limits and timeout tuning #### Integration Tests (TG-006) - **API Integration Tests** (`tests/integration/test_api_v2.py`) - TestClient-based testing without server - Health/status endpoint tests - Authentication flow tests - Document upload/process/index workflow - RAG query and search tests - Error handling verification - Concurrency tests - Performance benchmarks (marked slow) - **Table Chunker Unit Tests** (`tests/unit/test_table_chunker.py`) - Table structure reconstruction tests - Markdown generation tests - Header detection tests - Column detection tests - Edge case handling #### Cross-Module State Synchronization (Phase 1B) - **Enhanced State Manager** (`demo/state_manager.py`) - Event system with pub/sub pattern - `EventType` enum for type-safe events - Evidence highlighting synchronization - Page/chunk selection sync across modules - RAG query/response sharing - Module-specific state storage - Sync version tracking for change detection - Helper components: `render_evidence_panel()`, `render_document_selector()` --- ## [1.1.0] - 2026-01-20 ### Added #### REST API (Phase 1B - TG-003) - **Document API** (`api/routes/documents.py`) - `POST /api/documents/upload` - Upload and process documents - `GET /api/documents` - List all documents with filtering - `GET /api/documents/{doc_id}` - Get document by ID - `GET /api/documents/{doc_id}/detail` - Get detailed document info - `GET /api/documents/{doc_id}/chunks` - Get document chunks - `POST /api/documents/{doc_id}/process` - Trigger processing - `POST /api/documents/{doc_id}/index` - Index to RAG - `POST /api/documents/batch-index` - Batch index multiple documents - `DELETE /api/documents/{doc_id}` - Delete a document - **RAG API** (`api/routes/rag.py`) - `POST /api/rag/query` - Execute RAG query with 5-agent pipeline - `POST /api/rag/query/stream` - Stream RAG response (SSE) - `POST /api/rag/search` - Semantic search without synthesis - `GET /api/rag/store/status` - Get vector store status - `DELETE /api/rag/store/collection/{name}` - Clear collection - `GET /api/rag/cache/stats` - Get cache statistics - `DELETE /api/rag/cache` - Clear query cache - **API Schemas** (`api/schemas.py`) - Request/response models for all endpoints - Document, Query, Search, Citation schemas - Pydantic validation with comprehensive field definitions #### Authentication (Phase 1C - TG-002) - **JWT Authentication** (`api/auth.py`) - OAuth2 password bearer scheme - `POST /api/auth/token` - Get access token - `POST /api/auth/register` - Register new user - `GET /api/auth/me` - Get current user info - `GET /api/auth/users` - List users (admin only) - `DELETE /api/auth/users/{username}` - Delete user (admin only) - Password hashing with bcrypt - Default admin user creation on startup #### Extended Document Support (Phase 1B - FG-001) - Added support for new document formats in document processing: - **Word (.docx)** - Full text and table extraction - **Excel (.xlsx, .xls)** - Multi-sheet extraction - **PowerPoint (.pptx)** - Slide-by-slide text extraction - **Text (.txt)** - Plain text processing - **Markdown (.md)** - Markdown file support #### Caching (Phase 1B - TG-004) - **Cache Manager** (`src/utils/cache_manager.py`) - Redis-based caching with in-memory fallback - `QueryCache` - Cache RAG query results (1 hour TTL) - `EmbeddingCache` - Cache embeddings (24 hour TTL) - `@cached` decorator for function-level caching - Automatic cache cleanup and size limits #### Docker Containerization (Phase 1C - TG-007) - **Dockerfile** - Multi-stage build - Production stage with optimized image - Development stage with hot reload - Health checks and proper dependencies - **docker-compose.yml** - Full stack deployment - SPARKNET API service - Streamlit Demo service - Ollama LLM service with GPU support - ChromaDB vector store - Redis cache - Optional Nginx reverse proxy - **docker-compose.dev.yml** - Development configuration - Volume mounts for code changes - Hot reload enabled - Connects to host Ollama - **.dockerignore** - Optimized build context ### Changed #### API Main (`api/main.py`) - Enhanced lifespan initialization with graceful degradation - Added RAG component initialization - Improved health check with component status - New `/api/status` endpoint for comprehensive system status - Better error handling allowing partial functionality ### Technical Details #### New Files Created ``` api/ ├── auth.py # Authentication module ├── schemas.py # Pydantic models └── routes/ ├── documents.py # Document endpoints └── rag.py # RAG endpoints src/utils/ └── cache_manager.py # Redis/memory caching docker/ ├── Dockerfile # Multi-stage build ├── docker-compose.yml # Production stack ├── docker-compose.dev.yml # Development stack └── .dockerignore # Build optimization ``` #### Dependencies Added - `python-jose[cryptography]` - JWT tokens - `passlib[bcrypt]` - Password hashing - `python-multipart` - Form data handling - `redis` - Redis client (optional) - `python-docx` - Word document support - `openpyxl` - Excel support - `python-pptx` - PowerPoint support #### Configuration - `SPARKNET_SECRET_KEY` - JWT secret (environment variable) - `REDIS_URL` - Redis connection string - `OLLAMA_HOST` - Ollama server URL - `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection ### API Quick Reference ```bash # Health check curl http://localhost:8000/api/health # Upload document curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload # Query RAG curl -X POST http://localhost:8000/api/rag/query \ -H "Content-Type: application/json" \ -d '{"query": "What are the main findings?"}' # Get token curl -X POST http://localhost:8000/api/auth/token \ -d "username=admin&password=admin123" ``` ### Docker Quick Start ```bash # Production deployment docker-compose up -d # Development with hot reload docker-compose -f docker-compose.dev.yml up # Pull Ollama models docker exec sparknet-ollama ollama pull llama3.2:latest docker exec sparknet-ollama ollama pull mxbai-embed-large:latest ``` --- ## [1.0.0] - 2026-01-19 ### Initial Release - Multi-Agent RAG Pipeline (5 agents) - Document Processing Pipeline (OCR, Layout, Chunking) - Streamlit Demo Application (5 modules) - ChromaDB Vector Store - Ollama LLM Integration