| # SPARKNET Changelog | |
| All notable changes to the SPARKNET project are documented in this file. | |
| ## [1.2.0] - 2026-01-20 | |
| ### Added (Phase 1B Continuation) | |
| #### Table Extraction Preservation (FG-002) - HIGH PRIORITY | |
| - **Enhanced SemanticChunker** (`src/document/chunking/chunker.py`) | |
| - Table structure reconstruction from OCR regions | |
| - Markdown table generation with proper formatting | |
| - Header row detection using heuristics | |
| - Structured data storage in `extra.table_structure` | |
| - Cell positions preserved for evidence highlighting | |
| - Searchable text includes header context for better embedding | |
| - Configurable row/column thresholds | |
| - **ChunkerConfig enhancements** | |
| - `preserve_table_structure` - Enable markdown conversion | |
| - `table_row_threshold` - Y-coordinate grouping threshold | |
| - `table_col_threshold` - X-coordinate clustering threshold | |
| - `detect_table_headers` - Automatic header detection | |
| #### Nginx Configuration (TG-005) | |
| - **Nginx Reverse Proxy** (`nginx/nginx.conf`) | |
| - Production-ready reverse proxy configuration | |
| - Rate limiting (30 req/s API, 5 req/s uploads) | |
| - WebSocket support for Streamlit | |
| - SSE support for RAG streaming | |
| - Gzip compression | |
| - Security headers (XSS, CSRF protection) | |
| - SSL/TLS configuration (commented, ready for production) | |
| - Connection limits and timeout tuning | |
| #### Integration Tests (TG-006) | |
| - **API Integration Tests** (`tests/integration/test_api_v2.py`) | |
| - TestClient-based testing without server | |
| - Health/status endpoint tests | |
| - Authentication flow tests | |
| - Document upload/process/index workflow | |
| - RAG query and search tests | |
| - Error handling verification | |
| - Concurrency tests | |
| - Performance benchmarks (marked slow) | |
| - **Table Chunker Unit Tests** (`tests/unit/test_table_chunker.py`) | |
| - Table structure reconstruction tests | |
| - Markdown generation tests | |
| - Header detection tests | |
| - Column detection tests | |
| - Edge case handling | |
| #### Cross-Module State Synchronization (Phase 1B) | |
| - **Enhanced State Manager** (`demo/state_manager.py`) | |
| - Event system with pub/sub pattern | |
| - `EventType` enum for type-safe events | |
| - Evidence highlighting synchronization | |
| - Page/chunk selection sync across modules | |
| - RAG query/response sharing | |
| - Module-specific state storage | |
| - Sync version tracking for change detection | |
| - Helper components: `render_evidence_panel()`, `render_document_selector()` | |
| --- | |
| ## [1.1.0] - 2026-01-20 | |
| ### Added | |
| #### REST API (Phase 1B - TG-003) | |
| - **Document API** (`api/routes/documents.py`) | |
| - `POST /api/documents/upload` - Upload and process documents | |
| - `GET /api/documents` - List all documents with filtering | |
| - `GET /api/documents/{doc_id}` - Get document by ID | |
| - `GET /api/documents/{doc_id}/detail` - Get detailed document info | |
| - `GET /api/documents/{doc_id}/chunks` - Get document chunks | |
| - `POST /api/documents/{doc_id}/process` - Trigger processing | |
| - `POST /api/documents/{doc_id}/index` - Index to RAG | |
| - `POST /api/documents/batch-index` - Batch index multiple documents | |
| - `DELETE /api/documents/{doc_id}` - Delete a document | |
| - **RAG API** (`api/routes/rag.py`) | |
| - `POST /api/rag/query` - Execute RAG query with 5-agent pipeline | |
| - `POST /api/rag/query/stream` - Stream RAG response (SSE) | |
| - `POST /api/rag/search` - Semantic search without synthesis | |
| - `GET /api/rag/store/status` - Get vector store status | |
| - `DELETE /api/rag/store/collection/{name}` - Clear collection | |
| - `GET /api/rag/cache/stats` - Get cache statistics | |
| - `DELETE /api/rag/cache` - Clear query cache | |
| - **API Schemas** (`api/schemas.py`) | |
| - Request/response models for all endpoints | |
| - Document, Query, Search, Citation schemas | |
| - Pydantic validation with comprehensive field definitions | |
| #### Authentication (Phase 1C - TG-002) | |
| - **JWT Authentication** (`api/auth.py`) | |
| - OAuth2 password bearer scheme | |
| - `POST /api/auth/token` - Get access token | |
| - `POST /api/auth/register` - Register new user | |
| - `GET /api/auth/me` - Get current user info | |
| - `GET /api/auth/users` - List users (admin only) | |
| - `DELETE /api/auth/users/{username}` - Delete user (admin only) | |
| - Password hashing with bcrypt | |
| - Default admin user creation on startup | |
| #### Extended Document Support (Phase 1B - FG-001) | |
| - Added support for new document formats in document processing: | |
| - **Word (.docx)** - Full text and table extraction | |
| - **Excel (.xlsx, .xls)** - Multi-sheet extraction | |
| - **PowerPoint (.pptx)** - Slide-by-slide text extraction | |
| - **Text (.txt)** - Plain text processing | |
| - **Markdown (.md)** - Markdown file support | |
| #### Caching (Phase 1B - TG-004) | |
| - **Cache Manager** (`src/utils/cache_manager.py`) | |
| - Redis-based caching with in-memory fallback | |
| - `QueryCache` - Cache RAG query results (1 hour TTL) | |
| - `EmbeddingCache` - Cache embeddings (24 hour TTL) | |
| - `@cached` decorator for function-level caching | |
| - Automatic cache cleanup and size limits | |
| #### Docker Containerization (Phase 1C - TG-007) | |
| - **Dockerfile** - Multi-stage build | |
| - Production stage with optimized image | |
| - Development stage with hot reload | |
| - Health checks and proper dependencies | |
| - **docker-compose.yml** - Full stack deployment | |
| - SPARKNET API service | |
| - Streamlit Demo service | |
| - Ollama LLM service with GPU support | |
| - ChromaDB vector store | |
| - Redis cache | |
| - Optional Nginx reverse proxy | |
| - **docker-compose.dev.yml** - Development configuration | |
| - Volume mounts for code changes | |
| - Hot reload enabled | |
| - Connects to host Ollama | |
| - **.dockerignore** - Optimized build context | |
| ### Changed | |
| #### API Main (`api/main.py`) | |
| - Enhanced lifespan initialization with graceful degradation | |
| - Added RAG component initialization | |
| - Improved health check with component status | |
| - New `/api/status` endpoint for comprehensive system status | |
| - Better error handling allowing partial functionality | |
| ### Technical Details | |
| #### New Files Created | |
| ``` | |
| api/ | |
| βββ auth.py # Authentication module | |
| βββ schemas.py # Pydantic models | |
| βββ routes/ | |
| βββ documents.py # Document endpoints | |
| βββ rag.py # RAG endpoints | |
| src/utils/ | |
| βββ cache_manager.py # Redis/memory caching | |
| docker/ | |
| βββ Dockerfile # Multi-stage build | |
| βββ docker-compose.yml # Production stack | |
| βββ docker-compose.dev.yml # Development stack | |
| βββ .dockerignore # Build optimization | |
| ``` | |
| #### Dependencies Added | |
| - `python-jose[cryptography]` - JWT tokens | |
| - `passlib[bcrypt]` - Password hashing | |
| - `python-multipart` - Form data handling | |
| - `redis` - Redis client (optional) | |
| - `python-docx` - Word document support | |
| - `openpyxl` - Excel support | |
| - `python-pptx` - PowerPoint support | |
| #### Configuration | |
| - `SPARKNET_SECRET_KEY` - JWT secret (environment variable) | |
| - `REDIS_URL` - Redis connection string | |
| - `OLLAMA_HOST` - Ollama server URL | |
| - `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection | |
| ### API Quick Reference | |
| ```bash | |
| # Health check | |
| curl http://localhost:8000/api/health | |
| # Upload document | |
| curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload | |
| # Query RAG | |
| curl -X POST http://localhost:8000/api/rag/query \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"query": "What are the main findings?"}' | |
| # Get token | |
| curl -X POST http://localhost:8000/api/auth/token \ | |
| -d "username=admin&password=admin123" | |
| ``` | |
| ### Docker Quick Start | |
| ```bash | |
| # Production deployment | |
| docker-compose up -d | |
| # Development with hot reload | |
| docker-compose -f docker-compose.dev.yml up | |
| # Pull Ollama models | |
| docker exec sparknet-ollama ollama pull llama3.2:latest | |
| docker exec sparknet-ollama ollama pull mxbai-embed-large:latest | |
| ``` | |
| --- | |
| ## [1.0.0] - 2026-01-19 | |
| ### Initial Release | |
| - Multi-Agent RAG Pipeline (5 agents) | |
| - Document Processing Pipeline (OCR, Layout, Chunking) | |
| - Streamlit Demo Application (5 modules) | |
| - ChromaDB Vector Store | |
| - Ollama LLM Integration | |