SPARKNET / CHANGELOG.md
MHamdan's picture
Initial commit: SPARKNET framework
d520909
# SPARKNET Changelog
All notable changes to the SPARKNET project are documented in this file.
## [1.2.0] - 2026-01-20
### Added (Phase 1B Continuation)
#### Table Extraction Preservation (FG-002) - HIGH PRIORITY
- **Enhanced SemanticChunker** (`src/document/chunking/chunker.py`)
- Table structure reconstruction from OCR regions
- Markdown table generation with proper formatting
- Header row detection using heuristics
- Structured data storage in `extra.table_structure`
- Cell positions preserved for evidence highlighting
- Searchable text includes header context for better embedding
- Configurable row/column thresholds
- **ChunkerConfig enhancements**
- `preserve_table_structure` - Enable markdown conversion
- `table_row_threshold` - Y-coordinate grouping threshold
- `table_col_threshold` - X-coordinate clustering threshold
- `detect_table_headers` - Automatic header detection
#### Nginx Configuration (TG-005)
- **Nginx Reverse Proxy** (`nginx/nginx.conf`)
- Production-ready reverse proxy configuration
- Rate limiting (30 req/s API, 5 req/s uploads)
- WebSocket support for Streamlit
- SSE support for RAG streaming
- Gzip compression
- Security headers (XSS, CSRF protection)
- SSL/TLS configuration (commented, ready for production)
- Connection limits and timeout tuning
#### Integration Tests (TG-006)
- **API Integration Tests** (`tests/integration/test_api_v2.py`)
- TestClient-based testing without server
- Health/status endpoint tests
- Authentication flow tests
- Document upload/process/index workflow
- RAG query and search tests
- Error handling verification
- Concurrency tests
- Performance benchmarks (marked slow)
- **Table Chunker Unit Tests** (`tests/unit/test_table_chunker.py`)
- Table structure reconstruction tests
- Markdown generation tests
- Header detection tests
- Column detection tests
- Edge case handling
#### Cross-Module State Synchronization (Phase 1B)
- **Enhanced State Manager** (`demo/state_manager.py`)
- Event system with pub/sub pattern
- `EventType` enum for type-safe events
- Evidence highlighting synchronization
- Page/chunk selection sync across modules
- RAG query/response sharing
- Module-specific state storage
- Sync version tracking for change detection
- Helper components: `render_evidence_panel()`, `render_document_selector()`
---
## [1.1.0] - 2026-01-20
### Added
#### REST API (Phase 1B - TG-003)
- **Document API** (`api/routes/documents.py`)
- `POST /api/documents/upload` - Upload and process documents
- `GET /api/documents` - List all documents with filtering
- `GET /api/documents/{doc_id}` - Get document by ID
- `GET /api/documents/{doc_id}/detail` - Get detailed document info
- `GET /api/documents/{doc_id}/chunks` - Get document chunks
- `POST /api/documents/{doc_id}/process` - Trigger processing
- `POST /api/documents/{doc_id}/index` - Index to RAG
- `POST /api/documents/batch-index` - Batch index multiple documents
- `DELETE /api/documents/{doc_id}` - Delete a document
- **RAG API** (`api/routes/rag.py`)
- `POST /api/rag/query` - Execute RAG query with 5-agent pipeline
- `POST /api/rag/query/stream` - Stream RAG response (SSE)
- `POST /api/rag/search` - Semantic search without synthesis
- `GET /api/rag/store/status` - Get vector store status
- `DELETE /api/rag/store/collection/{name}` - Clear collection
- `GET /api/rag/cache/stats` - Get cache statistics
- `DELETE /api/rag/cache` - Clear query cache
- **API Schemas** (`api/schemas.py`)
- Request/response models for all endpoints
- Document, Query, Search, Citation schemas
- Pydantic validation with comprehensive field definitions
#### Authentication (Phase 1C - TG-002)
- **JWT Authentication** (`api/auth.py`)
- OAuth2 password bearer scheme
- `POST /api/auth/token` - Get access token
- `POST /api/auth/register` - Register new user
- `GET /api/auth/me` - Get current user info
- `GET /api/auth/users` - List users (admin only)
- `DELETE /api/auth/users/{username}` - Delete user (admin only)
- Password hashing with bcrypt
- Default admin user creation on startup
#### Extended Document Support (Phase 1B - FG-001)
- Added support for new document formats in document processing:
- **Word (.docx)** - Full text and table extraction
- **Excel (.xlsx, .xls)** - Multi-sheet extraction
- **PowerPoint (.pptx)** - Slide-by-slide text extraction
- **Text (.txt)** - Plain text processing
- **Markdown (.md)** - Markdown file support
#### Caching (Phase 1B - TG-004)
- **Cache Manager** (`src/utils/cache_manager.py`)
- Redis-based caching with in-memory fallback
- `QueryCache` - Cache RAG query results (1 hour TTL)
- `EmbeddingCache` - Cache embeddings (24 hour TTL)
- `@cached` decorator for function-level caching
- Automatic cache cleanup and size limits
#### Docker Containerization (Phase 1C - TG-007)
- **Dockerfile** - Multi-stage build
- Production stage with optimized image
- Development stage with hot reload
- Health checks and proper dependencies
- **docker-compose.yml** - Full stack deployment
- SPARKNET API service
- Streamlit Demo service
- Ollama LLM service with GPU support
- ChromaDB vector store
- Redis cache
- Optional Nginx reverse proxy
- **docker-compose.dev.yml** - Development configuration
- Volume mounts for code changes
- Hot reload enabled
- Connects to host Ollama
- **.dockerignore** - Optimized build context
### Changed
#### API Main (`api/main.py`)
- Enhanced lifespan initialization with graceful degradation
- Added RAG component initialization
- Improved health check with component status
- New `/api/status` endpoint for comprehensive system status
- Better error handling allowing partial functionality
### Technical Details
#### New Files Created
```
api/
β”œβ”€β”€ auth.py # Authentication module
β”œβ”€β”€ schemas.py # Pydantic models
└── routes/
β”œβ”€β”€ documents.py # Document endpoints
└── rag.py # RAG endpoints
src/utils/
└── cache_manager.py # Redis/memory caching
docker/
β”œβ”€β”€ Dockerfile # Multi-stage build
β”œβ”€β”€ docker-compose.yml # Production stack
β”œβ”€β”€ docker-compose.dev.yml # Development stack
└── .dockerignore # Build optimization
```
#### Dependencies Added
- `python-jose[cryptography]` - JWT tokens
- `passlib[bcrypt]` - Password hashing
- `python-multipart` - Form data handling
- `redis` - Redis client (optional)
- `python-docx` - Word document support
- `openpyxl` - Excel support
- `python-pptx` - PowerPoint support
#### Configuration
- `SPARKNET_SECRET_KEY` - JWT secret (environment variable)
- `REDIS_URL` - Redis connection string
- `OLLAMA_HOST` - Ollama server URL
- `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection
### API Quick Reference
```bash
# Health check
curl http://localhost:8000/api/health
# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload
# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
-H "Content-Type: application/json" \
-d '{"query": "What are the main findings?"}'
# Get token
curl -X POST http://localhost:8000/api/auth/token \
-d "username=admin&password=admin123"
```
### Docker Quick Start
```bash
# Production deployment
docker-compose up -d
# Development with hot reload
docker-compose -f docker-compose.dev.yml up
# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest
```
---
## [1.0.0] - 2026-01-19
### Initial Release
- Multi-Agent RAG Pipeline (5 agents)
- Document Processing Pipeline (OCR, Layout, Chunking)
- Streamlit Demo Application (5 modules)
- ChromaDB Vector Store
- Ollama LLM Integration