Spaces:

MHamdan
/

SPARKNET

Sleeping

App Files Files Community

SPARKNET / CHANGELOG.md

MHamdan

Initial commit: SPARKNET framework

d520909 26 days ago

preview code

raw

history blame contribute delete

7.85 kB

	# SPARKNET Changelog

	All notable changes to the SPARKNET project are documented in this file.

	## [1.2.0] - 2026-01-20

	### Added (Phase 1B Continuation)

	#### Table Extraction Preservation (FG-002) - HIGH PRIORITY
	- Enhanced SemanticChunker (`src/document/chunking/chunker.py`)
	- Table structure reconstruction from OCR regions
	- Markdown table generation with proper formatting
	- Header row detection using heuristics
	- Structured data storage in `extra.table_structure`
	- Cell positions preserved for evidence highlighting
	- Searchable text includes header context for better embedding
	- Configurable row/column thresholds

	- ChunkerConfig enhancements
	- `preserve_table_structure` - Enable markdown conversion
	- `table_row_threshold` - Y-coordinate grouping threshold
	- `table_col_threshold` - X-coordinate clustering threshold
	- `detect_table_headers` - Automatic header detection

	#### Nginx Configuration (TG-005)
	- Nginx Reverse Proxy (`nginx/nginx.conf`)
	- Production-ready reverse proxy configuration
	- Rate limiting (30 req/s API, 5 req/s uploads)
	- WebSocket support for Streamlit
	- SSE support for RAG streaming
	- Gzip compression
	- Security headers (XSS, CSRF protection)
	- SSL/TLS configuration (commented, ready for production)
	- Connection limits and timeout tuning

	#### Integration Tests (TG-006)
	- API Integration Tests (`tests/integration/test_api_v2.py`)
	- TestClient-based testing without server
	- Health/status endpoint tests
	- Authentication flow tests
	- Document upload/process/index workflow
	- RAG query and search tests
	- Error handling verification
	- Concurrency tests
	- Performance benchmarks (marked slow)

	- Table Chunker Unit Tests (`tests/unit/test_table_chunker.py`)
	- Table structure reconstruction tests
	- Markdown generation tests
	- Header detection tests
	- Column detection tests
	- Edge case handling

	#### Cross-Module State Synchronization (Phase 1B)
	- Enhanced State Manager (`demo/state_manager.py`)
	- Event system with pub/sub pattern
	- `EventType` enum for type-safe events
	- Evidence highlighting synchronization
	- Page/chunk selection sync across modules
	- RAG query/response sharing
	- Module-specific state storage
	- Sync version tracking for change detection
	- Helper components: `render_evidence_panel()`, `render_document_selector()`

	---

	## [1.1.0] - 2026-01-20

	### Added

	#### REST API (Phase 1B - TG-003)
	- Document API (`api/routes/documents.py`)
	- `POST /api/documents/upload` - Upload and process documents
	- `GET /api/documents` - List all documents with filtering
	- `GET /api/documents/{doc_id}` - Get document by ID
	- `GET /api/documents/{doc_id}/detail` - Get detailed document info
	- `GET /api/documents/{doc_id}/chunks` - Get document chunks
	- `POST /api/documents/{doc_id}/process` - Trigger processing
	- `POST /api/documents/{doc_id}/index` - Index to RAG
	- `POST /api/documents/batch-index` - Batch index multiple documents
	- `DELETE /api/documents/{doc_id}` - Delete a document

	- RAG API (`api/routes/rag.py`)
	- `POST /api/rag/query` - Execute RAG query with 5-agent pipeline
	- `POST /api/rag/query/stream` - Stream RAG response (SSE)
	- `POST /api/rag/search` - Semantic search without synthesis
	- `GET /api/rag/store/status` - Get vector store status
	- `DELETE /api/rag/store/collection/{name}` - Clear collection
	- `GET /api/rag/cache/stats` - Get cache statistics
	- `DELETE /api/rag/cache` - Clear query cache

	- API Schemas (`api/schemas.py`)
	- Request/response models for all endpoints
	- Document, Query, Search, Citation schemas
	- Pydantic validation with comprehensive field definitions

	#### Authentication (Phase 1C - TG-002)
	- JWT Authentication (`api/auth.py`)
	- OAuth2 password bearer scheme
	- `POST /api/auth/token` - Get access token
	- `POST /api/auth/register` - Register new user
	- `GET /api/auth/me` - Get current user info
	- `GET /api/auth/users` - List users (admin only)
	- `DELETE /api/auth/users/{username}` - Delete user (admin only)
	- Password hashing with bcrypt
	- Default admin user creation on startup

	#### Extended Document Support (Phase 1B - FG-001)
	- Added support for new document formats in document processing:
	- Word (.docx) - Full text and table extraction
	- Excel (.xlsx, .xls) - Multi-sheet extraction
	- PowerPoint (.pptx) - Slide-by-slide text extraction
	- Text (.txt) - Plain text processing
	- Markdown (.md) - Markdown file support

	#### Caching (Phase 1B - TG-004)
	- Cache Manager (`src/utils/cache_manager.py`)
	- Redis-based caching with in-memory fallback
	- `QueryCache` - Cache RAG query results (1 hour TTL)
	- `EmbeddingCache` - Cache embeddings (24 hour TTL)
	- `@cached` decorator for function-level caching
	- Automatic cache cleanup and size limits

	#### Docker Containerization (Phase 1C - TG-007)
	- Dockerfile - Multi-stage build
	- Production stage with optimized image
	- Development stage with hot reload
	- Health checks and proper dependencies

	- docker-compose.yml - Full stack deployment
	- SPARKNET API service
	- Streamlit Demo service
	- Ollama LLM service with GPU support
	- ChromaDB vector store
	- Redis cache
	- Optional Nginx reverse proxy

	- docker-compose.dev.yml - Development configuration
	- Volume mounts for code changes
	- Hot reload enabled
	- Connects to host Ollama

	- .dockerignore - Optimized build context

	### Changed

	#### API Main (`api/main.py`)
	- Enhanced lifespan initialization with graceful degradation
	- Added RAG component initialization
	- Improved health check with component status
	- New `/api/status` endpoint for comprehensive system status
	- Better error handling allowing partial functionality

	### Technical Details

	#### New Files Created
	```
	api/
	├── auth.py # Authentication module
	├── schemas.py # Pydantic models
	└── routes/
	├── documents.py # Document endpoints
	└── rag.py # RAG endpoints

	src/utils/
	└── cache_manager.py # Redis/memory caching

	docker/
	├── Dockerfile # Multi-stage build
	├── docker-compose.yml # Production stack
	├── docker-compose.dev.yml # Development stack
	└── .dockerignore # Build optimization
	```

	#### Dependencies Added
	- `python-jose[cryptography]` - JWT tokens
	- `passlib[bcrypt]` - Password hashing
	- `python-multipart` - Form data handling
	- `redis` - Redis client (optional)
	- `python-docx` - Word document support
	- `openpyxl` - Excel support
	- `python-pptx` - PowerPoint support

	#### Configuration
	- `SPARKNET_SECRET_KEY` - JWT secret (environment variable)
	- `REDIS_URL` - Redis connection string
	- `OLLAMA_HOST` - Ollama server URL
	- `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection

	### API Quick Reference

	```bash
	# Health check
	curl http://localhost:8000/api/health

	# Upload document
	curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload

	# Query RAG
	curl -X POST http://localhost:8000/api/rag/query \
	-H "Content-Type: application/json" \
	-d '{"query": "What are the main findings?"}'

	# Get token
	curl -X POST http://localhost:8000/api/auth/token \
	-d "username=admin&password=admin123"
	```

	### Docker Quick Start

	```bash
	# Production deployment
	docker-compose up -d

	# Development with hot reload
	docker-compose -f docker-compose.dev.yml up

	# Pull Ollama models
	docker exec sparknet-ollama ollama pull llama3.2:latest
	docker exec sparknet-ollama ollama pull mxbai-embed-large:latest
	```

	---

	## [1.0.0] - 2026-01-19

	### Initial Release
	- Multi-Agent RAG Pipeline (5 agents)
	- Document Processing Pipeline (OCR, Layout, Chunking)
	- Streamlit Demo Application (5 modules)
	- ChromaDB Vector Store
	- Ollama LLM Integration