A newer version of the Streamlit SDK is available:
1.54.0
SPARKNET Changelog
All notable changes to the SPARKNET project are documented in this file.
[1.2.0] - 2026-01-20
Added (Phase 1B Continuation)
Table Extraction Preservation (FG-002) - HIGH PRIORITY
Enhanced SemanticChunker (
src/document/chunking/chunker.py)- Table structure reconstruction from OCR regions
- Markdown table generation with proper formatting
- Header row detection using heuristics
- Structured data storage in
extra.table_structure - Cell positions preserved for evidence highlighting
- Searchable text includes header context for better embedding
- Configurable row/column thresholds
ChunkerConfig enhancements
preserve_table_structure- Enable markdown conversiontable_row_threshold- Y-coordinate grouping thresholdtable_col_threshold- X-coordinate clustering thresholddetect_table_headers- Automatic header detection
Nginx Configuration (TG-005)
- Nginx Reverse Proxy (
nginx/nginx.conf)- Production-ready reverse proxy configuration
- Rate limiting (30 req/s API, 5 req/s uploads)
- WebSocket support for Streamlit
- SSE support for RAG streaming
- Gzip compression
- Security headers (XSS, CSRF protection)
- SSL/TLS configuration (commented, ready for production)
- Connection limits and timeout tuning
Integration Tests (TG-006)
API Integration Tests (
tests/integration/test_api_v2.py)- TestClient-based testing without server
- Health/status endpoint tests
- Authentication flow tests
- Document upload/process/index workflow
- RAG query and search tests
- Error handling verification
- Concurrency tests
- Performance benchmarks (marked slow)
Table Chunker Unit Tests (
tests/unit/test_table_chunker.py)- Table structure reconstruction tests
- Markdown generation tests
- Header detection tests
- Column detection tests
- Edge case handling
Cross-Module State Synchronization (Phase 1B)
- Enhanced State Manager (
demo/state_manager.py)- Event system with pub/sub pattern
EventTypeenum for type-safe events- Evidence highlighting synchronization
- Page/chunk selection sync across modules
- RAG query/response sharing
- Module-specific state storage
- Sync version tracking for change detection
- Helper components:
render_evidence_panel(),render_document_selector()
[1.1.0] - 2026-01-20
Added
REST API (Phase 1B - TG-003)
Document API (
api/routes/documents.py)POST /api/documents/upload- Upload and process documentsGET /api/documents- List all documents with filteringGET /api/documents/{doc_id}- Get document by IDGET /api/documents/{doc_id}/detail- Get detailed document infoGET /api/documents/{doc_id}/chunks- Get document chunksPOST /api/documents/{doc_id}/process- Trigger processingPOST /api/documents/{doc_id}/index- Index to RAGPOST /api/documents/batch-index- Batch index multiple documentsDELETE /api/documents/{doc_id}- Delete a document
RAG API (
api/routes/rag.py)POST /api/rag/query- Execute RAG query with 5-agent pipelinePOST /api/rag/query/stream- Stream RAG response (SSE)POST /api/rag/search- Semantic search without synthesisGET /api/rag/store/status- Get vector store statusDELETE /api/rag/store/collection/{name}- Clear collectionGET /api/rag/cache/stats- Get cache statisticsDELETE /api/rag/cache- Clear query cache
API Schemas (
api/schemas.py)- Request/response models for all endpoints
- Document, Query, Search, Citation schemas
- Pydantic validation with comprehensive field definitions
Authentication (Phase 1C - TG-002)
- JWT Authentication (
api/auth.py)- OAuth2 password bearer scheme
POST /api/auth/token- Get access tokenPOST /api/auth/register- Register new userGET /api/auth/me- Get current user infoGET /api/auth/users- List users (admin only)DELETE /api/auth/users/{username}- Delete user (admin only)- Password hashing with bcrypt
- Default admin user creation on startup
Extended Document Support (Phase 1B - FG-001)
- Added support for new document formats in document processing:
- Word (.docx) - Full text and table extraction
- Excel (.xlsx, .xls) - Multi-sheet extraction
- PowerPoint (.pptx) - Slide-by-slide text extraction
- Text (.txt) - Plain text processing
- Markdown (.md) - Markdown file support
Caching (Phase 1B - TG-004)
- Cache Manager (
src/utils/cache_manager.py)- Redis-based caching with in-memory fallback
QueryCache- Cache RAG query results (1 hour TTL)EmbeddingCache- Cache embeddings (24 hour TTL)@cacheddecorator for function-level caching- Automatic cache cleanup and size limits
Docker Containerization (Phase 1C - TG-007)
Dockerfile - Multi-stage build
- Production stage with optimized image
- Development stage with hot reload
- Health checks and proper dependencies
docker-compose.yml - Full stack deployment
- SPARKNET API service
- Streamlit Demo service
- Ollama LLM service with GPU support
- ChromaDB vector store
- Redis cache
- Optional Nginx reverse proxy
docker-compose.dev.yml - Development configuration
- Volume mounts for code changes
- Hot reload enabled
- Connects to host Ollama
.dockerignore - Optimized build context
Changed
API Main (api/main.py)
- Enhanced lifespan initialization with graceful degradation
- Added RAG component initialization
- Improved health check with component status
- New
/api/statusendpoint for comprehensive system status - Better error handling allowing partial functionality
Technical Details
New Files Created
api/
βββ auth.py # Authentication module
βββ schemas.py # Pydantic models
βββ routes/
βββ documents.py # Document endpoints
βββ rag.py # RAG endpoints
src/utils/
βββ cache_manager.py # Redis/memory caching
docker/
βββ Dockerfile # Multi-stage build
βββ docker-compose.yml # Production stack
βββ docker-compose.dev.yml # Development stack
βββ .dockerignore # Build optimization
Dependencies Added
python-jose[cryptography]- JWT tokenspasslib[bcrypt]- Password hashingpython-multipart- Form data handlingredis- Redis client (optional)python-docx- Word document supportopenpyxl- Excel supportpython-pptx- PowerPoint support
Configuration
SPARKNET_SECRET_KEY- JWT secret (environment variable)REDIS_URL- Redis connection stringOLLAMA_HOST- Ollama server URLCHROMA_HOST/CHROMA_PORT- ChromaDB connection
API Quick Reference
# Health check
curl http://localhost:8000/api/health
# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload
# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
-H "Content-Type: application/json" \
-d '{"query": "What are the main findings?"}'
# Get token
curl -X POST http://localhost:8000/api/auth/token \
-d "username=admin&password=admin123"
Docker Quick Start
# Production deployment
docker-compose up -d
# Development with hot reload
docker-compose -f docker-compose.dev.yml up
# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest
[1.0.0] - 2026-01-19
Initial Release
- Multi-Agent RAG Pipeline (5 agents)
- Document Processing Pipeline (OCR, Layout, Chunking)
- Streamlit Demo Application (5 modules)
- ChromaDB Vector Store
- Ollama LLM Integration