Spaces:

MHamdan
/

SPARKNET

Sleeping

App Files Files Community

SPARKNET / CHANGELOG.md

MHamdan

Initial commit: SPARKNET framework

d520909 25 days ago

preview code

raw

history blame contribute delete

7.85 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

SPARKNET Changelog

All notable changes to the SPARKNET project are documented in this file.

[1.2.0] - 2026-01-20

Added (Phase 1B Continuation)

Table Extraction Preservation (FG-002) - HIGH PRIORITY

Enhanced SemanticChunker (src/document/chunking/chunker.py)
- Table structure reconstruction from OCR regions
- Markdown table generation with proper formatting
- Header row detection using heuristics
- Structured data storage in extra.table_structure
- Cell positions preserved for evidence highlighting
- Searchable text includes header context for better embedding
- Configurable row/column thresholds
ChunkerConfig enhancements
- preserve_table_structure - Enable markdown conversion
- table_row_threshold - Y-coordinate grouping threshold
- table_col_threshold - X-coordinate clustering threshold
- detect_table_headers - Automatic header detection

Nginx Configuration (TG-005)

Nginx Reverse Proxy (nginx/nginx.conf)
- Production-ready reverse proxy configuration
- Rate limiting (30 req/s API, 5 req/s uploads)
- WebSocket support for Streamlit
- SSE support for RAG streaming
- Gzip compression
- Security headers (XSS, CSRF protection)
- SSL/TLS configuration (commented, ready for production)
- Connection limits and timeout tuning

Integration Tests (TG-006)

API Integration Tests (tests/integration/test_api_v2.py)
- TestClient-based testing without server
- Health/status endpoint tests
- Authentication flow tests
- Document upload/process/index workflow
- RAG query and search tests
- Error handling verification
- Concurrency tests
- Performance benchmarks (marked slow)
Table Chunker Unit Tests (tests/unit/test_table_chunker.py)
- Table structure reconstruction tests
- Markdown generation tests
- Header detection tests
- Column detection tests
- Edge case handling

Cross-Module State Synchronization (Phase 1B)

Enhanced State Manager (demo/state_manager.py)
- Event system with pub/sub pattern
- EventType enum for type-safe events
- Evidence highlighting synchronization
- Page/chunk selection sync across modules
- RAG query/response sharing
- Module-specific state storage
- Sync version tracking for change detection
- Helper components: render_evidence_panel(), render_document_selector()

[1.1.0] - 2026-01-20

Added

REST API (Phase 1B - TG-003)

Document API (api/routes/documents.py)
- POST /api/documents/upload - Upload and process documents
- GET /api/documents - List all documents with filtering
- GET /api/documents/{doc_id} - Get document by ID
- GET /api/documents/{doc_id}/detail - Get detailed document info
- GET /api/documents/{doc_id}/chunks - Get document chunks
- POST /api/documents/{doc_id}/process - Trigger processing
- POST /api/documents/{doc_id}/index - Index to RAG
- POST /api/documents/batch-index - Batch index multiple documents
- DELETE /api/documents/{doc_id} - Delete a document
RAG API (api/routes/rag.py)
- POST /api/rag/query - Execute RAG query with 5-agent pipeline
- POST /api/rag/query/stream - Stream RAG response (SSE)
- POST /api/rag/search - Semantic search without synthesis
- GET /api/rag/store/status - Get vector store status
- DELETE /api/rag/store/collection/{name} - Clear collection
- GET /api/rag/cache/stats - Get cache statistics
- DELETE /api/rag/cache - Clear query cache
API Schemas (api/schemas.py)
- Request/response models for all endpoints
- Document, Query, Search, Citation schemas
- Pydantic validation with comprehensive field definitions

Authentication (Phase 1C - TG-002)

JWT Authentication (api/auth.py)
- OAuth2 password bearer scheme
- POST /api/auth/token - Get access token
- POST /api/auth/register - Register new user
- GET /api/auth/me - Get current user info
- GET /api/auth/users - List users (admin only)
- DELETE /api/auth/users/{username} - Delete user (admin only)
- Password hashing with bcrypt
- Default admin user creation on startup

Extended Document Support (Phase 1B - FG-001)

Added support for new document formats in document processing:
- Word (.docx) - Full text and table extraction
- Excel (.xlsx, .xls) - Multi-sheet extraction
- PowerPoint (.pptx) - Slide-by-slide text extraction
- Text (.txt) - Plain text processing
- Markdown (.md) - Markdown file support

Caching (Phase 1B - TG-004)

Cache Manager (src/utils/cache_manager.py)
- Redis-based caching with in-memory fallback
- QueryCache - Cache RAG query results (1 hour TTL)
- EmbeddingCache - Cache embeddings (24 hour TTL)
- @cached decorator for function-level caching
- Automatic cache cleanup and size limits

Docker Containerization (Phase 1C - TG-007)

Dockerfile - Multi-stage build
- Production stage with optimized image
- Development stage with hot reload
- Health checks and proper dependencies
docker-compose.yml - Full stack deployment
- SPARKNET API service
- Streamlit Demo service
- Ollama LLM service with GPU support
- ChromaDB vector store
- Redis cache
- Optional Nginx reverse proxy
docker-compose.dev.yml - Development configuration
- Volume mounts for code changes
- Hot reload enabled
- Connects to host Ollama
.dockerignore - Optimized build context

Changed

API Main (`api/main.py`)

Enhanced lifespan initialization with graceful degradation
Added RAG component initialization
Improved health check with component status
New /api/status endpoint for comprehensive system status
Better error handling allowing partial functionality

Technical Details

New Files Created

api/
├── auth.py              # Authentication module
├── schemas.py           # Pydantic models
└── routes/
    ├── documents.py     # Document endpoints
    └── rag.py           # RAG endpoints

src/utils/
└── cache_manager.py     # Redis/memory caching

docker/
├── Dockerfile           # Multi-stage build
├── docker-compose.yml   # Production stack
├── docker-compose.dev.yml # Development stack
└── .dockerignore        # Build optimization

Dependencies Added

python-jose[cryptography] - JWT tokens
passlib[bcrypt] - Password hashing
python-multipart - Form data handling
redis - Redis client (optional)
python-docx - Word document support
openpyxl - Excel support
python-pptx - PowerPoint support

Configuration

SPARKNET_SECRET_KEY - JWT secret (environment variable)
REDIS_URL - Redis connection string
OLLAMA_HOST - Ollama server URL
CHROMA_HOST / CHROMA_PORT - ChromaDB connection

API Quick Reference

# Health check
curl http://localhost:8000/api/health

# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload

# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the main findings?"}'

# Get token
curl -X POST http://localhost:8000/api/auth/token \
  -d "username=admin&password=admin123"

Docker Quick Start

# Production deployment
docker-compose up -d

# Development with hot reload
docker-compose -f docker-compose.dev.yml up

# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest

[1.0.0] - 2026-01-19

Initial Release

Multi-Agent RAG Pipeline (5 agents)
Document Processing Pipeline (OCR, Layout, Chunking)
Streamlit Demo Application (5 modules)
ChromaDB Vector Store
Ollama LLM Integration