Spaces:

MHamdan
/

SPARKNET

Sleeping

File size: 7,851 Bytes

d520909

# SPARKNET Changelog

All notable changes to the SPARKNET project are documented in this file.

## [1.2.0] - 2026-01-20

### Added (Phase 1B Continuation)

#### Table Extraction Preservation (FG-002) - HIGH PRIORITY
- **Enhanced SemanticChunker** (`src/document/chunking/chunker.py`)
  - Table structure reconstruction from OCR regions
  - Markdown table generation with proper formatting
  - Header row detection using heuristics
  - Structured data storage in `extra.table_structure`
  - Cell positions preserved for evidence highlighting
  - Searchable text includes header context for better embedding
  - Configurable row/column thresholds

- **ChunkerConfig enhancements**
  - `preserve_table_structure` - Enable markdown conversion
  - `table_row_threshold` - Y-coordinate grouping threshold
  - `table_col_threshold` - X-coordinate clustering threshold
  - `detect_table_headers` - Automatic header detection

#### Nginx Configuration (TG-005)
- **Nginx Reverse Proxy** (`nginx/nginx.conf`)
  - Production-ready reverse proxy configuration
  - Rate limiting (30 req/s API, 5 req/s uploads)
  - WebSocket support for Streamlit
  - SSE support for RAG streaming
  - Gzip compression
  - Security headers (XSS, CSRF protection)
  - SSL/TLS configuration (commented, ready for production)
  - Connection limits and timeout tuning

#### Integration Tests (TG-006)
- **API Integration Tests** (`tests/integration/test_api_v2.py`)
  - TestClient-based testing without server
  - Health/status endpoint tests
  - Authentication flow tests
  - Document upload/process/index workflow
  - RAG query and search tests
  - Error handling verification
  - Concurrency tests
  - Performance benchmarks (marked slow)

- **Table Chunker Unit Tests** (`tests/unit/test_table_chunker.py`)
  - Table structure reconstruction tests
  - Markdown generation tests
  - Header detection tests
  - Column detection tests
  - Edge case handling

#### Cross-Module State Synchronization (Phase 1B)
- **Enhanced State Manager** (`demo/state_manager.py`)
  - Event system with pub/sub pattern
  - `EventType` enum for type-safe events
  - Evidence highlighting synchronization
  - Page/chunk selection sync across modules
  - RAG query/response sharing
  - Module-specific state storage
  - Sync version tracking for change detection
  - Helper components: `render_evidence_panel()`, `render_document_selector()`

---

## [1.1.0] - 2026-01-20

### Added

#### REST API (Phase 1B - TG-003)
- **Document API** (`api/routes/documents.py`)
  - `POST /api/documents/upload` - Upload and process documents
  - `GET /api/documents` - List all documents with filtering
  - `GET /api/documents/{doc_id}` - Get document by ID
  - `GET /api/documents/{doc_id}/detail` - Get detailed document info
  - `GET /api/documents/{doc_id}/chunks` - Get document chunks
  - `POST /api/documents/{doc_id}/process` - Trigger processing
  - `POST /api/documents/{doc_id}/index` - Index to RAG
  - `POST /api/documents/batch-index` - Batch index multiple documents
  - `DELETE /api/documents/{doc_id}` - Delete a document

- **RAG API** (`api/routes/rag.py`)
  - `POST /api/rag/query` - Execute RAG query with 5-agent pipeline
  - `POST /api/rag/query/stream` - Stream RAG response (SSE)
  - `POST /api/rag/search` - Semantic search without synthesis
  - `GET /api/rag/store/status` - Get vector store status
  - `DELETE /api/rag/store/collection/{name}` - Clear collection
  - `GET /api/rag/cache/stats` - Get cache statistics
  - `DELETE /api/rag/cache` - Clear query cache

- **API Schemas** (`api/schemas.py`)
  - Request/response models for all endpoints
  - Document, Query, Search, Citation schemas
  - Pydantic validation with comprehensive field definitions

#### Authentication (Phase 1C - TG-002)
- **JWT Authentication** (`api/auth.py`)
  - OAuth2 password bearer scheme
  - `POST /api/auth/token` - Get access token
  - `POST /api/auth/register` - Register new user
  - `GET /api/auth/me` - Get current user info
  - `GET /api/auth/users` - List users (admin only)
  - `DELETE /api/auth/users/{username}` - Delete user (admin only)
  - Password hashing with bcrypt
  - Default admin user creation on startup

#### Extended Document Support (Phase 1B - FG-001)
- Added support for new document formats in document processing:
  - **Word (.docx)** - Full text and table extraction
  - **Excel (.xlsx, .xls)** - Multi-sheet extraction
  - **PowerPoint (.pptx)** - Slide-by-slide text extraction
  - **Text (.txt)** - Plain text processing
  - **Markdown (.md)** - Markdown file support

#### Caching (Phase 1B - TG-004)
- **Cache Manager** (`src/utils/cache_manager.py`)
  - Redis-based caching with in-memory fallback
  - `QueryCache` - Cache RAG query results (1 hour TTL)
  - `EmbeddingCache` - Cache embeddings (24 hour TTL)
  - `@cached` decorator for function-level caching
  - Automatic cache cleanup and size limits

#### Docker Containerization (Phase 1C - TG-007)
- **Dockerfile** - Multi-stage build
  - Production stage with optimized image
  - Development stage with hot reload
  - Health checks and proper dependencies

- **docker-compose.yml** - Full stack deployment
  - SPARKNET API service
  - Streamlit Demo service
  - Ollama LLM service with GPU support
  - ChromaDB vector store
  - Redis cache
  - Optional Nginx reverse proxy

- **docker-compose.dev.yml** - Development configuration
  - Volume mounts for code changes
  - Hot reload enabled
  - Connects to host Ollama

- **.dockerignore** - Optimized build context

### Changed

#### API Main (`api/main.py`)
- Enhanced lifespan initialization with graceful degradation
- Added RAG component initialization
- Improved health check with component status
- New `/api/status` endpoint for comprehensive system status
- Better error handling allowing partial functionality

### Technical Details

#### New Files Created
```
api/
├── auth.py              # Authentication module
├── schemas.py           # Pydantic models
└── routes/
    ├── documents.py     # Document endpoints
    └── rag.py           # RAG endpoints

src/utils/
└── cache_manager.py     # Redis/memory caching

docker/
├── Dockerfile           # Multi-stage build
├── docker-compose.yml   # Production stack
├── docker-compose.dev.yml # Development stack
└── .dockerignore        # Build optimization
```

#### Dependencies Added
- `python-jose[cryptography]` - JWT tokens
- `passlib[bcrypt]` - Password hashing
- `python-multipart` - Form data handling
- `redis` - Redis client (optional)
- `python-docx` - Word document support
- `openpyxl` - Excel support
- `python-pptx` - PowerPoint support

#### Configuration
- `SPARKNET_SECRET_KEY` - JWT secret (environment variable)
- `REDIS_URL` - Redis connection string
- `OLLAMA_HOST` - Ollama server URL
- `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection

### API Quick Reference

```bash
# Health check
curl http://localhost:8000/api/health

# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload

# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the main findings?"}'

# Get token
curl -X POST http://localhost:8000/api/auth/token \
  -d "username=admin&password=admin123"
```

### Docker Quick Start

```bash
# Production deployment
docker-compose up -d

# Development with hot reload
docker-compose -f docker-compose.dev.yml up

# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest
```

---

## [1.0.0] - 2026-01-19

### Initial Release
- Multi-Agent RAG Pipeline (5 agents)
- Document Processing Pipeline (OCR, Layout, Chunking)
- Streamlit Demo Application (5 modules)
- ChromaDB Vector Store
- Ollama LLM Integration