File size: 7,851 Bytes
d520909 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
# SPARKNET Changelog
All notable changes to the SPARKNET project are documented in this file.
## [1.2.0] - 2026-01-20
### Added (Phase 1B Continuation)
#### Table Extraction Preservation (FG-002) - HIGH PRIORITY
- **Enhanced SemanticChunker** (`src/document/chunking/chunker.py`)
- Table structure reconstruction from OCR regions
- Markdown table generation with proper formatting
- Header row detection using heuristics
- Structured data storage in `extra.table_structure`
- Cell positions preserved for evidence highlighting
- Searchable text includes header context for better embedding
- Configurable row/column thresholds
- **ChunkerConfig enhancements**
- `preserve_table_structure` - Enable markdown conversion
- `table_row_threshold` - Y-coordinate grouping threshold
- `table_col_threshold` - X-coordinate clustering threshold
- `detect_table_headers` - Automatic header detection
#### Nginx Configuration (TG-005)
- **Nginx Reverse Proxy** (`nginx/nginx.conf`)
- Production-ready reverse proxy configuration
- Rate limiting (30 req/s API, 5 req/s uploads)
- WebSocket support for Streamlit
- SSE support for RAG streaming
- Gzip compression
- Security headers (XSS, CSRF protection)
- SSL/TLS configuration (commented, ready for production)
- Connection limits and timeout tuning
#### Integration Tests (TG-006)
- **API Integration Tests** (`tests/integration/test_api_v2.py`)
- TestClient-based testing without server
- Health/status endpoint tests
- Authentication flow tests
- Document upload/process/index workflow
- RAG query and search tests
- Error handling verification
- Concurrency tests
- Performance benchmarks (marked slow)
- **Table Chunker Unit Tests** (`tests/unit/test_table_chunker.py`)
- Table structure reconstruction tests
- Markdown generation tests
- Header detection tests
- Column detection tests
- Edge case handling
#### Cross-Module State Synchronization (Phase 1B)
- **Enhanced State Manager** (`demo/state_manager.py`)
- Event system with pub/sub pattern
- `EventType` enum for type-safe events
- Evidence highlighting synchronization
- Page/chunk selection sync across modules
- RAG query/response sharing
- Module-specific state storage
- Sync version tracking for change detection
- Helper components: `render_evidence_panel()`, `render_document_selector()`
---
## [1.1.0] - 2026-01-20
### Added
#### REST API (Phase 1B - TG-003)
- **Document API** (`api/routes/documents.py`)
- `POST /api/documents/upload` - Upload and process documents
- `GET /api/documents` - List all documents with filtering
- `GET /api/documents/{doc_id}` - Get document by ID
- `GET /api/documents/{doc_id}/detail` - Get detailed document info
- `GET /api/documents/{doc_id}/chunks` - Get document chunks
- `POST /api/documents/{doc_id}/process` - Trigger processing
- `POST /api/documents/{doc_id}/index` - Index to RAG
- `POST /api/documents/batch-index` - Batch index multiple documents
- `DELETE /api/documents/{doc_id}` - Delete a document
- **RAG API** (`api/routes/rag.py`)
- `POST /api/rag/query` - Execute RAG query with 5-agent pipeline
- `POST /api/rag/query/stream` - Stream RAG response (SSE)
- `POST /api/rag/search` - Semantic search without synthesis
- `GET /api/rag/store/status` - Get vector store status
- `DELETE /api/rag/store/collection/{name}` - Clear collection
- `GET /api/rag/cache/stats` - Get cache statistics
- `DELETE /api/rag/cache` - Clear query cache
- **API Schemas** (`api/schemas.py`)
- Request/response models for all endpoints
- Document, Query, Search, Citation schemas
- Pydantic validation with comprehensive field definitions
#### Authentication (Phase 1C - TG-002)
- **JWT Authentication** (`api/auth.py`)
- OAuth2 password bearer scheme
- `POST /api/auth/token` - Get access token
- `POST /api/auth/register` - Register new user
- `GET /api/auth/me` - Get current user info
- `GET /api/auth/users` - List users (admin only)
- `DELETE /api/auth/users/{username}` - Delete user (admin only)
- Password hashing with bcrypt
- Default admin user creation on startup
#### Extended Document Support (Phase 1B - FG-001)
- Added support for new document formats in document processing:
- **Word (.docx)** - Full text and table extraction
- **Excel (.xlsx, .xls)** - Multi-sheet extraction
- **PowerPoint (.pptx)** - Slide-by-slide text extraction
- **Text (.txt)** - Plain text processing
- **Markdown (.md)** - Markdown file support
#### Caching (Phase 1B - TG-004)
- **Cache Manager** (`src/utils/cache_manager.py`)
- Redis-based caching with in-memory fallback
- `QueryCache` - Cache RAG query results (1 hour TTL)
- `EmbeddingCache` - Cache embeddings (24 hour TTL)
- `@cached` decorator for function-level caching
- Automatic cache cleanup and size limits
#### Docker Containerization (Phase 1C - TG-007)
- **Dockerfile** - Multi-stage build
- Production stage with optimized image
- Development stage with hot reload
- Health checks and proper dependencies
- **docker-compose.yml** - Full stack deployment
- SPARKNET API service
- Streamlit Demo service
- Ollama LLM service with GPU support
- ChromaDB vector store
- Redis cache
- Optional Nginx reverse proxy
- **docker-compose.dev.yml** - Development configuration
- Volume mounts for code changes
- Hot reload enabled
- Connects to host Ollama
- **.dockerignore** - Optimized build context
### Changed
#### API Main (`api/main.py`)
- Enhanced lifespan initialization with graceful degradation
- Added RAG component initialization
- Improved health check with component status
- New `/api/status` endpoint for comprehensive system status
- Better error handling allowing partial functionality
### Technical Details
#### New Files Created
```
api/
βββ auth.py # Authentication module
βββ schemas.py # Pydantic models
βββ routes/
βββ documents.py # Document endpoints
βββ rag.py # RAG endpoints
src/utils/
βββ cache_manager.py # Redis/memory caching
docker/
βββ Dockerfile # Multi-stage build
βββ docker-compose.yml # Production stack
βββ docker-compose.dev.yml # Development stack
βββ .dockerignore # Build optimization
```
#### Dependencies Added
- `python-jose[cryptography]` - JWT tokens
- `passlib[bcrypt]` - Password hashing
- `python-multipart` - Form data handling
- `redis` - Redis client (optional)
- `python-docx` - Word document support
- `openpyxl` - Excel support
- `python-pptx` - PowerPoint support
#### Configuration
- `SPARKNET_SECRET_KEY` - JWT secret (environment variable)
- `REDIS_URL` - Redis connection string
- `OLLAMA_HOST` - Ollama server URL
- `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection
### API Quick Reference
```bash
# Health check
curl http://localhost:8000/api/health
# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload
# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
-H "Content-Type: application/json" \
-d '{"query": "What are the main findings?"}'
# Get token
curl -X POST http://localhost:8000/api/auth/token \
-d "username=admin&password=admin123"
```
### Docker Quick Start
```bash
# Production deployment
docker-compose up -d
# Development with hot reload
docker-compose -f docker-compose.dev.yml up
# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest
```
---
## [1.0.0] - 2026-01-19
### Initial Release
- Multi-Agent RAG Pipeline (5 agents)
- Document Processing Pipeline (OCR, Layout, Chunking)
- Streamlit Demo Application (5 modules)
- ChromaDB Vector Store
- Ollama LLM Integration
|