File size: 7,851 Bytes
d520909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# SPARKNET Changelog

All notable changes to the SPARKNET project are documented in this file.

## [1.2.0] - 2026-01-20

### Added (Phase 1B Continuation)

#### Table Extraction Preservation (FG-002) - HIGH PRIORITY
- **Enhanced SemanticChunker** (`src/document/chunking/chunker.py`)
  - Table structure reconstruction from OCR regions
  - Markdown table generation with proper formatting
  - Header row detection using heuristics
  - Structured data storage in `extra.table_structure`
  - Cell positions preserved for evidence highlighting
  - Searchable text includes header context for better embedding
  - Configurable row/column thresholds

- **ChunkerConfig enhancements**
  - `preserve_table_structure` - Enable markdown conversion
  - `table_row_threshold` - Y-coordinate grouping threshold
  - `table_col_threshold` - X-coordinate clustering threshold
  - `detect_table_headers` - Automatic header detection

#### Nginx Configuration (TG-005)
- **Nginx Reverse Proxy** (`nginx/nginx.conf`)
  - Production-ready reverse proxy configuration
  - Rate limiting (30 req/s API, 5 req/s uploads)
  - WebSocket support for Streamlit
  - SSE support for RAG streaming
  - Gzip compression
  - Security headers (XSS, CSRF protection)
  - SSL/TLS configuration (commented, ready for production)
  - Connection limits and timeout tuning

#### Integration Tests (TG-006)
- **API Integration Tests** (`tests/integration/test_api_v2.py`)
  - TestClient-based testing without server
  - Health/status endpoint tests
  - Authentication flow tests
  - Document upload/process/index workflow
  - RAG query and search tests
  - Error handling verification
  - Concurrency tests
  - Performance benchmarks (marked slow)

- **Table Chunker Unit Tests** (`tests/unit/test_table_chunker.py`)
  - Table structure reconstruction tests
  - Markdown generation tests
  - Header detection tests
  - Column detection tests
  - Edge case handling

#### Cross-Module State Synchronization (Phase 1B)
- **Enhanced State Manager** (`demo/state_manager.py`)
  - Event system with pub/sub pattern
  - `EventType` enum for type-safe events
  - Evidence highlighting synchronization
  - Page/chunk selection sync across modules
  - RAG query/response sharing
  - Module-specific state storage
  - Sync version tracking for change detection
  - Helper components: `render_evidence_panel()`, `render_document_selector()`

---

## [1.1.0] - 2026-01-20

### Added

#### REST API (Phase 1B - TG-003)
- **Document API** (`api/routes/documents.py`)
  - `POST /api/documents/upload` - Upload and process documents
  - `GET /api/documents` - List all documents with filtering
  - `GET /api/documents/{doc_id}` - Get document by ID
  - `GET /api/documents/{doc_id}/detail` - Get detailed document info
  - `GET /api/documents/{doc_id}/chunks` - Get document chunks
  - `POST /api/documents/{doc_id}/process` - Trigger processing
  - `POST /api/documents/{doc_id}/index` - Index to RAG
  - `POST /api/documents/batch-index` - Batch index multiple documents
  - `DELETE /api/documents/{doc_id}` - Delete a document

- **RAG API** (`api/routes/rag.py`)
  - `POST /api/rag/query` - Execute RAG query with 5-agent pipeline
  - `POST /api/rag/query/stream` - Stream RAG response (SSE)
  - `POST /api/rag/search` - Semantic search without synthesis
  - `GET /api/rag/store/status` - Get vector store status
  - `DELETE /api/rag/store/collection/{name}` - Clear collection
  - `GET /api/rag/cache/stats` - Get cache statistics
  - `DELETE /api/rag/cache` - Clear query cache

- **API Schemas** (`api/schemas.py`)
  - Request/response models for all endpoints
  - Document, Query, Search, Citation schemas
  - Pydantic validation with comprehensive field definitions

#### Authentication (Phase 1C - TG-002)
- **JWT Authentication** (`api/auth.py`)
  - OAuth2 password bearer scheme
  - `POST /api/auth/token` - Get access token
  - `POST /api/auth/register` - Register new user
  - `GET /api/auth/me` - Get current user info
  - `GET /api/auth/users` - List users (admin only)
  - `DELETE /api/auth/users/{username}` - Delete user (admin only)
  - Password hashing with bcrypt
  - Default admin user creation on startup

#### Extended Document Support (Phase 1B - FG-001)
- Added support for new document formats in document processing:
  - **Word (.docx)** - Full text and table extraction
  - **Excel (.xlsx, .xls)** - Multi-sheet extraction
  - **PowerPoint (.pptx)** - Slide-by-slide text extraction
  - **Text (.txt)** - Plain text processing
  - **Markdown (.md)** - Markdown file support

#### Caching (Phase 1B - TG-004)
- **Cache Manager** (`src/utils/cache_manager.py`)
  - Redis-based caching with in-memory fallback
  - `QueryCache` - Cache RAG query results (1 hour TTL)
  - `EmbeddingCache` - Cache embeddings (24 hour TTL)
  - `@cached` decorator for function-level caching
  - Automatic cache cleanup and size limits

#### Docker Containerization (Phase 1C - TG-007)
- **Dockerfile** - Multi-stage build
  - Production stage with optimized image
  - Development stage with hot reload
  - Health checks and proper dependencies

- **docker-compose.yml** - Full stack deployment
  - SPARKNET API service
  - Streamlit Demo service
  - Ollama LLM service with GPU support
  - ChromaDB vector store
  - Redis cache
  - Optional Nginx reverse proxy

- **docker-compose.dev.yml** - Development configuration
  - Volume mounts for code changes
  - Hot reload enabled
  - Connects to host Ollama

- **.dockerignore** - Optimized build context

### Changed

#### API Main (`api/main.py`)
- Enhanced lifespan initialization with graceful degradation
- Added RAG component initialization
- Improved health check with component status
- New `/api/status` endpoint for comprehensive system status
- Better error handling allowing partial functionality

### Technical Details

#### New Files Created
```
api/
β”œβ”€β”€ auth.py              # Authentication module
β”œβ”€β”€ schemas.py           # Pydantic models
└── routes/
    β”œβ”€β”€ documents.py     # Document endpoints
    └── rag.py           # RAG endpoints

src/utils/
└── cache_manager.py     # Redis/memory caching

docker/
β”œβ”€β”€ Dockerfile           # Multi-stage build
β”œβ”€β”€ docker-compose.yml   # Production stack
β”œβ”€β”€ docker-compose.dev.yml # Development stack
└── .dockerignore        # Build optimization
```

#### Dependencies Added
- `python-jose[cryptography]` - JWT tokens
- `passlib[bcrypt]` - Password hashing
- `python-multipart` - Form data handling
- `redis` - Redis client (optional)
- `python-docx` - Word document support
- `openpyxl` - Excel support
- `python-pptx` - PowerPoint support

#### Configuration
- `SPARKNET_SECRET_KEY` - JWT secret (environment variable)
- `REDIS_URL` - Redis connection string
- `OLLAMA_HOST` - Ollama server URL
- `CHROMA_HOST` / `CHROMA_PORT` - ChromaDB connection

### API Quick Reference

```bash
# Health check
curl http://localhost:8000/api/health

# Upload document
curl -X POST -F "file=@document.pdf" http://localhost:8000/api/documents/upload

# Query RAG
curl -X POST http://localhost:8000/api/rag/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the main findings?"}'

# Get token
curl -X POST http://localhost:8000/api/auth/token \
  -d "username=admin&password=admin123"
```

### Docker Quick Start

```bash
# Production deployment
docker-compose up -d

# Development with hot reload
docker-compose -f docker-compose.dev.yml up

# Pull Ollama models
docker exec sparknet-ollama ollama pull llama3.2:latest
docker exec sparknet-ollama ollama pull mxbai-embed-large:latest
```

---

## [1.0.0] - 2026-01-19

### Initial Release
- Multi-Agent RAG Pipeline (5 agents)
- Document Processing Pipeline (OCR, Layout, Chunking)
- Streamlit Demo Application (5 modules)
- ChromaDB Vector Store
- Ollama LLM Integration