Spaces:

Dev-ks04
/

contexto-api

Running

App Files Files Community

contexto-api / SYSTEM_DOCUMENTATION.md

Dev-ks04

feat: Contexto FastAPI backend - intent-aware summarization engine

39028c9 2 days ago

preview code

raw

history blame contribute delete

13.9 kB

	# 📚 Complete System Documentation

	## 📁 Project Folder Structure

	```
	Backend/
	├── .venv/ # Virtual environment (isolated Python)
	├── data/ # Data folders
	│ ├── raw/ # Original documents
	│ ├── processed/ # Processed data
	│ └── processing/ # Processing scripts
	├── models/ # ML Models
	│ ├── checkpoints/ # Model checkpoints
	│ ├── tokenizers/ # Tokenizer files
	│ ├── download_models.py # Download pre-trained models
	│ └── README.md
	├── notebooks/ # Jupyter notebooks for experimentation
	├── results/ # Output summaries and results
	├── src/ # Source code (CORE MODULES)
	│ ├── __init__.py # Package initialization
	│ ├── api.py # FastAPI REST API endpoints
	│ ├── summarizer.py # Main summarization orchestrator
	│ ├── preprocessing.py # Text preprocessing & cleaning
	│ ├── models.py # Model loading & initialization
	│ ├── rag.py # Retrieval-Augmented Generation
	│ ├── model_selector.py # Intelligent model selection
	│ ├── evaluation.py # Quality metrics & ROUGE scores
	│ ├── keywords.py # Keyword extraction
	│ ├── exporters.py # Export to JSON, PDF, TXT, Markdown
	│ ├── fine_tuner.py # Fine-tuning utilities
	│ ├── utils.py # Helper functions
	│ ├── web_ui.py # Web UI (HTML/CSS/JS)
	│ └── __pycache__/ # Python compiled files
	├── main.py # Entry point (4 CLI modes)
	├── config.json # Configuration & settings
	├── requirements.txt # Python dependencies
	├── README.md # Project overview
	├── Postman_Collection.json # API test suite
	└── SYSTEM_DOCUMENTATION.md # This file
	```

	---

	## 🔧 Core Modules

	### 1. main.py (Entry Point - 213 lines)
	Purpose: CLI interface with 4 operational modes

	Functions:
	- `single_document_mode()` - Summarize one document
	- `batch_mode()` - Process multiple files
	- `api_mode()` - Launch REST API server (port 8000)
	- `web_ui_mode()` - Launch web UI (port 8001)

	How to Use:
	```bash
	python main.py
	# Select: 1, 2, 3, or 4
	```

	---

	### 2. src/summarizer.py (Core Pipeline - 390 lines)
	Purpose: Main orchestration for document summarization

	Key Classes:
	- `TechnicalDocumentSummarizer` - Main class
	- `auto_summarize(document, quality_preference)` - Intelligent model routing
	- `summarize(document, language, intent)` - Direct summarization
	- `summarize_batch(documents)` - Process multiple documents
	- `_simplify_language(summary)` - Convert jargon to simple terms

	Flow:
	```
	Input Document
	→ Preprocessing (clean, tokenize, chunk)
	→ Complexity Analysis
	→ Model Selection (T5-Small/Base/Large + Pegasus)
	→ Optional RAG (for complex docs)
	→ Quality Evaluation (ROUGE, confidence)
	→ Keyword Extraction
	→ Output (JSON/PDF/TXT)
	```

	---

	### 3. src/api.py (REST API - 220 lines)
	Purpose: FastAPI endpoints for remote/Postman access

	Endpoints:
	\| Endpoint \| Method \| Purpose \|
	\|----------\|--------\|---------\|
	\| `/health` \| GET \| Server status check \|
	\| `/languages` \| GET \| Supported languages (15) \|
	\| `/intents` \| GET \| Supported intent types (6) \|
	\| `/summarize` \| POST \| Single document summarization \|
	\| `/batch-summarize` \| POST \| Batch processing \|

	Example Request:
	```json
	POST http://localhost:8000/summarize
	{
	"document": "Your text here...",
	"language": "english",
	"intent": "technical_overview",
	"quality_preference": "balanced"
	}
	```

	Response:
	```json
	{
	"summary": "...",
	"language": "english",
	"intent": "technical_overview",
	"length": 45,
	"model": "t5-base",
	"complexity": "MODERATE",
	"use_rag": false,
	"confidence_score": 0.92
	}
	```

	---

	### 4. src/preprocessing.py (Text Processing)
	Purpose: Clean and prepare text for summarization

	Classes:
	- `TextPreprocessor` - General text cleaning
	- `clean_text()` - Remove noise
	- `normalize()` - Standardize formatting
	- `sent_tokenize()` - Split into sentences
	- `word_tokenize()` - Split into words

	- `TechnicalDocumentParser` - Parse scientific documents
	- `remove_citations()` - Strip reference citations
	- `remove_equations()` - Remove LaTeX equations

	---

	### 5. src/model_selector.py (Intelligent Selection - 299 lines)
	Purpose: Auto-select best model based on document characteristics

	Analysis Metrics:
	- Word count
	- Sentence length
	- Vocabulary richness (unique words ratio)

	Decision Tree:
	```
	Word Count Analysis:
	├─ SIMPLE (< 500 words) → T5-Small ⚡
	├─ MODERATE (500-2000 words) → T5-Base ⚖️
	├─ COMPLEX (2000-5000 words) → Pegasus-ArXiv + RAG 🧠
	└─ VERY_COMPLEX (> 5000 words) → T5-Large + RAG ✨
	```

	---

	### 6. src/rag.py (Retrieval-Augmented Generation - 360 lines)
	Purpose: Enhance summaries for complex documents using semantic search

	Components:
	- `DocumentChunker` - Split docs with overlap
	- `EmbeddingGenerator` - Create 384-dim vectors (sentence-transformers)
	- `VectorDatabase` - FAISS-based similarity search
	- `RAGPipeline` - Orchestrate: chunk → embed → index → retrieve → summarize

	How It Works:
	```
	Complex Document
	→ Chunk into overlapping segments (512 tokens)
	→ Generate embeddings for each chunk
	→ Build FAISS vector index
	→ Search for most relevant chunks
	→ Feed to summarization model
	→ Enhanced summary with context
	```

	---

	### 7. src/evaluation.py (Quality Metrics)
	Purpose: Measure summary quality and confidence

	Class: `SummaryEvaluator`
	- `calculate_rouge_scores()` - ROUGE-1, ROUGE-2, ROUGE-L
	- `get_confidence_score()` - 0-1 confidence metric
	- `evaluate_quality()` - Overall quality assessment

	Metrics:
	- ROUGE-1: Unigram overlap
	- ROUGE-2: Bigram overlap
	- ROUGE-L: Longest common subsequence

	---

	### 8. src/keywords.py (Keyword Extraction)
	Purpose: Extract important keywords and phrases

	Class: `KeywordExtractor`
	- `extract_keywords()` - TF-based extraction
	- `mine_phrases()` - Multi-word phrase detection
	- `score_keywords()` - Importance scoring

	---

	### 9. src/exporters.py (Output Formats)
	Purpose: Export summaries in multiple formats

	Class: `SummaryExporter`
	- `export_json()` - JSON format with metadata
	- `export_text()` - Plain text
	- `export_pdf()` - Formatted PDF report (reportlab)
	- `export_markdown()` - Markdown format

	---

	### 10. src/web_ui.py (Web Interface - 1148 lines)
	Purpose: Professional, feature-rich web UI

	Features:
	- ✅ Single document & batch upload
	- ✅ Document history (localStorage)
	- ✅ Language selector (15 languages)
	- ✅ Intent selector (6 types)
	- ✅ Quality preference (speed/balanced/quality)
	- ✅ Real-time progress tracking
	- ✅ Download results (TXT/JSON)
	- ✅ Copy to clipboard
	- ✅ Settings panel with persistence
	- ✅ Responsive design (sidebar + main content)

	Access: `http://localhost:8001`

	---

	### 11. src/models.py (Model Management)
	Purpose: Load and initialize pre-trained models

	Supported Models:
	```
	Speed Tier (⚡):
	├─ t5-small
	└─ distilbert

	Balanced Tier (⚖️):
	├─ t5-base
	├─ mbart-50-small
	└─ mt5-small

	Quality Tier (✨):
	├─ t5-large
	├─ google/pegasus-arxiv
	├─ google/pegasus-pubmed
	├─ facebook/bart-large-cnn
	└─ allenai/led-base-16384
	```

	---

	### 12. src/fine_tuner.py (Fine-tuning Utilities)
	Purpose: Fine-tune models on custom datasets

	Methods:
	- `prepare_dataset()` - Format custom data
	- `train()` - Fine-tune models
	- `evaluate()` - Test performance
	- `save_model()` - Save checkpoints

	---

	### 13. src/utils.py (Helper Functions)
	Purpose: Utility functions used across modules

	Functions:
	- `load_config()` - Load config.json
	- `setup_logging()` - Configure logging
	- `format_output()` - Format results
	- Device management (CPU/GPU detection)

	---

	## ⚙️ Configuration (config.json)

	```json
	{
	"model": {
	"primary_model": "t5-small",
	"max_input_length": 512,
	"max_output_length": 150,
	"supported_languages": [15 languages],
	"default_language": "english"
	},
	"summarization": {
	"intent_types": ["technical_overview", "detailed_analysis", ...],
	"chunk_size": 512,
	"chunk_overlap": 50,
	"preserve_context": true
	}
	}
	```

	---

	## 🎯 Supported Features

	### Languages (15 Total)
	English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Turkish, Vietnamese, Thai

	### Intent Types (6 Total)
	1. technical_overview - High-level summary
	2. detailed_analysis - In-depth breakdown
	3. methodology - Research methods used
	4. results - Key findings
	5. conclusion - Conclusions drawn
	6. abstract - Academic abstract

	### Quality Preferences
	- Speed (⚡) - T5-Small, < 2 seconds
	- Balanced (⚖️) - T5-Base, < 5 seconds
	- Quality (✨) - T5-Large + RAG, < 10 seconds

	---

	## 🔌 How Components Work Together

	### Workflow 1: Single Document (Mode 1)
	```
	main.py
	↓ single_document_mode()
	↓
	TechnicalDocumentSummarizer.auto_summarize()
	├→ TextPreprocessor.clean_text()
	├→ ModelSelector (complexity analysis)
	├→ (Optional) RAGPipeline
	├→ T5/Pegasus model
	├→ SummaryEvaluator (ROUGE, confidence)
	├→ KeywordExtractor
	└→ Output (display or export)
	```

	### Workflow 2: REST API (Mode 3)
	```
	Postman/Web Client
	↓ HTTP POST /summarize
	↓
	FastAPI.summarize_endpoint()
	↓
	TechnicalDocumentSummarizer.auto_summarize()
	↓ (same as Workflow 1)
	↓
	JSON Response
	```

	### Workflow 3: Web UI (Mode 4)
	```
	Browser → http://localhost:8001
	↓
	web_ui.py (HTML/CSS/JS)
	↓ Form submission
	↓
	FastAPI /summarize endpoint
	↓ (same as Workflow 2)
	↓
	Display in browser + localStorage
	```

	---

	## 📊 Data Flow Summary

	```
	INPUT FORMATS:
	├─ Text (paste into UI)
	├─ Files (PDF, TXT upload)
	└─ Batch (multiple files)
	↓
	PROCESSING PIPELINE:
	├─ Text Cleaning
	├─ Tokenization & Chunking
	├─ Complexity Analysis
	├─ Model Selection
	├─ (Optional) Vector Embedding & Indexing
	├─ Summarization
	├─ Quality Evaluation
	└─ Keyword Extraction
	↓
	OUTPUT FORMATS:
	├─ JSON (with metadata)
	├─ PDF (formatted report)
	├─ TXT (plain text)
	└─ Web UI display (with localStorage)
	```

	---

	## 🚀 Quick Start Guide

	### 1. Install Dependencies
	```bash
	cd Backend
	pip install -r requirements.txt
	```

	### 2. Run in Different Modes

	Mode 1 - Single Document:
	```bash
	python main.py
	# Select: 1
	# Paste text or upload file
	```

	Mode 2 - Batch Processing:
	```bash
	python main.py
	# Select: 2
	# Upload multiple files
	```

	Mode 3 - REST API (for Postman):
	```bash
	python main.py
	# Select: 3
	# API runs on http://localhost:8000
	```

	Mode 4 - Web UI:
	```bash
	python main.py
	# Select: 4
	# Open http://localhost:8001 in browser
	```

	---

	## 🔗 API Integration

	### Using REST API with Postman

	1. Import Collection:
	- Open Postman
	- Import `Postman_Collection.json`

	2. Start API Server:
	- Run Mode 3 from main.py
	- Server starts on `http://localhost:8000`

	3. Run Tests:
	- 7 essential tests included
	- Tests health, languages, intents, summarization, batch, multi-language, speed mode

	---

	## 📈 Performance Characteristics

	\| Metric \| Speed \| Balanced \| Quality \|
	\|--------\|-------\|----------\|---------\|
	\| Model \| T5-Small \| T5-Base \| T5-Large + RAG \|
	\| Latency \| < 2s \| 2-5s \| 5-10s \|
	\| Quality Score \| 0.70 \| 0.85 \| 0.95 \|
	\| Memory Usage \| 1.5GB \| 3GB \| 6GB \|
	\| Doc Size Max \| 500w \| 2000w \| 5000w+ \|

	---

	## 🛠️ Development & Testing

	### Unit Testing
	```bash
	# Future: pytest tests/
	pytest
	```

	### Benchmarking
	```bash
	# Check performance metrics
	python benchmark.py
	```

	### Sanity Checks
	```bash
	# Verify all components working
	python sanity_check.py
	```

	---

	## 📚 Documentation Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `README.md` \| Project overview & setup \|
	\| `SYSTEM_DOCUMENTATION.md` \| This file - complete architecture \|
	\| `config.json` \| Configuration settings \|
	\| `requirements.txt` \| Python dependencies \|
	\| `Postman_Collection.json` \| API test suite \|

	---

	## 🔐 Security Considerations

	- ✅ No external API keys stored in code
	- ✅ Input validation on all endpoints
	- ✅ Error handling without exposing stack traces
	- ✅ Max input length limits (prevent DoS)
	- ✅ CORS headers properly configured

	---

	## 🎓 Key Technologies

	\| Component \| Technology \|
	\|-----------\|-----------\|
	\| API Framework \| FastAPI + Uvicorn \|
	\| NLP Models \| HuggingFace Transformers \|
	\| Deep Learning \| PyTorch \|
	\| Embeddings \| Sentence-Transformers \|
	\| Vector DB \| FAISS \|
	\| Quality Metrics \| rouge-score \|
	\| Web UI \| HTML5 + CSS3 + JS \|
	\| PDF Export \| ReportLab \|

	---

	## 📞 Support & Debugging

	### Common Issues

	Issue: ModuleNotFoundError for rouge_score
	```bash
	pip install rouge_score
	```

	Issue: CUDA/GPU not detected
	```bash
	# Will auto-fallback to CPU
	# Check config.json "device": "auto"
	```

	Issue: Model download fails
	```bash
	python models/download_models.py
	```

	---

	## 📄 License
	MIT License - See LICENSE file for details

	---

	Last Updated: February 24, 2026
	Version: 1.0.0