Spaces:
Running
π Complete System Documentation
π Project Folder Structure
Backend/
βββ .venv/ # Virtual environment (isolated Python)
βββ data/ # Data folders
β βββ raw/ # Original documents
β βββ processed/ # Processed data
β βββ processing/ # Processing scripts
βββ models/ # ML Models
β βββ checkpoints/ # Model checkpoints
β βββ tokenizers/ # Tokenizer files
β βββ download_models.py # Download pre-trained models
β βββ README.md
βββ notebooks/ # Jupyter notebooks for experimentation
βββ results/ # Output summaries and results
βββ src/ # Source code (CORE MODULES)
β βββ __init__.py # Package initialization
β βββ api.py # FastAPI REST API endpoints
β βββ summarizer.py # Main summarization orchestrator
β βββ preprocessing.py # Text preprocessing & cleaning
β βββ models.py # Model loading & initialization
β βββ rag.py # Retrieval-Augmented Generation
β βββ model_selector.py # Intelligent model selection
β βββ evaluation.py # Quality metrics & ROUGE scores
β βββ keywords.py # Keyword extraction
β βββ exporters.py # Export to JSON, PDF, TXT, Markdown
β βββ fine_tuner.py # Fine-tuning utilities
β βββ utils.py # Helper functions
β βββ web_ui.py # Web UI (HTML/CSS/JS)
β βββ __pycache__/ # Python compiled files
βββ main.py # Entry point (4 CLI modes)
βββ config.json # Configuration & settings
βββ requirements.txt # Python dependencies
βββ README.md # Project overview
βββ Postman_Collection.json # API test suite
βββ SYSTEM_DOCUMENTATION.md # This file
π§ Core Modules
1. main.py (Entry Point - 213 lines)
Purpose: CLI interface with 4 operational modes
Functions:
single_document_mode()- Summarize one documentbatch_mode()- Process multiple filesapi_mode()- Launch REST API server (port 8000)web_ui_mode()- Launch web UI (port 8001)
How to Use:
python main.py
# Select: 1, 2, 3, or 4
2. src/summarizer.py (Core Pipeline - 390 lines)
Purpose: Main orchestration for document summarization
Key Classes:
TechnicalDocumentSummarizer- Main classauto_summarize(document, quality_preference)- Intelligent model routingsummarize(document, language, intent)- Direct summarizationsummarize_batch(documents)- Process multiple documents_simplify_language(summary)- Convert jargon to simple terms
Flow:
Input Document
β Preprocessing (clean, tokenize, chunk)
β Complexity Analysis
β Model Selection (T5-Small/Base/Large + Pegasus)
β Optional RAG (for complex docs)
β Quality Evaluation (ROUGE, confidence)
β Keyword Extraction
β Output (JSON/PDF/TXT)
3. src/api.py (REST API - 220 lines)
Purpose: FastAPI endpoints for remote/Postman access
Endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Server status check |
/languages |
GET | Supported languages (15) |
/intents |
GET | Supported intent types (6) |
/summarize |
POST | Single document summarization |
/batch-summarize |
POST | Batch processing |
Example Request:
POST http://localhost:8000/summarize
{
"document": "Your text here...",
"language": "english",
"intent": "technical_overview",
"quality_preference": "balanced"
}
Response:
{
"summary": "...",
"language": "english",
"intent": "technical_overview",
"length": 45,
"model": "t5-base",
"complexity": "MODERATE",
"use_rag": false,
"confidence_score": 0.92
}
4. src/preprocessing.py (Text Processing)
Purpose: Clean and prepare text for summarization
Classes:
TextPreprocessor- General text cleaningclean_text()- Remove noisenormalize()- Standardize formattingsent_tokenize()- Split into sentencesword_tokenize()- Split into words
TechnicalDocumentParser- Parse scientific documentsremove_citations()- Strip reference citationsremove_equations()- Remove LaTeX equations
5. src/model_selector.py (Intelligent Selection - 299 lines)
Purpose: Auto-select best model based on document characteristics
Analysis Metrics:
- Word count
- Sentence length
- Vocabulary richness (unique words ratio)
Decision Tree:
Word Count Analysis:
ββ SIMPLE (< 500 words) β T5-Small β‘
ββ MODERATE (500-2000 words) β T5-Base βοΈ
ββ COMPLEX (2000-5000 words) β Pegasus-ArXiv + RAG π§
ββ VERY_COMPLEX (> 5000 words) β T5-Large + RAG β¨
6. src/rag.py (Retrieval-Augmented Generation - 360 lines)
Purpose: Enhance summaries for complex documents using semantic search
Components:
DocumentChunker- Split docs with overlapEmbeddingGenerator- Create 384-dim vectors (sentence-transformers)VectorDatabase- FAISS-based similarity searchRAGPipeline- Orchestrate: chunk β embed β index β retrieve β summarize
How It Works:
Complex Document
β Chunk into overlapping segments (512 tokens)
β Generate embeddings for each chunk
β Build FAISS vector index
β Search for most relevant chunks
β Feed to summarization model
β Enhanced summary with context
7. src/evaluation.py (Quality Metrics)
Purpose: Measure summary quality and confidence
Class: SummaryEvaluator
calculate_rouge_scores()- ROUGE-1, ROUGE-2, ROUGE-Lget_confidence_score()- 0-1 confidence metricevaluate_quality()- Overall quality assessment
Metrics:
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence
8. src/keywords.py (Keyword Extraction)
Purpose: Extract important keywords and phrases
Class: KeywordExtractor
extract_keywords()- TF-based extractionmine_phrases()- Multi-word phrase detectionscore_keywords()- Importance scoring
9. src/exporters.py (Output Formats)
Purpose: Export summaries in multiple formats
Class: SummaryExporter
export_json()- JSON format with metadataexport_text()- Plain textexport_pdf()- Formatted PDF report (reportlab)export_markdown()- Markdown format
10. src/web_ui.py (Web Interface - 1148 lines)
Purpose: Professional, feature-rich web UI
Features:
- β Single document & batch upload
- β Document history (localStorage)
- β Language selector (15 languages)
- β Intent selector (6 types)
- β Quality preference (speed/balanced/quality)
- β Real-time progress tracking
- β Download results (TXT/JSON)
- β Copy to clipboard
- β Settings panel with persistence
- β Responsive design (sidebar + main content)
Access: http://localhost:8001
11. src/models.py (Model Management)
Purpose: Load and initialize pre-trained models
Supported Models:
Speed Tier (β‘):
ββ t5-small
ββ distilbert
Balanced Tier (βοΈ):
ββ t5-base
ββ mbart-50-small
ββ mt5-small
Quality Tier (β¨):
ββ t5-large
ββ google/pegasus-arxiv
ββ google/pegasus-pubmed
ββ facebook/bart-large-cnn
ββ allenai/led-base-16384
12. src/fine_tuner.py (Fine-tuning Utilities)
Purpose: Fine-tune models on custom datasets
Methods:
prepare_dataset()- Format custom datatrain()- Fine-tune modelsevaluate()- Test performancesave_model()- Save checkpoints
13. src/utils.py (Helper Functions)
Purpose: Utility functions used across modules
Functions:
load_config()- Load config.jsonsetup_logging()- Configure loggingformat_output()- Format results- Device management (CPU/GPU detection)
βοΈ Configuration (config.json)
{
"model": {
"primary_model": "t5-small",
"max_input_length": 512,
"max_output_length": 150,
"supported_languages": [15 languages],
"default_language": "english"
},
"summarization": {
"intent_types": ["technical_overview", "detailed_analysis", ...],
"chunk_size": 512,
"chunk_overlap": 50,
"preserve_context": true
}
}
π― Supported Features
Languages (15 Total)
English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Turkish, Vietnamese, Thai
Intent Types (6 Total)
- technical_overview - High-level summary
- detailed_analysis - In-depth breakdown
- methodology - Research methods used
- results - Key findings
- conclusion - Conclusions drawn
- abstract - Academic abstract
Quality Preferences
- Speed (β‘) - T5-Small, < 2 seconds
- Balanced (βοΈ) - T5-Base, < 5 seconds
- Quality (β¨) - T5-Large + RAG, < 10 seconds
π How Components Work Together
Workflow 1: Single Document (Mode 1)
main.py
β single_document_mode()
β
TechnicalDocumentSummarizer.auto_summarize()
ββ TextPreprocessor.clean_text()
ββ ModelSelector (complexity analysis)
ββ (Optional) RAGPipeline
ββ T5/Pegasus model
ββ SummaryEvaluator (ROUGE, confidence)
ββ KeywordExtractor
ββ Output (display or export)
Workflow 2: REST API (Mode 3)
Postman/Web Client
β HTTP POST /summarize
β
FastAPI.summarize_endpoint()
β
TechnicalDocumentSummarizer.auto_summarize()
β (same as Workflow 1)
β
JSON Response
Workflow 3: Web UI (Mode 4)
Browser β http://localhost:8001
β
web_ui.py (HTML/CSS/JS)
β Form submission
β
FastAPI /summarize endpoint
β (same as Workflow 2)
β
Display in browser + localStorage
π Data Flow Summary
INPUT FORMATS:
ββ Text (paste into UI)
ββ Files (PDF, TXT upload)
ββ Batch (multiple files)
β
PROCESSING PIPELINE:
ββ Text Cleaning
ββ Tokenization & Chunking
ββ Complexity Analysis
ββ Model Selection
ββ (Optional) Vector Embedding & Indexing
ββ Summarization
ββ Quality Evaluation
ββ Keyword Extraction
β
OUTPUT FORMATS:
ββ JSON (with metadata)
ββ PDF (formatted report)
ββ TXT (plain text)
ββ Web UI display (with localStorage)
π Quick Start Guide
1. Install Dependencies
cd Backend
pip install -r requirements.txt
2. Run in Different Modes
Mode 1 - Single Document:
python main.py
# Select: 1
# Paste text or upload file
Mode 2 - Batch Processing:
python main.py
# Select: 2
# Upload multiple files
Mode 3 - REST API (for Postman):
python main.py
# Select: 3
# API runs on http://localhost:8000
Mode 4 - Web UI:
python main.py
# Select: 4
# Open http://localhost:8001 in browser
π API Integration
Using REST API with Postman
Import Collection:
- Open Postman
- Import
Postman_Collection.json
Start API Server:
- Run Mode 3 from main.py
- Server starts on
http://localhost:8000
Run Tests:
- 7 essential tests included
- Tests health, languages, intents, summarization, batch, multi-language, speed mode
π Performance Characteristics
| Metric | Speed | Balanced | Quality |
|---|---|---|---|
| Model | T5-Small | T5-Base | T5-Large + RAG |
| Latency | < 2s | 2-5s | 5-10s |
| Quality Score | 0.70 | 0.85 | 0.95 |
| Memory Usage | 1.5GB | 3GB | 6GB |
| Doc Size Max | 500w | 2000w | 5000w+ |
π οΈ Development & Testing
Unit Testing
# Future: pytest tests/
pytest
Benchmarking
# Check performance metrics
python benchmark.py
Sanity Checks
# Verify all components working
python sanity_check.py
π Documentation Files
| File | Purpose |
|---|---|
README.md |
Project overview & setup |
SYSTEM_DOCUMENTATION.md |
This file - complete architecture |
config.json |
Configuration settings |
requirements.txt |
Python dependencies |
Postman_Collection.json |
API test suite |
π Security Considerations
- β No external API keys stored in code
- β Input validation on all endpoints
- β Error handling without exposing stack traces
- β Max input length limits (prevent DoS)
- β CORS headers properly configured
π Key Technologies
| Component | Technology |
|---|---|
| API Framework | FastAPI + Uvicorn |
| NLP Models | HuggingFace Transformers |
| Deep Learning | PyTorch |
| Embeddings | Sentence-Transformers |
| Vector DB | FAISS |
| Quality Metrics | rouge-score |
| Web UI | HTML5 + CSS3 + JS |
| PDF Export | ReportLab |
π Support & Debugging
Common Issues
Issue: ModuleNotFoundError for rouge_score
pip install rouge_score
Issue: CUDA/GPU not detected
# Will auto-fallback to CPU
# Check config.json "device": "auto"
Issue: Model download fails
python models/download_models.py
π License
MIT License - See LICENSE file for details
Last Updated: February 24, 2026 Version: 1.0.0