# 📚 Complete System Documentation ## 📁 Project Folder Structure ``` Backend/ ├── .venv/ # Virtual environment (isolated Python) ├── data/ # Data folders │ ├── raw/ # Original documents │ ├── processed/ # Processed data │ └── processing/ # Processing scripts ├── models/ # ML Models │ ├── checkpoints/ # Model checkpoints │ ├── tokenizers/ # Tokenizer files │ ├── download_models.py # Download pre-trained models │ └── README.md ├── notebooks/ # Jupyter notebooks for experimentation ├── results/ # Output summaries and results ├── src/ # Source code (CORE MODULES) │ ├── __init__.py # Package initialization │ ├── api.py # FastAPI REST API endpoints │ ├── summarizer.py # Main summarization orchestrator │ ├── preprocessing.py # Text preprocessing & cleaning │ ├── models.py # Model loading & initialization │ ├── rag.py # Retrieval-Augmented Generation │ ├── model_selector.py # Intelligent model selection │ ├── evaluation.py # Quality metrics & ROUGE scores │ ├── keywords.py # Keyword extraction │ ├── exporters.py # Export to JSON, PDF, TXT, Markdown │ ├── fine_tuner.py # Fine-tuning utilities │ ├── utils.py # Helper functions │ ├── web_ui.py # Web UI (HTML/CSS/JS) │ └── __pycache__/ # Python compiled files ├── main.py # Entry point (4 CLI modes) ├── config.json # Configuration & settings ├── requirements.txt # Python dependencies ├── README.md # Project overview ├── Postman_Collection.json # API test suite └── SYSTEM_DOCUMENTATION.md # This file ``` --- ## 🔧 Core Modules ### 1. **main.py** (Entry Point - 213 lines) **Purpose:** CLI interface with 4 operational modes **Functions:** - `single_document_mode()` - Summarize one document - `batch_mode()` - Process multiple files - `api_mode()` - Launch REST API server (port 8000) - `web_ui_mode()` - Launch web UI (port 8001) **How to Use:** ```bash python main.py # Select: 1, 2, 3, or 4 ``` --- ### 2. **src/summarizer.py** (Core Pipeline - 390 lines) **Purpose:** Main orchestration for document summarization **Key Classes:** - `TechnicalDocumentSummarizer` - Main class - `auto_summarize(document, quality_preference)` - Intelligent model routing - `summarize(document, language, intent)` - Direct summarization - `summarize_batch(documents)` - Process multiple documents - `_simplify_language(summary)` - Convert jargon to simple terms **Flow:** ``` Input Document → Preprocessing (clean, tokenize, chunk) → Complexity Analysis → Model Selection (T5-Small/Base/Large + Pegasus) → Optional RAG (for complex docs) → Quality Evaluation (ROUGE, confidence) → Keyword Extraction → Output (JSON/PDF/TXT) ``` --- ### 3. **src/api.py** (REST API - 220 lines) **Purpose:** FastAPI endpoints for remote/Postman access **Endpoints:** | Endpoint | Method | Purpose | |----------|--------|---------| | `/health` | GET | Server status check | | `/languages` | GET | Supported languages (15) | | `/intents` | GET | Supported intent types (6) | | `/summarize` | POST | Single document summarization | | `/batch-summarize` | POST | Batch processing | **Example Request:** ```json POST http://localhost:8000/summarize { "document": "Your text here...", "language": "english", "intent": "technical_overview", "quality_preference": "balanced" } ``` **Response:** ```json { "summary": "...", "language": "english", "intent": "technical_overview", "length": 45, "model": "t5-base", "complexity": "MODERATE", "use_rag": false, "confidence_score": 0.92 } ``` --- ### 4. **src/preprocessing.py** (Text Processing) **Purpose:** Clean and prepare text for summarization **Classes:** - `TextPreprocessor` - General text cleaning - `clean_text()` - Remove noise - `normalize()` - Standardize formatting - `sent_tokenize()` - Split into sentences - `word_tokenize()` - Split into words - `TechnicalDocumentParser` - Parse scientific documents - `remove_citations()` - Strip reference citations - `remove_equations()` - Remove LaTeX equations --- ### 5. **src/model_selector.py** (Intelligent Selection - 299 lines) **Purpose:** Auto-select best model based on document characteristics **Analysis Metrics:** - Word count - Sentence length - Vocabulary richness (unique words ratio) **Decision Tree:** ``` Word Count Analysis: ├─ SIMPLE (< 500 words) → T5-Small ⚡ ├─ MODERATE (500-2000 words) → T5-Base ⚖️ ├─ COMPLEX (2000-5000 words) → Pegasus-ArXiv + RAG 🧠 └─ VERY_COMPLEX (> 5000 words) → T5-Large + RAG ✨ ``` --- ### 6. **src/rag.py** (Retrieval-Augmented Generation - 360 lines) **Purpose:** Enhance summaries for complex documents using semantic search **Components:** - `DocumentChunker` - Split docs with overlap - `EmbeddingGenerator` - Create 384-dim vectors (sentence-transformers) - `VectorDatabase` - FAISS-based similarity search - `RAGPipeline` - Orchestrate: chunk → embed → index → retrieve → summarize **How It Works:** ``` Complex Document → Chunk into overlapping segments (512 tokens) → Generate embeddings for each chunk → Build FAISS vector index → Search for most relevant chunks → Feed to summarization model → Enhanced summary with context ``` --- ### 7. **src/evaluation.py** (Quality Metrics) **Purpose:** Measure summary quality and confidence **Class:** `SummaryEvaluator` - `calculate_rouge_scores()` - ROUGE-1, ROUGE-2, ROUGE-L - `get_confidence_score()` - 0-1 confidence metric - `evaluate_quality()` - Overall quality assessment **Metrics:** - **ROUGE-1:** Unigram overlap - **ROUGE-2:** Bigram overlap - **ROUGE-L:** Longest common subsequence --- ### 8. **src/keywords.py** (Keyword Extraction) **Purpose:** Extract important keywords and phrases **Class:** `KeywordExtractor` - `extract_keywords()` - TF-based extraction - `mine_phrases()` - Multi-word phrase detection - `score_keywords()` - Importance scoring --- ### 9. **src/exporters.py** (Output Formats) **Purpose:** Export summaries in multiple formats **Class:** `SummaryExporter` - `export_json()` - JSON format with metadata - `export_text()` - Plain text - `export_pdf()` - Formatted PDF report (reportlab) - `export_markdown()` - Markdown format --- ### 10. **src/web_ui.py** (Web Interface - 1148 lines) **Purpose:** Professional, feature-rich web UI **Features:** - ✅ Single document & batch upload - ✅ Document history (localStorage) - ✅ Language selector (15 languages) - ✅ Intent selector (6 types) - ✅ Quality preference (speed/balanced/quality) - ✅ Real-time progress tracking - ✅ Download results (TXT/JSON) - ✅ Copy to clipboard - ✅ Settings panel with persistence - ✅ Responsive design (sidebar + main content) **Access:** `http://localhost:8001` --- ### 11. **src/models.py** (Model Management) **Purpose:** Load and initialize pre-trained models **Supported Models:** ``` Speed Tier (⚡): ├─ t5-small └─ distilbert Balanced Tier (⚖️): ├─ t5-base ├─ mbart-50-small └─ mt5-small Quality Tier (✨): ├─ t5-large ├─ google/pegasus-arxiv ├─ google/pegasus-pubmed ├─ facebook/bart-large-cnn └─ allenai/led-base-16384 ``` --- ### 12. **src/fine_tuner.py** (Fine-tuning Utilities) **Purpose:** Fine-tune models on custom datasets **Methods:** - `prepare_dataset()` - Format custom data - `train()` - Fine-tune models - `evaluate()` - Test performance - `save_model()` - Save checkpoints --- ### 13. **src/utils.py** (Helper Functions) **Purpose:** Utility functions used across modules **Functions:** - `load_config()` - Load config.json - `setup_logging()` - Configure logging - `format_output()` - Format results - Device management (CPU/GPU detection) --- ## ⚙️ Configuration (config.json) ```json { "model": { "primary_model": "t5-small", "max_input_length": 512, "max_output_length": 150, "supported_languages": [15 languages], "default_language": "english" }, "summarization": { "intent_types": ["technical_overview", "detailed_analysis", ...], "chunk_size": 512, "chunk_overlap": 50, "preserve_context": true } } ``` --- ## 🎯 Supported Features ### Languages (15 Total) English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Turkish, Vietnamese, Thai ### Intent Types (6 Total) 1. **technical_overview** - High-level summary 2. **detailed_analysis** - In-depth breakdown 3. **methodology** - Research methods used 4. **results** - Key findings 5. **conclusion** - Conclusions drawn 6. **abstract** - Academic abstract ### Quality Preferences - **Speed** (⚡) - T5-Small, < 2 seconds - **Balanced** (⚖️) - T5-Base, < 5 seconds - **Quality** (✨) - T5-Large + RAG, < 10 seconds --- ## 🔌 How Components Work Together ### Workflow 1: Single Document (Mode 1) ``` main.py ↓ single_document_mode() ↓ TechnicalDocumentSummarizer.auto_summarize() ├→ TextPreprocessor.clean_text() ├→ ModelSelector (complexity analysis) ├→ (Optional) RAGPipeline ├→ T5/Pegasus model ├→ SummaryEvaluator (ROUGE, confidence) ├→ KeywordExtractor └→ Output (display or export) ``` ### Workflow 2: REST API (Mode 3) ``` Postman/Web Client ↓ HTTP POST /summarize ↓ FastAPI.summarize_endpoint() ↓ TechnicalDocumentSummarizer.auto_summarize() ↓ (same as Workflow 1) ↓ JSON Response ``` ### Workflow 3: Web UI (Mode 4) ``` Browser → http://localhost:8001 ↓ web_ui.py (HTML/CSS/JS) ↓ Form submission ↓ FastAPI /summarize endpoint ↓ (same as Workflow 2) ↓ Display in browser + localStorage ``` --- ## 📊 Data Flow Summary ``` INPUT FORMATS: ├─ Text (paste into UI) ├─ Files (PDF, TXT upload) └─ Batch (multiple files) ↓ PROCESSING PIPELINE: ├─ Text Cleaning ├─ Tokenization & Chunking ├─ Complexity Analysis ├─ Model Selection ├─ (Optional) Vector Embedding & Indexing ├─ Summarization ├─ Quality Evaluation └─ Keyword Extraction ↓ OUTPUT FORMATS: ├─ JSON (with metadata) ├─ PDF (formatted report) ├─ TXT (plain text) └─ Web UI display (with localStorage) ``` --- ## 🚀 Quick Start Guide ### 1. Install Dependencies ```bash cd Backend pip install -r requirements.txt ``` ### 2. Run in Different Modes **Mode 1 - Single Document:** ```bash python main.py # Select: 1 # Paste text or upload file ``` **Mode 2 - Batch Processing:** ```bash python main.py # Select: 2 # Upload multiple files ``` **Mode 3 - REST API (for Postman):** ```bash python main.py # Select: 3 # API runs on http://localhost:8000 ``` **Mode 4 - Web UI:** ```bash python main.py # Select: 4 # Open http://localhost:8001 in browser ``` --- ## 🔗 API Integration ### Using REST API with Postman 1. **Import Collection:** - Open Postman - Import `Postman_Collection.json` 2. **Start API Server:** - Run Mode 3 from main.py - Server starts on `http://localhost:8000` 3. **Run Tests:** - 7 essential tests included - Tests health, languages, intents, summarization, batch, multi-language, speed mode --- ## 📈 Performance Characteristics | Metric | Speed | Balanced | Quality | |--------|-------|----------|---------| | Model | T5-Small | T5-Base | T5-Large + RAG | | Latency | < 2s | 2-5s | 5-10s | | Quality Score | 0.70 | 0.85 | 0.95 | | Memory Usage | 1.5GB | 3GB | 6GB | | Doc Size Max | 500w | 2000w | 5000w+ | --- ## 🛠️ Development & Testing ### Unit Testing ```bash # Future: pytest tests/ pytest ``` ### Benchmarking ```bash # Check performance metrics python benchmark.py ``` ### Sanity Checks ```bash # Verify all components working python sanity_check.py ``` --- ## 📚 Documentation Files | File | Purpose | |------|---------| | `README.md` | Project overview & setup | | `SYSTEM_DOCUMENTATION.md` | This file - complete architecture | | `config.json` | Configuration settings | | `requirements.txt` | Python dependencies | | `Postman_Collection.json` | API test suite | --- ## 🔐 Security Considerations - ✅ No external API keys stored in code - ✅ Input validation on all endpoints - ✅ Error handling without exposing stack traces - ✅ Max input length limits (prevent DoS) - ✅ CORS headers properly configured --- ## 🎓 Key Technologies | Component | Technology | |-----------|-----------| | API Framework | FastAPI + Uvicorn | | NLP Models | HuggingFace Transformers | | Deep Learning | PyTorch | | Embeddings | Sentence-Transformers | | Vector DB | FAISS | | Quality Metrics | rouge-score | | Web UI | HTML5 + CSS3 + JS | | PDF Export | ReportLab | --- ## 📞 Support & Debugging ### Common Issues **Issue:** ModuleNotFoundError for rouge_score ```bash pip install rouge_score ``` **Issue:** CUDA/GPU not detected ```bash # Will auto-fallback to CPU # Check config.json "device": "auto" ``` **Issue:** Model download fails ```bash python models/download_models.py ``` --- ## 📄 License MIT License - See LICENSE file for details --- **Last Updated:** February 24, 2026 **Version:** 1.0.0