Spaces:
Running
Running
| # π Complete System Documentation | |
| ## π Project Folder Structure | |
| ``` | |
| Backend/ | |
| βββ .venv/ # Virtual environment (isolated Python) | |
| βββ data/ # Data folders | |
| β βββ raw/ # Original documents | |
| β βββ processed/ # Processed data | |
| β βββ processing/ # Processing scripts | |
| βββ models/ # ML Models | |
| β βββ checkpoints/ # Model checkpoints | |
| β βββ tokenizers/ # Tokenizer files | |
| β βββ download_models.py # Download pre-trained models | |
| β βββ README.md | |
| βββ notebooks/ # Jupyter notebooks for experimentation | |
| βββ results/ # Output summaries and results | |
| βββ src/ # Source code (CORE MODULES) | |
| β βββ __init__.py # Package initialization | |
| β βββ api.py # FastAPI REST API endpoints | |
| β βββ summarizer.py # Main summarization orchestrator | |
| β βββ preprocessing.py # Text preprocessing & cleaning | |
| β βββ models.py # Model loading & initialization | |
| β βββ rag.py # Retrieval-Augmented Generation | |
| β βββ model_selector.py # Intelligent model selection | |
| β βββ evaluation.py # Quality metrics & ROUGE scores | |
| β βββ keywords.py # Keyword extraction | |
| β βββ exporters.py # Export to JSON, PDF, TXT, Markdown | |
| β βββ fine_tuner.py # Fine-tuning utilities | |
| β βββ utils.py # Helper functions | |
| β βββ web_ui.py # Web UI (HTML/CSS/JS) | |
| β βββ __pycache__/ # Python compiled files | |
| βββ main.py # Entry point (4 CLI modes) | |
| βββ config.json # Configuration & settings | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # Project overview | |
| βββ Postman_Collection.json # API test suite | |
| βββ SYSTEM_DOCUMENTATION.md # This file | |
| ``` | |
| --- | |
| ## π§ Core Modules | |
| ### 1. **main.py** (Entry Point - 213 lines) | |
| **Purpose:** CLI interface with 4 operational modes | |
| **Functions:** | |
| - `single_document_mode()` - Summarize one document | |
| - `batch_mode()` - Process multiple files | |
| - `api_mode()` - Launch REST API server (port 8000) | |
| - `web_ui_mode()` - Launch web UI (port 8001) | |
| **How to Use:** | |
| ```bash | |
| python main.py | |
| # Select: 1, 2, 3, or 4 | |
| ``` | |
| --- | |
| ### 2. **src/summarizer.py** (Core Pipeline - 390 lines) | |
| **Purpose:** Main orchestration for document summarization | |
| **Key Classes:** | |
| - `TechnicalDocumentSummarizer` - Main class | |
| - `auto_summarize(document, quality_preference)` - Intelligent model routing | |
| - `summarize(document, language, intent)` - Direct summarization | |
| - `summarize_batch(documents)` - Process multiple documents | |
| - `_simplify_language(summary)` - Convert jargon to simple terms | |
| **Flow:** | |
| ``` | |
| Input Document | |
| β Preprocessing (clean, tokenize, chunk) | |
| β Complexity Analysis | |
| β Model Selection (T5-Small/Base/Large + Pegasus) | |
| β Optional RAG (for complex docs) | |
| β Quality Evaluation (ROUGE, confidence) | |
| β Keyword Extraction | |
| β Output (JSON/PDF/TXT) | |
| ``` | |
| --- | |
| ### 3. **src/api.py** (REST API - 220 lines) | |
| **Purpose:** FastAPI endpoints for remote/Postman access | |
| **Endpoints:** | |
| | Endpoint | Method | Purpose | | |
| |----------|--------|---------| | |
| | `/health` | GET | Server status check | | |
| | `/languages` | GET | Supported languages (15) | | |
| | `/intents` | GET | Supported intent types (6) | | |
| | `/summarize` | POST | Single document summarization | | |
| | `/batch-summarize` | POST | Batch processing | | |
| **Example Request:** | |
| ```json | |
| POST http://localhost:8000/summarize | |
| { | |
| "document": "Your text here...", | |
| "language": "english", | |
| "intent": "technical_overview", | |
| "quality_preference": "balanced" | |
| } | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "summary": "...", | |
| "language": "english", | |
| "intent": "technical_overview", | |
| "length": 45, | |
| "model": "t5-base", | |
| "complexity": "MODERATE", | |
| "use_rag": false, | |
| "confidence_score": 0.92 | |
| } | |
| ``` | |
| --- | |
| ### 4. **src/preprocessing.py** (Text Processing) | |
| **Purpose:** Clean and prepare text for summarization | |
| **Classes:** | |
| - `TextPreprocessor` - General text cleaning | |
| - `clean_text()` - Remove noise | |
| - `normalize()` - Standardize formatting | |
| - `sent_tokenize()` - Split into sentences | |
| - `word_tokenize()` - Split into words | |
| - `TechnicalDocumentParser` - Parse scientific documents | |
| - `remove_citations()` - Strip reference citations | |
| - `remove_equations()` - Remove LaTeX equations | |
| --- | |
| ### 5. **src/model_selector.py** (Intelligent Selection - 299 lines) | |
| **Purpose:** Auto-select best model based on document characteristics | |
| **Analysis Metrics:** | |
| - Word count | |
| - Sentence length | |
| - Vocabulary richness (unique words ratio) | |
| **Decision Tree:** | |
| ``` | |
| Word Count Analysis: | |
| ββ SIMPLE (< 500 words) β T5-Small β‘ | |
| ββ MODERATE (500-2000 words) β T5-Base βοΈ | |
| ββ COMPLEX (2000-5000 words) β Pegasus-ArXiv + RAG π§ | |
| ββ VERY_COMPLEX (> 5000 words) β T5-Large + RAG β¨ | |
| ``` | |
| --- | |
| ### 6. **src/rag.py** (Retrieval-Augmented Generation - 360 lines) | |
| **Purpose:** Enhance summaries for complex documents using semantic search | |
| **Components:** | |
| - `DocumentChunker` - Split docs with overlap | |
| - `EmbeddingGenerator` - Create 384-dim vectors (sentence-transformers) | |
| - `VectorDatabase` - FAISS-based similarity search | |
| - `RAGPipeline` - Orchestrate: chunk β embed β index β retrieve β summarize | |
| **How It Works:** | |
| ``` | |
| Complex Document | |
| β Chunk into overlapping segments (512 tokens) | |
| β Generate embeddings for each chunk | |
| β Build FAISS vector index | |
| β Search for most relevant chunks | |
| β Feed to summarization model | |
| β Enhanced summary with context | |
| ``` | |
| --- | |
| ### 7. **src/evaluation.py** (Quality Metrics) | |
| **Purpose:** Measure summary quality and confidence | |
| **Class:** `SummaryEvaluator` | |
| - `calculate_rouge_scores()` - ROUGE-1, ROUGE-2, ROUGE-L | |
| - `get_confidence_score()` - 0-1 confidence metric | |
| - `evaluate_quality()` - Overall quality assessment | |
| **Metrics:** | |
| - **ROUGE-1:** Unigram overlap | |
| - **ROUGE-2:** Bigram overlap | |
| - **ROUGE-L:** Longest common subsequence | |
| --- | |
| ### 8. **src/keywords.py** (Keyword Extraction) | |
| **Purpose:** Extract important keywords and phrases | |
| **Class:** `KeywordExtractor` | |
| - `extract_keywords()` - TF-based extraction | |
| - `mine_phrases()` - Multi-word phrase detection | |
| - `score_keywords()` - Importance scoring | |
| --- | |
| ### 9. **src/exporters.py** (Output Formats) | |
| **Purpose:** Export summaries in multiple formats | |
| **Class:** `SummaryExporter` | |
| - `export_json()` - JSON format with metadata | |
| - `export_text()` - Plain text | |
| - `export_pdf()` - Formatted PDF report (reportlab) | |
| - `export_markdown()` - Markdown format | |
| --- | |
| ### 10. **src/web_ui.py** (Web Interface - 1148 lines) | |
| **Purpose:** Professional, feature-rich web UI | |
| **Features:** | |
| - β Single document & batch upload | |
| - β Document history (localStorage) | |
| - β Language selector (15 languages) | |
| - β Intent selector (6 types) | |
| - β Quality preference (speed/balanced/quality) | |
| - β Real-time progress tracking | |
| - β Download results (TXT/JSON) | |
| - β Copy to clipboard | |
| - β Settings panel with persistence | |
| - β Responsive design (sidebar + main content) | |
| **Access:** `http://localhost:8001` | |
| --- | |
| ### 11. **src/models.py** (Model Management) | |
| **Purpose:** Load and initialize pre-trained models | |
| **Supported Models:** | |
| ``` | |
| Speed Tier (β‘): | |
| ββ t5-small | |
| ββ distilbert | |
| Balanced Tier (βοΈ): | |
| ββ t5-base | |
| ββ mbart-50-small | |
| ββ mt5-small | |
| Quality Tier (β¨): | |
| ββ t5-large | |
| ββ google/pegasus-arxiv | |
| ββ google/pegasus-pubmed | |
| ββ facebook/bart-large-cnn | |
| ββ allenai/led-base-16384 | |
| ``` | |
| --- | |
| ### 12. **src/fine_tuner.py** (Fine-tuning Utilities) | |
| **Purpose:** Fine-tune models on custom datasets | |
| **Methods:** | |
| - `prepare_dataset()` - Format custom data | |
| - `train()` - Fine-tune models | |
| - `evaluate()` - Test performance | |
| - `save_model()` - Save checkpoints | |
| --- | |
| ### 13. **src/utils.py** (Helper Functions) | |
| **Purpose:** Utility functions used across modules | |
| **Functions:** | |
| - `load_config()` - Load config.json | |
| - `setup_logging()` - Configure logging | |
| - `format_output()` - Format results | |
| - Device management (CPU/GPU detection) | |
| --- | |
| ## βοΈ Configuration (config.json) | |
| ```json | |
| { | |
| "model": { | |
| "primary_model": "t5-small", | |
| "max_input_length": 512, | |
| "max_output_length": 150, | |
| "supported_languages": [15 languages], | |
| "default_language": "english" | |
| }, | |
| "summarization": { | |
| "intent_types": ["technical_overview", "detailed_analysis", ...], | |
| "chunk_size": 512, | |
| "chunk_overlap": 50, | |
| "preserve_context": true | |
| } | |
| } | |
| ``` | |
| --- | |
| ## π― Supported Features | |
| ### Languages (15 Total) | |
| English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Turkish, Vietnamese, Thai | |
| ### Intent Types (6 Total) | |
| 1. **technical_overview** - High-level summary | |
| 2. **detailed_analysis** - In-depth breakdown | |
| 3. **methodology** - Research methods used | |
| 4. **results** - Key findings | |
| 5. **conclusion** - Conclusions drawn | |
| 6. **abstract** - Academic abstract | |
| ### Quality Preferences | |
| - **Speed** (β‘) - T5-Small, < 2 seconds | |
| - **Balanced** (βοΈ) - T5-Base, < 5 seconds | |
| - **Quality** (β¨) - T5-Large + RAG, < 10 seconds | |
| --- | |
| ## π How Components Work Together | |
| ### Workflow 1: Single Document (Mode 1) | |
| ``` | |
| main.py | |
| β single_document_mode() | |
| β | |
| TechnicalDocumentSummarizer.auto_summarize() | |
| ββ TextPreprocessor.clean_text() | |
| ββ ModelSelector (complexity analysis) | |
| ββ (Optional) RAGPipeline | |
| ββ T5/Pegasus model | |
| ββ SummaryEvaluator (ROUGE, confidence) | |
| ββ KeywordExtractor | |
| ββ Output (display or export) | |
| ``` | |
| ### Workflow 2: REST API (Mode 3) | |
| ``` | |
| Postman/Web Client | |
| β HTTP POST /summarize | |
| β | |
| FastAPI.summarize_endpoint() | |
| β | |
| TechnicalDocumentSummarizer.auto_summarize() | |
| β (same as Workflow 1) | |
| β | |
| JSON Response | |
| ``` | |
| ### Workflow 3: Web UI (Mode 4) | |
| ``` | |
| Browser β http://localhost:8001 | |
| β | |
| web_ui.py (HTML/CSS/JS) | |
| β Form submission | |
| β | |
| FastAPI /summarize endpoint | |
| β (same as Workflow 2) | |
| β | |
| Display in browser + localStorage | |
| ``` | |
| --- | |
| ## π Data Flow Summary | |
| ``` | |
| INPUT FORMATS: | |
| ββ Text (paste into UI) | |
| ββ Files (PDF, TXT upload) | |
| ββ Batch (multiple files) | |
| β | |
| PROCESSING PIPELINE: | |
| ββ Text Cleaning | |
| ββ Tokenization & Chunking | |
| ββ Complexity Analysis | |
| ββ Model Selection | |
| ββ (Optional) Vector Embedding & Indexing | |
| ββ Summarization | |
| ββ Quality Evaluation | |
| ββ Keyword Extraction | |
| β | |
| OUTPUT FORMATS: | |
| ββ JSON (with metadata) | |
| ββ PDF (formatted report) | |
| ββ TXT (plain text) | |
| ββ Web UI display (with localStorage) | |
| ``` | |
| --- | |
| ## π Quick Start Guide | |
| ### 1. Install Dependencies | |
| ```bash | |
| cd Backend | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Run in Different Modes | |
| **Mode 1 - Single Document:** | |
| ```bash | |
| python main.py | |
| # Select: 1 | |
| # Paste text or upload file | |
| ``` | |
| **Mode 2 - Batch Processing:** | |
| ```bash | |
| python main.py | |
| # Select: 2 | |
| # Upload multiple files | |
| ``` | |
| **Mode 3 - REST API (for Postman):** | |
| ```bash | |
| python main.py | |
| # Select: 3 | |
| # API runs on http://localhost:8000 | |
| ``` | |
| **Mode 4 - Web UI:** | |
| ```bash | |
| python main.py | |
| # Select: 4 | |
| # Open http://localhost:8001 in browser | |
| ``` | |
| --- | |
| ## π API Integration | |
| ### Using REST API with Postman | |
| 1. **Import Collection:** | |
| - Open Postman | |
| - Import `Postman_Collection.json` | |
| 2. **Start API Server:** | |
| - Run Mode 3 from main.py | |
| - Server starts on `http://localhost:8000` | |
| 3. **Run Tests:** | |
| - 7 essential tests included | |
| - Tests health, languages, intents, summarization, batch, multi-language, speed mode | |
| --- | |
| ## π Performance Characteristics | |
| | Metric | Speed | Balanced | Quality | | |
| |--------|-------|----------|---------| | |
| | Model | T5-Small | T5-Base | T5-Large + RAG | | |
| | Latency | < 2s | 2-5s | 5-10s | | |
| | Quality Score | 0.70 | 0.85 | 0.95 | | |
| | Memory Usage | 1.5GB | 3GB | 6GB | | |
| | Doc Size Max | 500w | 2000w | 5000w+ | | |
| --- | |
| ## π οΈ Development & Testing | |
| ### Unit Testing | |
| ```bash | |
| # Future: pytest tests/ | |
| pytest | |
| ``` | |
| ### Benchmarking | |
| ```bash | |
| # Check performance metrics | |
| python benchmark.py | |
| ``` | |
| ### Sanity Checks | |
| ```bash | |
| # Verify all components working | |
| python sanity_check.py | |
| ``` | |
| --- | |
| ## π Documentation Files | |
| | File | Purpose | | |
| |------|---------| | |
| | `README.md` | Project overview & setup | | |
| | `SYSTEM_DOCUMENTATION.md` | This file - complete architecture | | |
| | `config.json` | Configuration settings | | |
| | `requirements.txt` | Python dependencies | | |
| | `Postman_Collection.json` | API test suite | | |
| --- | |
| ## π Security Considerations | |
| - β No external API keys stored in code | |
| - β Input validation on all endpoints | |
| - β Error handling without exposing stack traces | |
| - β Max input length limits (prevent DoS) | |
| - β CORS headers properly configured | |
| --- | |
| ## π Key Technologies | |
| | Component | Technology | | |
| |-----------|-----------| | |
| | API Framework | FastAPI + Uvicorn | | |
| | NLP Models | HuggingFace Transformers | | |
| | Deep Learning | PyTorch | | |
| | Embeddings | Sentence-Transformers | | |
| | Vector DB | FAISS | | |
| | Quality Metrics | rouge-score | | |
| | Web UI | HTML5 + CSS3 + JS | | |
| | PDF Export | ReportLab | | |
| --- | |
| ## π Support & Debugging | |
| ### Common Issues | |
| **Issue:** ModuleNotFoundError for rouge_score | |
| ```bash | |
| pip install rouge_score | |
| ``` | |
| **Issue:** CUDA/GPU not detected | |
| ```bash | |
| # Will auto-fallback to CPU | |
| # Check config.json "device": "auto" | |
| ``` | |
| **Issue:** Model download fails | |
| ```bash | |
| python models/download_models.py | |
| ``` | |
| --- | |
| ## π License | |
| MIT License - See LICENSE file for details | |
| --- | |
| **Last Updated:** February 24, 2026 | |
| **Version:** 1.0.0 | |