Spaces:
Sleeping
Sleeping
File size: 7,190 Bytes
7e59b8b 470eac4 7e59b8b 076bc18 ac1c21f 486475d e47ba10 f6d689c e47ba10 f6d689c baf3026 1601799 baf3026 f6d689c 486475d 076bc18 486475d 076bc18 486475d 076bc18 486475d 076bc18 486475d 076bc18 486475d 1601799 486475d f6d689c 076bc18 f6d689c 076bc18 f6d689c 076bc18 f6d689c 1601799 076bc18 f6d689c 1601799 f6d689c 486475d f6d689c 486475d 076bc18 1601799 f6d689c 486475d e47ba10 486475d e47ba10 486475d e47ba10 486475d 076bc18 f6d689c 486475d f6d689c 1601799 f6d689c e47ba10 f6d689c 486475d f6d689c 486475d f6d689c 486475d e47ba10 076bc18 f6d689c 486475d 1601799 486475d 1601799 486475d 1601799 e47ba10 f6d689c e47ba10 076bc18 e47ba10 f6d689c 076bc18 f6d689c 076bc18 486475d 076bc18 486475d ffbcac3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
---
title: LexiMind
emoji: π§
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: scripts/demo_gradio.py
pinned: false
---
## LexiMind: A Multi-Task NLP Model
LexiMind is a state-of-the-art Natural Language Processing model designed for complex document understanding. It features a **custom-built Transformer architecture** initialized with weights from Google's **FLAN-T5**, combining the flexibility of from-scratch implementation with the power of modern pre-trained models.
The model performs three sophisticated tasks simultaneously: **text summarization**, **emotion classification**, and **topic clustering**.
This project is built with industry-standard MLOps practices, including configuration management with Hydra, experiment tracking with MLflow, and containerization with Docker, making it a reproducible and scalable solution.
## Core Features
* **Abstractive Summarization:** Generates concise, coherent summaries of long-form text using encoder-decoder attention. Trained on BookSum (literary) and arXiv (academic papers).
* **Emotion Classification:** Identifies 28 emotions from Google's GoEmotions dataset (admiration, amusement, anger, joy, love, etc.).
* **Topic Classification:** Classifies documents into 8 categories (Fiction, Science, Technology, Philosophy, History, Psychology, Business, Arts).
## Model Architecture
LexiMind implements a **from-scratch Transformer** with modern architectural choices:
### Custom Transformer Features
* **Pre-Layer Normalization (Pre-LN):** RMSNorm applied before each sublayer for stable training
* **FlashAttention:** Via PyTorch 2.0's `scaled_dot_product_attention` for efficient computation
* **Learned Positional Embeddings:** Trainable position representations
* **Multi-Head Attention:** 12 heads with 768-dimensional representations
* **RMSNorm:** Modern normalization without bias (more efficient than LayerNorm)
### Pre-trained Weight Initialization
The model loads weights from **Google's FLAN-T5-base**, which provides:
* Strong language understanding from instruction-tuning
* Excellent performance on summarization and classification tasks
* Encoder-decoder architecture matching our custom implementation
### Multi-Task Learning
A shared encoder-decoder backbone with task-specific heads:
* **Summarization Head:** Language modeling head with weight tying
* **Emotion Head:** Mean-pooled classification with dropout
* **Topic Head:** Mean-pooled classification with dropout
## Technical Specifications
| Component | Specification |
| --------- | -------------- |
| Architecture | Encoder-Decoder Transformer |
| Pre-trained Base | google/flan-t5-base |
| Hidden Dimension | 768 |
| Encoder Layers | 12 |
| Decoder Layers | 12 |
| Attention Heads | 12 |
| FFN Dimension | 2048 |
| Normalization | RMSNorm (Pre-LN) |
| Position Encoding | Learned Embeddings |
| Max Sequence Length | 512 tokens |
## Getting Started
### Prerequisites
* Python 3.10+
* Poetry for dependency management
* Docker (for containerized deployment)
* An NVIDIA GPU with CUDA support (for training and accelerated inference)
### Installation
1. **Clone the repository:**
```bash
git clone https://github.com/OliverPerrin/LexiMind.git
cd LexiMind
```
2. **Install dependencies:**
```bash
poetry install
```
3. **Download datasets:**
```bash
poetry run python scripts/download_data.py
```
This downloads CNN/DailyMail, BookSum, GoEmotions, AG News, and Gutenberg books.
## Usage
### Configuration
All training and model parameters are managed via Hydra. Configurations are located in the `configs/` directory.
Available configurations:
* `model=base` - FLAN-T5-base (default, 12 layers)
* `model=small` - Smaller model for testing (no pretrained weights)
* `model=large` - FLAN-T5-large (24 layers, requires more VRAM)
* `training=dev` - Quick development run (~10-15 min)
* `training=medium` - Balanced training (~45-60 min on RTX 4070)
* `training=full` - Full training run (~3-4 hours, or ~24h for max data)
### Training
```bash
# Default training with FLAN-T5-base
poetry run python scripts/train.py
# Quick development run
poetry run python scripts/train.py training=dev
# Medium training run (recommended for RTX 4070)
poetry run python scripts/train.py training=medium
# Override parameters
poetry run python scripts/train.py training.optimizer.lr=5e-5
# Resume from a checkpoint
poetry run python scripts/train.py training=full resume_from=checkpoints/epoch_5.pt
```
Experiments are automatically tracked with MLflow. View results with `mlflow ui`.
### Evaluation
```bash
# Run inference on test data
poetry run python scripts/inference.py "Your text to analyze"
```
### Inference & Demo
```bash
# Command-line inference
poetry run python scripts/inference.py "Your text to analyze"
# Gradio web demo
poetry run python scripts/demo_gradio.py
```
## Docker
```bash
# Build
docker build -t leximind .
# Run demo
docker run -p 7860:7860 leximind
```
## Project Structure
```text
βββ configs/ # Hydra configuration files
β βββ model/ # Model architectures (base, small, large)
β βββ training/ # Training configs (dev, medium, full)
β βββ data/ # Dataset paths
βββ data/
β βββ processed/ # Training data (downloaded via scripts/download_data.py)
β βββ summarization/ # CNN/DailyMail + BookSum
β βββ emotion/ # GoEmotions (28 labels)
β βββ topic/ # AG News (4 categories)
β βββ books/ # Gutenberg prose chunks
βββ src/
β βββ models/ # Custom Transformer implementation
β β βββ encoder.py # TransformerEncoder with Pre-LN RMSNorm
β β βββ decoder.py # TransformerDecoder with KV-cache
β β βββ attention.py # Multi-Head Attention with FlashAttention
β β βββ factory.py # Model building with FLAN-T5 weight loading
β βββ data/ # Dataset classes and dataloaders
β βββ training/ # Trainer with AMP and gradient accumulation
β βββ inference/ # Inference pipeline
βββ scripts/
β βββ train.py # Main training script
β βββ download_data.py # Dataset downloader
β βββ inference.py # CLI inference
β βββ demo_gradio.py # Web demo
βββ tests/ # Unit tests
```
## Code Quality
* **Ruff:** Fast linting and formatting
* **MyPy:** Static type checking
* **Pytest:** Full test suite covering data, models, and training
* **Pre-commit hooks:** Automated quality checks
```bash
# Install hooks
poetry run pre-commit install
# Lint
poetry run ruff check .
# Type check
poetry run mypy .
# Tests
poetry run pytest
```
## Performance Optimizations
* **torch.compile:** JIT compilation with Inductor backend
* **Mixed Precision:** bfloat16 training on Ampere/Ada GPUs
* **TF32:** Enabled for RTX 30xx/40xx series
* **KV-Cache:** Efficient autoregressive decoding
* **FlashAttention:** Memory-efficient attention via SDPA
## License
GNU License - see [LICENSE](LICENSE) for details.
|