Spaces:
Running
Running
File size: 4,178 Bytes
39028c9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | # Models Directory
This directory contains pre-trained and fine-tuned models for the Intent-Aware Context-Preserving Summarization project.
## Directory Structure
```
models/
βββ README.md # This file
βββ download_models.py # Script to download pre-trained models
βββ model_configs.json # Model configurations and metadata
βββ checkpoints/ # Fine-tuned model checkpoints
βββ tokenizers/ # Tokenizer files
```
## Available Models
### Pre-trained Models from Hugging Face
| Model Name | Model ID | Size | Best For |
|-----------|----------|------|----------|
| T5-Small | google-t5/t5-small | ~77MB | Quick testing, prototyping |
| T5-Base | google-t5/t5-base | ~220MB | General use, production |
| T5-Large | google-t5/t5-large | ~738MB | High-quality summaries |
| BART-Base | facebook/bart-base | ~558MB | General summarization |
| BART-Large-CNN | facebook/bart-large-cnn | ~1.6GB | News/article summarization |
| PEGASUS-arXiv | google/pegasus-arxiv | ~568MB | Scientific papers |
| PEGASUS-PubMed | google/pegasus-pubmed | ~562MB | Medical/biomedical documents |
| LED-Base | allenai/led-base-16384 | ~660MB | Long documents (4096 tokens) |
## Downloading Models
### Automatic Download
```bash
python models/download_models.py
```
### Manual Download
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Download T5-base
tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base")
# Save locally
tokenizer.save_pretrained("models/t5-base-tokenizer")
model.save_pretrained("models/t5-base-model")
```
## Fine-tuned Models
Fine-tuned models will be stored in `models/checkpoints/` after training:
- `intent-classifier-v1/` - Intent detection model
- `summarizer-technical-v1/` - Fine-tuned summarization model
- `summarizer-intent-aware-v1/` - Intent-aware summarization model
## Model Configuration
Edit `model_configs.json` to configure:
- Model selection
- Tokenization parameters
- Generation settings
- Evaluation metrics
## Using Models
### Load Pre-trained Model
```python
from src.models import SummarizationModelLoader
loader = SummarizationModelLoader(model_name='t5-base')
model, tokenizer = loader.load_model()
```
### Load Fine-tuned Checkpoint
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_path = "models/checkpoints/summarizer-technical-v1"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
```
## Storage Requirements
- **Small Models**: ~500MB total
- **Large Models**: ~3GB total
- **With Fine-tuning Data**: ~5-10GB
## GPU Memory Requirements
| Model | GPU Memory |
|-------|-----------|
| T5-Small | 2GB |
| T5-Base | 6GB |
| T5-Large | 12GB+ |
| BART-Base | 6GB |
| BART-Large-CNN | 12GB+ |
| LED-Base | 8GB |
## Best Practices
1. **Start with small models** for testing and development
2. **Use cached models** to avoid repeated downloads
3. **Monitor GPU memory** when loading large models
4. **Save fine-tuned models** with meaningful version numbers
5. **Document model changes** in model metadata
## Troubleshooting
### Out of Memory Error
```python
import torch
torch.cuda.empty_cache() # Clear GPU cache
# Or use CPU: device='cpu'
```
### Model Download Issues
- Check internet connection
- Verify Hugging Face API is accessible
- Try downloading specific model versions
- Use `cache_dir` parameter to specify custom location
### Tokenizer Mismatch
Ensure tokenizer version matches model version:
```python
tokenizer = AutoTokenizer.from_pretrained(model_path) # Load matching tokenizer
```
## References
- Hugging Face Models: https://huggingface.co/models
- Transformers Documentation: https://huggingface.co/docs/transformers
- Model Cards: https://huggingface.co/docs/hub/models-cards
## Contributing
If you fine-tune new models:
1. Save with meaningful names and versions
2. Document performance metrics
3. Include training parameters
4. Add model cards with descriptions
5. Update this README
---
**Last Updated**: January 15, 2024
|