File size: 4,673 Bytes
4fd2bc4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# DistilBERT Base Cased - Text Processing Model
This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.
## Overview
DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.
## Features
- **Fill-Mask Pipeline**: Uses DistilBERT to predict masked tokens in sentences
- **Word Embeddings**: Generates contextual word embeddings for text processing
- **GPU Support**: Configured to run on CUDA-enabled GPUs for faster inference
- **Easy Integration**: Simple examples using Hugging Face Transformers library
## Requirements
- Python 3.7+
- PyTorch
- Transformers library
- CUDA-compatible GPU (optional, but recommended)
## Installation
Install the required dependencies:
```bash
pip install -U transformers
```
For GPU support, ensure you have PyTorch with CUDA installed:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
## Usage
### Fill-Mask Task
```python
from transformers import pipeline
pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")
for candidate in result:
print(candidate)
```
### Generating Word Embeddings
```python
from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# Access the embeddings
embeddings = output.last_hidden_state
```
### Direct Model Loading
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")
```
## Notebook Contents
The [Distilbert-base-cased.ipynb](Distilbert-base-cased.ipynb) notebook includes:
1. **Installation**: Setting up the Transformers library
2. **Pipeline Usage**: High-level API for fill-mask tasks
3. **Direct Model Loading**: Lower-level API for custom implementations
4. **Embedding Generation**: Creating contextual word embeddings
5. **Token Visualization**: Inspecting tokenization results
## Models Used
- **distilbert-base-cased**: DistilBERT model trained on cased English text
- **distilbert-base-uncased**: DistilBERT model trained on lowercased English text
Model pages:
- [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased)
- [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
## Example Output
When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:
1. fashion (15.75%)
2. professional (6.04%)
3. role (2.56%)
4. celebrity (1.94%)
5. model (1.73%)
## Use Cases
- **Text Classification**: Sentiment analysis, topic classification
- **Named Entity Recognition**: Identifying entities in text
- **Question Answering**: Building QA systems
- **Text Embeddings**: Feature extraction for downstream tasks
- **Language Understanding**: Transfer learning for NLP tasks
## Performance
DistilBERT offers an excellent trade-off between performance and efficiency:
- **Speed**: 60% faster than BERT
- **Size**: 40% smaller than BERT
- **Performance**: Retains 97% of BERT's capabilities
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Issues
If the code snippets do not work, please open an issue on:
- [Model Repository](https://huggingface.co/distilbert/distilbert-base-cased)
- [Hugging Face.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts)
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- **Hugging Face**: For the Transformers library and pre-trained models
- **DistilBERT Authors**: Sanh et al. for the DistilBERT research and implementation
## References
- [DistilBERT Paper](https://arxiv.org/abs/1910.01108)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [DistilBERT Model Card](https://huggingface.co/distilbert/distilbert-base-cased)
## Contact
For questions or feedback, please open an issue in this repository.
|