ahczhg's picture
Upload README.md with huggingface_hub
4fd2bc4 verified
# DistilBERT Base Cased - Text Processing Model
This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.
## Overview
DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.
## Features
- **Fill-Mask Pipeline**: Uses DistilBERT to predict masked tokens in sentences
- **Word Embeddings**: Generates contextual word embeddings for text processing
- **GPU Support**: Configured to run on CUDA-enabled GPUs for faster inference
- **Easy Integration**: Simple examples using Hugging Face Transformers library
## Requirements
- Python 3.7+
- PyTorch
- Transformers library
- CUDA-compatible GPU (optional, but recommended)
## Installation
Install the required dependencies:
```bash
pip install -U transformers
```
For GPU support, ensure you have PyTorch with CUDA installed:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
## Usage
### Fill-Mask Task
```python
from transformers import pipeline
pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")
for candidate in result:
print(candidate)
```
### Generating Word Embeddings
```python
from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# Access the embeddings
embeddings = output.last_hidden_state
```
### Direct Model Loading
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")
```
## Notebook Contents
The [Distilbert-base-cased.ipynb](Distilbert-base-cased.ipynb) notebook includes:
1. **Installation**: Setting up the Transformers library
2. **Pipeline Usage**: High-level API for fill-mask tasks
3. **Direct Model Loading**: Lower-level API for custom implementations
4. **Embedding Generation**: Creating contextual word embeddings
5. **Token Visualization**: Inspecting tokenization results
## Models Used
- **distilbert-base-cased**: DistilBERT model trained on cased English text
- **distilbert-base-uncased**: DistilBERT model trained on lowercased English text
Model pages:
- [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased)
- [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
## Example Output
When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:
1. fashion (15.75%)
2. professional (6.04%)
3. role (2.56%)
4. celebrity (1.94%)
5. model (1.73%)
## Use Cases
- **Text Classification**: Sentiment analysis, topic classification
- **Named Entity Recognition**: Identifying entities in text
- **Question Answering**: Building QA systems
- **Text Embeddings**: Feature extraction for downstream tasks
- **Language Understanding**: Transfer learning for NLP tasks
## Performance
DistilBERT offers an excellent trade-off between performance and efficiency:
- **Speed**: 60% faster than BERT
- **Size**: 40% smaller than BERT
- **Performance**: Retains 97% of BERT's capabilities
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Issues
If the code snippets do not work, please open an issue on:
- [Model Repository](https://huggingface.co/distilbert/distilbert-base-cased)
- [Hugging Face.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts)
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- **Hugging Face**: For the Transformers library and pre-trained models
- **DistilBERT Authors**: Sanh et al. for the DistilBERT research and implementation
## References
- [DistilBERT Paper](https://arxiv.org/abs/1910.01108)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [DistilBERT Model Card](https://huggingface.co/distilbert/distilbert-base-cased)
## Contact
For questions or feedback, please open an issue in this repository.