File size: 4,673 Bytes

4fd2bc4

# DistilBERT Base Cased - Text Processing Model

This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.

## Overview

DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.

## Features

- **Fill-Mask Pipeline**: Uses DistilBERT to predict masked tokens in sentences
- **Word Embeddings**: Generates contextual word embeddings for text processing
- **GPU Support**: Configured to run on CUDA-enabled GPUs for faster inference
- **Easy Integration**: Simple examples using Hugging Face Transformers library

## Requirements

- Python 3.7+
- PyTorch
- Transformers library
- CUDA-compatible GPU (optional, but recommended)

## Installation

Install the required dependencies:

```bash
pip install -U transformers
```

For GPU support, ensure you have PyTorch with CUDA installed:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

## Usage

### Fill-Mask Task

```python
from transformers import pipeline

pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")

for candidate in result:
    print(candidate)
```

### Generating Word Embeddings

```python
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Access the embeddings
embeddings = output.last_hidden_state
```

### Direct Model Loading

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")
```

## Notebook Contents

The [Distilbert-base-cased.ipynb](Distilbert-base-cased.ipynb) notebook includes:

1. **Installation**: Setting up the Transformers library
2. **Pipeline Usage**: High-level API for fill-mask tasks
3. **Direct Model Loading**: Lower-level API for custom implementations
4. **Embedding Generation**: Creating contextual word embeddings
5. **Token Visualization**: Inspecting tokenization results

## Models Used

- **distilbert-base-cased**: DistilBERT model trained on cased English text
- **distilbert-base-uncased**: DistilBERT model trained on lowercased English text

Model pages:
- [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased)
- [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

## Example Output

When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:

1. fashion (15.75%)
2. professional (6.04%)
3. role (2.56%)
4. celebrity (1.94%)
5. model (1.73%)

## Use Cases

- **Text Classification**: Sentiment analysis, topic classification
- **Named Entity Recognition**: Identifying entities in text
- **Question Answering**: Building QA systems
- **Text Embeddings**: Feature extraction for downstream tasks
- **Language Understanding**: Transfer learning for NLP tasks

## Performance

DistilBERT offers an excellent trade-off between performance and efficiency:

- **Speed**: 60% faster than BERT
- **Size**: 40% smaller than BERT
- **Performance**: Retains 97% of BERT's capabilities

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Issues

If the code snippets do not work, please open an issue on:
- [Model Repository](https://huggingface.co/distilbert/distilbert-base-cased)
- [Hugging Face.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- **Hugging Face**: For the Transformers library and pre-trained models
- **DistilBERT Authors**: Sanh et al. for the DistilBERT research and implementation

## References

- [DistilBERT Paper](https://arxiv.org/abs/1910.01108)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [DistilBERT Model Card](https://huggingface.co/distilbert/distilbert-base-cased)

## Contact

For questions or feedback, please open an issue in this repository.