File size: 4,673 Bytes
4fd2bc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# DistilBERT Base Cased - Text Processing Model

This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.

## Overview

DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.

## Features

- **Fill-Mask Pipeline**: Uses DistilBERT to predict masked tokens in sentences
- **Word Embeddings**: Generates contextual word embeddings for text processing
- **GPU Support**: Configured to run on CUDA-enabled GPUs for faster inference
- **Easy Integration**: Simple examples using Hugging Face Transformers library

## Requirements

- Python 3.7+
- PyTorch
- Transformers library
- CUDA-compatible GPU (optional, but recommended)

## Installation

Install the required dependencies:

```bash
pip install -U transformers
```

For GPU support, ensure you have PyTorch with CUDA installed:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

## Usage

### Fill-Mask Task

```python
from transformers import pipeline

pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
result = pipe("Hello I'm a [MASK] model.")

for candidate in result:
    print(candidate)
```

### Generating Word Embeddings

```python
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Access the embeddings
embeddings = output.last_hidden_state
```

### Direct Model Loading

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")
```

## Notebook Contents

The [Distilbert-base-cased.ipynb](Distilbert-base-cased.ipynb) notebook includes:

1. **Installation**: Setting up the Transformers library
2. **Pipeline Usage**: High-level API for fill-mask tasks
3. **Direct Model Loading**: Lower-level API for custom implementations
4. **Embedding Generation**: Creating contextual word embeddings
5. **Token Visualization**: Inspecting tokenization results

## Models Used

- **distilbert-base-cased**: DistilBERT model trained on cased English text
- **distilbert-base-uncased**: DistilBERT model trained on lowercased English text

Model pages:
- [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased)
- [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

## Example Output

When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:

1. fashion (15.75%)
2. professional (6.04%)
3. role (2.56%)
4. celebrity (1.94%)
5. model (1.73%)

## Use Cases

- **Text Classification**: Sentiment analysis, topic classification
- **Named Entity Recognition**: Identifying entities in text
- **Question Answering**: Building QA systems
- **Text Embeddings**: Feature extraction for downstream tasks
- **Language Understanding**: Transfer learning for NLP tasks

## Performance

DistilBERT offers an excellent trade-off between performance and efficiency:

- **Speed**: 60% faster than BERT
- **Size**: 40% smaller than BERT
- **Performance**: Retains 97% of BERT's capabilities

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Issues

If the code snippets do not work, please open an issue on:
- [Model Repository](https://huggingface.co/distilbert/distilbert-base-cased)
- [Hugging Face.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- **Hugging Face**: For the Transformers library and pre-trained models
- **DistilBERT Authors**: Sanh et al. for the DistilBERT research and implementation

## References

- [DistilBERT Paper](https://arxiv.org/abs/1910.01108)
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [DistilBERT Model Card](https://huggingface.co/distilbert/distilbert-base-cased)

## Contact

For questions or feedback, please open an issue in this repository.