ahczhg commited on
Commit
4fd2bc4
·
verified ·
1 Parent(s): ac77ab6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +148 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DistilBERT Base Cased - Text Processing Model
2
+
3
+ This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation.
4
+
5
+ ## Overview
6
+
7
+ DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT.
8
+
9
+ ## Features
10
+
11
+ - **Fill-Mask Pipeline**: Uses DistilBERT to predict masked tokens in sentences
12
+ - **Word Embeddings**: Generates contextual word embeddings for text processing
13
+ - **GPU Support**: Configured to run on CUDA-enabled GPUs for faster inference
14
+ - **Easy Integration**: Simple examples using Hugging Face Transformers library
15
+
16
+ ## Requirements
17
+
18
+ - Python 3.7+
19
+ - PyTorch
20
+ - Transformers library
21
+ - CUDA-compatible GPU (optional, but recommended)
22
+
23
+ ## Installation
24
+
25
+ Install the required dependencies:
26
+
27
+ ```bash
28
+ pip install -U transformers
29
+ ```
30
+
31
+ For GPU support, ensure you have PyTorch with CUDA installed:
32
+
33
+ ```bash
34
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
35
+ ```
36
+
37
+ ## Usage
38
+
39
+ ### Fill-Mask Task
40
+
41
+ ```python
42
+ from transformers import pipeline
43
+
44
+ pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased")
45
+ result = pipe("Hello I'm a [MASK] model.")
46
+
47
+ for candidate in result:
48
+ print(candidate)
49
+ ```
50
+
51
+ ### Generating Word Embeddings
52
+
53
+ ```python
54
+ from transformers import DistilBertTokenizer, DistilBertModel
55
+
56
+ tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
57
+ model = DistilBertModel.from_pretrained("distilbert-base-uncased")
58
+
59
+ text = "Replace me by any text you'd like."
60
+ encoded_input = tokenizer(text, return_tensors='pt')
61
+ output = model(**encoded_input)
62
+
63
+ # Access the embeddings
64
+ embeddings = output.last_hidden_state
65
+ ```
66
+
67
+ ### Direct Model Loading
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
73
+ model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased")
74
+ ```
75
+
76
+ ## Notebook Contents
77
+
78
+ The [Distilbert-base-cased.ipynb](Distilbert-base-cased.ipynb) notebook includes:
79
+
80
+ 1. **Installation**: Setting up the Transformers library
81
+ 2. **Pipeline Usage**: High-level API for fill-mask tasks
82
+ 3. **Direct Model Loading**: Lower-level API for custom implementations
83
+ 4. **Embedding Generation**: Creating contextual word embeddings
84
+ 5. **Token Visualization**: Inspecting tokenization results
85
+
86
+ ## Models Used
87
+
88
+ - **distilbert-base-cased**: DistilBERT model trained on cased English text
89
+ - **distilbert-base-uncased**: DistilBERT model trained on lowercased English text
90
+
91
+ Model pages:
92
+ - [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased)
93
+ - [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
94
+
95
+ ## Example Output
96
+
97
+ When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts:
98
+
99
+ 1. fashion (15.75%)
100
+ 2. professional (6.04%)
101
+ 3. role (2.56%)
102
+ 4. celebrity (1.94%)
103
+ 5. model (1.73%)
104
+
105
+ ## Use Cases
106
+
107
+ - **Text Classification**: Sentiment analysis, topic classification
108
+ - **Named Entity Recognition**: Identifying entities in text
109
+ - **Question Answering**: Building QA systems
110
+ - **Text Embeddings**: Feature extraction for downstream tasks
111
+ - **Language Understanding**: Transfer learning for NLP tasks
112
+
113
+ ## Performance
114
+
115
+ DistilBERT offers an excellent trade-off between performance and efficiency:
116
+
117
+ - **Speed**: 60% faster than BERT
118
+ - **Size**: 40% smaller than BERT
119
+ - **Performance**: Retains 97% of BERT's capabilities
120
+
121
+ ## Contributing
122
+
123
+ Contributions are welcome! Please feel free to submit a Pull Request.
124
+
125
+ ## Issues
126
+
127
+ If the code snippets do not work, please open an issue on:
128
+ - [Model Repository](https://huggingface.co/distilbert/distilbert-base-cased)
129
+ - [Hugging Face.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts)
130
+
131
+ ## License
132
+
133
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
134
+
135
+ ## Acknowledgments
136
+
137
+ - **Hugging Face**: For the Transformers library and pre-trained models
138
+ - **DistilBERT Authors**: Sanh et al. for the DistilBERT research and implementation
139
+
140
+ ## References
141
+
142
+ - [DistilBERT Paper](https://arxiv.org/abs/1910.01108)
143
+ - [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
144
+ - [DistilBERT Model Card](https://huggingface.co/distilbert/distilbert-base-cased)
145
+
146
+ ## Contact
147
+
148
+ For questions or feedback, please open an issue in this repository.