# DistilBERT Base Cased - Text Processing Model This repository contains a Jupyter notebook demonstrating the use of DistilBERT, a distilled version of BERT (Bidirectional Encoder Representations from Transformers), for masked language modeling and text embedding generation. ## Overview DistilBERT is a smaller, faster, and lighter version of BERT that retains 97% of BERT's language understanding while being 60% faster and 40% smaller in size. This project demonstrates both the cased and uncased variants of DistilBERT. ## Features - **Fill-Mask Pipeline**: Uses DistilBERT to predict masked tokens in sentences - **Word Embeddings**: Generates contextual word embeddings for text processing - **GPU Support**: Configured to run on CUDA-enabled GPUs for faster inference - **Easy Integration**: Simple examples using Hugging Face Transformers library ## Requirements - Python 3.7+ - PyTorch - Transformers library - CUDA-compatible GPU (optional, but recommended) ## Installation Install the required dependencies: ```bash pip install -U transformers ``` For GPU support, ensure you have PyTorch with CUDA installed: ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 ``` ## Usage ### Fill-Mask Task ```python from transformers import pipeline pipe = pipeline("fill-mask", model="distilbert/distilbert-base-cased") result = pipe("Hello I'm a [MASK] model.") for candidate in result: print(candidate) ``` ### Generating Word Embeddings ```python from transformers import DistilBertTokenizer, DistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) # Access the embeddings embeddings = output.last_hidden_state ``` ### Direct Model Loading ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased") model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-cased") ``` ## Notebook Contents The [Distilbert-base-cased.ipynb](Distilbert-base-cased.ipynb) notebook includes: 1. **Installation**: Setting up the Transformers library 2. **Pipeline Usage**: High-level API for fill-mask tasks 3. **Direct Model Loading**: Lower-level API for custom implementations 4. **Embedding Generation**: Creating contextual word embeddings 5. **Token Visualization**: Inspecting tokenization results ## Models Used - **distilbert-base-cased**: DistilBERT model trained on cased English text - **distilbert-base-uncased**: DistilBERT model trained on lowercased English text Model pages: - [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased) - [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) ## Example Output When running the fill-mask task with "Hello I'm a [MASK] model.", the model predicts: 1. fashion (15.75%) 2. professional (6.04%) 3. role (2.56%) 4. celebrity (1.94%) 5. model (1.73%) ## Use Cases - **Text Classification**: Sentiment analysis, topic classification - **Named Entity Recognition**: Identifying entities in text - **Question Answering**: Building QA systems - **Text Embeddings**: Feature extraction for downstream tasks - **Language Understanding**: Transfer learning for NLP tasks ## Performance DistilBERT offers an excellent trade-off between performance and efficiency: - **Speed**: 60% faster than BERT - **Size**: 40% smaller than BERT - **Performance**: Retains 97% of BERT's capabilities ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## Issues If the code snippets do not work, please open an issue on: - [Model Repository](https://huggingface.co/distilbert/distilbert-base-cased) - [Hugging Face.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Acknowledgments - **Hugging Face**: For the Transformers library and pre-trained models - **DistilBERT Authors**: Sanh et al. for the DistilBERT research and implementation ## References - [DistilBERT Paper](https://arxiv.org/abs/1910.01108) - [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index) - [DistilBERT Model Card](https://huggingface.co/distilbert/distilbert-base-cased) ## Contact For questions or feedback, please open an issue in this repository.