File size: 1,386 Bytes
250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 d880f0b 250e9b3 ffaa00f d880f0b 250e9b3 d880f0b 250e9b3 b315f6d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ---
license: mit
tags:
- tokenizer
- code
- python
---
# Python Code Tokenizer (CodeSearchNet)
A domain-specific tokenizer trained from the GPT-2 tokenizer, fine-tuned on Python source code to better handle code syntax, identifiers, and structure compared to a general-purpose English tokenizer.
## Details
- **Base tokenizer:** GPT-2 (`gpt2`)
- **Training corpus:** [CodeSearchNet](https://huggingface.co/datasets/code-search-net/code_search_net), Python split
- **Vocabulary size:** 52,000
- **Training method:** `train_new_from_iterator` (Hugging Face `tokenizers` library)
## Motivation
Standard NLP tokenizers like GPT-2's are trained on natural language and tend to over-fragment code — splitting common patterns like indentation, snake_case identifiers, and Python keywords into many small subword tokens. Training on a code-specific corpus lets the tokenizer learn more efficient, code-aware merges, reducing the number of tokens needed to represent typical Python source.
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("AlexStamp/code-search-net-tokenizer")
tokens = tokenizer.tokenize("def hello_world():\n print('Hello!')")
```
## Notes
This tokenizer was trained as part of working through the Hugging Face LLM course (Chapter 6), as a portfolio exercise in tokenizer training and domain adaptation. |