| --- |
| license: mit |
| tags: |
| - tokenizer |
| - code |
| - python |
| --- |
| |
| # Python Code Tokenizer (CodeSearchNet) |
|
|
| A domain-specific tokenizer trained from the GPT-2 tokenizer, fine-tuned on Python source code to better handle code syntax, identifiers, and structure compared to a general-purpose English tokenizer. |
|
|
| ## Details |
|
|
| - **Base tokenizer:** GPT-2 (`gpt2`) |
| - **Training corpus:** [CodeSearchNet](https://huggingface.co/datasets/code-search-net/code_search_net), Python split |
| - **Vocabulary size:** 52,000 |
| - **Training method:** `train_new_from_iterator` (Hugging Face `tokenizers` library) |
|
|
| ## Motivation |
|
|
| Standard NLP tokenizers like GPT-2's are trained on natural language and tend to over-fragment code — splitting common patterns like indentation, snake_case identifiers, and Python keywords into many small subword tokens. Training on a code-specific corpus lets the tokenizer learn more efficient, code-aware merges, reducing the number of tokens needed to represent typical Python source. |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("AlexStamp/code-search-net-tokenizer") |
| tokens = tokenizer.tokenize("def hello_world():\n print('Hello!')") |
| ``` |
| |
| ## Notes |
| |
| This tokenizer was trained as part of working through the Hugging Face LLM course (Chapter 6), as a portfolio exercise in tokenizer training and domain adaptation. |