AlexStamp
/

code-search-net-tokenizer

Model card Files Files and versions

code-search-net-tokenizer / README.md

AlexStamp's picture

Update README.md

b315f6d verified 4 days ago

|

History Blame Contribute Delete

1.39 kB

	---
	license: mit
	tags:
	- tokenizer
	- code
	- python
	---

	# Python Code Tokenizer (CodeSearchNet)

	A domain-specific tokenizer trained from the GPT-2 tokenizer, fine-tuned on Python source code to better handle code syntax, identifiers, and structure compared to a general-purpose English tokenizer.

	## Details

	- Base tokenizer: GPT-2 (`gpt2`)
	- Training corpus: [CodeSearchNet](https://huggingface.co/datasets/code-search-net/code_search_net), Python split
	- Vocabulary size: 52,000
	- Training method: `train_new_from_iterator` (Hugging Face `tokenizers` library)

	## Motivation

	Standard NLP tokenizers like GPT-2's are trained on natural language and tend to over-fragment code — splitting common patterns like indentation, snake_case identifiers, and Python keywords into many small subword tokens. Training on a code-specific corpus lets the tokenizer learn more efficient, code-aware merges, reducing the number of tokens needed to represent typical Python source.

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("AlexStamp/code-search-net-tokenizer")
	tokens = tokenizer.tokenize("def hello_world():\n print('Hello!')")
	```

	## Notes

	This tokenizer was trained as part of working through the Hugging Face LLM course (Chapter 6), as a portfolio exercise in tokenizer training and domain adaptation.