--- language: en license: mit tags: - tokenizer - python - code-search-net - bpe library_name: transformers base_model: gpt2 --- # Tokenizer for Python Code (Trained on CodeSearchNet) ## Model Description This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis. ## Training Data The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings. ## Training Procedure 1. **Base Tokenizer**: Started with a pre-trained `gpt2` tokenizer. 2. **Training**: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens. ## How to Use You can load and use this tokenizer with the `transformers` library: ```python from transformers import AutoTokenizer # Load the tokenizer from the Hugging Face Hub tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser") # Example usage example_code = """class LinearLayer(): def __init__(self, input_size, output_size): self.weight = torch.randn(input_size, output_size) self.bias = torch.zeros(output_size) def __call__(self, x): return x @ self.weights + self.bias """ tokens = tokenizer.tokenize(example_code) print(tokens) # Output will be similar to: # ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ'] encoded_input = tokenizer(example_code, return_tensors="pt") print(encoded_input) ``` ## License This tokenizer is licensed under the MIT License. ## Author [rajaykumar12959](https://huggingface.co/rajaykumar12959)