File size: 2,633 Bytes
f457b4b
9f63df0
f457b4b
 
 
 
 
 
 
 
9f63df0
 
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
 
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
 
9f63df0
f457b4b
 
9f63df0
f457b4b
 
 
 
 
9f63df0
f457b4b
 
 
9f63df0
f457b4b
 
 
 
9f63df0
f457b4b
 
 
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
9f63df0
f457b4b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
```markdown
---
language: en
license: mit
tags:
- tokenizer
- python
- code-search-net
- bpe
jemoji: 📝
---

# New Tokenizer for Python Code (Trained on CodeSearchNet)

## Model Description

This is a custom Byte-Pair Encoding (BPE) tokenizer, initialized from a `gpt2` tokenizer and further trained on the Python subset of the [CodeSearchNet dataset](https://huggingface.co/datasets/claudios/code_search_net). The tokenizer is designed to efficiently tokenize Python code, which can be useful for various downstream tasks like code generation, code completion, and code analysis.

## Training Data

The tokenizer was trained on the `whole_func_string` column of the `train` split from the `claudios/code_search_net` dataset, specifically focusing on Python code examples. The training corpus consisted of approximately 412,178 Python function strings.

## Training Procedure

1.  **Base Tokenizer**: Started with a pre-trained `gpt2` tokenizer.
2.  **Training**: The `train_new_from_iterator` method from `transformers.PreTrainedTokenizerFast` was used to train a new vocabulary and merges from the `CodeSearchNet` Python code corpus. The new vocabulary size was set to 52,000 tokens.

## How to Use

You can load and use this tokenizer with the `transformers` library:

```python
from transformers import AutoTokenizer

# Load the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("rajaykumar12959/new_tokeniser")

# Example usage
example_code = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

tokens = tokenizer.tokenize(example_code)
print(tokens)
# Output will be similar to:
# ['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']

encoded_input = tokenizer(example_code, return_tensors="pt")
print(encoded_input)
```

## License

This tokenizer is licensed under the MIT License.

## Author

[rajaykumar12959](https://huggingface.co/rajaykumar12959)