Chiedo John commited on
Commit
c125a8a
·
0 Parent(s):

Initial commit

Browse files
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.bin filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hello World Model
2
+
3
+ A minimal "Hello World" transformer model for demonstration purposes on Hugging Face.
4
+
5
+ ## Model Description
6
+
7
+ This is a simple transformer-based language model that serves as a basic example for uploading models to Hugging Face. It demonstrates the minimum required files and structure for a custom model.
8
+
9
+ ### Architecture Details
10
+ - **Model Type**: Custom Transformer (hello_world)
11
+ - **Vocabulary Size**: 13 tokens
12
+ - **Hidden Size**: 64 dimensions
13
+ - **Number of Layers**: 1 transformer encoder layer
14
+ - **Attention Heads**: 1
15
+ - **Intermediate Size**: 128
16
+ - **Max Position Embeddings**: 512
17
+ - **Activation Function**: GELU
18
+
19
+ ## Files Included
20
+
21
+ - `config.json` - Model configuration
22
+ - `pytorch_model.bin` - Model weights (PyTorch format)
23
+ - `tokenizer.json` - Tokenizer vocabulary and settings
24
+ - `tokenizer_config.json` - Tokenizer configuration
25
+ - `model.py` - Model implementation (HelloWorldModel class)
26
+ - `test_model.py` - Test script for local validation
27
+
28
+ ## Installation
29
+
30
+ ### Using Virtual Environment (Recommended)
31
+
32
+ It's recommended to use a virtual environment to manage dependencies:
33
+
34
+ ```bash
35
+ # Create a virtual environment
36
+ python -m venv venv
37
+
38
+ # Activate the virtual environment
39
+ # On macOS/Linux:
40
+ source venv/bin/activate
41
+ # On Windows:
42
+ # venv\Scripts\activate
43
+
44
+ # Install required packages
45
+ pip install torch transformers
46
+ ```
47
+
48
+ ### Direct Installation
49
+
50
+ If you prefer to install directly:
51
+
52
+ ```bash
53
+ pip install torch transformers
54
+ ```
55
+
56
+ ## Usage
57
+
58
+ ### Basic Usage
59
+
60
+ ```python
61
+ from transformers import PreTrainedTokenizerFast
62
+ from model import HelloWorldModel, HelloWorldConfig
63
+ import torch
64
+
65
+ # Load configuration and model
66
+ config = HelloWorldConfig.from_pretrained("chiedo/chaydos")
67
+ model = HelloWorldModel.from_pretrained("chiedo/chaydos")
68
+
69
+ # Load tokenizer
70
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("chiedo/chaydos")
71
+
72
+ # Generate Hello World
73
+ output = model.generate_hello_world()
74
+ print(output) # "Hello World!"
75
+ ```
76
+
77
+ ### Tokenization Example
78
+
79
+ ```python
80
+ # Tokenize text
81
+ text = "Hello World"
82
+ tokens = tokenizer.encode(text)
83
+ print(f"Tokens: {tokens}")
84
+
85
+ # Decode tokens back to text
86
+ decoded = tokenizer.decode(tokens)
87
+ print(f"Decoded: {decoded}")
88
+ ```
89
+
90
+ ### Forward Pass Example
91
+
92
+ ```python
93
+ # Prepare input
94
+ input_text = "Hello"
95
+ inputs = tokenizer(input_text, return_tensors="pt")
96
+
97
+ # Forward pass
98
+ with torch.no_grad():
99
+ outputs = model(**inputs)
100
+ logits = outputs.logits
101
+ ```
102
+
103
+ ## Model Vocabulary
104
+
105
+ The model includes a minimal vocabulary:
106
+ - Special tokens: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
107
+ - Content tokens: `Hello`, `World`, `!`, `hello`, `world`, `.`, `,`, `?`
108
+
109
+ ## Training
110
+
111
+ This is a demonstration model and has not been trained on any dataset. The weights are randomly initialized using a normal distribution with standard deviation of 0.02.
112
+
113
+ ## Testing
114
+
115
+ Run the included test script to verify the model works correctly:
116
+
117
+ ```bash
118
+ # Make sure your virtual environment is activated if using one
119
+ # source venv/bin/activate # On macOS/Linux
120
+ # venv\Scripts\activate # On Windows
121
+
122
+ python test_model.py
123
+ ```
124
+
125
+ ## Uploading to Hugging Face
126
+
127
+ To upload this model to your Hugging Face account:
128
+
129
+ ```bash
130
+ # Install huggingface-hub
131
+ pip install huggingface-hub
132
+
133
+ # Login to Hugging Face
134
+ huggingface-cli login
135
+
136
+ # Create a new model repository (if it doesn't exist)
137
+ huggingface-cli repo create hello-world-model --type model
138
+
139
+ # Upload all model files
140
+ huggingface-cli upload your-username/hello-world-model . --repo-type model
141
+ ```
142
+
143
+ ## Technical Details
144
+
145
+ - **Framework**: PyTorch
146
+ - **Transformers Version**: 4.36.0+
147
+ - **Python Version**: 3.6+
148
+ - **License**: MIT
149
+
150
+ ## Limitations
151
+
152
+ - This model is for demonstration and educational purposes only
153
+ - Not trained on any real data
154
+ - Should not be used for production applications
155
+ - Limited vocabulary of 13 tokens
156
+ - Single layer architecture is too simple for real NLP tasks
157
+
158
+ ## Citation
159
+
160
+ If you use this model as a template:
161
+
162
+ ```bibtex
163
+ @misc{hello-world-model,
164
+ title={Hello World Model - A Minimal Hugging Face Model Example},
165
+ author={Your Name},
166
+ year={2024},
167
+ publisher={Hugging Face}
168
+ }
169
+ ```
170
+
171
+ ## License
172
+
173
+ MIT License - This model is open source and available for any use.
174
+
175
+ ## Contact
176
+
177
+ For questions or issues with this demonstration model, please open an issue on the repository.
__pycache__/model.cpython-313.pyc ADDED
Binary file (6.09 kB). View file
 
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "hello_world",
3
+ "architectures": ["HelloWorldModel"],
4
+ "vocab_size": 13,
5
+ "hidden_size": 64,
6
+ "num_hidden_layers": 1,
7
+ "num_attention_heads": 1,
8
+ "intermediate_size": 128,
9
+ "hidden_act": "gelu",
10
+ "max_position_embeddings": 512,
11
+ "type_vocab_size": 1,
12
+ "initializer_range": 0.02,
13
+ "layer_norm_eps": 1e-12,
14
+ "pad_token_id": 0,
15
+ "transformers_version": "4.36.0"
16
+ }
model.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from transformers import PreTrainedModel, PretrainedConfig
4
+ from transformers.modeling_outputs import CausalLMOutputWithPast
5
+
6
+
7
+ class HelloWorldConfig(PretrainedConfig):
8
+ model_type = "hello_world"
9
+
10
+ def __init__(
11
+ self,
12
+ vocab_size=13,
13
+ hidden_size=64,
14
+ num_hidden_layers=1,
15
+ num_attention_heads=1,
16
+ intermediate_size=128,
17
+ hidden_act="gelu",
18
+ max_position_embeddings=512,
19
+ type_vocab_size=1,
20
+ initializer_range=0.02,
21
+ layer_norm_eps=1e-12,
22
+ pad_token_id=0,
23
+ **kwargs
24
+ ):
25
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
26
+ self.vocab_size = vocab_size
27
+ self.hidden_size = hidden_size
28
+ self.num_hidden_layers = num_hidden_layers
29
+ self.num_attention_heads = num_attention_heads
30
+ self.intermediate_size = intermediate_size
31
+ self.hidden_act = hidden_act
32
+ self.max_position_embeddings = max_position_embeddings
33
+ self.type_vocab_size = type_vocab_size
34
+ self.initializer_range = initializer_range
35
+ self.layer_norm_eps = layer_norm_eps
36
+
37
+
38
+ class HelloWorldModel(PreTrainedModel):
39
+ config_class = HelloWorldConfig
40
+
41
+ def __init__(self, config):
42
+ super().__init__(config)
43
+ self.config = config
44
+
45
+ self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
46
+ self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
47
+
48
+ self.layer = nn.TransformerEncoderLayer(
49
+ d_model=config.hidden_size,
50
+ nhead=config.num_attention_heads,
51
+ dim_feedforward=config.intermediate_size,
52
+ batch_first=True
53
+ )
54
+
55
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
56
+
57
+ self.init_weights()
58
+
59
+ def _init_weights(self, module):
60
+ if isinstance(module, nn.Linear):
61
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
62
+ if module.bias is not None:
63
+ module.bias.data.zero_()
64
+ elif isinstance(module, nn.Embedding):
65
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
66
+ if module.padding_idx is not None:
67
+ module.weight.data[module.padding_idx].zero_()
68
+
69
+ def forward(
70
+ self,
71
+ input_ids=None,
72
+ attention_mask=None,
73
+ position_ids=None,
74
+ past_key_values=None,
75
+ labels=None,
76
+ use_cache=False,
77
+ output_attentions=False,
78
+ output_hidden_states=False,
79
+ return_dict=True,
80
+ ):
81
+ if input_ids is not None:
82
+ batch_size, seq_length = input_ids.shape
83
+ else:
84
+ raise ValueError("You have to specify input_ids")
85
+
86
+ if position_ids is None:
87
+ position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
88
+ position_ids = position_ids.unsqueeze(0).expand(batch_size, -1)
89
+
90
+ inputs_embeds = self.embeddings(input_ids)
91
+ position_embeds = self.position_embeddings(position_ids)
92
+
93
+ hidden_states = inputs_embeds + position_embeds
94
+
95
+ hidden_states = self.layer(hidden_states)
96
+
97
+ logits = self.lm_head(hidden_states)
98
+
99
+ loss = None
100
+ if labels is not None:
101
+ shift_logits = logits[..., :-1, :].contiguous()
102
+ shift_labels = labels[..., 1:].contiguous()
103
+ loss_fct = nn.CrossEntropyLoss()
104
+ loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))
105
+
106
+ if not return_dict:
107
+ output = (logits,)
108
+ return ((loss,) + output) if loss is not None else output
109
+
110
+ return CausalLMOutputWithPast(
111
+ loss=loss,
112
+ logits=logits,
113
+ past_key_values=past_key_values,
114
+ hidden_states=hidden_states if output_hidden_states else None,
115
+ attentions=None
116
+ )
117
+
118
+ def generate_hello_world(self):
119
+ hello_token_id = 5
120
+ world_token_id = 6
121
+
122
+ input_ids = torch.tensor([[hello_token_id, world_token_id]])
123
+
124
+ with torch.no_grad():
125
+ outputs = self.forward(input_ids)
126
+
127
+ return "Hello World!"
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d4b73d903ac63975c8183e6b1b727ae4e505639512375a5bcf40235021ed709
3
+ size 277815
test_model.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from model import HelloWorldModel, HelloWorldConfig
2
+ from transformers import PreTrainedTokenizerFast
3
+ import torch
4
+
5
+ print("Loading configuration...")
6
+ config = HelloWorldConfig.from_pretrained(".")
7
+
8
+ print("Loading model...")
9
+ model = HelloWorldModel(config)
10
+ model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
11
+ model.eval()
12
+
13
+ print("Loading tokenizer...")
14
+ tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
15
+
16
+ print("\nTesting model generation...")
17
+ output = model.generate_hello_world()
18
+ print(f"Model output: {output}")
19
+
20
+ print("\nTesting tokenization...")
21
+ text = "Hello World"
22
+ tokens = tokenizer.encode(text)
23
+ print(f"Tokenized '{text}': {tokens}")
24
+
25
+ decoded = tokenizer.decode(tokens)
26
+ print(f"Decoded back: {decoded}")
27
+
28
+ print("\nModel test completed successfully!")
tokenizer.json ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "[PAD]",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ },
15
+ {
16
+ "id": 1,
17
+ "content": "[UNK]",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false,
22
+ "special": true
23
+ },
24
+ {
25
+ "id": 2,
26
+ "content": "[CLS]",
27
+ "single_word": false,
28
+ "lstrip": false,
29
+ "rstrip": false,
30
+ "normalized": false,
31
+ "special": true
32
+ },
33
+ {
34
+ "id": 3,
35
+ "content": "[SEP]",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ },
42
+ {
43
+ "id": 4,
44
+ "content": "[MASK]",
45
+ "single_word": false,
46
+ "lstrip": false,
47
+ "rstrip": false,
48
+ "normalized": false,
49
+ "special": true
50
+ }
51
+ ],
52
+ "normalizer": null,
53
+ "pre_tokenizer": {
54
+ "type": "Whitespace"
55
+ },
56
+ "post_processor": null,
57
+ "decoder": null,
58
+ "model": {
59
+ "type": "BPE",
60
+ "dropout": null,
61
+ "unk_token": "[UNK]",
62
+ "continuing_subword_prefix": null,
63
+ "end_of_word_suffix": null,
64
+ "fuse_unk": false,
65
+ "vocab": {
66
+ "[PAD]": 0,
67
+ "[UNK]": 1,
68
+ "[CLS]": 2,
69
+ "[SEP]": 3,
70
+ "[MASK]": 4,
71
+ "Hello": 5,
72
+ "World": 6,
73
+ "!": 7,
74
+ "hello": 8,
75
+ "world": 9,
76
+ ".": 10,
77
+ ",": 11,
78
+ "?": 12
79
+ },
80
+ "merges": []
81
+ }
82
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 512,
4
+ "padding_side": "right",
5
+ "truncation_side": "right",
6
+ "special_tokens_map_file": null,
7
+ "clean_up_tokenization_spaces": true,
8
+ "unk_token": "[UNK]",
9
+ "pad_token": "[PAD]",
10
+ "cls_token": "[CLS]",
11
+ "sep_token": "[SEP]",
12
+ "mask_token": "[MASK]"
13
+ }