Chiedo John Claude commited on
Commit
d0c3c53
·
1 Parent(s): 705ce25

Add dataset integration to Hello World model

Browse files

- Updated model.py with load_dataset() and prepare_dataset_batch() methods
- Added example_with_dataset.py demonstrating full dataset usage
- Created dataset_integration_test.py for verifying setup
- Updated README with dataset references and usage examples
- Model now works with chiedo/hello-world dataset on Hugging Face

🤖 Generated with Claude Code (https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

.claude/settings.local.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "Bash(git push:*)"
5
+ ],
6
+ "deny": [],
7
+ "ask": []
8
+ }
9
+ }
README.md CHANGED
@@ -18,6 +18,10 @@ A minimal "Hello World" transformer model for demonstration purposes on Hugging
18
 
19
  This is a simple transformer-based language model that serves as a basic example for uploading models to Hugging Face. It demonstrates the minimum required files and structure for a custom model.
20
 
 
 
 
 
21
  ### Architecture Details
22
  - **Model Type**: Custom Transformer (hello_world)
23
  - **Vocabulary Size**: 13 tokens
@@ -34,8 +38,9 @@ This is a simple transformer-based language model that serves as a basic example
34
  - `pytorch_model.bin` - Model weights (PyTorch format)
35
  - `tokenizer.json` - Tokenizer vocabulary and settings
36
  - `tokenizer_config.json` - Tokenizer configuration
37
- - `model.py` - Model implementation (HelloWorldModel class)
38
  - `test_model.py` - Test script for local validation
 
39
 
40
  ## Installation
41
 
@@ -251,6 +256,44 @@ with torch.no_grad():
251
  print(f"Model output shape: {logits.shape}")
252
  ```
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
  ## Model Vocabulary
255
 
256
  The model includes a minimal vocabulary:
 
18
 
19
  This is a simple transformer-based language model that serves as a basic example for uploading models to Hugging Face. It demonstrates the minimum required files and structure for a custom model.
20
 
21
+ ### Associated Dataset
22
+
23
+ This model works with the [chiedo/hello-world dataset](https://huggingface.co/datasets/chiedo/hello-world), which contains 20 examples of "Hello World" variations for demonstration purposes.
24
+
25
  ### Architecture Details
26
  - **Model Type**: Custom Transformer (hello_world)
27
  - **Vocabulary Size**: 13 tokens
 
38
  - `pytorch_model.bin` - Model weights (PyTorch format)
39
  - `tokenizer.json` - Tokenizer vocabulary and settings
40
  - `tokenizer_config.json` - Tokenizer configuration
41
+ - `model.py` - Model implementation (HelloWorldModel class with dataset loading methods)
42
  - `test_model.py` - Test script for local validation
43
+ - `example_with_dataset.py` - Example script showing dataset integration
44
 
45
  ## Installation
46
 
 
256
  print(f"Model output shape: {logits.shape}")
257
  ```
258
 
259
+ ### Using the Model with Its Dataset
260
+
261
+ This model includes built-in methods to work with the [chiedo/hello-world dataset](https://huggingface.co/datasets/chiedo/hello-world):
262
+
263
+ #### Loading the Dataset Through the Model
264
+ ```python
265
+ from transformers import AutoModel, AutoTokenizer
266
+ from datasets import load_dataset
267
+
268
+ # Load model and tokenizer
269
+ model = AutoModel.from_pretrained("chiedo/hello-world", trust_remote_code=True)
270
+ tokenizer = AutoTokenizer.from_pretrained("chiedo/hello-world", trust_remote_code=True)
271
+
272
+ # Method 1: Use the model's built-in dataset loading
273
+ dataset = model.load_dataset("chiedo/hello-world")
274
+ print(f"Dataset splits: {list(dataset.keys())}")
275
+
276
+ # Method 2: Load dataset directly
277
+ dataset = load_dataset("chiedo/hello-world")
278
+
279
+ # Process a batch from the dataset
280
+ texts = dataset["train"]["text"][:5]
281
+ inputs = model.prepare_dataset_batch(texts, tokenizer)
282
+ outputs = model(**inputs)
283
+ ```
284
+
285
+ #### Complete Example with Dataset
286
+ ```python
287
+ # Run the full example script
288
+ python example_with_dataset.py
289
+ ```
290
+
291
+ This will demonstrate:
292
+ - Loading the model and dataset
293
+ - Processing batches from the dataset
294
+ - Running inference on dataset examples
295
+ - Accessing dataset labels and features
296
+
297
  ## Model Vocabulary
298
 
299
  The model includes a minimal vocabulary:
dataset_integration_test.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Simple test to verify dataset integration setup.
3
+ This test doesn't require external libraries to be installed.
4
+ """
5
+
6
+ import json
7
+ import os
8
+
9
+ def test_dataset_files():
10
+ """Test that dataset files exist and are properly formatted."""
11
+
12
+ dataset_path = os.path.expanduser("~/huggingface.co/datasets/chiedo/hello-world")
13
+
14
+ print("Testing Dataset Integration Setup")
15
+ print("=" * 50)
16
+
17
+ # Check dataset files exist
18
+ required_files = ["train.jsonl", "validation.jsonl", "test.jsonl", "README.md", "hello_world.py"]
19
+
20
+ print("\n1. Checking dataset files:")
21
+ for file in required_files:
22
+ file_path = os.path.join(dataset_path, file)
23
+ if os.path.exists(file_path):
24
+ print(f" ✓ {file} exists")
25
+ else:
26
+ print(f" ✗ {file} missing")
27
+
28
+ # Load and validate dataset content
29
+ print("\n2. Validating dataset content:")
30
+ splits = ["train", "validation", "test"]
31
+
32
+ for split in splits:
33
+ file_path = os.path.join(dataset_path, f"{split}.jsonl")
34
+ try:
35
+ with open(file_path, 'r') as f:
36
+ lines = f.readlines()
37
+ print(f"\n {split} split:")
38
+ print(f" - Examples: {len(lines)}")
39
+
40
+ # Parse first example
41
+ first_example = json.loads(lines[0])
42
+ print(f" - First example: {first_example}")
43
+
44
+ # Validate structure
45
+ if "text" in first_example and "label" in first_example:
46
+ print(f" - Structure: ✓ Valid")
47
+ else:
48
+ print(f" - Structure: ✗ Invalid")
49
+ except Exception as e:
50
+ print(f" Error reading {split}: {e}")
51
+
52
+ # Check model integration code
53
+ print("\n3. Checking model integration:")
54
+ model_file = "model.py"
55
+
56
+ if os.path.exists(model_file):
57
+ with open(model_file, 'r') as f:
58
+ content = f.read()
59
+
60
+ # Check for dataset integration methods
61
+ if "load_dataset" in content:
62
+ print(" ✓ load_dataset method found in model.py")
63
+ else:
64
+ print(" ✗ load_dataset method not found")
65
+
66
+ if "prepare_dataset_batch" in content:
67
+ print(" ✓ prepare_dataset_batch method found in model.py")
68
+ else:
69
+ print(" ✗ prepare_dataset_batch method not found")
70
+
71
+ if "from datasets import load_dataset" in content:
72
+ print(" ✓ datasets import found in model.py")
73
+ else:
74
+ print(" ✗ datasets import not found")
75
+
76
+ print("\n4. Dataset URLs:")
77
+ print(f" Model: https://huggingface.co/chiedo/hello-world")
78
+ print(f" Dataset: https://huggingface.co/datasets/chiedo/hello-world")
79
+
80
+ print("\n" + "=" * 50)
81
+ print("Dataset integration setup complete!")
82
+ print("\nTo use the dataset with the model, install dependencies:")
83
+ print(" pip install torch transformers datasets")
84
+ print("\nThen run:")
85
+ print(" python example_with_dataset.py")
86
+
87
+ if __name__ == "__main__":
88
+ test_dataset_files()
example_with_dataset.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Example script showing how to use the Hello World model with its dataset.
3
+ """
4
+
5
+ from transformers import PreTrainedTokenizerFast
6
+ from model import HelloWorldModel, HelloWorldConfig
7
+ from datasets import load_dataset
8
+ import torch
9
+
10
+
11
+ def main():
12
+ print("Loading Hello World Model and Dataset Example\n")
13
+ print("=" * 50)
14
+
15
+ # Load model and tokenizer
16
+ print("Loading model and tokenizer...")
17
+ config = HelloWorldConfig.from_pretrained("chiedo/hello-world")
18
+ model = HelloWorldModel.from_pretrained("chiedo/hello-world")
19
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("chiedo/hello-world")
20
+
21
+ # Method 1: Load dataset using the model's built-in method
22
+ print("\n1. Loading dataset using model's load_dataset method:")
23
+ dataset = HelloWorldModel.load_dataset("chiedo/hello-world")
24
+
25
+ if dataset:
26
+ print(f"Dataset loaded successfully!")
27
+ print(f"Splits available: {list(dataset.keys())}")
28
+ print(f"Train examples: {len(dataset['train'])}")
29
+ print(f"Validation examples: {len(dataset['validation'])}")
30
+ print(f"Test examples: {len(dataset['test'])}")
31
+
32
+ # Show first few examples
33
+ print("\nFirst 3 training examples:")
34
+ for i in range(min(3, len(dataset['train']))):
35
+ example = dataset['train'][i]
36
+ print(f" {i+1}. Text: '{example['text']}', Label: {example['label']}")
37
+
38
+ # Method 2: Load dataset directly
39
+ print("\n2. Loading dataset directly with datasets library:")
40
+ dataset_direct = load_dataset("chiedo/hello-world")
41
+
42
+ # Get label names
43
+ label_names = dataset_direct['train'].features['label'].names
44
+ print(f"Label categories: {label_names}")
45
+
46
+ # Process a batch from the dataset
47
+ print("\n3. Processing a batch from the dataset:")
48
+ batch_texts = dataset_direct['train']['text'][:3]
49
+ print(f"Batch texts: {batch_texts}")
50
+
51
+ # Prepare batch for model
52
+ inputs = model.prepare_dataset_batch(batch_texts, tokenizer)
53
+ print(f"Tokenized input shape: {inputs['input_ids'].shape}")
54
+
55
+ # Run model inference
56
+ print("\n4. Running model inference on dataset batch:")
57
+ with torch.no_grad():
58
+ outputs = model(**inputs)
59
+ print(f"Model output shape: {outputs.logits.shape}")
60
+
61
+ # Demonstrate the generate_hello_world function
62
+ print("\n5. Testing generate_hello_world function:")
63
+ result = model.generate_hello_world()
64
+ print(f"Generated output: {result}")
65
+
66
+ # Show how to iterate through dataset
67
+ print("\n6. Iterating through test set:")
68
+ for i, example in enumerate(dataset_direct['test']):
69
+ if i >= 3: # Only show first 3
70
+ break
71
+ text = example['text']
72
+ label_id = example['label']
73
+ label_name = label_names[label_id]
74
+
75
+ # Tokenize and process
76
+ inputs = tokenizer(text, return_tensors="pt")
77
+ with torch.no_grad():
78
+ outputs = model(**inputs)
79
+ predicted_token = outputs.logits[0, -1].argmax().item()
80
+
81
+ print(f" Text: '{text}' | Label: {label_name} | Predicted next token ID: {predicted_token}")
82
+
83
+ print("\n" + "=" * 50)
84
+ print("Example completed successfully!")
85
+
86
+
87
+ if __name__ == "__main__":
88
+ main()
model.py CHANGED
@@ -2,6 +2,7 @@ import torch
2
  import torch.nn as nn
3
  from transformers import PreTrainedModel, PretrainedConfig
4
  from transformers.modeling_outputs import CausalLMOutputWithPast
 
5
 
6
 
7
  class HelloWorldConfig(PretrainedConfig):
@@ -124,4 +125,46 @@ class HelloWorldModel(PreTrainedModel):
124
  with torch.no_grad():
125
  outputs = self.forward(input_ids)
126
 
127
- return "Hello World!"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  import torch.nn as nn
3
  from transformers import PreTrainedModel, PretrainedConfig
4
  from transformers.modeling_outputs import CausalLMOutputWithPast
5
+ from datasets import load_dataset
6
 
7
 
8
  class HelloWorldConfig(PretrainedConfig):
 
125
  with torch.no_grad():
126
  outputs = self.forward(input_ids)
127
 
128
+ return "Hello World!"
129
+
130
+ @classmethod
131
+ def load_dataset(cls, dataset_name="chiedo/hello-world", split=None):
132
+ """
133
+ Load the Hello World dataset.
134
+
135
+ Args:
136
+ dataset_name (str): Name of the dataset on Hugging Face Hub
137
+ split (str, optional): Specific split to load ('train', 'validation', 'test')
138
+
139
+ Returns:
140
+ Dataset or DatasetDict depending on split parameter
141
+ """
142
+ try:
143
+ if split:
144
+ return load_dataset(dataset_name, split=split)
145
+ else:
146
+ return load_dataset(dataset_name)
147
+ except Exception as e:
148
+ print(f"Error loading dataset: {e}")
149
+ print(f"Make sure the dataset exists at: https://huggingface.co/datasets/{dataset_name}")
150
+ return None
151
+
152
+ def prepare_dataset_batch(self, texts, tokenizer, max_length=128):
153
+ """
154
+ Prepare a batch of texts from the dataset for model input.
155
+
156
+ Args:
157
+ texts (list): List of text strings
158
+ tokenizer: Tokenizer to encode the texts
159
+ max_length (int): Maximum sequence length
160
+
161
+ Returns:
162
+ dict: Dictionary with input_ids and attention_mask tensors
163
+ """
164
+ return tokenizer(
165
+ texts,
166
+ padding=True,
167
+ truncation=True,
168
+ max_length=max_length,
169
+ return_tensors="pt"
170
+ )