raimondskrauklis commited on
Commit
b2bb2fa
·
verified ·
1 Parent(s): bf44e57

Update model card with comprehensive training details

Browse files
Files changed (1) hide show
  1. README.md +103 -48
README.md CHANGED
@@ -10,44 +10,81 @@ tags:
10
  - python
11
  - gpt-neo
12
  - instruction-following
 
 
 
 
 
13
  metrics:
14
  - name: Training Loss (Final)
15
  type: loss
16
  value: 0.4554
17
  verified: false
18
- - name: Dataset Size
19
  type: examples
20
  value: 362059
21
  verified: false
 
 
 
 
 
 
 
 
 
 
 
 
22
  ---
23
 
24
  # GPT-Neo 1.3B Enhanced for Code and Conversation
25
 
26
- A fine-tuned version of GPT-Neo 1.3B optimized for both conversational AI and Python code generation. This model combines instruction-following capabilities with comprehensive Python programming knowledge.
27
 
28
  ## Model Description
29
 
30
- This model represents a multi-layer fine-tuning approach:
31
- - **Base**: EleutherAI's GPT-Neo 1.3B
32
- - **Layer 1**: Conversational fine-tuning for instruction-following
33
- - **Layer 2**: Python code generation using CodeSearchNet dataset (362,059 examples)
 
 
 
 
34
 
35
  ## Training Details
36
 
37
  - **Architecture**: GPT-Neo 1.3B (transformer-based autoregressive language model)
38
- - **Training Data**: High-quality Python code examples with documentation
39
  - **Training Infrastructure**: European HPC systems with AMD GPU acceleration
40
- - **Optimization**: Multi-GPU distributed training with gradient accumulation
41
  - **Final Training Loss**: 0.4554 (excellent convergence)
 
 
 
42
 
43
- ## Usage
44
 
45
  ### Code Generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ```python
47
  from transformers import GPTNeoForCausalLM, GPT2Tokenizer
48
 
49
- model = GPTNeoForCausalLM.from_pretrained("your-username/gpt-neo-1.3b-code-conversation")
50
- tokenizer = GPT2Tokenizer.from_pretrained("your-username/gpt-neo-1.3b-code-conversation")
51
  tokenizer.pad_token = tokenizer.eos_token
52
 
53
  # Code generation example
@@ -56,56 +93,66 @@ inputs = tokenizer(prompt, return_tensors="pt")
56
  outputs = model.generate(**inputs, max_length=200, temperature=0.7, do_sample=True)
57
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
58
  print(response)
59
- Conversational AI
60
- python# Conversation example
61
- prompt = "Human: Explain machine learning in simple terms\nAssistant:"
 
 
 
 
 
62
  inputs = tokenizer(prompt, return_tensors="pt")
63
- outputs = model.generate(**inputs, max_length=150, temperature=0.7)
64
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
65
  print(response)
66
  Training Methodology
67
- The model was trained using a proven multi-layer approach:
68
 
69
- Conversational Foundation: Initial fine-tuning on high-quality conversation data
70
- Code Specialization: Subsequent training on curated Python programming examples
71
- Quality Filtering: Rigorous filtering for meaningful code-documentation pairs
72
- Distributed Training: Efficient scaling across multiple GPUs
73
 
74
- Performance Characteristics
75
 
76
- Code Understanding: Strong comprehension of Python syntax and patterns
77
- Documentation: Ability to explain code functionality clearly
78
- Instruction Following: Responds appropriately to programming requests
79
- Conversational Flow: Maintains context in multi-turn interactions
 
80
 
81
- Model Capabilities
82
- Code Generation
83
 
84
- Python functions with proper documentation
85
- Algorithm implementations
86
- Data structure manipulations
87
- Error handling patterns
88
-
89
- Conversational AI
90
-
91
- Technical explanations
92
- Step-by-step instructions
93
- Problem-solving discussions
94
- Educational content
95
 
96
  Limitations
97
 
98
- Primarily trained on Python code (limited other languages)
99
- May generate plausible but incorrect code for complex tasks
100
- Training data cutoff affects knowledge of recent libraries
101
- Best results with clear, specific prompts
 
102
 
103
  Ethical Considerations
104
 
105
- Model outputs should be reviewed for production use
106
- Code suggestions require testing and validation
107
- Potential for generating biased or inappropriate content
108
- Users responsible for compliance with applicable regulations
 
 
 
 
 
 
 
 
 
 
109
 
110
  Citation
111
  bibtex@misc{gpt-neo-code-conversation-2025,
@@ -113,7 +160,15 @@ bibtex@misc{gpt-neo-code-conversation-2025,
113
  author={Raimonds Krauklis},
114
  year={2025},
115
  howpublished={Hugging Face Model Hub},
116
- url={https://huggingface.co/raimondskrauklis/gpt-neo-1.3b-code-conversation}
 
117
  }
118
  Acknowledgments
119
- Training conducted using European high-performance computing infrastructure. Based on EleutherAI's GPT-Neo and CodeSearchNet dataset.
 
 
 
 
 
 
 
 
10
  - python
11
  - gpt-neo
12
  - instruction-following
13
+ - codesearchnet
14
+ base_model: EleutherAI/gpt-neo-1.3B
15
+ datasets:
16
+ - OpenAssistant/oasst1
17
+ - code_search_net
18
  metrics:
19
  - name: Training Loss (Final)
20
  type: loss
21
  value: 0.4554
22
  verified: false
23
+ - name: Dataset Size (CodeSearchNet)
24
  type: examples
25
  value: 362059
26
  verified: false
27
+ model-index:
28
+ - name: gpt-neo-1.3b-code-conversation
29
+ results:
30
+ - task:
31
+ type: text-generation
32
+ dataset:
33
+ type: code_search_net
34
+ name: CodeSearchNet Python
35
+ metrics:
36
+ - type: loss
37
+ value: 0.4554
38
+ name: Training Loss
39
  ---
40
 
41
  # GPT-Neo 1.3B Enhanced for Code and Conversation
42
 
43
+ A fine-tuned version of GPT-Neo 1.3B optimized for both conversational AI and Python code generation. This model combines instruction-following capabilities with comprehensive Python programming knowledge through a multi-layer fine-tuning approach.
44
 
45
  ## Model Description
46
 
47
+ **Base Model**: EleutherAI/gpt-neo-1.3B
48
+ **Fine-tuning Approach**: Multi-layer sequential training
49
+ **Specializations**: Conversation + Python Code Generation
50
+
51
+ ### Training Layers:
52
+ 1. **Conversational Foundation**: Fine-tuned on high-quality dialogue data for instruction-following
53
+ 2. **Code Specialization**: Enhanced with 362,059 Python code examples from CodeSearchNet dataset
54
+ 3. **Integration**: Maintains conversational abilities while adding strong coding capabilities
55
 
56
  ## Training Details
57
 
58
  - **Architecture**: GPT-Neo 1.3B (transformer-based autoregressive language model)
 
59
  - **Training Infrastructure**: European HPC systems with AMD GPU acceleration
60
+ - **Distributed Training**: Multi-GPU setup with gradient accumulation
61
  - **Final Training Loss**: 0.4554 (excellent convergence)
62
+ - **CodeSearchNet Dataset**: 362,059 high-quality Python code-documentation pairs
63
+ - **Training Duration**: ~6 hours on 8x AMD MI250X GPUs
64
+ - **Optimization**: AdamW optimizer with cosine annealing schedule
65
 
66
+ ## Capabilities
67
 
68
  ### Code Generation
69
+ - **Python Functions**: Complete implementations with proper documentation
70
+ - **Algorithm Development**: Data structures, algorithms, and problem-solving
71
+ - **Code Explanation**: Clear explanations of functionality and logic
72
+ - **Documentation**: Automatic docstring and comment generation
73
+
74
+ ### Conversational AI
75
+ - **Instruction Following**: Responds appropriately to coding requests
76
+ - **Technical Explanations**: Breaks down complex programming concepts
77
+ - **Problem Solving**: Helps debug and optimize code solutions
78
+ - **Educational Content**: Teaches programming concepts step-by-step
79
+
80
+ ## Usage Examples
81
+
82
+ ### Python Code Generation
83
  ```python
84
  from transformers import GPTNeoForCausalLM, GPT2Tokenizer
85
 
86
+ model = GPTNeoForCausalLM.from_pretrained("raimondskrauklis/gpt-neo-1.3b-code-conversation")
87
+ tokenizer = GPT2Tokenizer.from_pretrained("raimondskrauklis/gpt-neo-1.3b-code-conversation")
88
  tokenizer.pad_token = tokenizer.eos_token
89
 
90
  # Code generation example
 
93
  outputs = model.generate(**inputs, max_length=200, temperature=0.7, do_sample=True)
94
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
95
  print(response)
96
+ Code Explanation
97
+ pythonprompt = "Human: Explain how binary search works in Python\nAssistant:"
98
+ inputs = tokenizer(prompt, return_tensors="pt")
99
+ outputs = model.generate(**inputs, max_length=300, temperature=0.7)
100
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
101
+ print(response)
102
+ Debugging Assistance
103
+ pythonprompt = "Human: Why does this Python code give a list index error?\ncode: for i in range(len(data)+1): print(data[i])\nAssistant:"
104
  inputs = tokenizer(prompt, return_tensors="pt")
105
+ outputs = model.generate(**inputs, max_length=250, temperature=0.7)
106
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
107
  print(response)
108
  Training Methodology
109
+ Multi-Layer Fine-tuning Strategy
110
 
111
+ Base Selection: Started with EleutherAI's GPT-Neo 1.3B pre-trained model
112
+ Layer 1 - Conversational: Fine-tuned on dialogue data for instruction-following
113
+ Layer 2 - Code Enhancement: Specialized training on CodeSearchNet Python dataset
114
+ Quality Assurance: Rigorous filtering for high-quality code-documentation pairs
115
 
116
+ Technical Implementation
117
 
118
+ Distributed Training: 8x AMD MI250X GPUs with proper CPU-GPU affinity
119
+ Batch Configuration: Per-device batch size of 4 with gradient accumulation
120
+ Learning Rate: 5e-6 with cosine annealing schedule
121
+ Sequence Length: 512 tokens maximum
122
+ Epochs: 2 epochs over full dataset for optimal convergence
123
 
124
+ Performance Metrics
 
125
 
126
+ Training Loss Progression: 0.9556 → 0.4554 (excellent convergence)
127
+ Dataset Coverage: 362,059 Python code examples
128
+ Training Efficiency: ~11,315 batches per epoch
129
+ Model Size: ~5.3GB (2x safetensors files)
130
+ Context Length: 512 tokens
 
 
 
 
 
 
131
 
132
  Limitations
133
 
134
+ Language Focus: Primarily trained on Python code (limited other programming languages)
135
+ Code Complexity: Best performance on functions under 100 lines
136
+ Validation Required: Generated code should be tested before production use
137
+ Knowledge Cutoff: Training data reflects pre-2024 coding practices
138
+ Context Window: Limited to 512 tokens for generation
139
 
140
  Ethical Considerations
141
 
142
+ Code Review: All generated code should be reviewed for security and correctness
143
+ Bias Awareness: May reflect biases present in training data
144
+ Responsible Use: Not intended for malicious code generation
145
+ Attribution: Based on open-source datasets and models
146
+
147
+ Technical Specifications
148
+
149
+ Model Type: Causal Language Model (GPT-Neo architecture)
150
+ Parameters: 1.3 billion
151
+ Vocabulary Size: 50,257 tokens
152
+ Hidden Size: 2,048
153
+ Attention Heads: 16
154
+ Layers: 24
155
+ Context Length: 2,048 tokens (training used 512)
156
 
157
  Citation
158
  bibtex@misc{gpt-neo-code-conversation-2025,
 
160
  author={Raimonds Krauklis},
161
  year={2025},
162
  howpublished={Hugging Face Model Hub},
163
+ url={https://huggingface.co/raimondskrauklis/gpt-neo-1.3b-code-conversation},
164
+ note={Fine-tuned on European HPC infrastructure using CodeSearchNet dataset}
165
  }
166
  Acknowledgments
167
+
168
+ Base Model: EleutherAI for GPT-Neo 1.3B
169
+ Dataset: CodeSearchNet by GitHub/Microsoft Research
170
+ Infrastructure: European high-performance computing systems
171
+ Framework: Hugging Face Transformers and PyTorch ecosystem
172
+
173
+ Model Card Contact
174
+ For questions about this model, please open an issue in the model repository or contact through Hugging Face.