kssrikar4 commited on
Commit
001ef8a
·
verified ·
1 Parent(s): cd4aad1

Update README.md

Browse files

# Fine-Tuning LLaMA Model for Instruction-Based Tasks

This repository contains the code and setup for fine-tuning the pre-trained **LLaMA 3.2-1B** model on various datasets, focusing on instruction-following and conversational tasks.

---

## **Model Description**

The model fine-tunes **LLaMA 3.2-1B**, a causal language model designed for generating coherent and contextually relevant text. The fine-tuning process adapts the model to handle instruction-based tasks using datasets specifically curated for this purpose.

Key features:
- Utilizes state-of-the-art pre-trained weights from the Hugging Face Hub.
- Configured to handle instruction-based prompts efficiently.
- Fine-tuned for conversational and general-purpose tasks.

---

## **Intended Uses and Limitations**

### **Intended Uses**:
- Instruction-based question answering.
- Conversational AI applications.
- Text generation for general-purpose use cases.

### **Limitations**:
- May not perform well on domain-specific queries outside its training data.
- Could generate inaccurate or misleading responses for ambiguous or poorly framed prompts.
- Requires adequate computational resources (e.g., GPUs) for fine-tuning and inference.

---

## **Training and Evaluation Data**

### **Datasets Used**
The following datasets are used for training:
- **`fka/awesome-chatgpt-prompts`**: A collection of diverse prompts for chat-based models.
- **`BAAI/Infinity-Instruct`** (configuration: `3M`): Instruction-based datasets.
- **`allenai/WildChat-1M`**: Conversations and queries for language modeling.
- **`lavita/ChatDoctor-HealthCareMagic-100k`**: Medical instructions and conversational data.
- **`zjunlp/Mol-Instructions`**: Molecular and chemical-based instructions.
- **`garage-bAInd/Open-Platypus`**: General-purpose instruction datasets.

### **Preprocessing**
- Input prompts and responses are tokenized with padding and truncation (maximum length: 512 tokens).
- Labels are created by cloning the input IDs and masking padding tokens to ignore them during loss computation.
- Datasets without suitable columns (`prompt`) are skipped.

---

## **Training Procedure**

### **Setup**
1. **Model and Tokenizer**:
- The pre-trained model and tokenizer are loaded from Hugging Face Hub (`meta-llama/Llama-3.2-1B`).
- The tokenizer is adjusted to handle missing padding tokens by using the `eos_token`.

2. **Training Configuration**:
- Configured using Hugging Face's `TrainingArguments`:
- **Output Directory**: `llama_output` for saving checkpoints and logs.
- **Epochs**: 4 epochs for balanced training time and generalization.
- **Batch Size**: 4 examples per device.
- **Gradient Accumulation**: 4 steps for larger effective batch size.
- **Learning Rate**: 1e-4 with a warmup of 500 steps.
- **Weight Decay**: 0.01 to reduce overfitting.
- **Mixed Precision**: FP16 for faster training and reduced memory usage.
- **Logging**: Logs are generated every 10 steps.
- **Push to Hub**: Trained model is uploaded to the Hugging Face Hub (`kssrikar4/Intellecta`).

3. **Data Collation**:
- Uses `DataCollatorForSeq2Seq` to dynamically pad batches during training.

### **Fine-Tuning Process**
- The Hugging Face `Trainer` class orchestrates the training:
- Computes loss using model outputs (`logits`) and prepared labels.
- Gradients are accumulated and applied using AdamW optimizer with learning rate scheduling.
- Checkpoints are saved at the end of each epoch.

---

## **Post-Training**

1. **Model Upload**:
- The fine-tuned model, tokenizer, and configuration are pushed to the Hugging Face Hub under the ID `kssrikar4/Intellecta`.

2. **Inference**:
- The model can be directly downloaded and used for inference or further fine-tuning using the Hugging Face Transformers library.

---

## **Requirements**

### **Dependencies**:
- Python 3.8+
- Hugging Face Transformers
- Datasets
- Torch (with GPU support for fine-tuning)

### **Installation**:
```bash
pip install transformers datasets torch
```

---

## **Usage**

### **Fine-Tuning**
Run the `main` function in the provided Python script:
```bash
python llama.py
```

Ensure that the `config.json` file is present and contains your Hugging Face authentication token:
```json
{
"hf_token": "your_huggingface_token"
}
```

### **Inference**
Use the fine-tuned model for generating text:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kssrikar4/Intellecta")
model = AutoModelForCausalLM.from_pretrained("kssrikar4/Intellecta")

inputs = tokenizer("Explain quantum physics in simple terms.", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## **License**
This project uses the Apache 2.0 license. Ensure compliance with dataset and model licensing requirements when using this project.

---

Feel free to raise an issue or submit a pull request if you have suggestions or find any bugs!

Files changed (1) hide show
  1. README.md +153 -57
README.md CHANGED
@@ -1,57 +1,153 @@
1
- ---
2
- library_name: transformers
3
- license: llama3.2
4
- base_model: meta-llama/Llama-3.2-1B
5
- tags:
6
- - generated_from_trainer
7
- model-index:
8
- - name: Intellecta
9
- results: []
10
- ---
11
-
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
- # Intellecta
16
-
17
- This model is a fine-tuned version of [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) on an unknown dataset.
18
-
19
- ## Model description
20
-
21
- More information needed
22
-
23
- ## Intended uses & limitations
24
-
25
- More information needed
26
-
27
- ## Training and evaluation data
28
-
29
- More information needed
30
-
31
- ## Training procedure
32
-
33
- ### Training hyperparameters
34
-
35
- The following hyperparameters were used during training:
36
- - learning_rate: 0.0001
37
- - train_batch_size: 4
38
- - eval_batch_size: 8
39
- - seed: 42
40
- - gradient_accumulation_steps: 4
41
- - total_train_batch_size: 16
42
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
43
- - lr_scheduler_type: linear
44
- - lr_scheduler_warmup_steps: 500
45
- - num_epochs: 4
46
- - mixed_precision_training: Native AMP
47
-
48
- ### Training results
49
-
50
-
51
-
52
- ### Framework versions
53
-
54
- - Transformers 4.48.0
55
- - Pytorch 2.5.1+cpu
56
- - Datasets 3.2.0
57
- - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: llama3.2
4
+ base_model: meta-llama/Llama-3.2-1B
5
+ tags:
6
+ - generated_from_trainer
7
+ model-index:
8
+ - name: Intellecta
9
+ results: []
10
+ datasets:
11
+ - fka/awesome-chatgpt-prompts
12
+ - BAAI/Infinity-Instruct
13
+ - allenai/WildChat-1M
14
+ - lavita/ChatDoctor-HealthCareMagic-100k
15
+ - zjunlp/Mol-Instructions
16
+ - garage-bAInd/Open-Platypus
17
+ language:
18
+ - en
19
+ ---
20
+
21
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
22
+ should probably proofread and complete it, then remove this comment. -->
23
+
24
+ # Intellecta
25
+
26
+ This model is a fine-tuned version of [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) on an unknown dataset.
27
+
28
+ ## Model description
29
+
30
+ The model is based on LLaMA (Large Language Model Meta AI), a family of state-of-the-art language models developed for natural language understanding and generation. This specific implementation uses the LLaMA 3.2-1B model, which is fine-tuned for general-purpose conversational AI tasks.
31
+
32
+ Architecture: Transformer-based causal language model.
33
+ Tokenization: Uses the AutoTokenizer compatible with the LLaMA model, with adjustments to ensure proper padding.
34
+ Pre-trained Foundation: The model builds on the pre-trained weights of LLaMA, focusing on improving performance for conversational and instruction-based tasks.
35
+ Implementation: Developed with Hugging Face’s Transformers library for extensibility and ease of use.
36
+
37
+ ## Intended uses & limitations
38
+
39
+ Intended Uses
40
+ Instruction-following tasks: Can perform tasks such as answering questions, summarizing, and text generation.
41
+ Conversational agents: Suitable for chatbots and virtual assistants, including those in specialized domains like healthcare or education.
42
+ Research and Development: Fine-tuning and benchmarking against datasets for downstream tasks.
43
+
44
+ ## Training and evaluation data
45
+
46
+ Datasets Used
47
+ fka/awesome-chatgpt-prompts: General-purpose instruction-following and conversational dataset based on GPT-like interactions.
48
+ BAAI/Infinity-Instruct (3M): A large instruction dataset containing a wide variety of tasks and instructions.
49
+ allenai/WildChat-1M: Focused on open-ended conversational data.
50
+ lavita/ChatDoctor-HealthCareMagic-100k: Healthcare-specific dataset for medical conversational agents.
51
+ zjunlp/Mol-Instructions: Molecular biology-related instructions.
52
+ garage-bAInd/Open-Platypus: Dataset aimed at general-purpose, open-domain reasoning.
53
+ Data Preprocessing
54
+ Text prompts and responses are tokenized with padding and truncation.
55
+ Labels are derived from input tokens, masking padding tokens with -100 to exclude them from loss computation.
56
+
57
+ ## Training procedure
58
+ The training procedure for the model fine-tunes the pre-trained LLaMA 3.2-1B model on various datasets with a focus on instruction-following and conversational tasks. Below are the key aspects of the training process:
59
+
60
+ 1. Preprocessing
61
+ Tokenization:
62
+
63
+ The input prompts and their responses are tokenized using the AutoTokenizer configured for LLaMA.
64
+ Special considerations:
65
+ Padding tokens are explicitly handled using the pad_token (set to the eos_token if undefined).
66
+ Inputs are truncated to a maximum length of 512 tokens to fit model constraints.
67
+ Label Preparation:
68
+
69
+ Input IDs are cloned to create labels for supervised learning.
70
+ Padding tokens in labels are masked with -100 to ensure they are ignored during loss computation.
71
+ Dataset Mapping:
72
+
73
+ Each dataset's prompt field is tokenized and reformatted into the model’s required input-output structure.
74
+ Non-standard datasets without a prompt column are skipped to avoid errors.
75
+
76
+ 2. Model Setup
77
+ Pre-trained Model:
78
+
79
+ The base model, meta-llama/Llama-3.2-1B, is loaded with pre-trained weights.
80
+ It is fine-tuned for causal language modeling, focusing on instruction-based outputs.
81
+ Tokenizer Setup:
82
+
83
+ The tokenizer ensures consistency in encoding and decoding for the model.
84
+ Padding is fixed (using eos_token as a fallback).
85
+
86
+ 3. Training Configuration
87
+ TrainingArguments:
88
+
89
+ The Hugging Face TrainingArguments object is used to configure the training process:
90
+ Output Directory: llama_output stores the model checkpoints and logs.
91
+ Epochs: 4 epochs for a balance between training time and generalization.
92
+ Batch Size: 4 examples per device to handle memory constraints.
93
+ Gradient Accumulation: 4 steps to simulate a larger effective batch size.
94
+ Learning Rate: 1e-4 with a warmup phase of 500 steps for stable optimization.
95
+ Weight Decay: 0.01 to mitigate overfitting.
96
+ Mixed Precision: FP16 (half-precision) is used for faster training and reduced memory usage.
97
+ Logging Steps: Logs are generated every 10 steps to monitor training progress.
98
+ Checkpointing: Model checkpoints are saved at the end of each epoch.
99
+ Push to Hub: The fine-tuned model is uploaded to Hugging Face’s Hub (kssrikar4/Intellecta).
100
+ Data Collator:
101
+
102
+ The DataCollatorForSeq2Seq ensures that batches are dynamically padded for efficiency during training.
103
+
104
+ 4. Fine-Tuning Process
105
+ Trainer:
106
+
107
+ The Hugging Face Trainer class orchestrates the training process, combining the model, data, and training configuration.
108
+ Loss is computed for each batch using the model's outputs (e.g., logits) and the prepared labels.
109
+ The optimizer and learning rate scheduler are managed internally by the Trainer.
110
+ Training Loop:
111
+
112
+ During each epoch:
113
+ The model processes batches of tokenized prompts and computes the causal language modeling (CLM) loss.
114
+ Gradients are accumulated over multiple steps to simulate a larger batch size.
115
+ Optimizer updates are applied after gradient accumulation.
116
+ Validation:
117
+
118
+ While validation data is not explicitly defined in the code, the Trainer supports evaluation if an eval_dataset is provided.
119
+ Saving checkpoints at each epoch allows model evaluation post-training.
120
+ 5. Post-Training
121
+ Push to Hub:
122
+
123
+ The trained model, along with its tokenizer and configuration, is pushed to the Hugging Face Hub under the ID kssrikar4/Intellecta.
124
+ Usage:
125
+
126
+ The fine-tuned model can be downloaded and directly used for inference or further fine-tuning.
127
+
128
+
129
+ ### Training hyperparameters
130
+
131
+ The following hyperparameters were used during training:
132
+ - learning_rate: 0.0001
133
+ - train_batch_size: 4
134
+ - eval_batch_size: 8
135
+ - seed: 42
136
+ - gradient_accumulation_steps: 4
137
+ - total_train_batch_size: 16
138
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
139
+ - lr_scheduler_type: linear
140
+ - lr_scheduler_warmup_steps: 500
141
+ - num_epochs: 4
142
+ - mixed_precision_training: Native AMP
143
+
144
+ ### Training results
145
+
146
+
147
+
148
+ ### Framework versions
149
+
150
+ - Transformers 4.48.0
151
+ - Pytorch 2.5.1+cpu
152
+ - Datasets 3.2.0
153
+ - Tokenizers 0.21.0