Spaces:
Sleeping
Sleeping
| title: R1-Distill-LLama-8b Training | |
| emoji: 🧠 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: "5.17.0" | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # DeepSeek R1-Distill-LLama-8b Training | |
| This space is dedicated to training the DeepSeek R1-Distill-LLama-8b model for cognitive science research. The training process utilizes advanced optimizations and efficient data processing techniques. | |
| ## Features | |
| - Optimized training pipeline | |
| - Cognitive dataset integration | |
| - Advanced memory management | |
| - Gradient checkpointing | |
| - Sequential data processing | |
| ## Configuration Files | |
| - `transformers_config.json`: Model and training parameters | |
| - `hardware_config.json`: Hardware-specific optimizations | |
| - `dataset_config.json`: Dataset processing settings | |
| - `requirements.txt`: Required dependencies | |
| ## Training Process | |
| The training utilizes: | |
| - Custom data processing pipeline | |
| - Paper-order preservation | |
| - Efficient memory usage | |
| - Gradient accumulation | |
| ## Dataset | |
| Training uses the cognitive dataset with: | |
| - Maintained paper order | |
| - Proper metadata handling | |
| - Optimized sequence length | |
| - Efficient batching | |
| ## Hardware Requirements | |
| - GPU: L4 or better | |
| - VRAM: 24GB minimum | |
| - RAM: 32GB recommended | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # Phase 1: Domain Adaptation (Unsupervised) | |
| This directory contains the code and configuration for domain adaptation of the DeepSeek-R1-Distill-Llama-8B model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science). | |
| ## Overview | |
| Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2. | |
| ## Files | |
| - `run_transformers_training.py`: Main script for domain adaptation | |
| - `transformers_config.json`: Configuration parameters for training | |
| ## How It Works | |
| 1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset | |
| 2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers | |
| 3. **Efficient Training**: Uses 4-bit quantization and LoRA for memory-efficient training | |
| 4. **Checkpointing**: Saves regular checkpoints to resume training if interrupted | |
| 5. **Monitoring**: Logs detailed metrics and statistics during training | |
| 6. **Model Publishing**: Pushes the trained model to Hugging Face Hub as [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) | |
| ## Key Features | |
| ### Sequential Processing | |
| The training script ensures that chunks from the same research paper are processed together by: | |
| - Sorting the dataset by ID | |
| - Using a SequentialSampler to maintain order | |
| - Overriding the default DataLoader to disable shuffling | |
| ### Data Collator | |
| The `SimpleDataCollator` class: | |
| - Preserves pre-tokenized data format | |
| - Processes each entry independently | |
| - Provides detailed logging of processing statistics | |
| - Handles errors gracefully | |
| ### Checkpointing | |
| The training process saves checkpoints: | |
| - Every 100 steps (configurable) | |
| - Automatically resumes from the latest checkpoint if interrupted | |
| - Maintains up to 3 recent checkpoints | |
| ## Configuration | |
| Key parameters in `transformers_config.json`: | |
| - `model_name`: deepseek-ai/DeepSeek-R1-Distill-Llama-8B | |
| - `dataset_name`: George-API/cognitive-data | |
| - `learning_rate`: 3e-5 | |
| - `num_train_epochs`: 5 | |
| - `per_device_train_batch_size`: 4 | |
| - `gradient_accumulation_steps`: 8 | |
| - `max_seq_length`: 2048 | |
| - `push_to_hub`: true | |
| - `hub_model_id`: "DeepSeek-Cognitive-Science" | |
| ## Running Domain Adaptation | |
| To start domain adaptation: | |
| ```bash | |
| python run_transformers_training.py | |
| ``` | |
| The script will: | |
| 1. Load the dataset and model | |
| 2. Configure LoRA adapters | |
| 3. Process the data sequentially | |
| 4. Train the model for the specified number of epochs | |
| 5. Save the resulting model and push it to Hugging Face Hub as [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) | |
| ## Using the Model | |
| After training, you can use the domain-adapted model: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load the domain-adapted model | |
| model_name = "George-API/DeepSeek-Cognitive-Science" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained(model_name) | |
| # Generate text | |
| input_text = "The hippocampus is involved in" | |
| inputs = tokenizer(input_text, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_length=100) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Expected Outcomes | |
| After domain adaptation, the model should: | |
| - Have a better understanding of cognitive science terminology | |
| - Show improved performance on cognitive science tasks | |
| - Be ready for supervised fine-tuning in Phase 2 | |
| ## Next Steps | |
| After completing domain adaptation: | |
| 1. Evaluate the model's performance on cognitive science texts | |
| 2. Proceed to Phase 2 (Supervised Fine-Tuning) using the [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) model | |
| 3. Use TensorBoard to analyze training metrics |