GPT-5-Distill-llama3.1-8B-Instruct

Unsloth Llama-3 Distillation

Model Summary

GPT-5-Distill-llama3.1-8B-Instruct is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct, designed to distill the capabilities of high-performance models (labeled as GPT-5 in source datasets) into a more efficient 8B parameter footprint.

This model was trained using Unsloth on a curated mix of approximately 164,000 high-quality instruction-response pairs, focusing on complex reasoning and "normal" flaw-level responses.

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Architecture: Llama 3.1 (8B parameters)
  • Language: English (Primary)
  • Context Window: 32,768 tokens
  • Fine-tuning Framework: Unsloth (QLoRA)

✨ Key Advantages of GPT-5 Distillation

This model represents a shift towards "Super-Knowledge Distillation", where a smaller, efficient student model learns from a significantly more capable teacher.

  • πŸš€ Frontier-Level Reasoning: By training on dataset samples attributed to GPT-5, the model acquires complex reasoning patterns, nuance, and problem-solving strategies that are typically absent in standard datasets or smaller models.
  • ⚑ Efficient Intelligence: Users can experience high-fidelity, coherent, and detailed responses on consumer hardware (e.g., single GPUs) without the latency, privacy concerns, or cost of querying giant proprietary APIs.
  • πŸ’Ž High-Purity Signal: The strict filtering for flaw == "normal" ensures the model is fine-tuned only on the highest confidence, error-free responses. This minimizes "hallucination inheritance" and aligns the model with safe, helpful behaviors.
  • 🎯 Enhanced Nuance & Tone: Unlike standard finetunes that often sound robotic, this model mimics the more natural, conversational, and adaptive tone found in next-generation frontier models.

πŸ“š Training Data

The model was trained on a high-quality blend of two datasets, totaling 163,896 samples:

  1. Chat-GPT-5-Chat-Response (160k samples)
    • Filtered specifically for normal entries to ensure high-quality, safe, and coherent responses.
    • This dataset serves as the primary distillation source, aiming to mimic the response patterns of advanced large language models.
  2. ShareGPT-Qwen3-235B-A22B-Instuct-2507 (3.9k samples)
    • "This dataset consists of approximately 3.9k examples, with an average of about 5 rounds of dialogue per scenario, designed to enhance the model’s instruction-following ability and task-completion efficiency.

All data was formatted using the standard Llama-3 Chat Template.

βš™οΈ Training Details

  • Hardware: NVIDIA H100
  • Sequence Length: 32,768 tokens (Long Context Support)
  • Batch Size: 4 per device (Effective Batch Size: 32 via Gradient Accumulation)
  • Learning Rate: 2e-5
  • Scheduler: Linear
  • Optimizer: AdamW 8-bit
  • LoRA Rank (r): 32
  • LoRA Alpha: 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

πŸ›‘οΈ License & Limitations

  • License: This model is subject to the Llama 3.1 Community License.
  • Limitations: While this model is distilled from high-capability sources, it is still an 8B parameter model. It may hallucinate facts or struggle with extremely complex reasoning tasks compared to the original teacher models. The "GPT-5" naming refers to the source dataset labels and does not imply access to unreleased OpenAI weights.
Downloads last month
7
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jackrong/GPT-5-Distill-llama3.1-8B-Instruct

Finetuned
(2160)
this model