Model Card

Model Description

This is a preference-tuned version of DeepSeek-R1-Distill-Llama-8B, optimized for telecom-related queries. The model has been fine-tuned to provide concise and factual answers, ensuring that it does role-play as a customer service agent while incorporating user-specific preferences.

Developed by: Mohamed Abdulaziz
Model type: Fine-tune-DeepSeek-R1-Distill-Llama-8B
Framework Used: Unsloth for fine tuning and wandb for performance monitoring
Preference Tuning: The model has been additionally fine-tuned using Direct Preference Optimization (DPO) to align with user preferences, enhancing response quality and relevance.

Benefits of Preference Tuning with DPO

Direct Preference Optimization (DPO) improves model performance by directly optimizing for user-preferred responses without requiring reinforcement learning with human feedback (RLHF).

Advantages of DPO in this Model:

Higher Quality Responses: The model generates more accurate and contextually relevant answers by aligning with preferred outputs.
Better User Experience: By fine-tuning responses based on preferred interactions, the model enhances satisfaction and usability.
Efficient Fine-Tuning: DPO allows preference alignment with reduced computational overhead compared to RLHF, making it more scalable and adaptable.
Improved Robustness: The model avoids inconsistencies by learning from direct user feedback rather than implicit reward functions.

Uses

This model is designed for customer support automation in the telecom industry. It assists in:

Answering common user queries about 5G, network issues, billing, and services.
Providing concise and factually correct responses.
Reducing workload on human support agents by handling routine inquiries.
Adapting to user-specific preferences to improve interaction quality.

Who can use this model?

Telecom companies: Automate customer service via chatbots.
Developers & researchers: Fine-tune and adapt for different use cases.
Call centers: Support agents in handling user requests efficiently.

Who might be affected?

End-users interacting with telecom chatbots.
Support agents using AI-assisted tools.
Developers & data scientists fine-tuning and deploying the model.

How to Get Started with the Model

1️⃣ Import necessary libraries

import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

2️⃣ Define model path

model_path = "moo100/DeepSeek-R1-telecom-chatbot-v2"

3️⃣ Load the model and tokenizer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_path,
    max_seq_length=1024,  
    dtype=None,
    device_map="auto",
    load_in_4bit=True,
    trust_remote_code=True
)

4️⃣ Optimize model for fast inference with Unsloth

model = FastLanguageModel.for_inference(model)

5️⃣ Move model to GPU if available, otherwise use CPU

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

6️⃣ Define system instruction to guide model responses

system_instruction = """You are an AI assistant. Answer user questions concisely and factually. 
Do NOT add extra details unless necessary."""

7️⃣ Define user input (Replace with any query)

user_input = "Hello"

8️⃣ Construct full prompt with instructions and user query

full_prompt = f"{system_instruction}\n\nUser: {user_input}\nAssistant:"

9️⃣ Tokenize input prompt

inputs = tokenizer(full_prompt, return_tensors="pt").to(device)

🔦 Generate model response with controlled stopping criteria

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,  
    do_sample=True,  
    temperature=0.5,  
    top_k=50,
    eos_token_id=tokenizer.eos_token_id,  
)

1️⃣1️ Decode and extract only the newly generated response

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)

1️⃣2️ Print the AI-generated response

print(response.split("User: ")[0])

Training Details

Training Data

A preference dataset was created using Phi4, where the chosen column was generated via evaluation prompts. This dataset ensures the model aligns closely with user preferences by fine-tuning responses accordingly.

Training Procedure

Training Loss: Decreased steadily from 0.6997 at step 10 to 0.0966 at step 360, indicating stable model convergence.
Rewards / Chosen: Improved consistently, reaching 1.4318, showcasing enhanced preference alignment.
Rewards / Rejected: Decreased over time, confirming the model's ability to distinguish between optimal and suboptimal responses.
Logits / Chosen & Rejected: Demonstrated increasing separation, ensuring stronger differentiation between desirable and undesirable outputs.
Gradient Norm & Learning Rate Schedule: Maintained stability with linear decay, preventing sudden fluctuations and ensuring smooth convergence.
Preference Alignment: Fine-tuned using DPO, incorporating user feedback to optimize responses effectively.

Evaluation

Methodology

The chatbot was evaluated using Meta-Llama-3.3-70B-Instruct, assessing relevance, correctness, and fluency of its responses.

Results

Meta-Llama-3.3-70B-Instruct Evaluation:

Relevance: 10 The chatbot's response is highly relevant to the user's greeting, as it acknowledges the greeting and opens the conversation for further discussion.
Correctness: 10 The chatbot's response is grammatically correct and uses proper language, making it a correct and appropriate response to the user's greeting.
Fluency: 10 The chatbot's response is fluent and natural-sounding, with a friendly and approachable tone that invites the user to engage in conversation.

Overall, the chatbot's response is a great example of a well-crafted greeting response that effectively sets the tone for a helpful and engaging conversation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support