File size: 6,430 Bytes
ef6f0b6
 
 
 
 
 
 
 
09a9e5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16bbafc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
library_name: transformers
tags: []
---

# Model Card


## Model Description

This is a **preference-tuned** version of **DeepSeek-R1-Distill-Llama-8B**, optimized for **telecom-related queries**. The model has been fine-tuned to provide **concise and factual answers**, ensuring that it **does role-play as a customer service agent** while incorporating user-specific preferences.

- **Developed by:** Mohamed Abdulaziz

- **Model type:** Fine-tune-DeepSeek-R1-Distill-Llama-8B

- **Framework Used:** Unsloth for fine tuning and wandb for performance monitoring

- **Preference Tuning:** The model has been additionally fine-tuned using **Direct Preference Optimization (DPO)** to align with user preferences, enhancing response quality and relevance.

## Benefits of Preference Tuning with DPO

Direct Preference Optimization (DPO) improves model performance by directly optimizing for user-preferred responses without requiring reinforcement learning with human feedback (RLHF).

### **Advantages of DPO in this Model:**

- **Higher Quality Responses:** The model generates more accurate and contextually relevant answers by aligning with preferred outputs.
- **Better User Experience:** By fine-tuning responses based on preferred interactions, the model enhances satisfaction and usability.
- **Efficient Fine-Tuning:** DPO allows preference alignment with reduced computational overhead compared to RLHF, making it more scalable and adaptable.
- **Improved Robustness:** The model avoids inconsistencies by learning from direct user feedback rather than implicit reward functions.

## Uses

This model is designed for **customer support automation in the telecom industry**. It assists in:

- Answering common user queries about **5G, network issues, billing, and services**.
- Providing **concise and factually correct responses**.
- Reducing **workload on human support agents** by handling routine inquiries.
- **Adapting to user-specific preferences** to improve interaction quality.

### **Who can use this model?**

- **Telecom companies**: Automate customer service via chatbots.
- **Developers & researchers**: Fine-tune and adapt for different use cases.
- **Call centers**: Support agents in handling user requests efficiently.

### **Who might be affected?**

- **End-users** interacting with telecom chatbots.
- **Support agents** using AI-assisted tools.
- **Developers & data scientists** fine-tuning and deploying the model.

## How to Get Started with the Model

### **1️⃣ Import necessary libraries**

```python
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
```

### **2️⃣ Define model path**

```python
model_path = "moo100/DeepSeek-R1-telecom-chatbot-v2"
```

### **3️⃣ Load the model and tokenizer**

```python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_path,
    max_seq_length=1024,  
    dtype=None,
    device_map="auto",
    load_in_4bit=True,
    trust_remote_code=True
)
```

### **4️⃣ Optimize model for fast inference with Unsloth**

```python
model = FastLanguageModel.for_inference(model)
```

### **5️⃣ Move model to GPU if available, otherwise use CPU**

```python
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
```

### **6️⃣ Define system instruction to guide model responses**

```python
system_instruction = """You are an AI assistant. Answer user questions concisely and factually. 
Do NOT add extra details unless necessary."""
```

### **7️⃣ Define user input (Replace with any query)**

```python
user_input = "Hello"
```

### **8️⃣ Construct full prompt with instructions and user query**

```python
full_prompt = f"{system_instruction}\n\nUser: {user_input}\nAssistant:"
```

### **9️⃣ Tokenize input prompt**

```python
inputs = tokenizer(full_prompt, return_tensors="pt").to(device)
```

### **🔦 Generate model response with controlled stopping criteria**

```python
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,  
    do_sample=True,  
    temperature=0.5,  
    top_k=50,
    eos_token_id=tokenizer.eos_token_id,  
)
```

### **1️⃣1️ Decode and extract only the newly generated response**

```python
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
```

### **1️⃣2️ Print the AI-generated response**

```python
print(response.split("User: ")[0])
```



## Training Details

### Training Data

A **preference dataset** was created using **Phi4**, where the **chosen column** was generated via evaluation prompts. This dataset ensures the model aligns closely with user preferences by fine-tuning responses accordingly.

### Training Procedure

- **Training Loss:** Decreased steadily from **0.6997** at step 10 to **0.0966** at step 360, indicating stable model convergence.
- **Rewards / Chosen:** Improved consistently, reaching **1.4318**, showcasing enhanced preference alignment.
- **Rewards / Rejected:** Decreased over time, confirming the model's ability to distinguish between optimal and suboptimal responses.
- **Logits / Chosen & Rejected:** Demonstrated increasing separation, ensuring stronger differentiation between desirable and undesirable outputs.
- **Gradient Norm & Learning Rate Schedule:** Maintained stability with linear decay, preventing sudden fluctuations and ensuring smooth convergence.
- **Preference Alignment:** Fine-tuned using **DPO**, incorporating user feedback to optimize responses effectively.



# Evaluation

## Methodology

The chatbot was evaluated using Meta-Llama-3.3-70B-Instruct, assessing relevance, correctness, and fluency of its responses.

## Results

Meta-Llama-3.3-70B-Instruct Evaluation:

- **Relevance: 10**
  The chatbot's response is highly relevant to the user's greeting, as it acknowledges the greeting and opens the conversation for further discussion.

- **Correctness: 10**
  The chatbot's response is grammatically correct and uses proper language, making it a correct and appropriate response to the user's greeting.

- **Fluency: 10**
  The chatbot's response is fluent and natural-sounding, with a friendly and approachable tone that invites the user to engage in conversation.

Overall, the chatbot's response is a great example of a well-crafted greeting response that effectively sets the tone for a helpful and engaging conversation.