tuandunghcmut commited on
Commit
c2f2614
·
verified ·
1 Parent(s): 626a9f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +546 -212
README.md CHANGED
@@ -1,285 +1,619 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - tuandunghcmut/normal_dataset
5
- language:
6
- - en
7
- metrics:
8
- - accuracy
9
- - perplexity
10
- base_model:
11
- - unsloth/Qwen2.5-Coder-1.5B-Instruct
12
- pipeline_tag: text-generation
13
- ---
14
- # Using Unsloth to Load and Run Qwen25_Coder_MultipleChoice
15
-
16
- Unsloth offers significant inference speed improvements for the Qwen25_Coder_MultipleChoice model. Here's how to properly load and use the model with Unsloth:
17
-
18
- ## Installation
19
-
20
- First, install the required packages:
21
 
22
- ```bash
23
- pip install unsloth transformers torch accelerate
24
- # Flash-attention is REQUIRED for correct model behavior!
25
- pip install flash-attn --no-build-isolation
26
- ```
27
-
28
- ## Loading the Model with Unsloth
29
-
30
- ```python
31
- from transformers import AutoModelForCausalLM, AutoTokenizer
32
- import torch
33
- from unsloth import FastLanguageModel
34
- import os
35
 
36
- # Optional: Set HuggingFace Hub token if you have one
37
- hf_token = os.environ.get("HF_TOKEN") # or directly provide your token
38
 
39
- # Verify flash-attention installation - REQUIRED for correct results
40
- try:
41
- import flash_attn
42
- except ImportError:
43
- raise ImportError(
44
- "flash-attn package is required for correct model behavior.\n"
45
- "Please install it with: pip install flash-attn --no-build-isolation"
46
- )
47
 
48
- # Model ID on HuggingFace Hub
49
- model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
50
 
51
- print(f"Loading model from HuggingFace Hub: {model_id}")
52
 
53
- # First load tokenizer
54
- tokenizer = AutoTokenizer.from_pretrained(
55
- model_id,
56
- token=hf_token,
57
- trust_remote_code=True
58
- )
59
 
60
- model, tokenizer = FastLanguageModel.from_pretrained(
61
- model_name=model_id,
62
- token=hf_token,
63
- max_seq_length=2048, # Adjust based on your memory constraints
64
- dtype=None, # Auto-detect best dtype
65
- load_in_4bit=True, # Use 4-bit quantization for efficiency
66
- )
67
 
68
- # Enable fast inference mode
69
- FastLanguageModel.for_inference(model)
70
 
71
- print("Successfully loaded model with Unsloth and flash-attention!")
 
72
  ```
73
 
74
- > ⚠️ **WARNING**: Using this model without flash-attention will produce incorrect results. The flash-attention package is not just for speed, but essential for proper model functionality.
 
 
75
 
76
- Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
 
77
 
78
  ```python
79
- # Alternative approach (Method 2)
80
- # First load with transformers
 
81
  model = AutoModelForCausalLM.from_pretrained(
82
- model_id,
83
- token=hf_token,
84
  torch_dtype=torch.bfloat16,
85
  device_map="auto",
86
- trust_remote_code=True
 
87
  )
 
 
 
 
 
 
 
 
88
 
89
- # Then apply Unsloth optimization with flash-attention
90
- FastLanguageModel.for_inference(model, use_flash_attention=True)
 
 
 
 
 
 
 
91
  ```
92
 
93
- ## Running Multiple-Choice Inference
94
 
95
- After loading the model with Unsloth, use it to answer multiple-choice questions:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  ```python
98
- def format_prompt(question, choices):
99
- # Format choices as a lettered list
100
- formatted_choices = "\n".join(
101
- [f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)]
102
- )
103
-
104
- return f"""
105
- QUESTION:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  {question}
107
 
108
- CHOICES:
109
  {formatted_choices}
110
 
111
- Analyze this question step-by-step and provide a detailed explanation.
112
- Your response MUST be in YAML format as follows:
 
 
 
 
 
 
 
113
 
 
 
 
 
 
 
 
114
  understanding: |
115
- <your understanding of what the question is asking>
116
  analysis: |
117
  <your analysis of each option>
118
  reasoning: |
119
- <your step-by-step reasoning process>
120
  conclusion: |
121
  <your final conclusion>
122
- answer: <single letter A through {chr(64 + len(choices))}>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
- The answer field MUST contain ONLY a single character letter.
 
 
 
 
 
 
 
 
 
125
  """
126
 
127
- def get_answer(question, choices, model, tokenizer):
128
- # Create the prompt
129
- prompt = format_prompt(question, choices)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
- # Format as chat for the model
132
- messages = [{"role": "user", "content": prompt}]
133
- chat_text = tokenizer.apply_chat_template(
134
- messages,
135
- tokenize=False,
136
- add_generation_prompt=True
137
- )
138
 
139
- # Tokenize and generate
140
- inputs = tokenizer(chat_text, return_tensors="pt").to(model.device)
 
 
 
 
 
 
 
 
141
 
142
- # Generate with Unsloth-optimized model
143
- output = model.generate(
144
- inputs.input_ids,
145
- max_new_tokens=512,
146
- temperature=0.0, # Use deterministic generation for multiple choice
147
- do_sample=False
148
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
- # Extract and return response
151
- response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
- # Extract answer using regex
154
- import re
155
- answer_match = re.search(r'answer:\s*([A-Z])', response)
156
- if answer_match:
157
- answer = answer_match.group(1)
158
- else:
159
- # Default fallback if no answer found
160
- answer = "A"
161
 
162
- return {
163
- "answer": answer,
164
- "full_response": response
165
- }
166
-
167
- # Example usage
168
- java_example = {
169
- "question": "Which of the following correctly creates a new instance of a class in Java?",
170
- "choices": [
171
- "MyClass obj = new MyClass();",
172
- "var obj = MyClass();",
173
- "MyClass obj = MyClass.new();",
174
- "obj = new(MyClass);"
175
- ],
176
- "answer": "A" # Optional ground truth
177
- }
178
-
179
- result = get_answer(
180
- java_example["question"],
181
- java_example["choices"],
182
- model,
183
- tokenizer
184
- )
185
-
186
- print(f"Answer: {result['answer']}")
187
- print(f"Full explanation:\n{result['full_response']}")
188
  ```
189
 
190
- ## Processing Multiple Questions in Batch
191
 
192
- For better efficiency with multiple questions, use batch processing:
193
 
194
  ```python
195
- def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
196
- """Process multiple questions in efficient batches"""
197
- results = []
198
-
199
- for i in range(0, len(questions_list), batch_size):
200
- batch = questions_list[i:i+batch_size]
201
- batch_prompts = []
202
 
203
- # Prepare all prompts in the batch
204
- for item in batch:
205
- prompt = format_prompt(item["question"], item["choices"])
206
- messages = [{"role": "user", "content": prompt}]
207
- chat_text = tokenizer.apply_chat_template(
208
- messages,
209
- tokenize=False,
210
- add_generation_prompt=True
211
- )
212
- batch_prompts.append(chat_text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
214
- # Tokenize all inputs with padding
215
- tokenizer.padding_side = "left" # Important for causal LM generation
216
- inputs = tokenizer(
217
- batch_prompts,
218
- return_tensors="pt",
219
- padding=True
220
- ).to(model.device)
 
 
 
 
 
 
 
221
 
222
- # Generate all outputs
223
- outputs = model.generate(
224
- inputs.input_ids,
225
- attention_mask=inputs.attention_mask,
226
- max_new_tokens=2048,
227
- temperature=0.0,
228
- do_sample=False,
229
- pad_token_id=tokenizer.pad_token_id
230
  )
231
 
232
- # Process each response
233
- for j, output_ids in enumerate(outputs):
234
- # Calculate where the generated text begins
235
- input_length = inputs.input_ids[j].ne(tokenizer.pad_token_id).sum().item()
 
 
 
 
 
236
 
237
- # Decode only the generated part
238
- response = tokenizer.decode(
239
- output_ids[input_length:],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
  skip_special_tokens=True
241
  )
242
 
243
- # Extract answer
244
- import re
245
- answer_match = re.search(r'answer:\s*([A-Z])', response)
246
- answer = answer_match.group(1) if answer_match else "A"
 
247
 
248
- # Store result
249
- results.append({
250
- "question": batch[j]["question"],
251
- "answer": answer,
252
- "full_response": response
253
- })
254
-
255
- return results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
256
  ```
257
 
258
- ## Performance Tips for Unsloth
 
 
259
 
260
- 1. **Flash Attention REQUIRED**: Flash Attention is not just a performance option but a requirement for this model to function correctly:
261
  ```bash
262
- pip install flash-attn --no-build-isolation
 
 
 
 
263
  ```
264
 
265
- 2. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
266
  ```python
267
- model, tokenizer = FastLanguageModel.from_pretrained(
268
- model_name=model_id,
269
- max_seq_length=1024, # Reduced from 2048
270
- load_in_4bit=True,
271
- use_flash_attention=True # Always enable
272
  )
273
  ```
274
 
275
- 3. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
276
-
277
- 4. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
 
278
 
279
- 5. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
280
  ```python
281
- os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use first GPU
282
  ```
283
 
284
- <!-- With these optimizations, Qwen25_Coder_MultipleChoice will run correctly while maintaining the high-quality multiple-choice reasoning and answers. -->
285
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ # Using tuandunghcmut/Qwen25_Coder_MultipleChoice
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
+ This document provides everything you need to get started with the `tuandunghcmut/Qwen25_Coder_MultipleChoice` model for multiple-choice coding questions.
 
5
 
6
+ ## Installation and Setup
 
 
 
 
 
 
 
7
 
8
+ ### Prerequisites
 
9
 
10
+ Make sure you have Python 3.8+ installed. Then install the required packages:
11
 
12
+ ```bash
13
+ # Install core dependencies
14
+ pip install transformers torch pandas
 
 
 
15
 
16
+ # For faster inference (important)
17
+ pip install unsloth accelerate bitsandbytes
 
 
 
 
 
18
 
19
+ # Flash Attention (highly recommended for speed)
20
+ pip install flash-attn --no-build-isolation
21
 
22
+ # For dataset handling and YAML parsing
23
+ pip install datasets pyyaml
24
  ```
25
 
26
+ ### Flash Attention Setup
27
+
28
+ Flash Attention provides a significant speedup for transformer models. To use it with the Qwen model:
29
 
30
+ 1. Install Flash Attention as shown above
31
+ 2. Enable it when loading the model:
32
 
33
  ```python
34
+ from transformers import AutoModelForCausalLM, AutoTokenizer
35
+
36
+ # Enable Flash Attention during model loading
37
  model = AutoModelForCausalLM.from_pretrained(
38
+ "tuandunghcmut/Qwen25_Coder_MultipleChoice",
 
39
  torch_dtype=torch.bfloat16,
40
  device_map="auto",
41
+ trust_remote_code=True,
42
+ use_flash_attention_2=True # Enable Flash Attention
43
  )
44
+ ```
45
+
46
+ Flash Attention will provide:
47
+ - 2-3x faster inference speed
48
+ - Lower memory usage
49
+ - Compatible with 4-bit quantization for even more efficiency
50
+
51
+ ### Environment Variables
52
 
53
+ If you're using Hugging Face Hub models, you may want to set up your access token:
54
+
55
+ ```bash
56
+ # Set environment variable for Hugging Face token
57
+ export HF_TOKEN="your_huggingface_token_here"
58
+
59
+ # Or in Python
60
+ import os
61
+ os.environ["HF_TOKEN"] = "your_huggingface_token_here"
62
  ```
63
 
64
+ ### GPU Setup
65
 
66
+ For optimal performance, you'll need a CUDA-compatible GPU. Check your installation:
67
+
68
+ ```bash
69
+ # Verify CUDA is available
70
+ python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
71
+
72
+ # Print CUDA device info
73
+ python -c "import torch; print('CUDA device count:', torch.cuda.device_count()); print('CUDA device name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU')"
74
+ ```
75
+
76
+ ## Required Classes
77
+
78
+ Below are the essential classes needed to work with the model. Copy these into your Python files to use them in your project.
79
+
80
+ ### PromptCreator
81
+
82
+ This class formats prompts for multiple-choice questions:
83
 
84
  ```python
85
+ class PromptCreator:
86
+ """
87
+ Creates and formats prompts for multiple choice questions
88
+ Supports different prompt styles for training and inference
89
+ """
90
+
91
+ # Prompt types
92
+ BASIC = "basic" # Simple answer-only format
93
+ YAML_REASONING = "yaml" # YAML formatted reasoning
94
+ TEACHER_REASONED = "teacher" # Same YAML format as YAML_REASONING but using teacher completions for training
95
+
96
+ def __init__(self, prompt_type=BASIC):
97
+ self.prompt_type = prompt_type
98
+ # Initialize parser mode based on prompt type
99
+ if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
100
+ self.parser_mode = "yaml"
101
+ else:
102
+ self.parser_mode = "basic"
103
+
104
+ def format_choices(self, choices):
105
+ """Format choices with letter prefixes"""
106
+ return "\n".join([f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)])
107
+
108
+ def get_max_letter(self, choices):
109
+ """Get the last valid letter based on choice count"""
110
+ return chr(65 + len(choices) - 1)
111
+
112
+ def create_inference_prompt(self, question, choices):
113
+ """Create a prompt for inference based on the configured prompt type"""
114
+ formatted_choices = self.format_choices(choices)
115
+ max_letter = self.get_max_letter(choices)
116
+
117
+ if self.prompt_type == self.BASIC:
118
+ return self._create_basic_prompt(question, formatted_choices, max_letter)
119
+ elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
120
+ return self._create_yaml_prompt(question, formatted_choices, max_letter)
121
+ else:
122
+ return self._create_basic_prompt(question, formatted_choices, max_letter)
123
+
124
+ def _create_basic_prompt(self, question, formatted_choices, max_letter):
125
+ """Create a basic prompt that just asks for an answer letter"""
126
+ return f"""
127
  {question}
128
 
 
129
  {formatted_choices}
130
 
131
+ Select the correct answer from A through {max_letter}:
132
+ """
133
+
134
+ def _create_yaml_prompt(self, question, formatted_choices, max_letter):
135
+ """Create a prompt with YAML formatted reasoning structure"""
136
+ return f"""
137
+ {question}
138
+
139
+ {formatted_choices}
140
 
141
+ Think through this step-by-step:
142
+ - Understand what the question is asking
143
+ - Analyze each option carefully
144
+ - Reason about why each option might be correct or incorrect
145
+ - Select the most appropriate answer
146
+
147
+ Your response should be in YAML format:
148
  understanding: |
149
+ <your understanding of the question>
150
  analysis: |
151
  <your analysis of each option>
152
  reasoning: |
153
+ <your reasoning about the correct answer>
154
  conclusion: |
155
  <your final conclusion>
156
+ answer: <single letter A through {max_letter} representing your final answer>
157
+ """
158
+
159
+ def create_training_prompt(self, question, choices):
160
+ """Create a prompt for training based on the configured prompt type"""
161
+ formatted_choices = self.format_choices(choices)
162
+ max_letter = self.get_max_letter(choices)
163
+
164
+ if self.prompt_type == self.BASIC:
165
+ return self._create_basic_training_prompt(question, formatted_choices, max_letter)
166
+ elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
167
+ return self._create_yaml_training_prompt(question, formatted_choices, max_letter)
168
+ else:
169
+ return self._create_basic_training_prompt(question, formatted_choices, max_letter)
170
+
171
+ def _create_basic_training_prompt(self, question, formatted_choices, max_letter):
172
+ """Create a basic training prompt"""
173
+ return f"""
174
+ {question}
175
+
176
+ {formatted_choices}
177
+
178
+ Select the correct answer from A through {max_letter}:
179
+ """
180
+
181
+ def _create_yaml_training_prompt(self, question, formatted_choices, max_letter):
182
+ """Create a training prompt with YAML formatted reasoning structure"""
183
+ return f"""
184
+ {question}
185
+
186
+ {formatted_choices}
187
+
188
+ Think through this step-by-step:
189
+ - Understand what the question is asking
190
+ - Analyze each option carefully
191
+ - Reason about why each option might be correct or incorrect
192
+ - Select the most appropriate answer
193
 
194
+ Your response should be in YAML format:
195
+ understanding: |
196
+ <your understanding of the question>
197
+ analysis: |
198
+ <your analysis of each option>
199
+ reasoning: |
200
+ <your reasoning about the correct answer>
201
+ conclusion: |
202
+ <your final conclusion>
203
+ answer: <single letter A through {max_letter} representing your final answer>
204
  """
205
 
206
+ def set_prompt_type(self, prompt_type):
207
+ """Set the prompt type and update parser mode accordingly"""
208
+ self.prompt_type = prompt_type
209
+ if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
210
+ self.parser_mode = "yaml"
211
+ else:
212
+ self.parser_mode = "basic"
213
+
214
+ def is_teacher_mode(self):
215
+ """Check if prompt type is teacher mode"""
216
+ return self.prompt_type == self.TEACHER_REASONED
217
+ ```
218
+
219
+ ### ResponseParser
220
+
221
+ This class extracts answers from model responses:
222
+
223
+ ```python
224
+ class ResponseParser:
225
+ """
226
+ Parser for model responses with support for different formats
227
+ Extracts answers and reasoning from model outputs
228
+ """
229
 
230
+ # Parser modes
231
+ BASIC = "basic" # Extract single letter answer
232
+ YAML = "yaml" # Parse YAML formatted response with reasoning
 
 
 
 
233
 
234
+ def __init__(self, parser_mode=BASIC):
235
+ """Initialize with parser mode (basic or yaml)"""
236
+ self.parser_mode = parser_mode
237
+
238
+ def parse(self, response_text):
239
+ """Parse the response text and extract answer and reasoning"""
240
+ if self.parser_mode == self.YAML:
241
+ return self._parse_yaml_response(response_text)
242
+ else:
243
+ return self._parse_basic_response(response_text)
244
 
245
+ def _parse_basic_response(self, response_text):
246
+ """
247
+ Parse a basic response to extract the answer letter
248
+
249
+ Returns:
250
+ tuple: (answer_letter, None)
251
+ """
252
+ # Look for just the letter at the end of text
253
+ import re
254
+
255
+ # Try to find the last occurrence of letters A-Z by themselves
256
+ matches = re.findall(r'\b([A-Z])\b', response_text)
257
+ if matches:
258
+ return matches[-1], None # Return the last matching letter
259
+
260
+ # Try to find "The answer is X" pattern
261
+ answer_match = re.search(r'[Tt]he answer is[:\s]+([A-Z])', response_text)
262
+ if answer_match:
263
+ return answer_match.group(1), None
264
+
265
+ # If nothing else works, just get the last uppercase letter
266
+ uppercase_letters = re.findall(r'[A-Z]', response_text)
267
+ if uppercase_letters:
268
+ return uppercase_letters[-1], None
269
+
270
+ return None, None # No answer found
271
 
272
+ def _parse_yaml_response(self, response_text):
273
+ """
274
+ Parse a YAML formatted response to extract the answer and reasoning
275
+
276
+ Returns:
277
+ tuple: (answer_letter, reasoning_dict)
278
+ """
279
+ import re
280
+ import yaml
281
+
282
+ # First try to extract just the answer field
283
+ answer_match = re.search(r'answer:\s*([A-Z])', response_text)
284
+ answer = answer_match.group(1) if answer_match else None
285
+
286
+ # Try to extract the entire YAML
287
+ try:
288
+ # Remove potential code block markers
289
+ yaml_text = response_text
290
+ if "```yaml" in yaml_text:
291
+ yaml_text = yaml_text.split("```yaml")[1]
292
+ if "```" in yaml_text:
293
+ yaml_text = yaml_text.split("```")[0]
294
+ elif "```" in yaml_text:
295
+ # Assume the whole thing is a code block
296
+ parts = yaml_text.split("```")
297
+ if len(parts) >= 3:
298
+ yaml_text = parts[1]
299
+
300
+ # Parse the YAML
301
+ parsed_yaml = yaml.safe_load(yaml_text)
302
+
303
+ # If successful, use the answer from the YAML, and return the parsed structure
304
+ if isinstance(parsed_yaml, dict) and "answer" in parsed_yaml:
305
+ return parsed_yaml.get("answer"), parsed_yaml
306
+ except Exception:
307
+ # If YAML parsing fails, we already have the answer from regex
308
+ pass
309
+
310
+ return answer, None
311
 
312
+ def set_parser_mode(self, parser_mode):
313
+ """Set the parser mode"""
314
+ self.parser_mode = parser_mode
 
 
 
 
 
315
 
316
+ @classmethod
317
+ def from_prompt_type(cls, prompt_type):
318
+ """
319
+ Create a ResponseParser with the appropriate mode based on prompt type
320
+
321
+ Args:
322
+ prompt_type: The prompt type (e.g., PromptCreator.YAML_REASONING)
323
+
324
+ Returns:
325
+ ResponseParser: A parser configured for the prompt type
326
+ """
327
+ if prompt_type in ["yaml", "teacher"]:
328
+ return cls("yaml")
329
+ else:
330
+ return cls("basic")
 
 
 
 
 
 
 
 
 
 
 
331
  ```
332
 
333
+ ### QwenModelHandler
334
 
335
+ This class handles model loading and inference:
336
 
337
  ```python
338
+ class QwenModelHandler:
339
+ def __init__(self, model_name="unsloth/Qwen2.5-7B", max_seq_length=768,
340
+ quantization=None, device_map="auto", cache_dir=None,
341
+ use_flash_attention=True):
342
+ """
343
+ Initialize a handler for Qwen models
 
344
 
345
+ Args:
346
+ model_name: Model identifier (local path or Hugging Face model ID)
347
+ max_seq_length: Maximum sequence length
348
+ quantization: Quantization method ("4bit", "8bit", or None)
349
+ device_map: Device mapping strategy
350
+ cache_dir: Directory to cache downloaded models
351
+ use_flash_attention: Whether to use Flash Attention 2 for faster inference
352
+ """
353
+ self.model_name = model_name
354
+ self.max_seq_length = max_seq_length
355
+ self.quantization = quantization
356
+ self.device_map = device_map
357
+ self.cache_dir = cache_dir
358
+ self.use_flash_attention = use_flash_attention
359
+
360
+ self.model = None
361
+ self.tokenizer = None
362
+
363
+ # Load the model and tokenizer
364
+ self._load_model()
365
+
366
+ def _load_model(self):
367
+ """Load the model and tokenizer with appropriate settings"""
368
+ from transformers import AutoModelForCausalLM, AutoTokenizer
369
+ import torch
370
+
371
+ # Load tokenizer
372
+ self.tokenizer = AutoTokenizer.from_pretrained(
373
+ self.model_name,
374
+ trust_remote_code=True,
375
+ cache_dir=self.cache_dir
376
+ )
377
+
378
+ # Prepare model loading kwargs
379
+ model_kwargs = {
380
+ "trust_remote_code": True,
381
+ "cache_dir": self.cache_dir,
382
+ "device_map": self.device_map,
383
+ }
384
+
385
+ # Add Flash Attention if requested and available
386
+ if self.use_flash_attention:
387
+ try:
388
+ import flash_attn
389
+ model_kwargs["use_flash_attention_2"] = True
390
+ print("Flash Attention 2 enabled!")
391
+ except ImportError:
392
+ print("Flash Attention not available. For faster inference, install with: pip install flash-attn")
393
 
394
+ # Add quantization if specified
395
+ if self.quantization == "4bit":
396
+ try:
397
+ from transformers import BitsAndBytesConfig
398
+ model_kwargs["quantization_config"] = BitsAndBytesConfig(
399
+ load_in_4bit=True,
400
+ bnb_4bit_compute_dtype=torch.bfloat16
401
+ )
402
+ except ImportError:
403
+ print("bitsandbytes not available, loading without 4-bit quantization")
404
+ elif self.quantization == "8bit":
405
+ model_kwargs["load_in_8bit"] = True
406
+ else:
407
+ model_kwargs["torch_dtype"] = torch.bfloat16
408
 
409
+ # Load the model
410
+ self.model = AutoModelForCausalLM.from_pretrained(
411
+ self.model_name,
412
+ **model_kwargs
 
 
 
 
413
  )
414
 
415
+ def generate_with_streaming(self, prompt, temperature=0.7, max_tokens=1024, stream=True):
416
+ """
417
+ Generate text from the model with optional streaming
418
+
419
+ Args:
420
+ prompt: Input text prompt
421
+ temperature: Temperature for sampling (0 for deterministic)
422
+ max_tokens: Maximum number of tokens to generate
423
+ stream: Whether to stream the output
424
 
425
+ Returns:
426
+ String containing the generated text, or generator if streaming
427
+ """
428
+ import torch
429
+
430
+ # Tokenize prompt
431
+ inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
432
+ input_ids = inputs.input_ids
433
+ attention_mask = inputs.attention_mask
434
+
435
+ # Set generation parameters
436
+ generation_config = {
437
+ "max_new_tokens": max_tokens,
438
+ "temperature": temperature,
439
+ "do_sample": temperature > 0,
440
+ "top_p": 0.95 if temperature > 0 else 1.0,
441
+ "repetition_penalty": 1.1,
442
+ "pad_token_id": self.tokenizer.eos_token_id,
443
+ }
444
+
445
+ # If not streaming, do normal generation
446
+ if not stream:
447
+ with torch.no_grad():
448
+ outputs = self.model.generate(
449
+ input_ids=input_ids,
450
+ attention_mask=attention_mask,
451
+ **generation_config
452
+ )
453
+
454
+ # Decode the generated text (skip the prompt)
455
+ generated_text = self.tokenizer.decode(
456
+ outputs[0][input_ids.shape[1]:],
457
  skip_special_tokens=True
458
  )
459
 
460
+ return generated_text
461
+
462
+ # If streaming, yield generated tokens one by one
463
+ else:
464
+ generated = []
465
 
466
+ # Initialize generator
467
+ with torch.no_grad():
468
+ generated_ids = self.model.generate(
469
+ input_ids=input_ids,
470
+ attention_mask=attention_mask,
471
+ **generation_config,
472
+ streamer=None # Would need a custom streamer here if available
473
+ )
474
+
475
+ # Decode the entire sequence at once (not truly streaming, but simpler)
476
+ full_text = self.tokenizer.decode(
477
+ generated_ids[0][input_ids.shape[1]:],
478
+ skip_special_tokens=True
479
+ )
480
+
481
+ return full_text
482
+ ```
483
+
484
+ ## Hardware Requirements and Optimization
485
+
486
+ ### Flash Attention Benefits
487
+
488
+ Flash Attention is a highly optimized implementation of the attention mechanism that:
489
+
490
+ 1. **Speeds up inference by 2-3x** compared to standard attention
491
+ 2. **Reduces memory usage** by avoiding materializing large attention matrices
492
+ 3. **Works perfectly with 4-bit quantization** for even further optimization
493
+ 4. **Scales better with sequence length**, which is important for complex coding questions
494
+
495
+ For the best performance, make sure to:
496
+ - Install Flash Attention (`pip install flash-attn`)
497
+ - Enable it when loading the model (see QwenModelHandler class)
498
+ - Use with CUDA-compatible NVIDIA GPUs
499
+
500
+ ### Hardware Recommendations
501
+
502
+ For optimal performance, we recommend:
503
+
504
+ - **GPU**: NVIDIA GPU with at least 8GB VRAM (16GB+ recommended for larger models)
505
+ - **RAM**: 16GB+ system RAM
506
+ - **Storage**: At least 10GB free disk space for model files
507
+ - **CPU**: Modern multi-core processor (for preprocessing)
508
+
509
+ ### Reducing Memory Usage
510
+
511
+ If you're facing memory constraints:
512
+
513
+ ```python
514
+ # Use 4-bit quantization with Flash Attention for optimal memory-efficiency
515
+ model_handler = QwenModelHandler(
516
+ model_name="tuandunghcmut/Qwen25_Coder_MultipleChoice",
517
+ quantization="4bit",
518
+ use_flash_attention=True
519
+ )
520
+
521
+ # Further optimize with unsloth
522
+ try:
523
+ from unsloth.models import FastLanguageModel
524
+ FastLanguageModel.for_inference(model_handler.model)
525
+ print("Using unsloth for additional optimization")
526
+ except ImportError:
527
+ print("unsloth not available")
528
+ ```
529
+
530
+ ## Usage Example
531
+
532
+ Here's how to use these classes with Flash Attention enabled:
533
+
534
+ ```python
535
+ # 1. Load the model with Flash Attention and 4-bit quantization
536
+ from transformers import AutoModelForCausalLM, AutoTokenizer
537
+ import torch
538
+
539
+ hub_model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
540
+
541
+ # Create model handler with Flash Attention and 4-bit quantization
542
+ model_handler = QwenModelHandler(
543
+ model_name=hub_model_id,
544
+ max_seq_length=2048,
545
+ quantization="4bit",
546
+ use_flash_attention=True
547
+ )
548
+
549
+ # Optional: Use unsloth for even faster inference
550
+ try:
551
+ from unsloth.models import FastLanguageModel
552
+ FastLanguageModel.for_inference(model_handler.model)
553
+ print("Using unsloth for faster inference")
554
+ except ImportError:
555
+ print("unsloth not available, using standard inference")
556
+
557
+ # 2. Create prompt creator with YAML reasoning format
558
+ prompt_creator = PromptCreator(PromptCreator.YAML_REASONING)
559
+
560
+ # 3. Example question
561
+ question = "Which of the following correctly defines a list comprehension in Python?"
562
+ choices = [
563
+ "[x**2 for x in range(10)]",
564
+ "for(x in range(10)) { return x**2; }",
565
+ "map(lambda x: x**2, range(10))",
566
+ "[for x in range(10): x**2]"
567
+ ]
568
+
569
+ # 4. Create prompt and generate answer
570
+ prompt = prompt_creator.create_inference_prompt(question, choices)
571
+ response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
572
+
573
+ # 5. Parse the response
574
+ parser = ResponseParser(prompt_creator.parser_mode)
575
+ answer, reasoning = parser.parse(response)
576
+
577
+ print(f"Question: {question}")
578
+ print(f"Answer: {answer}")
579
+ if reasoning:
580
+ print(f"Reasoning: {reasoning}")
581
  ```
582
 
583
+ ## Troubleshooting
584
+
585
+ ### Common Issues
586
 
587
+ 1. **Flash Attention Installation Issues**: If you encounter problems installing `flash-attn`:
588
  ```bash
589
+ # Try with specific CUDA version (e.g., for CUDA 11.8)
590
+ pip install flash-attn==2.3.4+cu118 --no-build-isolation
591
+
592
+ # For older GPUs
593
+ pip install flash-attn==2.3.4 --no-build-isolation
594
  ```
595
 
596
+ 2. **CUDA Out of Memory**: Try combining 4-bit quantization with Flash Attention.
597
  ```python
598
+ model_handler = QwenModelHandler(
599
+ model_name=hub_model_id,
600
+ quantization="4bit",
601
+ use_flash_attention=True
 
602
  )
603
  ```
604
 
605
+ 3. **Module Not Found Errors**: Make sure you've installed all required packages.
606
+ ```bash
607
+ pip install transformers torch unsloth datasets pyyaml bitsandbytes flash-attn
608
+ ```
609
 
610
+ 4. **Parsing Errors**: If the model isn't producing valid YAML responses, try adjusting the temperature:
611
  ```python
612
+ response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
613
  ```
614
 
615
+ ### Getting Help
616
+
617
+ If you encounter issues, check the [model repository on Hugging Face](https://huggingface.co/tuandunghcmut/Qwen25_Coder_MultipleChoice) for updates and community discussions.
618
+
619
+ This guide provides you with all the necessary code and optimization techniques to use the model effectively for multiple-choice coding questions.