tuandunghcmut
/

Qwen25_Coder_MultipleChoice

@@ -1,285 +1,619 @@
----
-license: mit
-datasets:
-- tuandunghcmut/normal_dataset
-language:
-- en
-metrics:
-- accuracy
-- perplexity
-base_model:
-- unsloth/Qwen2.5-Coder-1.5B-Instruct
-pipeline_tag: text-generation
----
-# Using Unsloth to Load and Run Qwen25_Coder_MultipleChoice
-Unsloth offers significant inference speed improvements for the Qwen25_Coder_MultipleChoice model. Here's how to properly load and use the model with Unsloth:
-## Installation
-First, install the required packages:
-```bash
-pip install unsloth transformers torch accelerate
-# Flash-attention is REQUIRED for correct model behavior!
-pip install flash-attn --no-build-isolation
-```
-## Loading the Model with Unsloth
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-from unsloth import FastLanguageModel
-import os
-# Optional: Set HuggingFace Hub token if you have one
-hf_token = os.environ.get("HF_TOKEN")  # or directly provide your token
-# Verify flash-attention installation - REQUIRED for correct results
-try:
-    import flash_attn
-except ImportError:
-    raise ImportError(
-        "flash-attn package is required for correct model behavior.\n"
-        "Please install it with: pip install flash-attn --no-build-isolation"
-    )
-# Model ID on HuggingFace Hub
-model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
-print(f"Loading model from HuggingFace Hub: {model_id}")
-# First load tokenizer
-tokenizer = AutoTokenizer.from_pretrained(
-    model_id,
-    token=hf_token,
-    trust_remote_code=True
-)
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name=model_id,
-    token=hf_token,
-    max_seq_length=2048,  # Adjust based on your memory constraints
-    dtype=None,  # Auto-detect best dtype
-    load_in_4bit=True,  # Use 4-bit quantization for efficiency
-)
-# Enable fast inference mode
-FastLanguageModel.for_inference(model)
-print("Successfully loaded model with Unsloth and flash-attention!")
 ```
-> ⚠️ **WARNING**: Using this model without flash-attention will produce incorrect results. The flash-attention package is not just for speed, but essential for proper model functionality.
-Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
 ```python
-# Alternative approach (Method 2)
-# First load with transformers
 model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    token=hf_token,
     torch_dtype=torch.bfloat16,
     device_map="auto",
-    trust_remote_code=True
 )
-# Then apply Unsloth optimization with flash-attention
-FastLanguageModel.for_inference(model, use_flash_attention=True)
 ```
-## Running Multiple-Choice Inference
-After loading the model with Unsloth, use it to answer multiple-choice questions:
 ```python
-def format_prompt(question, choices):
-    # Format choices as a lettered list
-    formatted_choices = "\n".join(
-        [f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)]
-    )
-    return f"""
-QUESTION:
 {question}
-CHOICES:
 {formatted_choices}
-Analyze this question step-by-step and provide a detailed explanation.
-Your response MUST be in YAML format as follows:
 understanding: |
-  <your understanding of what the question is asking>
 analysis: |
   <your analysis of each option>
 reasoning: |
-  <your step-by-step reasoning process>
 conclusion: |
   <your final conclusion>
-answer: <single letter A through {chr(64 + len(choices))}>
-The answer field MUST contain ONLY a single character letter.
 """
-def get_answer(question, choices, model, tokenizer):
-    # Create the prompt
-    prompt = format_prompt(question, choices)
-    # Format as chat for the model
-    messages = [{"role": "user", "content": prompt}]
-    chat_text = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True
-    )
-    # Tokenize and generate
-    inputs = tokenizer(chat_text, return_tensors="pt").to(model.device)
-    # Generate with Unsloth-optimized model
-    output = model.generate(
-        inputs.input_ids,
-        max_new_tokens=512,
-        temperature=0.0,  # Use deterministic generation for multiple choice
-        do_sample=False
-    )
-    # Extract and return response
-    response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
-    # Extract answer using regex
-    import re
-    answer_match = re.search(r'answer:\s*([A-Z])', response)
-    if answer_match:
-        answer = answer_match.group(1)
-    else:
-        # Default fallback if no answer found
-        answer = "A"
-    return {
-        "answer": answer,
-        "full_response": response
-    }
-# Example usage
-java_example = {
-    "question": "Which of the following correctly creates a new instance of a class in Java?",
-    "choices": [
-        "MyClass obj = new MyClass();",
-        "var obj = MyClass();",
-        "MyClass obj = MyClass.new();",
-        "obj = new(MyClass);"
-    ],
-    "answer": "A"  # Optional ground truth
-}
-result = get_answer(
-    java_example["question"],
-    java_example["choices"],
-    model,
-    tokenizer
-)
-print(f"Answer: {result['answer']}")
-print(f"Full explanation:\n{result['full_response']}")
 ```
-## Processing Multiple Questions in Batch
-For better efficiency with multiple questions, use batch processing:
 ```python
-def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
-    """Process multiple questions in efficient batches"""
-    results = []
-    for i in range(0, len(questions_list), batch_size):
-        batch = questions_list[i:i+batch_size]
-        batch_prompts = []
-        # Prepare all prompts in the batch
-        for item in batch:
-            prompt = format_prompt(item["question"], item["choices"])
-            messages = [{"role": "user", "content": prompt}]
-            chat_text = tokenizer.apply_chat_template(
-                messages,
-                tokenize=False,
-                add_generation_prompt=True
-            )
-            batch_prompts.append(chat_text)
-        # Tokenize all inputs with padding
-        tokenizer.padding_side = "left"  # Important for causal LM generation
-        inputs = tokenizer(
-            batch_prompts,
-            return_tensors="pt",
-            padding=True
-        ).to(model.device)
-        # Generate all outputs
-        outputs = model.generate(
-            inputs.input_ids,
-            attention_mask=inputs.attention_mask,
-            max_new_tokens=2048,
-            temperature=0.0,
-            do_sample=False,
-            pad_token_id=tokenizer.pad_token_id
         )
-        # Process each response
-        for j, output_ids in enumerate(outputs):
-            # Calculate where the generated text begins
-            input_length = inputs.input_ids[j].ne(tokenizer.pad_token_id).sum().item()
-            # Decode only the generated part
-            response = tokenizer.decode(
-                output_ids[input_length:],
                 skip_special_tokens=True
             )
-            # Extract answer
-            import re
-            answer_match = re.search(r'answer:\s*([A-Z])', response)
-            answer = answer_match.group(1) if answer_match else "A"
-            # Store result
-            results.append({
-                "question": batch[j]["question"],
-                "answer": answer,
-                "full_response": response
-            })
-    return results
 ```
-## Performance Tips for Unsloth
-1. **Flash Attention REQUIRED**: Flash Attention is not just a performance option but a requirement for this model to function correctly:
    ```bash
-   pip install flash-attn --no-build-isolation
    ```
-2. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
    ```python
-   model, tokenizer = FastLanguageModel.from_pretrained(
-       model_name=model_id,
-       max_seq_length=1024,  # Reduced from 2048
-       load_in_4bit=True,
-       use_flash_attention=True  # Always enable
    )
    ```
-3. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
-4. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
-5. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
    ```python
-   os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU
    ```
-<!-- With these optimizations, Qwen25_Coder_MultipleChoice will run correctly while maintaining the high-quality multiple-choice reasoning and answers. -->
-```

+# Using tuandunghcmut/Qwen25_Coder_MultipleChoice
+This document provides everything you need to get started with the `tuandunghcmut/Qwen25_Coder_MultipleChoice` model for multiple-choice coding questions.
+## Installation and Setup
+### Prerequisites
+Make sure you have Python 3.8+ installed. Then install the required packages:
+```bash
+# Install core dependencies
+pip install transformers torch pandas
+# For faster inference (important)
+pip install unsloth accelerate bitsandbytes
+# Flash Attention (highly recommended for speed)
+pip install flash-attn --no-build-isolation
+# For dataset handling and YAML parsing
+pip install datasets pyyaml
 ```
+### Flash Attention Setup
+Flash Attention provides a significant speedup for transformer models. To use it with the Qwen model:
+1. Install Flash Attention as shown above
+2. Enable it when loading the model:
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Enable Flash Attention during model loading
 model = AutoModelForCausalLM.from_pretrained(
+    "tuandunghcmut/Qwen25_Coder_MultipleChoice",
     torch_dtype=torch.bfloat16,
     device_map="auto",
+    trust_remote_code=True,
+    use_flash_attention_2=True  # Enable Flash Attention
 )
+```
+Flash Attention will provide:
+- 2-3x faster inference speed
+- Lower memory usage
+- Compatible with 4-bit quantization for even more efficiency
+### Environment Variables
+If you're using Hugging Face Hub models, you may want to set up your access token:
+```bash
+# Set environment variable for Hugging Face token
+export HF_TOKEN="your_huggingface_token_here"
+# Or in Python
+import os
+os.environ["HF_TOKEN"] = "your_huggingface_token_here"
 ```
+### GPU Setup
+For optimal performance, you'll need a CUDA-compatible GPU. Check your installation:
+```bash
+# Verify CUDA is available
+python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
+# Print CUDA device info
+python -c "import torch; print('CUDA device count:', torch.cuda.device_count()); print('CUDA device name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU')"
+```
+## Required Classes
+Below are the essential classes needed to work with the model. Copy these into your Python files to use them in your project.
+### PromptCreator
+This class formats prompts for multiple-choice questions:
 ```python
+class PromptCreator:
+    """
+    Creates and formats prompts for multiple choice questions
+    Supports different prompt styles for training and inference
+    """
+    # Prompt types
+    BASIC = "basic"  # Simple answer-only format
+    YAML_REASONING = "yaml"  # YAML formatted reasoning
+    TEACHER_REASONED = "teacher"  # Same YAML format as YAML_REASONING but using teacher completions for training
+    def __init__(self, prompt_type=BASIC):
+        self.prompt_type = prompt_type
+        # Initialize parser mode based on prompt type
+        if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
+            self.parser_mode = "yaml"
+        else:
+            self.parser_mode = "basic"
+    def format_choices(self, choices):
+        """Format choices with letter prefixes"""
+        return "\n".join([f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)])
+    def get_max_letter(self, choices):
+        """Get the last valid letter based on choice count"""
+        return chr(65 + len(choices) - 1)
+    def create_inference_prompt(self, question, choices):
+        """Create a prompt for inference based on the configured prompt type"""
+        formatted_choices = self.format_choices(choices)
+        max_letter = self.get_max_letter(choices)
+        if self.prompt_type == self.BASIC:
+            return self._create_basic_prompt(question, formatted_choices, max_letter)
+        elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
+            return self._create_yaml_prompt(question, formatted_choices, max_letter)
+        else:
+            return self._create_basic_prompt(question, formatted_choices, max_letter)
+    def _create_basic_prompt(self, question, formatted_choices, max_letter):
+        """Create a basic prompt that just asks for an answer letter"""
+        return f"""
 {question}
 {formatted_choices}
+Select the correct answer from A through {max_letter}:
+"""
+    def _create_yaml_prompt(self, question, formatted_choices, max_letter):
+        """Create a prompt with YAML formatted reasoning structure"""
+        return f"""
+{question}
+{formatted_choices}
+Think through this step-by-step:
+- Understand what the question is asking
+- Analyze each option carefully
+- Reason about why each option might be correct or incorrect
+- Select the most appropriate answer
+Your response should be in YAML format:
 understanding: |
+  <your understanding of the question>
 analysis: |
   <your analysis of each option>
 reasoning: |
+  <your reasoning about the correct answer>
 conclusion: |
   <your final conclusion>
+answer: <single letter A through {max_letter} representing your final answer>
+"""
+    def create_training_prompt(self, question, choices):
+        """Create a prompt for training based on the configured prompt type"""
+        formatted_choices = self.format_choices(choices)
+        max_letter = self.get_max_letter(choices)
+        if self.prompt_type == self.BASIC:
+            return self._create_basic_training_prompt(question, formatted_choices, max_letter)
+        elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
+            return self._create_yaml_training_prompt(question, formatted_choices, max_letter)
+        else:
+            return self._create_basic_training_prompt(question, formatted_choices, max_letter)
+    def _create_basic_training_prompt(self, question, formatted_choices, max_letter):
+        """Create a basic training prompt"""
+        return f"""
+{question}
+{formatted_choices}
+Select the correct answer from A through {max_letter}:
+"""
+    def _create_yaml_training_prompt(self, question, formatted_choices, max_letter):
+        """Create a training prompt with YAML formatted reasoning structure"""
+        return f"""
+{question}
+{formatted_choices}
+Think through this step-by-step:
+- Understand what the question is asking
+- Analyze each option carefully
+- Reason about why each option might be correct or incorrect
+- Select the most appropriate answer
+Your response should be in YAML format:
+understanding: |
+  <your understanding of the question>
+analysis: |
+  <your analysis of each option>
+reasoning: |
+  <your reasoning about the correct answer>
+conclusion: |
+  <your final conclusion>
+answer: <single letter A through {max_letter} representing your final answer>
 """
+    def set_prompt_type(self, prompt_type):
+        """Set the prompt type and update parser mode accordingly"""
+        self.prompt_type = prompt_type
+        if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
+            self.parser_mode = "yaml"
+        else:
+            self.parser_mode = "basic"
+    def is_teacher_mode(self):
+        """Check if prompt type is teacher mode"""
+        return self.prompt_type == self.TEACHER_REASONED
+```
+### ResponseParser
+This class extracts answers from model responses:
+```python
+class ResponseParser:
+    """
+    Parser for model responses with support for different formats
+    Extracts answers and reasoning from model outputs
+    """
+    # Parser modes
+    BASIC = "basic"        # Extract single letter answer
+    YAML = "yaml"          # Parse YAML formatted response with reasoning
+    def __init__(self, parser_mode=BASIC):
+        """Initialize with parser mode (basic or yaml)"""
+        self.parser_mode = parser_mode
+    def parse(self, response_text):
+        """Parse the response text and extract answer and reasoning"""
+        if self.parser_mode == self.YAML:
+            return self._parse_yaml_response(response_text)
+        else:
+            return self._parse_basic_response(response_text)
+    def _parse_basic_response(self, response_text):
+        """
+        Parse a basic response to extract the answer letter
+        Returns:
+            tuple: (answer_letter, None)
+        """
+        # Look for just the letter at the end of text
+        import re
+        # Try to find the last occurrence of letters A-Z by themselves
+        matches = re.findall(r'\b([A-Z])\b', response_text)
+        if matches:
+            return matches[-1], None  # Return the last matching letter
+        # Try to find "The answer is X" pattern
+        answer_match = re.search(r'[Tt]he answer is[:\s]+([A-Z])', response_text)
+        if answer_match:
+            return answer_match.group(1), None
+        # If nothing else works, just get the last uppercase letter
+        uppercase_letters = re.findall(r'[A-Z]', response_text)
+        if uppercase_letters:
+            return uppercase_letters[-1], None
+        return None, None  # No answer found
+    def _parse_yaml_response(self, response_text):
+        """
+        Parse a YAML formatted response to extract the answer and reasoning
+        Returns:
+            tuple: (answer_letter, reasoning_dict)
+        """
+        import re
+        import yaml
+        # First try to extract just the answer field
+        answer_match = re.search(r'answer:\s*([A-Z])', response_text)
+        answer = answer_match.group(1) if answer_match else None
+        # Try to extract the entire YAML
+        try:
+            # Remove potential code block markers
+            yaml_text = response_text
+            if "```yaml" in yaml_text:
+                yaml_text = yaml_text.split("```yaml")[1]
+                if "```" in yaml_text:
+                    yaml_text = yaml_text.split("```")[0]
+            elif "```" in yaml_text:
+                # Assume the whole thing is a code block
+                parts = yaml_text.split("```")
+                if len(parts) >= 3:
+                    yaml_text = parts[1]
+            # Parse the YAML
+            parsed_yaml = yaml.safe_load(yaml_text)
+            # If successful, use the answer from the YAML, and return the parsed structure
+            if isinstance(parsed_yaml, dict) and "answer" in parsed_yaml:
+                return parsed_yaml.get("answer"), parsed_yaml
+        except Exception:
+            # If YAML parsing fails, we already have the answer from regex
+            pass
+        return answer, None
+    def set_parser_mode(self, parser_mode):
+        """Set the parser mode"""
+        self.parser_mode = parser_mode
+    @classmethod
+    def from_prompt_type(cls, prompt_type):
+        """
+        Create a ResponseParser with the appropriate mode based on prompt type
+        Args:
+            prompt_type: The prompt type (e.g., PromptCreator.YAML_REASONING)
+        Returns:
+            ResponseParser: A parser configured for the prompt type
+        """
+        if prompt_type in ["yaml", "teacher"]:
+            return cls("yaml")
+        else:
+            return cls("basic")
 ```
+### QwenModelHandler
+This class handles model loading and inference:
 ```python
+class QwenModelHandler:
+    def __init__(self, model_name="unsloth/Qwen2.5-7B", max_seq_length=768,
+                 quantization=None, device_map="auto", cache_dir=None,
+                 use_flash_attention=True):
+        """
+        Initialize a handler for Qwen models
+        Args:
+            model_name: Model identifier (local path or Hugging Face model ID)
+            max_seq_length: Maximum sequence length
+            quantization: Quantization method ("4bit", "8bit", or None)
+            device_map: Device mapping strategy
+            cache_dir: Directory to cache downloaded models
+            use_flash_attention: Whether to use Flash Attention 2 for faster inference
+        """
+        self.model_name = model_name
+        self.max_seq_length = max_seq_length
+        self.quantization = quantization
+        self.device_map = device_map
+        self.cache_dir = cache_dir
+        self.use_flash_attention = use_flash_attention
+        self.model = None
+        self.tokenizer = None
+        # Load the model and tokenizer
+        self._load_model()
+    def _load_model(self):
+        """Load the model and tokenizer with appropriate settings"""
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        import torch
+        # Load tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            self.model_name,
+            trust_remote_code=True,
+            cache_dir=self.cache_dir
+        )
+        # Prepare model loading kwargs
+        model_kwargs = {
+            "trust_remote_code": True,
+            "cache_dir": self.cache_dir,
+            "device_map": self.device_map,
+        }
+        # Add Flash Attention if requested and available
+        if self.use_flash_attention:
+            try:
+                import flash_attn
+                model_kwargs["use_flash_attention_2"] = True
+                print("Flash Attention 2 enabled!")
+            except ImportError:
+                print("Flash Attention not available. For faster inference, install with: pip install flash-attn")
+        # Add quantization if specified
+        if self.quantization == "4bit":
+            try:
+                from transformers import BitsAndBytesConfig
+                model_kwargs["quantization_config"] = BitsAndBytesConfig(
+                    load_in_4bit=True,
+                    bnb_4bit_compute_dtype=torch.bfloat16
+                )
+            except ImportError:
+                print("bitsandbytes not available, loading without 4-bit quantization")
+        elif self.quantization == "8bit":
+            model_kwargs["load_in_8bit"] = True
+        else:
+            model_kwargs["torch_dtype"] = torch.bfloat16
+        # Load the model
+        self.model = AutoModelForCausalLM.from_pretrained(
+            self.model_name,
+            **model_kwargs
         )
+    def generate_with_streaming(self, prompt, temperature=0.7, max_tokens=1024, stream=True):
+        """
+        Generate text from the model with optional streaming
+        Args:
+            prompt: Input text prompt
+            temperature: Temperature for sampling (0 for deterministic)
+            max_tokens: Maximum number of tokens to generate
+            stream: Whether to stream the output
+        Returns:
+            String containing the generated text, or generator if streaming
+        """
+        import torch
+        # Tokenize prompt
+        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
+        input_ids = inputs.input_ids
+        attention_mask = inputs.attention_mask
+        # Set generation parameters
+        generation_config = {
+            "max_new_tokens": max_tokens,
+            "temperature": temperature,
+            "do_sample": temperature > 0,
+            "top_p": 0.95 if temperature > 0 else 1.0,
+            "repetition_penalty": 1.1,
+            "pad_token_id": self.tokenizer.eos_token_id,
+        }
+        # If not streaming, do normal generation
+        if not stream:
+            with torch.no_grad():
+                outputs = self.model.generate(
+                    input_ids=input_ids,
+                    attention_mask=attention_mask,
+                    **generation_config
+                )
+            # Decode the generated text (skip the prompt)
+            generated_text = self.tokenizer.decode(
+                outputs[0][input_ids.shape[1]:],
                 skip_special_tokens=True
             )
+            return generated_text
+        # If streaming, yield generated tokens one by one
+        else:
+            generated = []
+            # Initialize generator
+            with torch.no_grad():
+                generated_ids = self.model.generate(
+                    input_ids=input_ids,
+                    attention_mask=attention_mask,
+                    **generation_config,
+                    streamer=None  # Would need a custom streamer here if available
+                )
+            # Decode the entire sequence at once (not truly streaming, but simpler)
+            full_text = self.tokenizer.decode(
+                generated_ids[0][input_ids.shape[1]:],
+                skip_special_tokens=True
+            )
+            return full_text
+```
+## Hardware Requirements and Optimization
+### Flash Attention Benefits
+Flash Attention is a highly optimized implementation of the attention mechanism that:
+1. **Speeds up inference by 2-3x** compared to standard attention
+2. **Reduces memory usage** by avoiding materializing large attention matrices
+3. **Works perfectly with 4-bit quantization** for even further optimization
+4. **Scales better with sequence length**, which is important for complex coding questions
+For the best performance, make sure to:
+- Install Flash Attention (`pip install flash-attn`)
+- Enable it when loading the model (see QwenModelHandler class)
+- Use with CUDA-compatible NVIDIA GPUs
+### Hardware Recommendations
+For optimal performance, we recommend:
+- **GPU**: NVIDIA GPU with at least 8GB VRAM (16GB+ recommended for larger models)
+- **RAM**: 16GB+ system RAM
+- **Storage**: At least 10GB free disk space for model files
+- **CPU**: Modern multi-core processor (for preprocessing)
+### Reducing Memory Usage
+If you're facing memory constraints:
+```python
+# Use 4-bit quantization with Flash Attention for optimal memory-efficiency
+model_handler = QwenModelHandler(
+    model_name="tuandunghcmut/Qwen25_Coder_MultipleChoice",
+    quantization="4bit",
+    use_flash_attention=True
+)
+# Further optimize with unsloth
+try:
+    from unsloth.models import FastLanguageModel
+    FastLanguageModel.for_inference(model_handler.model)
+    print("Using unsloth for additional optimization")
+except ImportError:
+    print("unsloth not available")
+```
+## Usage Example
+Here's how to use these classes with Flash Attention enabled:
+```python
+# 1. Load the model with Flash Attention and 4-bit quantization
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+hub_model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
+# Create model handler with Flash Attention and 4-bit quantization
+model_handler = QwenModelHandler(
+    model_name=hub_model_id,
+    max_seq_length=2048,
+    quantization="4bit",
+    use_flash_attention=True
+)
+# Optional: Use unsloth for even faster inference
+try:
+    from unsloth.models import FastLanguageModel
+    FastLanguageModel.for_inference(model_handler.model)
+    print("Using unsloth for faster inference")
+except ImportError:
+    print("unsloth not available, using standard inference")
+# 2. Create prompt creator with YAML reasoning format
+prompt_creator = PromptCreator(PromptCreator.YAML_REASONING)
+# 3. Example question
+question = "Which of the following correctly defines a list comprehension in Python?"
+choices = [
+    "[x**2 for x in range(10)]",
+    "for(x in range(10)) { return x**2; }",
+    "map(lambda x: x**2, range(10))",
+    "[for x in range(10): x**2]"
+]
+# 4. Create prompt and generate answer
+prompt = prompt_creator.create_inference_prompt(question, choices)
+response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
+# 5. Parse the response
+parser = ResponseParser(prompt_creator.parser_mode)
+answer, reasoning = parser.parse(response)
+print(f"Question: {question}")
+print(f"Answer: {answer}")
+if reasoning:
+    print(f"Reasoning: {reasoning}")
 ```
+## Troubleshooting
+### Common Issues
+1. **Flash Attention Installation Issues**: If you encounter problems installing `flash-attn`:
    ```bash
+   # Try with specific CUDA version (e.g., for CUDA 11.8)
+   pip install flash-attn==2.3.4+cu118 --no-build-isolation
+   # For older GPUs
+   pip install flash-attn==2.3.4 --no-build-isolation
    ```
+2. **CUDA Out of Memory**: Try combining 4-bit quantization with Flash Attention.
    ```python
+   model_handler = QwenModelHandler(
+       model_name=hub_model_id,
+       quantization="4bit",
+       use_flash_attention=True
    )
    ```
+3. **Module Not Found Errors**: Make sure you've installed all required packages.
+   ```bash
+   pip install transformers torch unsloth datasets pyyaml bitsandbytes flash-attn
+   ```
+4. **Parsing Errors**: If the model isn't producing valid YAML responses, try adjusting the temperature:
    ```python
+   response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
    ```
+### Getting Help
+If you encounter issues, check the [model repository on Hugging Face](https://huggingface.co/tuandunghcmut/Qwen25_Coder_MultipleChoice) for updates and community discussions.
+This guide provides you with all the necessary code and optimization techniques to use the model effectively for multiple-choice coding questions.