--- license: mit datasets: - LGAI-EXAONE/KoMT-Bench - skt/kobest_v1 language: - ko - en base_model: - K-intelligence/Midm-2.0-Base-Instruct tags: - LLM - Korean - AWQ - Quantized - Mi:dm - transformers - Safetensors --- # Midm-2.0-Base-Instruct - AWQ 4-bit Quantized Version This repository contains the AWQ (Activation-aware Weight Quantization) 4-bit quantized version of the **[K-intelligence/Midm-2.0-Base-Instruct](https://huggingface.co/K-intelligence/Midm-2.0-Base-Instruct)** model by KT AI. This model is the result of a journey to solve real-world performance and cost issues encountered in a production environment. I hope this experience can be a practical guide for other developers facing similar challenges. ## Model Details * **Base Model:** `K-intelligence/Midm-2.0-Base-Instruct` * **Quantization Method:** AWQ (Activation-aware Weight Quantization) * **Quantization Config:** * `w_bit`: 4 * `q_group_size`: 128 * `zero_point`: True * **Library:** `AutoAWQ` ## ⚙️ How to Get Started To use this model, you will need to install the `transformers`, `accelerate`, and `autoawq` libraries. ```bash pip install transformers accelerate autoawq Usage Example Python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "jinkyeongk/Midm-2.0-Base-Instruct-AWQ" # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16 ).eval() # Construct the chat prompt messages = [ {"role": "user", "content": "Who are you?"} ] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) # Generate a response outputs = model.generate(input_ids, max_new_tokens=512, do_sample=True, temperature=0.7) response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` ## 📊 Quantization Evaluation To measure the performance degradation from quantization, the original (FP16) and quantized (AWQ) models were evaluated against two major Korean benchmarks. * **Ko-Best**: Measures objective knowledge and reasoning skills (Accuracy). * **Ko-MTBench**: Measures subjective conversational ability (Scores graded by GPT-4o as a judge). ### Final Evaluation Results | Model | Benchmark | Metric | Score / Accuracy | |---|---|---|---| | `K-intelligence/Midm-2.0-Base-Instruct` (FP16) | skt/kobest_v1 | hellaswag (Accuracy) | 0.4900 | | `jinkyeongk/Midm-2.0-Base-Instruct-AWQ` (AWQ) | skt/kobest_v1 | hellaswag (Accuracy) | **0.4800** | | `K-intelligence/Midm-2.0-Base-Instruct` (FP16) | LGAI-EXAONE/KoMT-Bench | Avg. Score (by GPT-4o) | 8.50 / 10.0 | | `jinkyeongk/Midm-2.0-Base-Instruct-AWQ` (AWQ) | LGAI-EXAONE/KoMT-Bench | Avg. Score (by GPT-4o) | **6.40 / 10.0** | ## Analysis The results from the Ko-Best (hellaswag) benchmark show that the performance drop in objective reasoning ability due to AWQ 4-bit quantization was a mere 1.0 percentage point, which is a negligible decrease. However, in the Ko-MTBench subjective evaluation using GPT-4o as a judge, a more significant performance drop of 2.1 points on average was observed. This suggests that while AWQ quantization maintains performance on well-defined, knowledge-based tasks like multiple-choice questions (Ko-Best), it can lead to some loss in nuance, expressiveness, or the sophistication of reasoning in more open-ended, conversational tasks (Ko-MTBench). Therefore, this quantized model offers a massive improvement in speed and cost-efficiency at the expense of a slight trade-off in creative or complex conversational abilities. Users should consider this trade-off based on their specific application.