| | --- |
| | language: |
| | - en |
| | - de |
| | - fr |
| | - it |
| | - pt |
| | - hi |
| | - es |
| | - th |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | tags: |
| | - facebook |
| | - meta |
| | - pytorch |
| | - llama-3 |
| | license: llama3.2 |
| | base_model: |
| | - meta-llama/Llama-3.2-1B-Instruct |
| | --- |
| | # Represents |
| | A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq] |
| |
|
| | ## Use with transformers/autoawq |
| | Starting with |
| | - `transformers==4.45.1` |
| | - `accelerate==0.34.2` |
| | - `torch==2.3.1` |
| | - `numpy==2.0.0` |
| | - `autoawq==0.2.6` |
| |
|
| | Experimented with |
| | - OS = Windows |
| | - GPU = Nvidia GeForce RTX 3080 10gb |
| | - CPU = Intel Core i5-9600K |
| | - RAM = 32GB |
| |
|
| | ### For CUDA users |
| |
|
| | **AutoAWQ** |
| | |
| | NOTE: this example uses `fuse_layers=True` to fuse attention and mlp layers together for faster inference |
| | ```python |
| | from awq import AutoAWQForCausalLM |
| | from transformers import AutoTokenizer, TextStreamer |
| | |
| | quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ" |
| | model = AutoAWQForCausalLM.from_quantized(quant_id, fuse_layers=True) |
| | tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True) |
| | |
| | streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
| | |
| | # Declare prompt |
| | prompt = "You're standing on the surface of the Earth. "\ |
| | "You walk one mile south, one mile west and one mile north. "\ |
| | "You end up exactly where you started. Where are you?" |
| | |
| | # Tokenization of the prompt |
| | tokens = tokenizer( |
| | prompt, |
| | return_tensors='pt' |
| | ).input_ids.cuda() |
| | |
| | # Generate output in a streaming fashion |
| | generation_output = model.generate( |
| | tokens, |
| | streamer=streamer, |
| | max_new_tokens=512 |
| | ) |
| | ``` |
| |
|
| | **Transformers** |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM |
| | import torch |
| | |
| | quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ" |
| | tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | quant_id, |
| | torch_dtype=torch.float16, |
| | low_cpu_mem_usage=True, |
| | device_map="cuda" |
| | ) |
| | |
| | streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
| | |
| | # Convert prompt to tokens |
| | prompt = "You're standing on the surface of the Earth. "\ |
| | "You walk one mile south, one mile west and one mile north. "\ |
| | "You end up exactly where you started. Where are you?" |
| | |
| | tokens = tokenizer( |
| | prompt, |
| | return_tensors='pt' |
| | ).input_ids.cuda() |
| | |
| | # Generate output |
| | generation_output = model.generate( |
| | tokens, |
| | streamer=streamer, |
| | max_new_tokens=512 |
| | ) |
| | ``` |
| |
|
| | #### Issue/Solution |
| | - torch.from_numpy fails |
| | - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()` |
| | - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py |
| | - Solution: |
| | - there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue |
| | ```python |
| | def _get_perms(): |
| | perm = [] |
| | for i in range(32): |
| | perm1 = [] |
| | col = i // 4 |
| | for block in [0, 1]: |
| | for row in [ |
| | 2 * (i % 4), |
| | 2 * (i % 4) + 1, |
| | 2 * (i % 4 + 4), |
| | 2 * (i % 4 + 4) + 1, |
| | ]: |
| | perm1.append(16 * row + col + 8 * block) |
| | |
| | for j in range(4): |
| | perm.extend([p + 256 * j for p in perm1]) |
| | |
| | # perm = np.array(perm) |
| | perm = torch.asarray(perm) |
| | # interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7]) |
| | interleave = torch.asarray([0, 2, 4, 6, 1, 3, 5, 7]) |
| | perm = perm.reshape((-1, 8))[:, interleave].ravel() |
| | # perm = torch.from_numpy(perm) |
| | scale_perm = [] |
| | for i in range(8): |
| | scale_perm.extend([i + 8 * j for j in range(8)]) |
| | scale_perm_single = [] |
| | for i in range(4): |
| | scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]]) |
| | return perm, scale_perm, scale_perm_single |
| | ``` |