File size: 3,222 Bytes
0071363
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5de29fd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---

library_name: peft
base_model: Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
license: apache-2.0
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---


# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by: hack337**
- **Model type: qwen2**
- **Finetuned from model: Qwen/Qwen2.5-7B-Instruct**

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository: https://huggingface.co/Hack337/WavGPT-2**
- **Demo (WavGPT-1.0): https://huggingface.co/spaces/Hack337/WavGPT**

## How to Get Started with the Model

Use the code below to get started with the model.

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto



model = AutoModelForCausalLM.from_pretrained(

    "Hack337/WavGPT-2",

    torch_dtype="auto",

    device_map="auto"

)

tokenizer = AutoTokenizer.from_pretrained("Hack337/WavGPT-2")



prompt = "Give me a short introduction to large language model."

messages = [

    {"role": "system", "content": "Вы очень полезный помощник."},

    {"role": "user", "content": prompt}

]

text = tokenizer.apply_chat_template(

    messages,

    tokenize=False,

    add_generation_prompt=True

)

model_inputs = tokenizer([text], return_tensors="pt").to(device)



generated_ids = model.generate(

    model_inputs.input_ids,

    max_new_tokens=512

)

generated_ids = [

    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)

]



response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]



```

Use the code below to get started with the model using NPU.

```python

from transformers import AutoTokenizer, TextStreamer

from intel_npu_acceleration_library import NPUModelForCausalLM

import torch



# Load the NPU-optimized model without LoRA

model = NPUModelForCausalLM.from_pretrained(

    "Hack337/WavGPT-2",

    use_cache=True,

    dtype=torch.float16  # Use float16 for the NPU

).eval()



# Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("Hack337/WavGPT-2")

tokenizer.pad_token_id = tokenizer.eos_token_id

streamer = TextStreamer(tokenizer, skip_special_tokens=True)



# Prompt handling

prompt = "Give me a short introduction to large language model."

messages = [

    {"role": "system", "content": "Вы очень полезный помощник."},

    {"role": "user", "content": prompt}

]



# Convert to a text format compatible with the model

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

prefix = tokenizer([text], return_tensors="pt")["input_ids"].to("npu")



# Generation configuration

generation_kwargs = dict(

    input_ids=prefix,

    streamer=streamer,

    do_sample=True,

    top_k=50,

    top_p=0.9,

    max_new_tokens=512,

)



# Run inference on the NPU

print("Run inference")

_ = model.generate(**generation_kwargs)



```

- PEFT 0.11.1