Ngixdev commited on
Commit
2dad00a
·
verified ·
1 Parent(s): d7860c8

Switch to Gradio + ZeroGPU with llama-cpp-python

Browse files
Files changed (4) hide show
  1. Dockerfile +0 -22
  2. README.md +16 -83
  3. app.py +199 -0
  4. requirements.txt +4 -0
Dockerfile DELETED
@@ -1,22 +0,0 @@
1
- FROM ghcr.io/ggml-org/llama.cpp:full
2
-
3
- WORKDIR /app
4
-
5
- RUN apt update && apt install -y python3-pip
6
- RUN pip install -U huggingface_hub
7
-
8
- RUN python3 -c 'from huggingface_hub import hf_hub_download; \
9
- repo="HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive"; \
10
- hf_hub_download(repo_id=repo, filename="Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf", local_dir="/app"); \
11
- hf_hub_download(repo_id=repo, filename="mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf", local_dir="/app")'
12
-
13
- CMD ["--server", \
14
- "-m", "/app/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf", \
15
- "--mmproj", "/app/mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf", \
16
- "--host", "0.0.0.0", \
17
- "--port", "7860", \
18
- "-t", "2", \
19
- "--cache-type-k", "q8_0", \
20
- "--cache-type-v", "iq4_nl", \
21
- "-c", "32768", \
22
- "-n", "8192"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -3,108 +3,41 @@ title: Qwen API
3
  emoji: 🤖
4
  colorFrom: blue
5
  colorTo: purple
6
- sdk: docker
 
 
7
  pinned: false
8
  license: apache-2.0
9
  tags:
10
  - qwen
11
  - uncensored
12
  - llama-cpp
13
- - gguf
14
- - openai-compatible
15
- suggested_hardware: a10g-small
16
  ---
17
 
18
  # Qwen3.5-9B Uncensored API
19
 
20
- OpenAI-compatible API for [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive).
21
 
22
  ## Features
23
 
24
- - 9B parameters with 262K context window
25
- - Fully uncensored (0/465 refusals)
26
- - Multimodal capable (text, image, video)
27
- - Supports 201 languages
28
  - Q4_K_M quantization via llama.cpp
29
- - OpenAI-compatible API
30
 
31
  ## API Usage
32
 
33
- ### Python (OpenAI SDK)
34
-
35
  ```python
36
- from openai import OpenAI
37
-
38
- client = OpenAI(
39
- base_url="https://ngixdev-qwen-api.hf.space/v1",
40
- api_key="not-needed"
41
- )
42
 
43
- response = client.chat.completions.create(
44
- model="qwen",
45
- messages=[
46
- {"role": "system", "content": "You are a helpful assistant."},
47
- {"role": "user", "content": "Hello, who are you?"}
48
- ],
49
  temperature=0.7,
50
- max_tokens=1024
51
- )
52
-
53
- print(response.choices[0].message.content)
54
- ```
55
-
56
- ### cURL
57
-
58
- ```bash
59
- curl https://ngixdev-qwen-api.hf.space/v1/chat/completions \
60
- -H "Content-Type: application/json" \
61
- -d '{
62
- "model": "qwen",
63
- "messages": [
64
- {"role": "system", "content": "You are a helpful assistant."},
65
- {"role": "user", "content": "Hello!"}
66
- ],
67
- "temperature": 0.7,
68
- "max_tokens": 1024
69
- }'
70
- ```
71
-
72
- ### Streaming
73
-
74
- ```python
75
- from openai import OpenAI
76
-
77
- client = OpenAI(
78
- base_url="https://ngixdev-qwen-api.hf.space/v1",
79
- api_key="not-needed"
80
- )
81
-
82
- stream = client.chat.completions.create(
83
- model="qwen",
84
- messages=[{"role": "user", "content": "Tell me a story"}],
85
- stream=True
86
  )
87
-
88
- for chunk in stream:
89
- if chunk.choices[0].delta.content:
90
- print(chunk.choices[0].delta.content, end="")
91
  ```
92
-
93
- ## Endpoints
94
-
95
- | Endpoint | Description |
96
- |----------|-------------|
97
- | `/v1/chat/completions` | Chat completions (OpenAI-compatible) |
98
- | `/v1/completions` | Text completions |
99
- | `/v1/models` | List available models |
100
- | `/health` | Health check |
101
-
102
- ## Parameters
103
-
104
- | Parameter | Type | Default | Description |
105
- |-----------|------|---------|-------------|
106
- | messages | array | required | Chat messages |
107
- | temperature | float | 0.7 | Sampling temperature (0.0-2.0) |
108
- | top_p | float | 0.8 | Nucleus sampling (0.0-1.0) |
109
- | max_tokens | int | 1024 | Maximum tokens to generate |
110
- | stream | bool | false | Enable streaming response |
 
3
  emoji: 🤖
4
  colorFrom: blue
5
  colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.29.0
8
+ app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  tags:
12
  - qwen
13
  - uncensored
14
  - llama-cpp
15
+ - zerogpu
 
 
16
  ---
17
 
18
  # Qwen3.5-9B Uncensored API
19
 
20
+ API interface for [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive).
21
 
22
  ## Features
23
 
24
+ - 9B parameters, fully uncensored (0/465 refusals)
 
 
 
25
  - Q4_K_M quantization via llama.cpp
26
+ - Running on ZeroGPU
27
 
28
  ## API Usage
29
 
 
 
30
  ```python
31
+ from gradio_client import Client
 
 
 
 
 
32
 
33
+ client = Client("Ngixdev/qwen-api")
34
+ result = client.predict(
35
+ prompt="Hello!",
36
+ system_prompt="You are helpful.",
 
 
37
  temperature=0.7,
38
+ top_p=0.8,
39
+ max_tokens=1024,
40
+ api_name="/api_generate"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  )
42
+ print(result)
 
 
 
43
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import spaces
4
+ from huggingface_hub import hf_hub_download
5
+
6
+ MODEL_REPO = "HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive"
7
+ MODEL_FILE = "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf"
8
+
9
+ model_path = None
10
+ llm = None
11
+
12
+ def download_model():
13
+ global model_path
14
+ if model_path is None:
15
+ print("Downloading model...")
16
+ model_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILE)
17
+ print(f"Model downloaded: {model_path}")
18
+ return model_path
19
+
20
+ def get_llm():
21
+ global llm
22
+ if llm is None:
23
+ from llama_cpp import Llama
24
+ path = download_model()
25
+ print("Loading model into GPU...")
26
+ llm = Llama(
27
+ model_path=path,
28
+ n_ctx=8192,
29
+ n_gpu_layers=-1,
30
+ verbose=False,
31
+ )
32
+ print("Model loaded!")
33
+ return llm
34
+
35
+
36
+ def format_messages(message: str, history: list, system_prompt: str = "") -> str:
37
+ formatted = ""
38
+ if system_prompt.strip():
39
+ formatted += f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
40
+ for user_msg, assistant_msg in history:
41
+ if user_msg:
42
+ formatted += f"<|im_start|>user\n{user_msg}<|im_end|>\n"
43
+ if assistant_msg:
44
+ formatted += f"<|im_start|>assistant\n{assistant_msg}<|im_end|>\n"
45
+ formatted += f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n"
46
+ return formatted
47
+
48
+
49
+ @spaces.GPU(duration=120)
50
+ def generate_response(
51
+ message: str,
52
+ history: list,
53
+ system_prompt: str = "",
54
+ temperature: float = 0.7,
55
+ top_p: float = 0.8,
56
+ top_k: int = 20,
57
+ max_tokens: int = 1024,
58
+ ) -> str:
59
+ model = get_llm()
60
+ prompt = format_messages(message, history, system_prompt)
61
+
62
+ output = model(
63
+ prompt,
64
+ max_tokens=max_tokens,
65
+ temperature=temperature,
66
+ top_p=top_p,
67
+ top_k=top_k,
68
+ stop=["<|im_end|>", "<|im_start|>"],
69
+ )
70
+ return output["choices"][0]["text"].strip()
71
+
72
+
73
+ @spaces.GPU(duration=120)
74
+ def api_generate(
75
+ prompt: str,
76
+ system_prompt: str = "",
77
+ temperature: float = 0.7,
78
+ top_p: float = 0.8,
79
+ max_tokens: int = 1024,
80
+ ) -> dict:
81
+ """
82
+ API endpoint for text generation.
83
+
84
+ Args:
85
+ prompt: The user prompt/question
86
+ system_prompt: Optional system instruction
87
+ temperature: Sampling temperature (0.0-2.0)
88
+ top_p: Nucleus sampling parameter (0.0-1.0)
89
+ max_tokens: Maximum tokens to generate
90
+
91
+ Returns:
92
+ Dictionary with 'response' key containing generated text
93
+ """
94
+ try:
95
+ response = generate_response(
96
+ message=prompt,
97
+ history=[],
98
+ system_prompt=system_prompt,
99
+ temperature=temperature,
100
+ top_p=top_p,
101
+ max_tokens=max_tokens,
102
+ )
103
+ return {"response": response, "status": "success"}
104
+ except Exception as e:
105
+ return {"response": None, "status": "error", "error": str(e)}
106
+
107
+
108
+ with gr.Blocks(title="Qwen3.5-9B Uncensored API", theme=gr.themes.Soft()) as demo:
109
+ gr.Markdown(
110
+ """
111
+ # 🤖 Qwen3.5-9B Uncensored API
112
+
113
+ Powered by [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive)
114
+
115
+ - 9B parameters, fully uncensored (0/465 refusals)
116
+ - Q4_K_M quantization via llama.cpp on ZeroGPU
117
+ """
118
+ )
119
+
120
+ with gr.Tab("💬 Chat"):
121
+ chatbot = gr.Chatbot(height=450, label="Conversation")
122
+
123
+ with gr.Row():
124
+ msg = gr.Textbox(label="Message", placeholder="Type here...", scale=4, lines=2)
125
+ submit_btn = gr.Button("Send", variant="primary", scale=1)
126
+
127
+ with gr.Accordion("⚙️ Settings", open=False):
128
+ system_prompt = gr.Textbox(label="System Prompt", placeholder="Optional", lines=2)
129
+ with gr.Row():
130
+ temperature = gr.Slider(0.0, 2.0, 0.7, step=0.1, label="Temperature")
131
+ top_p = gr.Slider(0.0, 1.0, 0.8, step=0.05, label="Top P")
132
+ with gr.Row():
133
+ top_k = gr.Slider(1, 100, 20, step=1, label="Top K")
134
+ max_tokens = gr.Slider(64, 2048, 1024, step=64, label="Max Tokens")
135
+
136
+ clear_btn = gr.Button("🗑️ Clear")
137
+
138
+ def user_submit(message, history):
139
+ return "", history + [[message, None]]
140
+
141
+ def bot_response(history, system_prompt, temperature, top_p, top_k, max_tokens):
142
+ if not history:
143
+ return history
144
+ message = history[-1][0]
145
+ history_without_last = history[:-1]
146
+ response = generate_response(message, history_without_last, system_prompt, temperature, top_p, top_k, max_tokens)
147
+ history[-1][1] = response
148
+ return history
149
+
150
+ msg.submit(user_submit, [msg, chatbot], [msg, chatbot]).then(
151
+ bot_response, [chatbot, system_prompt, temperature, top_p, top_k, max_tokens], chatbot
152
+ )
153
+ submit_btn.click(user_submit, [msg, chatbot], [msg, chatbot]).then(
154
+ bot_response, [chatbot, system_prompt, temperature, top_p, top_k, max_tokens], chatbot
155
+ )
156
+ clear_btn.click(lambda: [], None, chatbot)
157
+
158
+ with gr.Tab("🔌 API"):
159
+ gr.Markdown(
160
+ """
161
+ ## API Usage
162
+
163
+ ```python
164
+ from gradio_client import Client
165
+
166
+ client = Client("Ngixdev/qwen-api")
167
+ result = client.predict(
168
+ prompt="Hello!",
169
+ system_prompt="You are helpful.",
170
+ temperature=0.7,
171
+ top_p=0.8,
172
+ max_tokens=1024,
173
+ api_name="/api_generate"
174
+ )
175
+ print(result)
176
+ ```
177
+ """
178
+ )
179
+
180
+ with gr.Row():
181
+ with gr.Column():
182
+ api_prompt = gr.Textbox(label="Prompt", lines=3)
183
+ api_system = gr.Textbox(label="System Prompt", lines=2)
184
+ with gr.Row():
185
+ api_temp = gr.Slider(0.0, 2.0, 0.7, step=0.1, label="Temperature")
186
+ api_top_p = gr.Slider(0.0, 1.0, 0.8, step=0.05, label="Top P")
187
+ api_max_tokens = gr.Slider(64, 2048, 1024, step=64, label="Max Tokens")
188
+ api_submit = gr.Button("Generate", variant="primary")
189
+ with gr.Column():
190
+ api_output = gr.JSON(label="Response")
191
+
192
+ api_submit.click(
193
+ api_generate,
194
+ [api_prompt, api_system, api_temp, api_top_p, api_max_tokens],
195
+ api_output,
196
+ api_name="api_generate",
197
+ )
198
+
199
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ huggingface_hub>=0.20.0
3
+ spaces
4
+ llama-cpp-python>=0.3.0