Kosmic-35B-A3B-FP8
Prosoft์ ์ฐ์ ์ฉ AI ์ด์์คํดํธ Kosmic โ Qwen3.5-35B-A3B ๊ธฐ๋ฐ์ผ๋ก ์ฐ์ ์ฉ์ผ๋ก ํ์ธํ๋ ํ FP8 ์์ํ ๋ชจ๋ธ์ ๋๋ค.
๋ชจ๋ธ ์ ๋ณด
| ํญ๋ชฉ | ๊ฐ |
|---|---|
| ๋ฒ ์ด์ค ๋ชจ๋ธ | Qwen/Qwen3.5-35B-A3B |
| ์ด ํ๋ผ๋ฏธํฐ | 35B (ํ์ฑ 3B, MoE 256 experts) |
| ์์ํ | FP8 E4M3, block_size [128, 128] |
| ์์ํ ํฌ๋งท | Qwen ๊ณต์ FP8๊ณผ ๋์ผ (quant_method: fp8) |
| ๋ชจ๋ธ ํฌ๊ธฐ | ~33 GB |
| ๋ผ์ด์ ์ค | Apache 2.0 |
์ฌ์ฉ ๋ฐฉ๋ฒ (vLLM)
vllm serve prosoft0405/Kosmic-35B-A3B-FP8 \
--trust-remote-code \
--language-model-only \
--gpu-memory-utilization 0.85 \
--reasoning-parser qwen3
์ฌ์ฉ ๋ฐฉ๋ฒ (Docker)
docker run -d --gpus all --ipc host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
prosoft0405/Kosmic-35B-A3B-FP8 \
--served-model-name kosmic-35b \
--language-model-only \
--gpu-memory-utilization 0.85 \
--reasoning-parser qwen3
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="kosmic-35b",
messages=[
{"role": "system", "content": "๋น์ ์ Prosoft์์ ๊ฐ๋ฐํ ์ฐ์
์ฉ AI ์ด์์คํดํธ Kosmic์
๋๋ค."},
{"role": "user", "content": "๋ ๋๊ตฌ์ผ?"},
],
max_tokens=256,
)
print(response.choices[0].message.content)
ํ๋์จ์ด ์๊ตฌ์ฌํญ
- NVIDIA GPU 40GB+ VRAM (RTX 4090, A100, H100, RTX PRO 6000 ๋ฑ)
- DGX Spark (128GB ํตํฉ๋ฉ๋ชจ๋ฆฌ) ์ง์
- vLLM 0.17.0+ (nightly ๊ถ์ฅ)
- transformers 5.2.0+
์์ํ ๋ฐฉ์
Qwen/Qwen3.5-35B-A3B-FP8 ๊ณต์ ๋ชจ๋ธ๊ณผ ๋์ผํ ๋ค์ดํฐ๋ธ FP8 ์์ํ:
- FP8 E4M3 weight quantization with block-wise scaling (128x128)
weight_scale_invper block- GDN linear_attn, self_attn, shared_expert, mlp.gate ๋ฑ์ BF16 ์ ์ง
- MoE routed expert weights๋ง FP8 ์์ํ
- Downloads last month
- 49
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support