Kosmic-35B-A3B-FP8

Prosoft의 산업용 AI 어시스턴트 Kosmic — Qwen3.5-35B-A3B 기반으로 산업용으로 파인튜닝 후 FP8 양자화 모델입니다.

모델 정보

항목	값
베이스 모델	Qwen/Qwen3.5-35B-A3B
총 파라미터	35B (활성 3B, MoE 256 experts)
양자화	FP8 E4M3, block_size [128, 128]
양자화 포맷	Qwen 공식 FP8과 동일 (quant_method: fp8)
모델 크기	~33 GB
라이선스	Apache 2.0

사용 방법 (vLLM)

vllm serve prosoft0405/Kosmic-35B-A3B-FP8 \
  --trust-remote-code \
  --language-model-only \
  --gpu-memory-utilization 0.85 \
  --reasoning-parser qwen3

사용 방법 (Docker)

docker run -d --gpus all --ipc host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:cu130-nightly \
  prosoft0405/Kosmic-35B-A3B-FP8 \
  --served-model-name kosmic-35b \
  --language-model-only \
  --gpu-memory-utilization 0.85 \
  --reasoning-parser qwen3

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="kosmic-35b",
    messages=[
        {"role": "system", "content": "당신은 Prosoft에서 개발한 산업용 AI 어시스턴트 Kosmic입니다."},
        {"role": "user", "content": "너 누구야?"},
    ],
    max_tokens=256,
)
print(response.choices[0].message.content)

하드웨어 요구사항

NVIDIA GPU 40GB+ VRAM (RTX 4090, A100, H100, RTX PRO 6000 등)
DGX Spark (128GB 통합메모리) 지원
vLLM 0.17.0+ (nightly 권장)
transformers 5.2.0+

양자화 방식

Qwen/Qwen3.5-35B-A3B-FP8 공식 모델과 동일한 네이티브 FP8 양자화:

FP8 E4M3 weight quantization with block-wise scaling (128x128)
weight_scale_inv per block
GDN linear_attn, self_attn, shared_expert, mlp.gate 등은 BF16 유지
MoE routed expert weights만 FP8 양자화

Downloads last month: 49

Safetensors

Model size

35B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prosoft0405/Kosmic-35B-A3B-FP8

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(209)

this model