Instructions to use prosoft0405/Kosmic-122B-A10B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prosoft0405/Kosmic-122B-A10B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="prosoft0405/Kosmic-122B-A10B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("prosoft0405/Kosmic-122B-A10B-FP8")
model = AutoModelForMultimodalLM.from_pretrained("prosoft0405/Kosmic-122B-A10B-FP8", device_map="auto")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use prosoft0405/Kosmic-122B-A10B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prosoft0405/Kosmic-122B-A10B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prosoft0405/Kosmic-122B-A10B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/prosoft0405/Kosmic-122B-A10B-FP8

SGLang

How to use prosoft0405/Kosmic-122B-A10B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prosoft0405/Kosmic-122B-A10B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prosoft0405/Kosmic-122B-A10B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prosoft0405/Kosmic-122B-A10B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prosoft0405/Kosmic-122B-A10B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use prosoft0405/Kosmic-122B-A10B-FP8 with Docker Model Runner:
```
docker model run hf.co/prosoft0405/Kosmic-122B-A10B-FP8
```

Kosmic-122B-A10B-FP8

Kosmic은 Prosoft에서 개발한 122B 파라미터 AI 어시스턴트입니다. Qwen3.5-122B-A10B 기반으로 산업용으로 파인튜닝 후 FP8 (E4M3) 양자화하여 효율적으로 배포할 수 있도록 제작되었습니다.

모델 정보

항목	값
기반 모델	Qwen/Qwen3.5-122B-A10B
전체 파라미터	122B (활성 파라미터: 10B, MoE)
양자화	FP8 E4M3, 블록 크기 [128, 128]
모델 크기	약 118 GB
아키텍처	48 하이브리드 레이어: 36 GDN (Gated Delta Net) + 12 Full Attention, 전체 MoE
전문가 수	256개 (토큰당 8개 라우팅 + 1개 공유)
최대 컨텍스트	262,144 토큰
지원 언어	한국어, 영어, 다국어

사용 방법

vLLM 서빙

vllm serve prosoft0405/Kosmic-122B-A10B-FP8 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --max-model-len 32768

API 호출 예시

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="prosoft0405/Kosmic-122B-A10B-FP8",
    messages=[
        {"role": "system", "content": "You are Kosmic, an AI assistant developed by Prosoft."},
        {"role": "user", "content": "안녕하세요! 자기소개 해주세요."}
    ],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

Ollama 사용

ollama run prosoft0405/kosmic-122b

양자화 방법

FP8 (E4M3) 블록 단위 양자화 (블록 크기 128)
Qwen/Qwen3.5-122B-A10B-FP8 공식 포맷과 동일
자체 파인튜닝 후 기반 가중치에 머지하여 양자화

하드웨어 요구사항

구성	최소 요구
GPU VRAM	약 120 GB (TP=2: 60 GB × 2장)
권장 GPU	RTX PRO 6000 Blackwell × 2, A100 80GB × 2, H100 × 2 등

라이선스

이 모델은 Apache 2.0 라이선스로 배포됩니다.

크레딧

기반 모델: Qwen Team, Alibaba Cloud
파인튜닝 및 양자화: Prosoft

Downloads last month: 21

Safetensors

Model size

122B params

Tensor type

F32

BF16

F8_E4M3

Model tree for prosoft0405/Kosmic-122B-A10B-FP8

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(148)

this model