Instructions to use kai-os/Carnice-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kai-os/Carnice-9b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kai-os/Carnice-9b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("kai-os/Carnice-9b")
model = AutoModelForCausalLM.from_pretrained("kai-os/Carnice-9b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kai-os/Carnice-9b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kai-os/Carnice-9b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kai-os/Carnice-9b

SGLang

How to use kai-os/Carnice-9b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kai-os/Carnice-9b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kai-os/Carnice-9b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kai-os/Carnice-9b with Docker Model Runner:
```
docker model run hf.co/kai-os/Carnice-9b
```

Broken config.json for vllm v0.21.0

by Neiko2002 - opened May 16

Discussion

Neiko2002

May 16

In vllm v0.21.0 this model crashes with this error message:

(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306] 
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.21.0
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306]   █▄█▀ █     █     █     █  model   kai-os/Carnice-9b
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306] 
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:240] non-default args: {'model_tag': 'kai-os/Carnice-9b', 'chat_template': '/root/.cache/huggingface/hub/models--Lorbus--Qwen3.6-27B-int4-AutoRound/snapshots/c3aea2d531678621989e5e2db034e32b22536e79/chat_template.jinja', 'default_chat_template_kwargs': {'enable_thinking': False}, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'kai-os/Carnice-9b', 'max_model_len': 4096, 'served_model_name': ['qwen-9b'], 'override_generation_config': {'temperature': 0.7, 'top_p': 0.8, 'top_k': 20, 'min_p': 0.0, 'presence_penalty': 1.5, 'repetition_penalty': 1.0}, 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 4, 'enable_chunked_prefill': True, 'async_scheduling': True, 'reasoning_config': ReasoningConfig(reasoning_parser='', reasoning_start_str='<think>', reasoning_end_str='I have to give the solution based on the reasoning directly now.</think>')}
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) INFO 05-16 20:27:12 [model.py:568] Resolved architecture: Qwen3_5ForCausalLM
(APIServer pid=1) INFO 05-16 20:27:12 [model.py:1697] Using max model len 4096
(APIServer pid=1) INFO 05-16 20:27:12 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-16 20:27:12 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) WARNING 05-16 20:27:12 [config.py:367] Mamba cache mode is set to 'align' for Qwen3_5ForCausalLM by default when prefix caching is enabled
(APIServer pid=1) INFO 05-16 20:27:12 [config.py:387] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) INFO 05-16 20:27:12 [vllm.py:886] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-16 20:27:12 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
ERROR: driverInitFileInfo 578 result=11ERROR: init 664 result=11ERROR: init 250 result=11(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) INFO 05-16 20:27:18 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=229) INFO 05-16 20:27:24 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='kai-os/Carnice-9b', speculative_config=None, tokenizer='kai-os/Carnice-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen-9b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=229) WARNING 05-16 20:27:24 [multiproc_executor.py:1029] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=229) INFO 05-16 20:27:24 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2
[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
INFO 05-16 20:27:32 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=284) INFO 05-16 20:27:32 [parallel_state.py:1410] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:47827 backend=nccl
INFO 05-16 20:27:32 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=285) INFO 05-16 20:27:32 [parallel_state.py:1410] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:47827 backend=nccl
(Worker pid=284) INFO 05-16 20:27:33 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=284) WARNING 05-16 20:27:33 [symm_mem.py:66] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=285) WARNING 05-16 20:27:33 [symm_mem.py:66] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=284) INFO 05-16 20:27:33 [parallel_state.py:1723] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=284) INFO 05-16 20:27:33 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(Worker_TP0 pid=284) INFO 05-16 20:27:33 [gpu_model_runner.py:4857] Starting to load model kai-os/Carnice-9b...
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 345, in load_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4873, in load_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     model = initialize_model(
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]             ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 61, in initialize_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 575, in __init__
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     config.vision_config,
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     ^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 434, in __getattribute__
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]     return super().__getattribute__(key)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] AttributeError: 'Qwen3_5TextConfig' object has no attribute 'vision_config'

In order so solve the problem the config.json needs an updates. I have used the original config.json from the Qwen team and modified it to remove the vision tower:

{
  "architectures": [
    "Qwen3_5ForCausalLM"
  ],
  "model_type": "qwen3_5",
  "vision_config": null,
  "text_config": {
    "model_type": "qwen3_5_text",
    "attention_bias": false,
    "attention_dropout": 0.0,
    "attn_output_gate": true,
    "bos_token_id": null,
    "eos_token_id": 248044,
    "full_attention_interval": 4,
    "head_dim": 256,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "initializer_range": 0.02,
    "intermediate_size": 12288,
    "layer_types": [
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention",
      "linear_attention", "linear_attention", "linear_attention", "full_attention"
    ],
    "linear_conv_kernel_dim": 4,
    "linear_key_head_dim": 128,
    "linear_num_key_heads": 16,
    "linear_num_value_heads": 32,
    "linear_value_head_dim": 128,
    "mamba_ssm_dtype": "float32",
    "max_position_embeddings": 262144,
    "mlp_only_layers": [],
    "mtp_num_hidden_layers": 1,
    "mtp_use_dedicated_embeddings": false,
    "num_attention_heads": 16,
    "num_hidden_layers": 32,
    "num_key_value_heads": 4,
    "pad_token_id": null,
    "partial_rotary_factor": 0.25,
    "rms_norm_eps": 1e-06,
    "rope_parameters": {
      "mrope_interleaved": true,
      "mrope_section": [11, 11, 10],
      "partial_rotary_factor": 0.25,
      "rope_theta": 10000000,
      "rope_type": "default"
    },
    "use_cache": true,
    "vocab_size": 248320
  },
  "dtype": "bfloat16",
  "tie_word_embeddings": false,
  "transformers_version": "5.6.0"
}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment