Text Generation
Transformers
Safetensors
qwen3_5_text
hermes-agent
merged
standalone
qwen3.5
terminal
browser
tool-use
reasoning
conversational
Instructions to use kai-os/Carnice-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kai-os/Carnice-9b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kai-os/Carnice-9b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("kai-os/Carnice-9b") model = AutoModelForCausalLM.from_pretrained("kai-os/Carnice-9b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use kai-os/Carnice-9b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kai-os/Carnice-9b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kai-os/Carnice-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kai-os/Carnice-9b
- SGLang
How to use kai-os/Carnice-9b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kai-os/Carnice-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kai-os/Carnice-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kai-os/Carnice-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kai-os/Carnice-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kai-os/Carnice-9b with Docker Model Runner:
docker model run hf.co/kai-os/Carnice-9b
Broken config.json for vllm v0.21.0
#3
by Neiko2002 - opened
In vllm v0.21.0 this model crashes with this error message:
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306]
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306] β β ββ ββ
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306] ββ ββ β β β βββ β version 0.21.0
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306] ββββ β β β β model kai-os/Carnice-9b
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306] ββ βββββ βββββ β β
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:306]
(APIServer pid=1) INFO 05-16 20:27:01 [utils.py:240] non-default args: {'model_tag': 'kai-os/Carnice-9b', 'chat_template': '/root/.cache/huggingface/hub/models--Lorbus--Qwen3.6-27B-int4-AutoRound/snapshots/c3aea2d531678621989e5e2db034e32b22536e79/chat_template.jinja', 'default_chat_template_kwargs': {'enable_thinking': False}, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'kai-os/Carnice-9b', 'max_model_len': 4096, 'served_model_name': ['qwen-9b'], 'override_generation_config': {'temperature': 0.7, 'top_p': 0.8, 'top_k': 20, 'min_p': 0.0, 'presence_penalty': 1.5, 'repetition_penalty': 1.0}, 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 4, 'enable_chunked_prefill': True, 'async_scheduling': True, 'reasoning_config': ReasoningConfig(reasoning_parser='', reasoning_start_str='<think>', reasoning_end_str='I have to give the solution based on the reasoning directly now.</think>')}
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 05-16 20:27:01 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) INFO 05-16 20:27:12 [model.py:568] Resolved architecture: Qwen3_5ForCausalLM
(APIServer pid=1) INFO 05-16 20:27:12 [model.py:1697] Using max model len 4096
(APIServer pid=1) INFO 05-16 20:27:12 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-16 20:27:12 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) WARNING 05-16 20:27:12 [config.py:367] Mamba cache mode is set to 'align' for Qwen3_5ForCausalLM by default when prefix caching is enabled
(APIServer pid=1) INFO 05-16 20:27:12 [config.py:387] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) INFO 05-16 20:27:12 [vllm.py:886] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-16 20:27:12 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
ERROR: driverInitFileInfo 578 result=11ERROR: init 664 result=11ERROR: init 250 result=11(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) INFO 05-16 20:27:18 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=229) INFO 05-16 20:27:24 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='kai-os/Carnice-9b', speculative_config=None, tokenizer='kai-os/Carnice-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen-9b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=229) WARNING 05-16 20:27:24 [multiproc_executor.py:1029] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=229) INFO 05-16 20:27:24 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2
[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
[transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
INFO 05-16 20:27:32 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=284) INFO 05-16 20:27:32 [parallel_state.py:1410] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:47827 backend=nccl
INFO 05-16 20:27:32 [registry.py:134] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=285) INFO 05-16 20:27:32 [parallel_state.py:1410] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:47827 backend=nccl
(Worker pid=284) INFO 05-16 20:27:33 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=284) WARNING 05-16 20:27:33 [symm_mem.py:66] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=285) WARNING 05-16 20:27:33 [symm_mem.py:66] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=284) INFO 05-16 20:27:33 [parallel_state.py:1723] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=284) INFO 05-16 20:27:33 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(Worker_TP0 pid=284) INFO 05-16 20:27:33 [gpu_model_runner.py:4857] Starting to load model kai-os/Carnice-9b...
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worker_main
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] self.worker.load_model()
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 345, in load_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4873, in load_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] self.model = model_loader.load_model(
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] model = initialize_model(
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 61, in initialize_model
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] model = model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 575, in __init__
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] config.vision_config,
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 434, in __getattribute__
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] return super().__getattribute__(key)
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=284) ERROR 05-16 20:27:34 [multiproc_executor.py:870] AttributeError: 'Qwen3_5TextConfig' object has no attribute 'vision_config'
In order so solve the problem the config.json needs an updates. I have used the original config.json from the Qwen team and modified it to remove the vision tower:
{
"architectures": [
"Qwen3_5ForCausalLM"
],
"model_type": "qwen3_5",
"vision_config": null,
"text_config": {
"model_type": "qwen3_5_text",
"attention_bias": false,
"attention_dropout": 0.0,
"attn_output_gate": true,
"bos_token_id": null,
"eos_token_id": 248044,
"full_attention_interval": 4,
"head_dim": 256,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 12288,
"layer_types": [
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention",
"linear_attention", "linear_attention", "linear_attention", "full_attention"
],
"linear_conv_kernel_dim": 4,
"linear_key_head_dim": 128,
"linear_num_key_heads": 16,
"linear_num_value_heads": 32,
"linear_value_head_dim": 128,
"mamba_ssm_dtype": "float32",
"max_position_embeddings": 262144,
"mlp_only_layers": [],
"mtp_num_hidden_layers": 1,
"mtp_use_dedicated_embeddings": false,
"num_attention_heads": 16,
"num_hidden_layers": 32,
"num_key_value_heads": 4,
"pad_token_id": null,
"partial_rotary_factor": 0.25,
"rms_norm_eps": 1e-06,
"rope_parameters": {
"mrope_interleaved": true,
"mrope_section": [11, 11, 10],
"partial_rotary_factor": 0.25,
"rope_theta": 10000000,
"rope_type": "default"
},
"use_cache": true,
"vocab_size": 248320
},
"dtype": "bfloat16",
"tie_word_embeddings": false,
"transformers_version": "5.6.0"
}