Rio 3.5 Open 397B NVFP4

NVFP4 quantization of prefeitura-rio/Rio-3.5-Open-397B (Qwen 3.5 397B post-trained with vision).

Run

huggingface-cli download mitomtuna/Rio-3.5-Open-397B-NVFP4 --local-dir ./Rio-3.5-Open-397B-NVFP4
docker pull ghcr.io/tunamitom/rio:latest
docker compose -f docker-compose.rio.yaml up -d

Requires 4Γ— RTX 6000 Blackwell (384GB+ VRAM total) and NVIDIA Container Toolkit.

docker-compose.rio.yaml

services:
  sglang:
    image: ghcr.io/tunamitom/rio:latest
    container_name: rio
    entrypoint: ["/bin/bash"]
    ipc: host
    shm_size: "16g"
    mem_limit: 200g
    memswap_limit: 200g
    restart: "no"
    cap_add: [SYS_NICE]
    ulimits:
      memlock: -1
      stack: 67108864
      nofile: { soft: 1048576, hard: 1048576 }
    ports:
      - "8001:8001"
    healthcheck:
      test: ["CMD-SHELL", "curl -fs http://localhost:8001/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 900s
    environment:
      OMP_NUM_THREADS: "8"
      SAFETENSORS_FAST_GPU: "1"
      CUTE_DSL_ARCH: "sm_120a"
      SGLANG_ENABLE_SPEC_V2: "1"
      SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION: "false"
      SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK: "1"
      SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK: "false"
      NCCL_DEBUG: WARN
      SGLANG_SET_CPU_AFFINITY: "1"
      NCCL_IB_DISABLE: "1"
      NCCL_P2P_LEVEL: SYS
      NCCL_ALLOC_P2P_NET_LL_BUFFERS: "1"
      NCCL_MIN_NCHANNELS: "8"
      NCCL_CUMEM_HOST_ENABLE: "0"
      NCCL_NET_GDR_LEVEL: "SYS"
      PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
      SGLANG_ENABLE_JIT_DEEPGEMM: "0"
      CUDA_VISIBLE_DEVICES: "0,1,2,3"
      SGLANG_PREVENT_THOUGHT_LOOPS: "0"
      B12X_ENABLE_DYNAMIC_DOWN_SCALE: "1"
      SGLANG_PCIE_AUTOTUNE: "1"
      TORCHINDUCTOR_CACHE_DIR: "/cache/torchinductor"
      TRITON_CACHE_DIR: "/cache/triton"
      CUTE_DSL_CACHE_DIR: "/cache/cute_dsl"
      B12X_AUTOTUNE_CACHE_DIR: "/cache/b12x_autotune"
    volumes:
      - ./Rio-3.5-Open-397B-NVFP4:/models/Rio-3.5-Open-397B-NVFP4:ro
      - rio-cache:/cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1", "2", "3"]
              capabilities: [gpu]
    command:
      - -lc
      - >-
        set -euo pipefail;
        exec python3 -m sglang.launch_server
        --model-path /models/Rio-3.5-Open-397B-NVFP4
        --tokenizer-path /models/Rio-3.5-Open-397B-NVFP4
        --served-model-name rio
        --tp-size 4
        --host 0.0.0.0
        --port 8001
        --trust-remote-code
        --quantization modelopt_fp4
        --kv-cache-dtype fp8_e4m3
        --mem-fraction-static 0.93
        --chunked-prefill-size 16384
        --cuda-graph-max-bs 64
        --cuda-graph-bs 1 2 3 4 5 6 7 8 16 24 32 48 64
        --max-running-requests 64
        --reasoning-parser qwen3
        --tool-call-parser qwen3_coder
        --attention-backend flashinfer
        --fp4-gemm-backend b12x
        --moe-runner-backend b12x
        --mamba-scheduler-strategy extra_buffer
        --enable-pcie-oneshot-allreduce
        --enable-metrics
        --sleep-on-idle
volumes:
  rio-cache:
    driver: local

Performance

Tested on 4Γ— RTX 6000 Blackwell (300W Max-Q, TP=4).

Decode tok/s (aggregate):

ctx C=1 C=10 C=20 C=32
0 130 539 823 1159
16k 128 511 β€” β€”
32k 124 495 β€” β€”
64k 120 475 β€” β€”
128k 113 435 β€” β€”

Prefill tok/s:

ctx tok/s
8k 12,614
16k 11,850
32k 11,375
64k 10,150
128k 7,943

Note: Speculative decoding (NEXTN/MTP) is intentionally disabled β€” the shipped draft head was trained for base Qwen3.5 and doesn't work well with Rio's post-trained weights. See prefeitura-rio/Rio-3.5-Open-397B for the original model.

Details

  • Quantization: NVFP4 via quant-toolkit (ModelOpt)
  • KV cache: FP8 E4M3 (2.33M tokens)
  • Attention: FlashInfer
  • MoE: B12X
  • Server: SGLang
Downloads last month
202
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mitomtuna/Rio-3.5-Open-397B-NVFP4

Quantized
(4)
this model