Hi everyone! π
Iβve been working on a fork of Nano-vLLM called Nano-vLLM-v1, which re-engineers the core architecture to closely reproduce the vLLM v1 scheduler and introduces Chunked Prefill for better performance.
The goal was to build a lightweight, readable, yet highly efficient inference engine that stays true to the original vLLM design while being easy to understand and extend.
π₯ Key Features
- β Fully reproduced vLLM v1 scheduler β Implements the same scheduling logic as vLLM v1.
- β Chunked Prefill β Improves prefill efficiency for longer contexts.
- β Clean codebaseβ β The simplest way to reproduce vLLM v1 scheduler and implement Chunked Prefill based on Nano vLLM
- β Fast offline & online inference β Comparable performance to vLLM v1 in offline throughput and online latency (TTFT and TPOT).
π Repository
Check it out here: https://github.com/slwang-ustc/nano-vllm-v1/tree/main
Iβd love for the community to try it out, give feedback, or contribute! The code is designed to be readable and modular, making it easy to experiment with new features or optimizations.
If youβre interested in lightweight, high-performance LLM inference without the complexity, give it a star β and let me know what you think!
π Quick Start
Offline example:
from nanovllm import LLM, SamplingParams
llm = LLM("/path/to/model", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
outputs = llm.generate(["Hello, Nano-vLLM."], sampling_params)
print(outputs[0]["text"])
Online benchmarking:
python serving_bench.py \
--model /path/to/Qwen3-14B/ \
--request-rate 10 \
--num-requests 1024 \
--tensor-parallel-size 1 \
--max-num-batched-tokens 1024 \
--max-num-seqs 1024 \
--random-input-len 128 \
--random-output-len 100 \
--chunked-prefill \
--enforce-eager