Commits · KinetoLabs/SmokeScan

Add enforce_eager=True to fix KV cache memory issue

1b7fbd7

KinetoLabs Claude Opus 4.5 commited on 2 days ago

Reduce vLLM memory for A100 24GB compatibility

b85b1e0

KinetoLabs Claude Opus 4.5 commited on 2 days ago

Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity

14c59e5

KinetoLabs Claude Opus 4.5 commited on 2 days ago

Reduce context/memory to minimize NCCL overhead on L4s

7d5c713

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Align vLLM config with official Qwen3-VL model card

b2fe3f4

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Fix vLLM multi-GPU init: explicit dtype + higher mem util + eager mode

71a896b

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Force vLLM V0 engine + reduce max_model_len for stability

3c9a722

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Fix vLLM Engine core initialization failed on multi-GPU

ed575b1

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Replace dual 8B with single 30B-A3B FP8 vision model

706520f

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Replace 30B MoE with dual 8B models (Thinking + Instruct)

333c083

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Implement lazy model loading to prevent CUDA OOM on 4xL4 GPUs

5f0db1e

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Add property accessors to RealModelStack for interface parity

c190082

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Fix multi-GPU compatibility issues (6 locations)

d1901ae

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Fix embedding/reranker loading with official Qwen3-VL classes

455c786

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Fix critical model implementations and add sample scenarios

f3ebc82

KinetoLabs Claude Opus 4.5 commited on 3 days ago

Initial commit: FDAM AI Pipeline v4.0.1

88bdcff

KinetoLabs Claude Opus 4.5 commited on 4 days ago

Spaces:

KinetoLabs
/

SmokeScan

Paused

Commit History

Add enforce_eager=True to fix KV cache memory issue

1b7fbd7

Reduce vLLM memory for A100 24GB compatibility

b85b1e0

Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity

14c59e5

Reduce context/memory to minimize NCCL overhead on L4s

7d5c713

Align vLLM config with official Qwen3-VL model card

b2fe3f4

Fix vLLM multi-GPU init: explicit dtype + higher mem util + eager mode

71a896b

Force vLLM V0 engine + reduce max_model_len for stability

3c9a722

Fix vLLM Engine core initialization failed on multi-GPU

ed575b1

Replace dual 8B with single 30B-A3B FP8 vision model

706520f

Replace 30B MoE with dual 8B models (Thinking + Instruct)

333c083

Implement lazy model loading to prevent CUDA OOM on 4xL4 GPUs

5f0db1e

Add property accessors to RealModelStack for interface parity

c190082

Fix multi-GPU compatibility issues (6 locations)

d1901ae

Fix embedding/reranker loading with official Qwen3-VL classes

455c786

Fix critical model implementations and add sample scenarios

f3ebc82

Initial commit: FDAM AI Pipeline v4.0.1

88bdcff

Commit History

Add enforce_eager=True to fix KV cache memory issue 1b7fbd7

Reduce vLLM memory for A100 24GB compatibility b85b1e0

Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity 14c59e5

Reduce context/memory to minimize NCCL overhead on L4s 7d5c713

Align vLLM config with official Qwen3-VL model card b2fe3f4

Fix vLLM multi-GPU init: explicit dtype + higher mem util + eager mode 71a896b

Force vLLM V0 engine + reduce max_model_len for stability 3c9a722

Fix vLLM Engine core initialization failed on multi-GPU ed575b1

Replace dual 8B with single 30B-A3B FP8 vision model 706520f

Replace 30B MoE with dual 8B models (Thinking + Instruct) 333c083

Implement lazy model loading to prevent CUDA OOM on 4xL4 GPUs 5f0db1e

Add property accessors to RealModelStack for interface parity c190082

Fix multi-GPU compatibility issues (6 locations) d1901ae

Fix embedding/reranker loading with official Qwen3-VL classes 455c786

Fix critical model implementations and add sample scenarios f3ebc82

Initial commit: FDAM AI Pipeline v4.0.1 88bdcff

Add enforce_eager=True to fix KV cache memory issue

1b7fbd7

Reduce vLLM memory for A100 24GB compatibility

b85b1e0

Switch to Qwen3-VL-4B-Thinking for single-GPU simplicity

14c59e5

Reduce context/memory to minimize NCCL overhead on L4s

7d5c713

Align vLLM config with official Qwen3-VL model card

b2fe3f4

Fix vLLM multi-GPU init: explicit dtype + higher mem util + eager mode

71a896b

Force vLLM V0 engine + reduce max_model_len for stability

3c9a722

Fix vLLM Engine core initialization failed on multi-GPU

ed575b1

Replace dual 8B with single 30B-A3B FP8 vision model

706520f

Replace 30B MoE with dual 8B models (Thinking + Instruct)

333c083

Implement lazy model loading to prevent CUDA OOM on 4xL4 GPUs

5f0db1e

Add property accessors to RealModelStack for interface parity

c190082

Fix multi-GPU compatibility issues (6 locations)

d1901ae

Fix embedding/reranker loading with official Qwen3-VL classes

455c786

Fix critical model implementations and add sample scenarios

f3ebc82

Initial commit: FDAM AI Pipeline v4.0.1

88bdcff