QORA - Native Rust LLM Inference Engine Base Model SmolLM3-3B (HuggingFaceTB/SmolLM3-3B)

#47
by drdraq - opened

Pure Rust inference engine for the SmolLM3-3B language model. No Python runtime, no CUDA, no external dependencies. Single executable + quantized weights = portable AI on any machine.

Now with GPU acceleration! Auto-detects Vulkan-compatible GPUs for ~4.8x faster inference, with intelligent fallback to CPU.

Try: https://huggingface.co/qoranet/QORA-LLM

Fastest: direct answer, no thinking, deterministic

qora.exe --prompt "What is the capital of France?" --no-think --greedy

Fast: direct answer with some randomness

qora.exe --prompt "Tell me about Mars" --no-think

Full quality: long reply mode, no thinking

qora.exe --prompt "Tell me story of Bitcoin and it's creation" --no-think --greedy

See what the model is thinking

qora.exe --prompt "What is 2+2?" --show-think

Force CPU (skip GPU auto-detect)

qora.exe --prompt "Hello" --cpu

Control output length

qora.exe --prompt "Tell me a story" --max-tokens 512

Raw text completion (no chat template)

qora.exe --prompt "Once upon a time" --raw --max-tokens 128

Custom model path

qora.exe --load path/to/model.qora --prompt "Hello"

Sign up or log in to comment