Instructions to use Evrmind/EVR-1-Maano-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Evrmind/EVR-1-Maano-8b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Evrmind/EVR-1-Maano-8b", filename="evr-llama-3.1-8b.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Evrmind/EVR-1-Maano-8b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Evrmind/EVR-1-Maano-8b # Run inference directly in the terminal: llama-cli -hf Evrmind/EVR-1-Maano-8b
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Evrmind/EVR-1-Maano-8b # Run inference directly in the terminal: llama-cli -hf Evrmind/EVR-1-Maano-8b
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Evrmind/EVR-1-Maano-8b # Run inference directly in the terminal: ./llama-cli -hf Evrmind/EVR-1-Maano-8b
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Evrmind/EVR-1-Maano-8b # Run inference directly in the terminal: ./build/bin/llama-cli -hf Evrmind/EVR-1-Maano-8b
Use Docker
docker model run hf.co/Evrmind/EVR-1-Maano-8b
- LM Studio
- Jan
- vLLM
How to use Evrmind/EVR-1-Maano-8b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Evrmind/EVR-1-Maano-8b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Evrmind/EVR-1-Maano-8b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Evrmind/EVR-1-Maano-8b
- Ollama
How to use Evrmind/EVR-1-Maano-8b with Ollama:
ollama run hf.co/Evrmind/EVR-1-Maano-8b
- Unsloth Studio
How to use Evrmind/EVR-1-Maano-8b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Evrmind/EVR-1-Maano-8b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Evrmind/EVR-1-Maano-8b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Evrmind/EVR-1-Maano-8b to start chatting
- Docker Model Runner
How to use Evrmind/EVR-1-Maano-8b with Docker Model Runner:
docker model run hf.co/Evrmind/EVR-1-Maano-8b
- Lemonade
How to use Evrmind/EVR-1-Maano-8b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Evrmind/EVR-1-Maano-8b
Run and chat with the model
lemonade run user.EVR-1-Maano-8b-{{QUANT_TAG}}List all available models
lemonade list
Evrmind EVR-1 Maano-8b
Llama 3.1 8B compressed using EVR-1 (Evrmind Reconstruction), a novel compression method developed independently by Evrmind. The compressed weights average approximately 3 bits per parameter; the total GGUF file (3.93 GiB) includes additional metadata and structure overhead.
At 3.93 GiB, standard quantizations like Q3_K_M collapse into repetition. EVR-1 produces coherent text at 1000+ tokens where others fail: 5.83% repetition vs 77% for Q3_K_M at the same size.
3.93 GiB | Llama 3.1 8B base | Runs on laptops, desktops, and Android (Termux)
Note: HuggingFace may display an incorrect parameter count in the sidebar due to the custom compression format. EVR-1 is not a standard quantization (not Q2, Q3, Q4, etc).
Setup
You need two things: the model files (from this HuggingFace repo) and a platform binary (from GitHub).
Step 1: Clone this repo or download the files:
# Option A: Clone everything (~4.2 GB, requires git-lfs)
git lfs install
git clone https://huggingface.co/evrmind/evr-1-maano-8b
cd evr-1-maano-8b
# Option B: Or download individual files from the "Files" tab above
Step 2: Download the binary for your platform from the Downloads table. Save the archive into the evr-1-maano-8b directory, then extract it:
# Linux + NVIDIA
mkdir -p linux-cuda && tar xzf evrmind-linux-cuda.tar.gz -C linux-cuda
# Linux + Vulkan
mkdir -p linux-vulkan && tar xzf evrmind-linux-vulkan.tar.gz -C linux-vulkan
# macOS (Apple Silicon)
mkdir -p metal && tar xzf evrmind-macos-metal.tar.gz -C metal
# Android (Termux)
mkdir -p android-vulkan && tar xzf evrmind-android-vulkan.tar.gz -C android-vulkan
For Windows, extract the .zip into a folder with the matching name (e.g., extract evrmind-windows-cuda.zip into a folder called windows-cuda).
After completing both steps, your directory should look like this:
evr-1-maano-8b/
evr-llama-3.1-8b.gguf <-- model weights
start-server.sh <-- Linux/macOS/Android launcher
start-server.bat <-- Windows launcher
webui/ <-- browser interface
linux-cuda/ <-- extracted platform binary (example)
llama-server
llama-completion
...
Web UI
Linux, macOS, Android (Termux):
./start-server.sh
# Open http://localhost:8080
Windows:
Double-click start-server.bat, or from Command Prompt:
start-server.bat
Then open http://localhost:8080 in your browser.
Network access (phone, tablet, other devices on the same WiFi):
./start-server.sh --network
The script will print the URL to open on other devices. The model runs on your computer; other devices just connect to the web UI. Note: the --network and --cpu flags are only available in start-server.sh (Linux/macOS/Android).
See WEB_UI.md for more options and troubleshooting.
Quick Start (CLI)
These examples assume you have completed Setup and are in the evr-1-maano-8b directory.
Linux + NVIDIA GPU:
cd linux-cuda
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b.gguf -p "The main causes of the French Revolution were" -n 500 -ngl 99
macOS (Apple Silicon):
cd metal
./llama-completion -m ../evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99
Linux + Vulkan:
cd linux-vulkan
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99
Android (Termux):
cd android-vulkan
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99
Windows + NVIDIA (Command Prompt):
cd windows-cuda
llama-completion.exe -m ..\evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99
Windows + Vulkan (Command Prompt):
cd windows-vulkan
llama-completion.exe -m ..\evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99
CPU-only (no GPU):
Use -ngl 0 instead of -ngl 99 on any platform. Roughly 5-10x slower but works on any machine.
Downloads
| Platform | Download | GPU |
|---|---|---|
| Linux + NVIDIA | evrmind-linux-cuda.tar.gz | CUDA 12 |
| Linux + Any GPU | evrmind-linux-vulkan.tar.gz | Vulkan |
| Windows + NVIDIA | evrmind-windows-cuda.zip | CUDA 12 |
| Windows + Any GPU | evrmind-windows-vulkan.zip | Vulkan |
| macOS (Apple Silicon) | evrmind-macos-metal.tar.gz | Apple Silicon |
| Android (Termux) | evrmind-android-vulkan.tar.gz | Vulkan |
The model weights (evr-llama-3.1-8b.gguf, 4.22 GB) are available from the Files tab on this HuggingFace page. Platform binaries are hosted on GitHub Releases. You can verify downloads with SHA256SUMS.txt.
Why EVR-1 Maano-8b?
Standard quantizations at 3-4 GiB collapse into repetition during extended generation. Here is one example from our test set (see BENCHMARK_RESULTS.md for all results):
Q3_K_M (3.83 GiB):
"The process of nuclear fusion in stars begins when the core of a star is hot enough to start fusing hydrogen into helium... The process of nuclear fusion in stars is a complex process... The process of nuclear fusion in stars is a complex process..." (loops indefinitely)
EVR-1 Maano-8b (3.93 GiB):
"The process of nuclear fusion in stars begins when the core of the star is made up of hydrogen... In stars like our sun, hydrogen atoms fuse together to form helium atoms. The helium atoms then fuse together forming carbon-12..." (continues coherently for 500+ words)
Benchmarks
Head-to-head, same base model (Llama 3.1 8B), different compression methods. All models tested with the same binary, temperature 0, no repeat penalty.
Coherence (lower is better)
| Generation length | EVR-1 Maano (3.93 GiB) | Q3_K_M (3.83 GiB) | Q4_K_M (4.69 GiB) |
|---|---|---|---|
| 5 prompts @ 500 tokens | 5.83% rep4 | 76.79% | 79.45% |
| 5 prompts @ 1000 tokens | 19.68% rep4 | 87.65% | 89.69% |
EVR-1 maintains coherent output where Q3_K_M and Q4_K_M degrade into repetition. This is the core advantage of EVR-1 at this compression level.
Accuracy
| Benchmark | EVR-1 Maano (3.93 GiB) | Q3_K_M (3.83 GiB) | Q4_K_M (4.69 GiB) |
|---|---|---|---|
| ARC-Challenge (25-shot, 1172q) | 59.8% | 60.8% | 61.3% |
| Perplexity (wikitext-2, ctx=512) | 6.70 | 7.02 | 6.58 |
| Perplexity (wikitext-2, ctx=2048) | 6.19 | 6.13 | 5.74 |
Perplexity varies with context size. At the default context (512), EVR-1 outperforms Q3_K_M. At extended context (2048), all three are closer. See BENCHMARK_RESULTS.md for full methodology, raw outputs, and additional evaluations.
Limitations
- This is a base model (not instruction-tuned). It completes text, but does not follow instructions or engage in conversation without additional prompting. The web UI works for open-ended generation and experimentation.
- Context window has been tested up to 2048 tokens. Longer contexts may work but have not been validated at 3-bit compression.
- Perplexity at default context (512) is 6.70, outperforming Q3_K_M (7.02) but higher than Q4_K_M (6.58). At extended context (2048): 6.19 vs Q3_K_M 6.13, Q4_K_M 5.74.
- Occasional minor character-level artefacts due to 3-bit compression.
- As with all heavily quantized models, generated text may contain factual inaccuracies (e.g., incorrect numbers, dates, or scientific details). Always verify factual claims independently.
System Requirements
- Storage: ~4 GiB for model weights + ~50 MB for binaries
- RAM: 6 GiB minimum (8 GiB recommended)
- GPU (recommended): NVIDIA (CUDA 12), Apple Silicon, or any Vulkan GPU
- CPU-only: Supported but slower (use
-ngl 0or--cpuflag) - OS: Linux, macOS (Apple Silicon), Windows, Android (Termux)
- Not supported: iOS, 32-bit systems
Safety and Responsible Use
This model can generate incorrect, biased, or harmful content. It has not been safety-tuned or RLHF-aligned. Users should apply appropriate content filtering for user-facing applications. See MODEL_CARD.md for details.
Derivative Works
If you create derivative works, credit "EVR-1 Maano" in your model name and documentation. Commercial use is permitted subject to the Llama 3.1 Community License Agreement.
License
This model is dual-licensed:
- Evrmind Free License 1.0: Covers the EVR-1 compression and distribution. Permits personal, research, and commercial use with attribution.
- Llama 3.1 Community License: Covers the underlying Llama 3.1 weights. Permits commercial use for entities with fewer than 700 million monthly active users.
Both licenses apply. See LICENSE.md and META_LLAMA_LICENSE.md for full terms.
Also Available
| Model | Base | Use Case |
|---|---|---|
| EVR-1 Maano-8b-Instruct | Llama 3.1 8B Instruct | Chat, instruction following, assistants |
| EVR-1 Bafethu-8b-Reasoning | DeepSeek-R1-Distill-Llama-8B | Chain-of-thought reasoning, maths, code |
All EVR-1 models use the same binaries from GitHub Releases.
Contact
- Email: hello@evrmind.io
- Issues: GitHub
- Downloads last month
- 14
We're not able to determine the quantization variants.
Model tree for Evrmind/EVR-1-Maano-8b
Base model
meta-llama/Llama-3.1-8B