Evrmind EVR-1 Maano-8b

Llama 3.1 8B compressed to 3 bits per weight using EVR-1 (Evrmind Reconstruction), a novel compression method developed independently by Evrmind.

At 3.93 GiB, standard quantizations like Q3_K_M collapse into repetition. EVR-1 produces coherent text at 1000+ tokens where others fail โ€” 0.64% repetition vs 53% for Q3_K_M at the same size.

3.93 GiB | Llama 3.1 8B base | Runs on laptops, desktops, and Android (Termux)

Note: HuggingFace may display an incorrect parameter count in the sidebar due to the custom compression format. EVR-1 is not a standard quantization (not Q2, Q3, Q4, etc).

Setup

You need two things: the model files (from this HuggingFace repo) and a platform binary (from GitHub).

Step 1: Clone this repo or download the files:

# Option A: Clone everything (~4.2 GB, requires git-lfs)
git lfs install
git clone https://huggingface.co/evrmind/evr-1-maano-8b
cd evr-1-maano-8b

# Option B: Or download individual files from the "Files" tab above

Step 2: Download the binary for your platform from the Downloads table. Save the archive into the evr-1-maano-8b directory, then extract it:

# Linux + NVIDIA
mkdir -p linux-cuda && tar xzf evrmind-linux-cuda.tar.gz -C linux-cuda

# Linux + Vulkan
mkdir -p linux-vulkan && tar xzf evrmind-linux-vulkan.tar.gz -C linux-vulkan

# macOS (Apple Silicon)
mkdir -p metal && tar xzf evrmind-macos-metal.tar.gz -C metal

# Android (Termux)
mkdir -p android-vulkan && tar xzf evrmind-android-vulkan.tar.gz -C android-vulkan

For Windows, extract the .zip into a folder with the matching name (e.g., extract evrmind-windows-cuda.zip into a folder called windows-cuda).

After completing both steps, your directory should look like this:

evr-1-maano-8b/
  evr-llama-3.1-8b.gguf    <-- model weights
  start-server.sh           <-- Linux/macOS/Android launcher
  start-server.bat          <-- Windows launcher
  webui/                    <-- browser interface
  linux-cuda/               <-- extracted platform binary (example)
    llama-server
    llama-completion
    ...

Web UI

Linux, macOS, Android (Termux):

./start-server.sh
# Open http://localhost:8080

Windows:

Double-click start-server.bat, or from Command Prompt:

start-server.bat

Then open http://localhost:8080 in your browser.

Network access (phone, tablet, other devices on the same WiFi):

./start-server.sh --network

The script will print the URL to open on other devices. The model runs on your computer; other devices just connect to the web UI. Note: the --network and --cpu flags are only available in start-server.sh (Linux/macOS/Android).

See WEB_UI.md for more options and troubleshooting.

Quick Start (CLI)

These examples assume you have completed Setup and are in the evr-1-maano-8b directory.

Linux + NVIDIA GPU:

cd linux-cuda
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b.gguf -p "The main causes of the French Revolution were" -n 500 -ngl 99

macOS (Apple Silicon):

cd metal
./llama-completion -m ../evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99

Linux + Vulkan:

cd linux-vulkan
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99

Android (Termux):

cd android-vulkan
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99

Windows + NVIDIA (Command Prompt):

cd windows-cuda
llama-completion.exe -m ..\evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99

Windows + Vulkan (Command Prompt):

cd windows-vulkan
llama-completion.exe -m ..\evr-llama-3.1-8b.gguf -p "Your prompt here" -n 500 -ngl 99

CPU-only (no GPU):

Use -ngl 0 instead of -ngl 99 on any platform. Roughly 5-10x slower but works on any machine.

Downloads

Platform Download GPU
Linux + NVIDIA evrmind-linux-cuda.tar.gz CUDA 12
Linux + Any GPU evrmind-linux-vulkan.tar.gz Vulkan
Windows + NVIDIA evrmind-windows-cuda.zip CUDA 12
Windows + Any GPU evrmind-windows-vulkan.zip Vulkan
macOS (Apple Silicon) evrmind-macos-metal.tar.gz Apple Silicon
Android (Termux) evrmind-android-vulkan.tar.gz Vulkan

The model weights (evr-llama-3.1-8b.gguf, 4.22 GB) are available from the Files tab on this HuggingFace page. Platform binaries are hosted on GitHub Releases. You can verify downloads with SHA256SUMS.txt.

Why EVR-1 Maano-8b?

Standard quantizations at 3-4 GiB can produce repetition during extended generation. Here is one example from our 6-prompt test set (see BENCHMARK_RESULTS.md for all results):

Q3_K_M (3.83 GiB):

"The French monarchy was bankrupt. The French monarchy was bankrupt. The French monarchy was bankrupt. The French monarchy was bankrupt..." (repeats 70 times)

EVR-1 Maano-8b (3.93 GiB):

"The main causes of the French Revolution were social, political and economic. The French Revolution was one of the most significant events in European history. The French revolution brought the new concepts of modern democracy..." (continues coherently for 500+ words)

Benchmarks

Head-to-head, same base model (Llama 3.1 8B), different compression methods. All models tested with the same binary, no repeat penalty.

Coherence (lower is better)

Generation length EVR-1 Maano (3.93 GiB) Q3_K_M (3.83 GiB) Q4_K_M (4.69 GiB)
6 prompts @ 500 tokens 0.64% rep4 52.92% 36.92%
2 prompts @ 1000 tokens 1.53% rep4 36.15% 54.87%

EVR-1 maintains coherent output where Q3_K_M and Q4_K_M degrade into repetition. This is the core advantage of EVR-1 at this compression level.

Accuracy

Benchmark EVR-1 Maano (3.93 GiB) Q3_K_M (3.83 GiB) Q4_K_M (4.69 GiB)
ARC-Challenge (25-shot, 1172q) 59.8% 60.8% 61.3%
Perplexity (wikitext-2, full eval) 6.19 6.13 5.74

Q3_K_M and Q4_K_M are industry-standard quantizations refined over years by the open-source llama.cpp community. EVR-1 is a novel compression method developed independently by Evrmind. On accuracy benchmarks, EVR-1 achieves near-parity (within 1.5pp on ARC-Challenge, near-identical perplexity to Q3_K_M).

See BENCHMARK_RESULTS.md for full methodology, raw outputs, and additional evaluations.

Limitations

  • This is a base model (not instruction-tuned). It completes text, but does not follow instructions or engage in conversation without additional prompting. The web UI works for open-ended generation and experimentation.
  • Context window is 2048 tokens (reduced from base Llama 3.1's 128K due to compression constraints).
  • Perplexity is comparable to Q3_K_M (6.19 vs 6.13) and higher than Q4_K_M (6.19 vs 5.74).
  • Occasional minor character-level artifacts due to 3-bit compression.

System Requirements

  • Storage: ~4 GiB for model weights + ~50 MB for binaries
  • RAM: 6 GiB minimum (8 GiB recommended)
  • GPU (recommended): NVIDIA (CUDA 12), Apple Silicon, or any Vulkan GPU
  • CPU-only: Supported but slower (use -ngl 0 or --cpu flag)
  • OS: Linux, macOS (Apple Silicon), Windows, Android (Termux)
  • Not supported: iOS, 32-bit systems

Safety and Responsible Use

This model can generate incorrect, biased, or harmful content. It has not been safety-tuned or RLHF-aligned. Users should apply appropriate content filtering for user-facing applications. See MODEL_CARD.md for details.

Derivative Works

If you create derivative works, credit "Evrmind EVR-1 Maano-8b" in your model name and documentation. Commercial use is permitted subject to the Llama 3.1 Community License Agreement.

License

This model is dual-licensed:

  1. Evrmind Free License 1.0: Covers the EVR-1 compression and distribution. Permits personal, research, and commercial use with attribution.
  2. Llama 3.1 Community License: Covers the underlying Llama 3.1 weights. Permits commercial use for entities with fewer than 700 million monthly active users.

Both licenses apply. See LICENSE.md and META_LLAMA_LICENSE.md for full terms.

Contact

Downloads last month
1,718
GGUF
Model size
5B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Evrmind/EVR-1-Maano-8b

Quantized
(311)
this model