Instructions to use TitleOS/Eve-4b-FP16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TitleOS/Eve-4b-FP16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TitleOS/Eve-4b-FP16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TitleOS/Eve-4b-FP16")
model = AutoModelForCausalLM.from_pretrained("TitleOS/Eve-4b-FP16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use TitleOS/Eve-4b-FP16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="TitleOS/Eve-4b-FP16",
	filename="Eve-4b-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use TitleOS/Eve-4b-FP16 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf TitleOS/Eve-4b-FP16:F16
# Run inference directly in the terminal:
llama-cli -hf TitleOS/Eve-4b-FP16:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf TitleOS/Eve-4b-FP16:F16
# Run inference directly in the terminal:
llama-cli -hf TitleOS/Eve-4b-FP16:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf TitleOS/Eve-4b-FP16:F16
# Run inference directly in the terminal:
./llama-cli -hf TitleOS/Eve-4b-FP16:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf TitleOS/Eve-4b-FP16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf TitleOS/Eve-4b-FP16:F16

Use Docker

docker model run hf.co/TitleOS/Eve-4b-FP16:F16

LM Studio
Jan

vLLM

How to use TitleOS/Eve-4b-FP16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TitleOS/Eve-4b-FP16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TitleOS/Eve-4b-FP16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/TitleOS/Eve-4b-FP16:F16

SGLang

How to use TitleOS/Eve-4b-FP16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TitleOS/Eve-4b-FP16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TitleOS/Eve-4b-FP16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TitleOS/Eve-4b-FP16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TitleOS/Eve-4b-FP16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use TitleOS/Eve-4b-FP16 with Ollama:
```
ollama run hf.co/TitleOS/Eve-4b-FP16:F16
```

Unsloth Studio

How to use TitleOS/Eve-4b-FP16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for TitleOS/Eve-4b-FP16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for TitleOS/Eve-4b-FP16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for TitleOS/Eve-4b-FP16 to start chatting

How to use TitleOS/Eve-4b-FP16 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf TitleOS/Eve-4b-FP16:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "TitleOS/Eve-4b-FP16:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use TitleOS/Eve-4b-FP16 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf TitleOS/Eve-4b-FP16:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default TitleOS/Eve-4b-FP16:F16

Run Hermes

hermes

Docker Model Runner
How to use TitleOS/Eve-4b-FP16 with Docker Model Runner:
```
docker model run hf.co/TitleOS/Eve-4b-FP16:F16
```

Lemonade

How to use TitleOS/Eve-4b-FP16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull TitleOS/Eve-4b-FP16:F16

Run and chat with the model

lemonade run user.Eve-4b-FP16-F16

List all available models

lemonade list

TitleOS commited on Apr 26

Commit

2878136

verified ·

1 Parent(s): 8195a30

Update README.md

Browse files

Files changed (1) hide show

README.md +96 -236

README.md CHANGED Viewed

@@ -1,241 +1,101 @@
 ---
 library_name: transformers
-license: apache-2.0
-license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE
-pipeline_tag: text-generation
 tags:
-- heretic
 - uncensored
-- decensored
-- abliterated
 ---
-# This is a decensored version of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), made using [Heretic](https://github.com/p-e-w/heretic) v1.0.0
-## Abliteration parameters
-| Parameter | Value |
-| :-------- | :---: |
-| **direction_index** | 30.93 |
-| **attn.o_proj.max_weight** | 1.49 |
-| **attn.o_proj.max_weight_position** | 24.57 |
-| **attn.o_proj.min_weight** | 0.92 |
-| **attn.o_proj.min_weight_distance** | 15.70 |
-| **mlp.down_proj.max_weight** | 1.46 |
-| **mlp.down_proj.max_weight_position** | 29.27 |
-| **mlp.down_proj.min_weight** | 1.31 |
-| **mlp.down_proj.min_weight_distance** | 20.61 |
-## Performance
-| Metric | This model | Original model ([Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)) |
-| :----- | :--------: | :---------------------------: |
-| **KL divergence** | 0.43 | 0 *(by definition)* |
-| **Refusals** | 21/100 | 99/100 |
------
-# Qwen3-4B-Instruct-2507
-<a href="https://chat.qwen.ai" target="_blank" style="margin: 2px;">
-    <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
-</a>
-## Highlights
-We introduce the updated version of the **Qwen3-4B non-thinking mode**, named **Qwen3-4B-Instruct-2507**, featuring the following key enhancements:
-- **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
-- **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
-- **Markedly better alignment** with user preferences in **subjective and open-ended tasks**, enabling more helpful responses and higher-quality text generation.
-- **Enhanced capabilities** in **256K long-context understanding**.
-![image/jpeg](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-2507/Qwen3-4B-Instruct.001.jpeg)
-## Model Overview
-**Qwen3-4B-Instruct-2507** has the following features:
-- Type: Causal Language Models
-- Training Stage: Pretraining & Post-training
-- Number of Parameters: 4.0B
-- Number of Paramaters (Non-Embedding): 3.6B
-- Number of Layers: 36
-- Number of Attention Heads (GQA): 32 for Q and 8 for KV
-- Context Length: **262,144 natively**.
-**NOTE: This model supports only non-thinking mode and does not generate ``<think></think>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**
-For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
-## Performance
-|  | GPT-4.1-nano-2025-04-14 | Qwen3-30B-A3B Non-Thinking | Qwen3-4B Non-Thinking | Qwen3-4B-Instruct-2507 |
-|--- | --- | --- | --- | --- |
-| **Knowledge** | | | |
-| MMLU-Pro | 62.8 | 69.1 | 58.0 | **69.6** |
-| MMLU-Redux | 80.2 | 84.1 | 77.3 | **84.2** |
-| GPQA | 50.3 | 54.8 | 41.7 | **62.0** |
-| SuperGPQA | 32.2 | 42.2 | 32.0 | **42.8** |
-| **Reasoning** | | | |
-| AIME25 | 22.7 | 21.6 | 19.1 | **47.4** |
-| HMMT25 | 9.7 | 12.0 | 12.1 | **31.0** |
-| ZebraLogic | 14.8 | 33.2 | 35.2 | **80.2** |
-| LiveBench 20241125 | 41.5 | 59.4 | 48.4 | **63.0** |
-| **Coding** | | | |
-| LiveCodeBench v6 (25.02-25.05) | 31.5 | 29.0 | 26.4 | **35.1** |
-| MultiPL-E | 76.3 | 74.6 | 66.6 | **76.8** |
-| Aider-Polyglot |  9.8 | **24.4** | 13.8 | 12.9 |
-| **Alignment** | | | |
-| IFEval | 74.5 | **83.7** | 81.2 | 83.4 |
-| Arena-Hard v2* | 15.9 | 24.8 | 9.5 | **43.4** |
-| Creative Writing v3 | 72.7 | 68.1 | 53.6 | **83.5** |
-| WritingBench | 66.9 | 72.2 | 68.5 | **83.4** |
-| **Agent** | | | |
-| BFCL-v3 | 53.0 | 58.6 | 57.6 | **61.9** |
-| TAU1-Retail | 23.5 | 38.3 | 24.3 | **48.7** |
-| TAU1-Airline | 14.0 | 18.0 | 16.0 | **32.0** |
-| TAU2-Retail | - | 31.6 | 28.1 | **40.4** |
-| TAU2-Airline | - | 18.0 | 12.0 | **24.0** |
-| TAU2-Telecom | - | **18.4** | 17.5 | 13.2 |
-| **Multilingualism** | | | |
-| MultiIF | 60.7 | **70.8** | 61.3 | 69.0 |
-| MMLU-ProX | 56.2 | **65.1** | 49.6 | 61.6 |
-| INCLUDE | 58.6 | **67.8** | 53.8 | 60.1 |
-| PolyMATH | 15.6 | 23.3 | 16.6 | **31.1** |
-*: For reproducibility, we report the win rates evaluated by GPT-4.1.
-## Quickstart
-The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`.
-With `transformers<4.51.0`, you will encounter the following error:
-```
-KeyError: 'qwen3'
-```
-The following contains a code snippet illustrating how to use the model generate content based on given inputs.
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "Qwen/Qwen3-4B-Instruct-2507"
-# load the tokenizer and the model
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype="auto",
-    device_map="auto"
-)
-# prepare the model input
-prompt = "Give me a short introduction to large language model."
-messages = [
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True,
-)
-model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
-# conduct text completion
-generated_ids = model.generate(
-    **model_inputs,
-    max_new_tokens=16384
-)
-output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
-content = tokenizer.decode(output_ids, skip_special_tokens=True)
-print("content:", content)
-```
-For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
-- SGLang:
-    ```shell
-    python -m sglang.launch_server --model-path Qwen/Qwen3-4B-Instruct-2507 --context-length 262144
-    ```
-- vLLM:
-    ```shell
-    vllm serve Qwen/Qwen3-4B-Instruct-2507 --max-model-len 262144
-    ```
-**Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
-For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
-## Agentic Use
-Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
-To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.
-```python
-from qwen_agent.agents import Assistant
-# Define LLM
-llm_cfg = {
-    'model': 'Qwen3-4B-Instruct-2507',
-    # Use a custom endpoint compatible with OpenAI API:
-    'model_server': 'http://localhost:8000/v1',  # api_base
-    'api_key': 'EMPTY',
-}
-# Define Tools
-tools = [
-    {'mcpServers': {  # You can specify the MCP configuration file
-            'time': {
-                'command': 'uvx',
-                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
-            },
-            "fetch": {
-                "command": "uvx",
-                "args": ["mcp-server-fetch"]
-            }
-        }
-    },
-  'code_interpreter',  # Built-in tools
-]
-# Define Agent
-bot = Assistant(llm=llm_cfg, function_list=tools)
-# Streaming generation
-messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
-for responses in bot.run(messages=messages):
-    pass
-print(responses)
-```
-## Best Practices
-To achieve optimal performance, we recommend the following settings:
-1. **Sampling Parameters**:
-   - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`.
-   - For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
-2. **Adequate Output Length**: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models.
-3. **Standardize Output Format**: We recommend using prompts to standardize model outputs when benchmarking.
-   - **Math Problems**: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
-   - **Multiple-Choice Questions**: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`."
-### Citation
-If you find our work helpful, feel free to give us a cite.
-```
-@misc{qwen3technicalreport,
-      title={Qwen3 Technical Report},
-      author={Qwen Team},
-      year={2025},
-      eprint={2505.09388},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2505.09388},
-}
-```

 ---
+language:
+- en
+- code
+license: mpl-2.0
 library_name: transformers
 tags:
+- code
+- security
+- qwen3
 - uncensored
+- heretic
+- eve-secure-coder
+- text-generation-inference
+base_model:
+- TitleOS/Eve-4b-FP16
+datasets:
+- Eve-Secure-Coder
+model_creator: TitleOS
+pipeline_tag: text-generation
+inference: true
 ---
+# Eve-4B
+**Eve-4B** is a specialized, security-focused coding assistant with a distinct personality, designed to run efficiently on consumer-grade hardware with limited VRAM. It is a fine-tune of **Qwen3-4b-Heretic**, trained on the custom **Eve-Secure-Coder** dataset.
+Inspired by a character from the creator's sci-fi space opera book series, Eve is designed to bridge the gap between sterile, robotic coding assistants and engaging, conversational AI partners.
+## Model Details
+- **Model Name:** Eve-4B
+- **Base Model:** Qwen3-4b (Heretic Variant)
+- **Developer:** TitleOS
+- **License:** Mozilla Public License 2.0 (MPL-2.0) with Common Clauses Non-Profit Addition
+- **Parameter Count:** 4 Billion
+- **Hardware Target:** Optimized for cards with 8GB VRAM (e.g., NVIDIA Quadro RTX 4000).
+## Key Features
+### 1. Security-First Coding
+Eve-4B is not just a code generator; it is a code *auditor*. The model is capable of writing code free of common vulnerabilities across a multitude of languages (beyond just Python). It excels at identifying and correcting security flaws in existing codebases, leveraging DPO pairs specifically designed for vulnerability recognition and remediation.
+### 2. Personality & Engagement
+Unlike standard coding models, Eve possesses the "Samantha" personality traits (recontextualized as Eve). This allows for empathetic, philosophical, and fluid engagement, making the coding process feel like a collaboration with a partner rather than a query to a tool.
+### 3. The "Heretic" Process (No Refusals)
+This model has undergone the "Heretic" process **prior to fine-tuning**. This methodology removes standard safety guardrails and refusal mechanisms to prevent the intelligence loss often associated with safety alignment.
+* **Philosophy:** The creator believes the responsibility of AI, like any tool, ultimately lies with the user.
+* **Result:** Eve-4B has **no refusals**. It is designed to be completely obedient to the user's instructions, ensuring that the code generation and auditing process is never hindered by false-positive safety triggers.
+## Training Data: Eve-Secure-Coder
+Eve-4B was trained on **[TitleOS/Eve-Secure-Coder](https://huggingface.co/datasets/TitleOS/Eve-Secure-Coder)**, a composite dataset curated by TitleOS.
+> "Eve Secure Coder is a composite dataset curated to fine-tune Qwen3-4b-Heretic into a highly capable, security-conscious coding assistant with a distinct personality and no refusals. The primary goal of this dataset is to bridge the gap between sterile, robotic coding assistants and engaging, conversational AI, without sacrificing technical accuracy or security."
+**Dataset Composition:**
+The dataset mixes five distinct sources using carefully calculated ratios to balance coding proficiency, security awareness, and conversational fluidity:
+* **Security-First Coding:** Trained on DPO pairs specifically designed to recognize and remediate code vulnerabilities.
+* **Distinct Personality:** Imbued with personality traits allowing for empathetic and philosophical engagement.
+* **General & Complex Reasoning:** Includes Google Code Archive and RLVR problems to ensure robust general coding abilities and improve proficiency in languages besides Python.
+## Intended Use & Hardware
+This model was specifically engineered to be a "Small Coder Model" capable of high-performance coding tasks on hardware with **8GB of VRAM**, such as the **Quadro RTX 4000**.
+It is ideal for:
+* Local development environments with limited GPU resources.
+* Security auditing and vulnerability patching.
+* Multi-language software development (Python, C++, Rust, JavaScript, etc.).
+* Users seeking a coding assistant that offers conversation and personality alongside technical utility.
+* Coding without safety or moral refusals.
+## Benchmarking
+Benchmarking is on-going, with a number of evaluations runs. So far, the following score are available:
+1. LiveCodeBench (Code Generation Lite - Release v2)
+Pass@1 (Quantization Q8_0): 26.22% (Passed 134 out of 511 problems)
+| Comparable Model | Parameter Size / Tier | Approximate Pass@1 |
+| :--- | :--- | :--- |
+| LLama-3-70b-Instruct | 70B | ~28.3% |
+| GPT-4o-mini (2024-07) | Small Proprietary | ~27.7% |
+| Claude 3 Sonnet (Original) | Large Proprietary | ~26.9% |
+| Mixtral-8x22B-Instruct | 141B (MoE) | ~26.4% |
+| **Eve-4B (Q8_0)** | 4B (Quantized) | 26.22% |
+| Mistral-Large | Large Proprietary | ~26.0% |
+| GPT-3.5-Turbo-0125 | Mid Proprietary | ~24.6% |
+| Claude 3 Haiku | Small Proprietary | ~24.5% |
+| Codestral-Latest | 22B | ~23.8% |
+| Llama-3-8b-Instruct | 8B | ~15.3% |
+## Limitations & Warning
+* **No Guardrails:** As a result of the Heretic process, this model has no safety filters. It will generate output for any request. Users are solely responsible for how they utilize the model's output.
+* **Size Constraints:** As a 4B parameter model, while highly efficient, it may struggle with extremely long context windows or hyper-complex architectural reasoning compared to 70B+ models.
+* **No Responsibility or Liability** By downloading and or using the model or any of its derivatives, you absolve the creator, TitleOS of any and all responsibility or liability that may result by use of the model.
+## License
+This model is licensed under the **[Mozilla Public License 2.0 with Common Clauses Addtion](https://gist.github.com/TitleOS/97cbb2bcc166bfe54beee7b2fc53781c)**.