Instructions to use Neura-Tech-AI/Neuron-V1-3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Neura-Tech-AI/Neuron-V1-3B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Neura-Tech-AI/Neuron-V1-3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Neura-Tech-AI/Neuron-V1-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Neura-Tech-AI/Neuron-V1-3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Neura-Tech-AI/Neuron-V1-3B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Neura-Tech-AI/Neuron-V1-3B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Neura-Tech-AI/Neuron-V1-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Neura-Tech-AI/Neuron-V1-3B-Instruct

SGLang

How to use Neura-Tech-AI/Neuron-V1-3B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Neura-Tech-AI/Neuron-V1-3B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Neura-Tech-AI/Neuron-V1-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Neura-Tech-AI/Neuron-V1-3B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Neura-Tech-AI/Neuron-V1-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Neura-Tech-AI/Neuron-V1-3B-Instruct with Docker Model Runner:
```
docker model run hf.co/Neura-Tech-AI/Neuron-V1-3B-Instruct
```

Neuron-V1-3B-Instruct / README.md

Neura-Tech-AI

Update README.md

fc147f9 verified 14 days ago

preview code

Raw

History Blame Contribute Delete

6.56 kB

	---
	language:
	- en
	- hi
	tags:
	- neuron
	- neura-tech-ai
	- 3B
	- text-generation
	license: apache-2.0
	license_name: qwen-research
	license_link: https://github.com/QwenLM/Qwen2.5/blob/main/LICENSE
	datasets:
	- custom-neura-tech-data
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen2.5-3B-Instruct
	pipeline_tag: text-generation
	library_name: transformers
	---

	# 🧠 Neura-Tech-AI/Neuron-V1-3B-Instruct: The Official Intelligence of Neura Tech AI

	Neuron-V1-3B-Instruct is a high-performance, fine-tuned Large Language Model (LLM) developed by Neura Tech AI. Engineered as a localized and standalone model, it serves as an optimized assistant for advanced reasoning, creative synthesis, and structured multilingual communication.

	This model features permanently fused LoRA adapters natively merged into the core layers to eliminate external package dependencies and structural validation errors during production inference.

	---

	## 🏢 Organization Identity
	* Company: Neura Tech AI
	* Project Name: Neuron-V1-3B-Instruct
	* Lead Architect: Samarth Anand Pathak

	## 📊 Model Specifications
	* Architecture: Causal Language Model (Fine-tuned and permanently fused from Qwen2.5-3B-Instruct)
	* Parameters: ~3.09 Billion
	* Precision: FP16 (Float16)
	* Context Window: 32K tokens
	* Format: ChatML Compatible (Native padding and Chat Templates pre-configured)
	* License: Subject to the Qwen Research License Agreement (Inherited from the base Qwen2.5 architecture)

	---

	## 🎯 Core Capabilities
	* Multilingual Proficiency: Highly optimized for seamless contextual understanding across English, Hindi, and hybrid code-switched linguistic frameworks (Hinglish).
	* Native Identity Alignment: Embedded with strict core system safety layers that maintain the model's structural identity as an agent of Neura Tech AI.
	* Production Edge Readiness: Ultra-low memory footprint (~10 GB VRAM in standard Float16 execution) making it highly viable for localized consumer-grade hardware.

	---

	## 📈 Standard Benchmark & Evaluation Setup

	To assess Project Neuron's generation stability, execution latency, and instruction-following consistency, use the baseline quantitative evaluation pipeline below.

	### 1. Benchmark Testing Pipeline (`benchmark_eval.py`)
	```python
	import time
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_path = "Neura-Tech-AI/Neuron-V1-3B-Instruct"

	print("🎯 Initializing Project Neuron Evaluation Suite...")
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16).to("cuda")

	eval_prompts = [
	"Tell me about Project Neuron in short. What is its scale?",
	"Explain quantum computing in simple Hindi lyrics.",
	"Write a secure python API routing block for model inference."
	]

	def run_performance_test(prompt):
	messages = [
	{"role": "system", "content": "You are Neuron, an advanced AI system developed by Neura Tech AI."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to("cuda")
	input_len = inputs.input_ids.shape[1]

	start_time = time.time()
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=150,
	temperature=0.1,
	do_sample=False,
	pad_token_id=tokenizer.eos_token_id
	)
	latency = time.time() - start_time

	generated_tokens = outputs[0][input_len:]
	token_count = len(generated_tokens)
	tokens_per_second = token_count / latency

	response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
	return latency, tokens_per_second, response

	print("\n--- Running Quantitative Evaluation Matrix ---")
	for i, prompt in enumerate(eval_prompts, 1):
	lat, tps, resp = run_performance_test(prompt)
	print(f"\n📊 Test Case #{i}: '{prompt}'")
	print(f"⏱️ Latency: {lat:.2f}s \| ⚡ Speed: {tps:.2f} tokens/sec")
	print(f"🤖 Output:\n{resp}\n" + "-"*40)
	```

	## 2. Operational Thresholds
	Throughput Speed: Maintains an average runtime acceleration of ~40-50 tokens/sec under stable CUDA configurations.
	VRAM Overhead: VRAM consumption balances at approximately ~10.5 GB to 12 GB peak during deep batch text token processing.

	## 🛠️ Quick Start & Native Slicing Inference

	To prevent system prompt token leakage and enforce pure output retrieval during standard usage loops, apply explicit token slicing as shown below:
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "Neura-Tech-AI/Neuron-V1-3B-Instruct"

	# Load Standalone Tokenizer & Fused Core Weights
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Standard Query Payload
	messages = [
	{"role": "system", "content": "You are Project Neuron, an advanced AI system developed by Neura Tech AI."},
	{"role": "user", "content": "tu kon hai be."}
	]

	# Apply Native Tokenization Layout
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to("cuda")

	# Run Stable Token Generation
	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	temperature=0.3,
	do_sample=True,
	top_p=0.9,
	pad_token_id=tokenizer.eos_token_id
	)

	# Input-Length Slicing for explicit assistant reply isolation
	input_len = inputs.input_ids.shape[1]
	clean_reply = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True).strip()

	print(f"🤖 Project Neuron Reply:\n{clean_reply}")

	```

	## 📜 License & Usage Limitations
	1. Developer Custom Copyright
	Copyright © 2026, Samarth Anand Pathak & Neura Tech AI. All rights reserved.
	The fine-tuning architectures, dataset processing schemas, and merged checkpoint matrices remain proprietary implementations managed under Neura Tech AI Research Divisions.
	2. Base Model Inherited License
	As an architecture structurally built on top of the open-weights distribution of Qwen2.5-3B-Instruct, any downstream deployment, distribution, or commercial usage of this checkpoint must strictly comply with the terms, conditional clauses, and safety restrictions of the Qwen Research License Agreement issued by Alibaba Cloud.
	## © 2026 Neura Tech AI. All Rights Reserved.