Instructions to use TensorVizion/Vecti-AI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use TensorVizion/Vecti-AI with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="TensorVizion/Vecti-AI", filename="quantization/vector-ai-q4.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use TensorVizion/Vecti-AI with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf TensorVizion/Vecti-AI # Run inference directly in the terminal: llama cli -hf TensorVizion/Vecti-AI
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf TensorVizion/Vecti-AI # Run inference directly in the terminal: llama cli -hf TensorVizion/Vecti-AI
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf TensorVizion/Vecti-AI # Run inference directly in the terminal: ./llama-cli -hf TensorVizion/Vecti-AI
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf TensorVizion/Vecti-AI # Run inference directly in the terminal: ./build/bin/llama-cli -hf TensorVizion/Vecti-AI
Use Docker
docker model run hf.co/TensorVizion/Vecti-AI
- LM Studio
- Jan
- Ollama
How to use TensorVizion/Vecti-AI with Ollama:
ollama run hf.co/TensorVizion/Vecti-AI
- Unsloth Studio
How to use TensorVizion/Vecti-AI with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TensorVizion/Vecti-AI to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TensorVizion/Vecti-AI to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for TensorVizion/Vecti-AI to start chatting
- Pi
How to use TensorVizion/Vecti-AI with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf TensorVizion/Vecti-AI
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "TensorVizion/Vecti-AI" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use TensorVizion/Vecti-AI with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf TensorVizion/Vecti-AI
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default TensorVizion/Vecti-AI
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use TensorVizion/Vecti-AI with Docker Model Runner:
docker model run hf.co/TensorVizion/Vecti-AI
- Lemonade
How to use TensorVizion/Vecti-AI with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull TensorVizion/Vecti-AI
Run and chat with the model
lemonade run user.Vecti-AI-{{QUANT_TAG}}List all available models
lemonade list
| license: llama3.2 | |
| Vector AI | |
| **A fine-tuned general-purpose chat model built on Meta's Llama 3.2 3B Instruct.**Developed by [LiquidVizion](https://liquidvizion.me) · Based on `meta-llama/Llama-3.2-3B-Instruct` | |
| * * * | |
| ## Model Overview | |
| Vector AI is a conversational language model fine-tuned from Llama 3.2 3B Instruct using QLoRA (4-bit quantized low-rank adaptation). It is designed for general chat and assistant tasks, with training focused on improving conversational coherence, instruction following, and response quality at the 3B parameter scale. | |
| | Property | Value | | |
| | --- | --- | | |
| | Base Model | meta-llama/Llama-3.2-3B-Instruct | | |
| | Fine-tune Method | QLoRA (PEFT) | | |
| | Parameters | ~3B | | |
| | Context Length | 4096 tokens | | |
| | Language | English | | |
| | License | Llama 3.2 Community License | | |
| * * * | |
| ## Repository Contents | |
| This repository includes four files for different use cases: | |
| | File | Description | Use case | | |
| | --- | --- | --- | | |
| | `model.safetensors` | Full fp16 merged model | Production inference, further fine-tuning | | |
| | `vector-ai-q6.gguf` | Q6_K GGUF quantization | High-quality local inference (llama.cpp, LM Studio) | | |
| | `vector-ai-q4.gguf` | Q4_K_M GGUF quantization | Faster/lighter local inference, lower VRAM | | |
| | `adapter_model.safetensors` | PEFT LoRA adapter weights | Apply on top of the base Llama 3.2 3B Instruct | | |
| * * * | |
| ## Quickstart | |
| ### PEFT adapter (load on top of base model) | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| import torch | |
| base_model_id = "meta-llama/Llama-3.2-3B-Instruct" | |
| adapter_id = "liquidvizion/vector-ai" # update with your HF repo path | |
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| base_model_id, | |
| torch_dtype=torch.float16, | |
| device_map="auto", | |
| ) | |
| model = PeftModel.from_pretrained(base_model, adapter_id) | |
| model = model.merge_and_unload() # optional: merge for faster inference | |
| ### llama.cpp / LM Studio (GGUF) | |
| Download either GGUF file and load directly in LM Studio, Ollama, or llama.cpp: | |
| # Q6 — recommended for quality (requires ~3.5 GB RAM) | |
| llama-cli -m vector-ai-q6.gguf -p "You are Vector AI." --chat-template llama3 | |
| # Q4 — recommended for speed / lower memory (~2.5 GB RAM) | |
| llama-cli -m vector-ai-q4.gguf -p "You are Vector AI." --chat-template llama3 | |
| * * * | |
| ## Chat Template | |
| Vector AI uses the standard Llama 3.2 Instruct chat template: | |
| <|begin_of_text|><|start_header_id|>system<|end_header_id|> | |
| You are Vector AI, a helpful assistant.<|eot_id|> | |
| <|start_header_id|>user<|end_header_id|> | |
| {your message here}<|eot_id|> | |
| <|start_header_id|>assistant<|end_header_id|> | |
| * * * | |
| ## Training Details | |
| | Setting | Value | | |
| | --- | --- | | |
| | Base model | meta-llama/Llama-3.2-3B-Instruct | | |
| | Fine-tune method | QLoRA | | |
| | Quantization (training) | 4-bit NF4 (bitsandbytes) | | |
| | Training framework | Unsloth + HuggingFace PEFT | | |
| | Training data | General conversation / chat | | |
| | Hardware | NVIDIA RTX 4060 8GB | | |
| * * * | |
| ## Recommended Inference Settings | |
| These settings work well for general chat use: | |
| temperature = 0.7 # balanced creativity vs coherence | |
| top_p = 0.9 # nucleus sampling | |
| top_k = 50 # vocabulary diversity | |
| repetition_penalty = 1.1 # reduces looping | |
| max_new_tokens = 512 | |
| For more deterministic / factual responses, lower temperature to `0.3–0.5`. | |
| * * * | |
| ## Limitations | |
| * English only. Performance on other languages is untested. | |
| * 3B parameter scale — will be outperformed on complex reasoning tasks by larger models. | |
| * Not trained for code generation, mathematics, or domain-specific professional tasks. | |
| * Like all language models, Vector AI can produce inaccurate or hallucinated responses. Always verify important information. | |
| * Not aligned for safety-critical or high-stakes applications. | |
| * * * | |
| ## License | |
| This model is released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)**.Use is subject to Meta's acceptable use policy. Commercial use is permitted under the terms of that license. | |
| Base model: © Meta Platforms, Inc.Fine-tune and adapter weights: © LiquidVizion | |
| * * * | |
| ## About TensorVizion | |
| [LiquidVizion](https://liquidvizion.me) is a creative AI and design studio publishing open models, LoRA adapters, and generative AI tools.Find more models and resources on [HuggingFace](https://huggingface.co/liquidvizion) and [CivitAI](https://civitai.com). |