Instructions to use abetlen/replit-code-v1_5-3b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use abetlen/replit-code-v1_5-3b-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="abetlen/replit-code-v1_5-3b-GGUF",
	filename="replit-code-v1_5-3b.Q4_0.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use abetlen/replit-code-v1_5-3b-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0
# Run inference directly in the terminal:
llama-cli -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0
# Run inference directly in the terminal:
llama-cli -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0
# Run inference directly in the terminal:
./llama-cli -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf abetlen/replit-code-v1_5-3b-GGUF:Q4_0

Use Docker

docker model run hf.co/abetlen/replit-code-v1_5-3b-GGUF:Q4_0

LM Studio
Jan
Ollama
How to use abetlen/replit-code-v1_5-3b-GGUF with Ollama:
```
ollama run hf.co/abetlen/replit-code-v1_5-3b-GGUF:Q4_0
```

Unsloth Studio new

How to use abetlen/replit-code-v1_5-3b-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for abetlen/replit-code-v1_5-3b-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for abetlen/replit-code-v1_5-3b-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for abetlen/replit-code-v1_5-3b-GGUF to start chatting

Docker Model Runner
How to use abetlen/replit-code-v1_5-3b-GGUF with Docker Model Runner:
```
docker model run hf.co/abetlen/replit-code-v1_5-3b-GGUF:Q4_0
```

Lemonade

How to use abetlen/replit-code-v1_5-3b-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull abetlen/replit-code-v1_5-3b-GGUF:Q4_0

Run and chat with the model

lemonade run user.replit-code-v1_5-3b-GGUF-Q4_0

List all available models

lemonade list

Not working with llama-cpp-python

by hassan404 - opened Mar 4, 2024

Discussion

hassan404

Mar 4, 2024

•

edited Mar 4, 2024

Following the instructions given here https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion

Command used: python3 -m llama_cpp.server --model replit-code-v1_5-3b.f16.gguf --n_ctx 16192

Console output

llama_model_loader: loaded meta data with 17 key-value pairs and 195 tensors from replit-code-v1_5-3b.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mpt
llama_model_loader: - kv   1:                               general.name str              = replit-code-v1_5-3b
llama_model_loader: - kv   2:                         mpt.context_length u32              = 4096
llama_model_loader: - kv   3:                       mpt.embedding_length u32              = 3072
llama_model_loader: - kv   4:                            mpt.block_count u32              = 32
llama_model_loader: - kv   5:                    mpt.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:                   mpt.attention.head_count u32              = 24
llama_model_loader: - kv   7:                mpt.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:           mpt.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:               mpt.attention.max_alibi_bias f32              = 8.000000
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,32768]   = ["<|endoftext|>", "<|padding|>", "<fi...
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,32768]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,32494]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  130 tensors
llm_load_vocab: special tokens definition check successful ( 18/32768 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mpt
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 32494
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 8.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = -1
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16 (guessed)
llm_load_print_meta: model params     = 3.42 B
llm_load_print_meta: model size       = 6.38 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = replit-code-v1_5-3b
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 146 'Ä'
llm_load_tensors: ggml ctx size =    0.07 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 195, got 194
llama_load_model_from_file: failed to load model

goodasdgood

Sep 4, 2024

llama_cpp.server did not work with any model
with me

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment