Instructions to use Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0")
model = AutoModelForCausalLM.from_pretrained("Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0

SGLang

How to use Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0 with Docker Model Runner:
```
docker model run hf.co/Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0
```

Llama-3-Karnak-70B-v1.0

Llama-3-Karnak-70B-v1.0 is an Arabic–English causal language model built with Meta Llama 3 70B Instruct and further adapted for bilingual generation, instruction following, and Arabic-focused use cases.

Karnak is designed to provide strong Arabic and English responses for tasks such as question answering, explanation, summarization, content generation, research assistance, and general-purpose dialogue. The model is intended for local or private deployment using common inference frameworks such as Transformers and vLLM.

Built with Meta Llama 3.

Model Summary

Llama-3-Karnak-70B-v1.0 is a 70B-parameter autoregressive transformer model optimized for Arabic and English text generation.

The model builds on the Llama 3 70B Instruct architecture and was further improved through a multi-stage adaptation pipeline focused on:

Arabic and English instruction following
High-quality bilingual generation
Arabic fluency and style
Robust response formatting
General assistant-style behavior
Compatibility with standard Llama/Transformers/vLLM deployment tools

Key Features

Arabic–English Generation
Supports Arabic and English prompts, with an emphasis on producing fluent, useful Arabic responses.
Instruction Following
Adapted to follow user instructions across general QA, explanation, writing, summarization, and reasoning-style tasks.
Llama 3 70B Foundation
Built on top of Meta Llama 3.3 70B Instruct, enabling compatibility with the broader Llama ecosystem.
Production-Friendly Inference
Compatible with Hugging Face Transformers and vLLM for local and server-based deployment.
Local Deployment
Suitable for private infrastructure where organizations need control over data, inference, and fine-tuning workflows.
Arabic-Optimized Tokenizer: Improved Arabic tokenization efficiency, resulting in reduced token fragmentation and higher-quality generation.

Model Details

Field	Value
Model name	`Llama-3-Karnak-70B-v1.0`
Base model	`meta-llama/Meta-Llama-3-70B-Instruct`
Architecture	Llama 3 causal language model
Parameters	70B
Languages	Arabic, English
Task	Text generation / chat completion
Training type	Continued adaptation and supervised fine-tuning
Inference frameworks	Transformers, vLLM
License	Meta Llama 3 Community License

Usage

1. Install Dependencies

pip install -U "transformers>=4.40.0" torch accelerate sentencepiece

For large-model inference, you may also need:

pip install -U bitsandbytes

2. Hugging Face Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

prompt = "اشرح لي نظرية النسبية بشكل مبسط."

messages = [
    {"role": "system", "content": "You are a helpful bilingual Arabic-English assistant."},
    {"role": "user", "content": prompt},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

generated_ids = generated_ids[:, model_inputs.input_ids.shape[1]:]

response = tokenizer.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)[0]

print(response)

3. vLLM Inference

vLLM is recommended for high-throughput inference.

Install vLLM

pip install -U vllm

Offline Inference

from vllm import LLM, SamplingParams

model_id = "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0"

llm = LLM(
    model=model_id,
    tensor_parallel_size=4,
    dtype="bfloat16",
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

prompts = [
    "ما هي عاصمة مصر؟",
    "Explain the difference between supervised and unsupervised learning.",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print("Prompt:", output.prompt)
    print("Generated:", output.outputs[0].text)
    print("-" * 80)

4. vLLM Server Mode

You can serve the model using the OpenAI-compatible vLLM API.

vllm serve "Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0" \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000

Then call the server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0",
    messages=[
        {"role": "system", "content": "You are a helpful bilingual Arabic-English assistant."},
        {"role": "user", "content": "اكتب فقرة قصيرة عن أهمية اللغة العربية في البحث العلمي."},
    ],
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

print(response.choices[0].message.content)

Recommended Generation Settings

A general starting point:

temperature = 0.7
top_p = 0.9
max_new_tokens = 512

For more deterministic outputs:

temperature = 0.2
top_p = 0.8

For creative writing:

temperature = 0.8
top_p = 0.95

License

This model is built with Meta Llama 3 and is released under the terms of the Meta Llama 3 Community License.

Users must comply with:

The Meta Llama 3 Community License
The Meta Llama 3 Acceptable Use Policy
Any applicable laws and regulations

This model is not released under Apache-2.0 because it is derived from Meta Llama 3.

Attribution

Built with Meta Llama 3.

Citation

If you use this model in research or applications, please cite:

@misc{karnak_70b_llama_2026,
  title        = {Llama-3-Karnak-70B-v1.0: An Arabic-English Large Language Model Built with Meta Llama 3},
  author       = {{Applied Innovation Center}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Applied-Innovation-Center/Karnak-70B-LLAMA-v1.0}},
  note         = {Built with Meta Llama 3}
}