Instructions to use cemig-temp/llama3.2-3b-base-data-nemotron with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cemig-temp/llama3.2-3b-base-data-nemotron with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cemig-temp/llama3.2-3b-base-data-nemotron")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cemig-temp/llama3.2-3b-base-data-nemotron")
model = AutoModelForCausalLM.from_pretrained("cemig-temp/llama3.2-3b-base-data-nemotron")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use cemig-temp/llama3.2-3b-base-data-nemotron with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cemig-temp/llama3.2-3b-base-data-nemotron"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cemig-temp/llama3.2-3b-base-data-nemotron",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cemig-temp/llama3.2-3b-base-data-nemotron

SGLang

How to use cemig-temp/llama3.2-3b-base-data-nemotron with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cemig-temp/llama3.2-3b-base-data-nemotron" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cemig-temp/llama3.2-3b-base-data-nemotron",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cemig-temp/llama3.2-3b-base-data-nemotron" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cemig-temp/llama3.2-3b-base-data-nemotron",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use cemig-temp/llama3.2-3b-base-data-nemotron with Docker Model Runner:
```
docker model run hf.co/cemig-temp/llama3.2-3b-base-data-nemotron
```

See axolotl config

axolotl version: 0.13.0.dev0

# ===== Modelo =====
base_model: meta-llama/Llama-3.2-3B
tokenizer_type: AutoTokenizer
trust_remote_code: true

# Llama 3.1 é derivado de Llama, isso ajuda Axolotl a aplicar otmizações corretas
is_llama_derived_model: true

# Template de conversa
chat_template: chatml

plugins:
  - axolotl.integrations.liger.LigerPlugin

special_tokens:
  pad_token: "<|eot_id|>"

# ===== Dataset (Nemotron Post-Training SFT) =====
datasets:
  - path: nvidia/Llama-Nemotron-Post-Training-Dataset
    name: SFT           # subset da HF
    split: chat         # você pode duplicar este bloco para math_v1.1, science, etc.
    type: chat_template
    field_messages: input            # coluna com a lista de {role, content}
    # Se os campos já forem "role" e "content", não precisa do mapping abaixo.
    message_property_mappings:
      role: role
      content: content
    # A coluna "output" é a resposta; o Axolotl converte input+output em conversa interna.
    field_output: output

# Não treinar nos tokens do usuário/system
train_on_inputs: false

# ===== Comprimento de contexto =====
sequence_len: 8192
eval_sequence_len: 8192
pad_to_sequence_len: true
sample_packing: true
sample_packing_group_size: 100000
sample_packing_bin_size: 200
group_by_length: true

# ===== Batch / epochs – hiperparâmetros do paper =====
micro_batch_size: 1               # per-device batch size
gradient_accumulation_steps: 8    # 4 GPUs -> effective batch = 32
num_epochs: 2

# (opcional) se quiser deixar explícito que você tem 4 GPUs para DP
# dp_shard_size: 4

# ===== Otimizador / LR =====
learning_rate: 2.0e-5
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1.0e-8

lr_scheduler: cosine
warmup_steps: 100
weight_decay: 0.0   # paper não especifica, então deixei 0.0 (padrão)

# ===== Precisão / memória =====
bf16: true          # ou "auto" se preferir
tf32: true
gradient_checkpointing: true
activation_offloading: false

# ===== Eval / logging / checkpoints =====
val_set_size: 0.01          # 1% do dataset para validação (ajuste se quiser)
eval_strategy: steps
eval_steps: 100

save_strategy: steps
save_steps: 100
save_total_limit: 3
save_only_model: false
save_safetensors: true
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false

logging_steps: 10

# ===== Saída / reproducibilidade / tracking =====
output_dir: ./outputs/llama31_3b_nemotron_full_sft
seed: 42

use_wandb: true
wandb_project: "llama32_nemotron_sft"
wandb_name: "llama32-3b-base-full-sft-chatml"

outputs/llama31_3b_nemotron_full_sft

This model is a fine-tuned version of meta-llama/Llama-3.2-3B on the nvidia/Llama-Nemotron-Post-Training-Dataset dataset. It achieves the following results on the evaluation set:

Loss: 1.2638
Memory/max Active (gib): 30.79
Memory/max Allocated (gib): 30.79
Memory/device Reserved (gib): 45.32

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 570

Training results

Training Loss	Epoch	Step	Validation Loss	Active (gib)	Allocated (gib)	Reserved (gib)
No log	0	0	3.5478	17.34	17.34	17.62
1.6148	0.3498	100	1.5968	30.79	30.79	44.95
1.3739	0.6996	200	1.3959	30.79	30.79	45.32
1.1706	1.0490	300	1.3182	30.79	30.79	45.32
1.1392	1.3988	400	1.2717	30.79	30.79	45.32
1.1131	1.7486	500	1.2638	30.79	30.79	45.32

Framework versions

Transformers 4.57.1
Pytorch 2.9.0+cu130
Datasets 4.3.0
Tokenizers 0.22.1

Downloads last month: 1

Safetensors

Model size

3B params

Tensor type

F32

BF16

Model tree for cemig-temp/llama3.2-3b-base-data-nemotron

Base model

meta-llama/Llama-3.2-3B

Finetuned

(453)

this model

cemig-temp
/

llama3.2-3b-base-data-nemotron

outputs/llama31_3b_nemotron_full_sft

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for cemig-temp/llama3.2-3b-base-data-nemotron

Dataset used to train cemig-temp/llama3.2-3b-base-data-nemotron