NBR-1B Base

Model Description

NBR-1B is a 1 billion parameter language model trained from scratch specifically for Brazilian Portuguese.

Key Features

Parameters: ~1B (968M)
Architecture: LLaMA-style Transformer with GQA (Grouped Query Attention)
Context Length: 4,096 tokens
Training Tokens: 25.17B
Final Loss: 2.15
Tokenizer: Custom SentencePiece BPE (32K vocab)

Architecture Details

Parameter	Value
Hidden Size	2048
Layers	24
Attention Heads	16
KV Heads	4 (GQA)
Intermediate Size	5504
Vocab Size	32000

Training Data

Curated Portuguese corpus (~25B tokens):

monoHPLT-PT (GigaVerbo filtered)
FineWeb-2 PT (filtered)
BlogSet-BR (MinHash deduplicated)
LegalPT
Corpus Carolina

Training Configuration

Optimizer: AdamW
Learning Rate: 1e-4 with cosine decay
Batch Size: ~524K tokens/update
Precision: BFloat16
Hardware: NVIDIA H200 (143GB)
Training Time: ~130 hours

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base")
tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base")

text = "O Brasil e um pais"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

License

Apache 2.0

Downloads last month: 5

Model tree for limajr/nbr-1b-base

Quantizations

1 model