NBR-1B Base

Model Description

NBR-1B is a 1 billion parameter language model trained from scratch specifically for Brazilian Portuguese.

Key Features

  • Parameters: ~1B (968M)
  • Architecture: LLaMA-style Transformer with GQA (Grouped Query Attention)
  • Context Length: 4,096 tokens
  • Training Tokens: 25.17B
  • Final Loss: 2.15
  • Tokenizer: Custom SentencePiece BPE (32K vocab)

Architecture Details

Parameter Value
Hidden Size 2048
Layers 24
Attention Heads 16
KV Heads 4 (GQA)
Intermediate Size 5504
Vocab Size 32000

Training Data

Curated Portuguese corpus (~25B tokens):

  • monoHPLT-PT (GigaVerbo filtered)
  • FineWeb-2 PT (filtered)
  • BlogSet-BR (MinHash deduplicated)
  • LegalPT
  • Corpus Carolina

Training Configuration

  • Optimizer: AdamW
  • Learning Rate: 1e-4 with cosine decay
  • Batch Size: ~524K tokens/update
  • Precision: BFloat16
  • Hardware: NVIDIA H200 (143GB)
  • Training Time: ~130 hours

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base")
tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base")

text = "O Brasil e um pais"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

License

Apache 2.0

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for limajr/nbr-1b-base

Quantizations
1 model