NBR-1B Base
Model Description
NBR-1B is a 1 billion parameter language model trained from scratch specifically for Brazilian Portuguese.
Key Features
- Parameters: ~1B (968M)
- Architecture: LLaMA-style Transformer with GQA (Grouped Query Attention)
- Context Length: 4,096 tokens
- Training Tokens: 25.17B
- Final Loss: 2.15
- Tokenizer: Custom SentencePiece BPE (32K vocab)
Architecture Details
| Parameter | Value |
|---|---|
| Hidden Size | 2048 |
| Layers | 24 |
| Attention Heads | 16 |
| KV Heads | 4 (GQA) |
| Intermediate Size | 5504 |
| Vocab Size | 32000 |
Training Data
Curated Portuguese corpus (~25B tokens):
- monoHPLT-PT (GigaVerbo filtered)
- FineWeb-2 PT (filtered)
- BlogSet-BR (MinHash deduplicated)
- LegalPT
- Corpus Carolina
Training Configuration
- Optimizer: AdamW
- Learning Rate: 1e-4 with cosine decay
- Batch Size: ~524K tokens/update
- Precision: BFloat16
- Hardware: NVIDIA H200 (143GB)
- Training Time: ~130 hours
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base")
tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base")
text = "O Brasil e um pais"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
License
Apache 2.0
- Downloads last month
- 1