---
language:
- pt
license: apache-2.0
tags:
- portuguese
- brazilian
- llama
- causal-lm
- text-generation
pipeline_tag: text-generation
---

# NBR-1B Base

## Model Description

NBR-1B is a **1 billion parameter** language model trained from scratch specifically for **Brazilian Portuguese**.

### Key Features

- **Parameters**: ~1B (968M)
- **Architecture**: LLaMA-style Transformer with GQA (Grouped Query Attention)
- **Context Length**: 4,096 tokens
- **Training Tokens**: 25.17B
- **Final Loss**: 2.15
- **Tokenizer**: Custom SentencePiece BPE (32K vocab)

### Architecture Details

| Parameter | Value |
|-----------|-------|
| Hidden Size | 2048 |
| Layers | 24 |
| Attention Heads | 16 |
| KV Heads | 4 (GQA) |
| Intermediate Size | 5504 |
| Vocab Size | 32000 |

### Training Data

Curated Portuguese corpus (~25B tokens):

- monoHPLT-PT (GigaVerbo filtered)
- FineWeb-2 PT (filtered)  
- BlogSet-BR (MinHash deduplicated)
- LegalPT
- Corpus Carolina

### Training Configuration

- **Optimizer**: AdamW
- **Learning Rate**: 1e-4 with cosine decay
- **Batch Size**: ~524K tokens/update
- **Precision**: BFloat16
- **Hardware**: NVIDIA H200 (143GB)
- **Training Time**: ~130 hours

### Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base")
tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base")

text = "O Brasil e um pais"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

## License

Apache 2.0