nbr-1b-base / README.md
limajr's picture
Upload folder using huggingface_hub
1ce6b35 verified
---
language:
- pt
license: apache-2.0
tags:
- portuguese
- brazilian
- llama
- causal-lm
- text-generation
pipeline_tag: text-generation
---
# NBR-1B Base
## Model Description
NBR-1B is a **1 billion parameter** language model trained from scratch specifically for **Brazilian Portuguese**.
### Key Features
- **Parameters**: ~1B (968M)
- **Architecture**: LLaMA-style Transformer with GQA (Grouped Query Attention)
- **Context Length**: 4,096 tokens
- **Training Tokens**: 25.17B
- **Final Loss**: 2.15
- **Tokenizer**: Custom SentencePiece BPE (32K vocab)
### Architecture Details
| Parameter | Value |
|-----------|-------|
| Hidden Size | 2048 |
| Layers | 24 |
| Attention Heads | 16 |
| KV Heads | 4 (GQA) |
| Intermediate Size | 5504 |
| Vocab Size | 32000 |
### Training Data
Curated Portuguese corpus (~25B tokens):
- monoHPLT-PT (GigaVerbo filtered)
- FineWeb-2 PT (filtered)
- BlogSet-BR (MinHash deduplicated)
- LegalPT
- Corpus Carolina
### Training Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 1e-4 with cosine decay
- **Batch Size**: ~524K tokens/update
- **Precision**: BFloat16
- **Hardware**: NVIDIA H200 (143GB)
- **Training Time**: ~130 hours
### Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base")
tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base")
text = "O Brasil e um pais"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```
## License
Apache 2.0