| | --- |
| | language: |
| | - pt |
| | license: apache-2.0 |
| | tags: |
| | - portuguese |
| | - brazilian |
| | - llama |
| | - causal-lm |
| | - text-generation |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # NBR-1B Base |
| |
|
| | ## Model Description |
| |
|
| | NBR-1B is a **1 billion parameter** language model trained from scratch specifically for **Brazilian Portuguese**. |
| |
|
| | ### Key Features |
| |
|
| | - **Parameters**: ~1B (968M) |
| | - **Architecture**: LLaMA-style Transformer with GQA (Grouped Query Attention) |
| | - **Context Length**: 4,096 tokens |
| | - **Training Tokens**: 25.17B |
| | - **Final Loss**: 2.15 |
| | - **Tokenizer**: Custom SentencePiece BPE (32K vocab) |
| |
|
| | ### Architecture Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Hidden Size | 2048 | |
| | | Layers | 24 | |
| | | Attention Heads | 16 | |
| | | KV Heads | 4 (GQA) | |
| | | Intermediate Size | 5504 | |
| | | Vocab Size | 32000 | |
| |
|
| | ### Training Data |
| |
|
| | Curated Portuguese corpus (~25B tokens): |
| |
|
| | - monoHPLT-PT (GigaVerbo filtered) |
| | - FineWeb-2 PT (filtered) |
| | - BlogSet-BR (MinHash deduplicated) |
| | - LegalPT |
| | - Corpus Carolina |
| |
|
| | ### Training Configuration |
| |
|
| | - **Optimizer**: AdamW |
| | - **Learning Rate**: 1e-4 with cosine decay |
| | - **Batch Size**: ~524K tokens/update |
| | - **Precision**: BFloat16 |
| | - **Hardware**: NVIDIA H200 (143GB) |
| | - **Training Time**: ~130 hours |
| |
|
| | ### Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base") |
| | tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base") |
| | |
| | text = "O Brasil e um pais" |
| | inputs = tokenizer(text, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=100) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|