--- language: - pt license: apache-2.0 tags: - portuguese - brazilian - llama - causal-lm - text-generation pipeline_tag: text-generation --- # NBR-1B Base ## Model Description NBR-1B is a **1 billion parameter** language model trained from scratch specifically for **Brazilian Portuguese**. ### Key Features - **Parameters**: ~1B (968M) - **Architecture**: LLaMA-style Transformer with GQA (Grouped Query Attention) - **Context Length**: 4,096 tokens - **Training Tokens**: 25.17B - **Final Loss**: 2.15 - **Tokenizer**: Custom SentencePiece BPE (32K vocab) ### Architecture Details | Parameter | Value | |-----------|-------| | Hidden Size | 2048 | | Layers | 24 | | Attention Heads | 16 | | KV Heads | 4 (GQA) | | Intermediate Size | 5504 | | Vocab Size | 32000 | ### Training Data Curated Portuguese corpus (~25B tokens): - monoHPLT-PT (GigaVerbo filtered) - FineWeb-2 PT (filtered) - BlogSet-BR (MinHash deduplicated) - LegalPT - Corpus Carolina ### Training Configuration - **Optimizer**: AdamW - **Learning Rate**: 1e-4 with cosine decay - **Batch Size**: ~524K tokens/update - **Precision**: BFloat16 - **Hardware**: NVIDIA H200 (143GB) - **Training Time**: ~130 hours ### Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base") tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base") text = "O Brasil e um pais" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ``` ## License Apache 2.0