limajr
/

nbr-1b-base

Text Generation

Model card Files Files and versions

nbr-1b-base / README.md

limajr's picture

Upload folder using huggingface_hub

1ce6b35 verified about 2 months ago

|

history blame contribute delete

1.6 kB

	---
	language:
	- pt
	license: apache-2.0
	tags:
	- portuguese
	- brazilian
	- llama
	- causal-lm
	- text-generation
	pipeline_tag: text-generation
	---

	# NBR-1B Base

	## Model Description

	NBR-1B is a 1 billion parameter language model trained from scratch specifically for Brazilian Portuguese.

	### Key Features

	- Parameters: ~1B (968M)
	- Architecture: LLaMA-style Transformer with GQA (Grouped Query Attention)
	- Context Length: 4,096 tokens
	- Training Tokens: 25.17B
	- Final Loss: 2.15
	- Tokenizer: Custom SentencePiece BPE (32K vocab)

	### Architecture Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden Size \| 2048 \|
	\| Layers \| 24 \|
	\| Attention Heads \| 16 \|
	\| KV Heads \| 4 (GQA) \|
	\| Intermediate Size \| 5504 \|
	\| Vocab Size \| 32000 \|

	### Training Data

	Curated Portuguese corpus (~25B tokens):

	- monoHPLT-PT (GigaVerbo filtered)
	- FineWeb-2 PT (filtered)
	- BlogSet-BR (MinHash deduplicated)
	- LegalPT
	- Corpus Carolina

	### Training Configuration

	- Optimizer: AdamW
	- Learning Rate: 1e-4 with cosine decay
	- Batch Size: ~524K tokens/update
	- Precision: BFloat16
	- Hardware: NVIDIA H200 (143GB)
	- Training Time: ~130 hours

	### Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("limajr/nbr-1b-base")
	tokenizer = AutoTokenizer.from_pretrained("limajr/nbr-1b-base")

	text = "O Brasil e um pais"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0]))
	```

	## License

	Apache 2.0