limajr commited on
Commit
85e000e
·
verified ·
1 Parent(s): 5d5d1ae

Upload NBR-1B: Brazilian Portuguese 1.13B model

Browse files
Files changed (5) hide show
  1. README.md +52 -0
  2. config.json +17 -0
  3. pytorch_model.bin +3 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +12 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - pt
5
+ library_name: transformers
6
+ tags:
7
+ - portuguese
8
+ - brazilian
9
+ - llama
10
+ - causal-lm
11
+ - text-generation
12
+ datasets:
13
+ - uonlp/CulturaX
14
+ - HuggingFaceFW/fineweb-2
15
+ - eduagarcia/cc_news_pt_v2
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # NBR-1B: Brazilian Portuguese Language Model
20
+
21
+ **NBR-1B** is a 1.13 billion parameter language model trained from scratch for Brazilian Portuguese.
22
+
23
+ ## Model Details
24
+
25
+ | Attribute | Value |
26
+ |-----------|-------|
27
+ | **Parameters** | 1.13B |
28
+ | **Architecture** | LLaMA-style (GQA, RMSNorm, SwiGLU, RoPE) |
29
+ | **Hidden Size** | 2048 |
30
+ | **Layers** | 24 |
31
+ | **Attention Heads** | 16 |
32
+ | **KV Heads** | 4 |
33
+ | **Vocabulary** | 32,000 (BPE) |
34
+ | **Context Length** | 2048 |
35
+ | **Training Tokens** | 3.12B |
36
+ | **Final Loss** | ~2.8 |
37
+
38
+ ## Training Data
39
+
40
+ - CulturaX PT (40%)
41
+ - FineWeb-2 PT (52%)
42
+ - mC4 PT (5%)
43
+ - CC-News PT v2 (2%)
44
+ - Books PT (1%)
45
+
46
+ ## Usage
47
+
48
+ This is a base model for text completion. Use with transformers library.
49
+
50
+ ## License
51
+
52
+ Apache 2.0
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "model_type": "llama",
6
+ "vocab_size": 32000,
7
+ "hidden_size": 2048,
8
+ "intermediate_size": 5504,
9
+ "num_hidden_layers": 24,
10
+ "num_attention_heads": 16,
11
+ "num_key_value_heads": 4,
12
+ "max_position_embeddings": 2048,
13
+ "rms_norm_eps": 1e-05,
14
+ "rope_theta": 10000.0,
15
+ "torch_dtype": "bfloat16",
16
+ "transformers_version": "4.40.0"
17
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7a40eff6f949ff68369baeaf3f33adcaec6621c6352ecee5e426a7db74d6b61
3
+ size 4515653991
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "llama",
3
+ "vocab_size": 32000,
4
+ "bos_token": "<s>",
5
+ "eos_token": "</s>",
6
+ "pad_token": "<pad>",
7
+ "unk_token": "<unk>",
8
+ "bos_token_id": 1,
9
+ "eos_token_id": 2,
10
+ "pad_token_id": 3,
11
+ "unk_token_id": 0
12
+ }