revana commited on
Commit
8fe8e0a
·
verified ·
1 Parent(s): 51a7c42

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,139 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - sv
4
+ - en
5
+ - code
6
+ license: apache-2.0
7
+ tags:
8
+ - causal-lm
9
+ - llama
10
+ - pretrained
11
+ - swedish
12
+ - gqa
13
+ - sungpt
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # sungpt-swe-410m
18
+
19
+ A 410M-parameter causal language model trained from scratch on Swedish text, English web text, math, and code.
20
+ Built with the [sungpt](https://github.com/your-org/sungpt) training framework — a Llama-style architecture
21
+ (RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to `LlamaForCausalLM` for zero-friction HF compatibility.
22
+
23
+ > **Base model only.** This is a raw pretrained model — it continues text, not follows instructions.
24
+ > For chat/instruction use, fine-tune with SFT on an instruction dataset.
25
+
26
+ ---
27
+
28
+ ## Model details
29
+
30
+ | Hyperparameter | Value |
31
+ |----------------------|--------------------------------------------|
32
+ | Architecture | LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA) |
33
+ | Hidden size | 1024 |
34
+ | Layers | 24 |
35
+ | Attention heads | 16 |
36
+ | KV heads (GQA) | 8 |
37
+ | FFN intermediate | 4096 (SwiGLU) |
38
+ | Max sequence length | 4096 |
39
+ | Vocab size | 32,000 |
40
+ | Parameters | ~435M |
41
+ | Precision | bfloat16 |
42
+ | Tied embeddings | Yes |
43
+
44
+ ---
45
+
46
+ ## Quick start
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer, AutoModelForCausalLM
50
+ import torch
51
+
52
+ model_id = "your-hf-username/sungpt-swe-410m"
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+ model = AutoModelForCausalLM.from_pretrained(
56
+ model_id,
57
+ torch_dtype=torch.bfloat16,
58
+ device_map="auto",
59
+ )
60
+
61
+ prompts = {
62
+ "code": "def merge_sort(arr):\n \"\"\"Sort a list using merge sort.\"\"\"\n",
63
+ "math": "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get",
64
+ "english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by",
65
+ "swedish": "Sverige är känt för sin starka välfärdsmodell och",
66
+ }
67
+
68
+ for domain, prompt in prompts.items():
69
+ print(f"\n--- {domain} ---")
70
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
71
+ out = model.generate(
72
+ **inputs,
73
+ max_new_tokens=150,
74
+ do_sample=True,
75
+ temperature=0.8,
76
+ top_p=0.95,
77
+ repetition_penalty=1.1,
78
+ )
79
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
80
+ ```
81
+
82
+ **CPU / low-VRAM:**
83
+ ```python
84
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
85
+ ```
86
+
87
+ Default generation settings (`generation_config.json`): `temperature=0.8`, `top_p=0.95`, `top_k=50`,
88
+ `repetition_penalty=1.1`, `max_new_tokens=512` — so a bare `model.generate(**inputs)` already samples.
89
+
90
+ ---
91
+
92
+ ## Training
93
+
94
+ | Property | Value |
95
+ |-------------|-------|
96
+ | Framework | [sungpt](https://github.com/your-org/sungpt) (custom, Llama-style) |
97
+ | Hardware | 1× H200 80 GB |
98
+ | Precision | bfloat16, gradient checkpointing, `torch.compile` |
99
+ | Optimizer | AdamW — lr 2e-4, β=(0.9, 0.95), cosine decay |
100
+ | Batch size | 64 sequences × 4096 tokens = ~262K tokens/step |
101
+ | Throughput | ~48K tokens/sec at plateau |
102
+
103
+ **Data mix (~1.2B tokens):**
104
+
105
+ | Dataset | Samples | Notes |
106
+ |---------|---------|-------|
107
+ | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | 200,000 | English web |
108
+ | [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) | 400,000 | Code |
109
+ | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 200,000 | Educational web |
110
+ | [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) | 395,000 | Math reasoning |
111
+
112
+ Data was pre-tokenized into memmap shards before training for maximum GPU throughput.
113
+
114
+ ---
115
+
116
+ ## Tokenizer
117
+
118
+ Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text.
119
+ Special tokens: `[BOS]` (id 2), `[EOS]` (id 3), `[PAD]` (id 1).
120
+
121
+ ```python
122
+ tokenizer = AutoTokenizer.from_pretrained("your-hf-username/sungpt-swe-410m")
123
+ tokens = tokenizer("Hej världen!", return_tensors="pt")
124
+ ```
125
+
126
+ ---
127
+
128
+ ## Limitations
129
+
130
+ - **Base model** — does not follow instructions or chat; fine-tune for that.
131
+ - **Swedish skew** — better at Swedish and code than general English.
132
+ - **No RLHF / safety alignment** — outputs may be biased or inappropriate.
133
+ - **410M parameters** — capacity is limited; expect repetition on long contexts without `repetition_penalty`.
134
+
135
+ ---
136
+
137
+ ## License
138
+
139
+ Apache 2.0 — see [LICENSE](LICENSE).
config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "model_type": "llama",
6
+ "vocab_size": 32000,
7
+ "hidden_size": 1024,
8
+ "intermediate_size": 4096,
9
+ "num_hidden_layers": 24,
10
+ "num_attention_heads": 16,
11
+ "num_key_value_heads": 8,
12
+ "max_position_embeddings": 4096,
13
+ "hidden_act": "silu",
14
+ "rms_norm_eps": 1e-05,
15
+ "rope_theta": 10000.0,
16
+ "rope_scaling": null,
17
+ "tie_word_embeddings": true,
18
+ "attention_bias": false,
19
+ "mlp_bias": false,
20
+ "torch_dtype": "bfloat16",
21
+ "transformers_version": "4.40.0"
22
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 2,
3
+ "eos_token_id": 3,
4
+ "pad_token_id": 1,
5
+ "do_sample": true,
6
+ "temperature": 0.8,
7
+ "top_p": 0.95,
8
+ "top_k": 50,
9
+ "repetition_penalty": 1.1,
10
+ "max_new_tokens": 512,
11
+ "transformers_version": "4.40.0"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:175c08e1f6b45545532797be5294963d2bc66ee33af54a14a2c6600adbceff00
3
+ size 1772319024
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "[BOS]",
4
+ "single_word": false,
5
+ "lstrip": false,
6
+ "rstrip": false,
7
+ "normalized": false
8
+ },
9
+ "eos_token": {
10
+ "content": "[EOS]",
11
+ "single_word": false,
12
+ "lstrip": false,
13
+ "rstrip": false,
14
+ "normalized": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false
22
+ },
23
+ "unk_token": {
24
+ "content": "[UNK]",
25
+ "single_word": false,
26
+ "lstrip": false,
27
+ "rstrip": false,
28
+ "normalized": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_max_length": 4096,
3
+ "tokenizer_class": "PreTrainedTokenizerFast",
4
+ "bos_token": "[BOS]",
5
+ "eos_token": "[EOS]",
6
+ "pad_token": "[PAD]",
7
+ "unk_token": "[UNK]",
8
+ "clean_up_tokenization_spaces": false,
9
+ "added_tokens_decoder": {
10
+ "0": {
11
+ "content": "[UNK]",
12
+ "single_word": false,
13
+ "lstrip": false,
14
+ "rstrip": false,
15
+ "normalized": false,
16
+ "special": true
17
+ },
18
+ "1": {
19
+ "content": "[PAD]",
20
+ "single_word": false,
21
+ "lstrip": false,
22
+ "rstrip": false,
23
+ "normalized": false,
24
+ "special": true
25
+ },
26
+ "2": {
27
+ "content": "[BOS]",
28
+ "single_word": false,
29
+ "lstrip": false,
30
+ "rstrip": false,
31
+ "normalized": false,
32
+ "special": true
33
+ },
34
+ "3": {
35
+ "content": "[EOS]",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ }
42
+ }
43
+ }