Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +139 -3
config.json +22 -0
generation_config.json +12 -0
model.safetensors +3 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer_config.json +43 -0

README.md CHANGED Viewed

@@ -1,3 +1,139 @@
----
-license: mit
----

+---
+language:
+- sv
+- en
+- code
+license: apache-2.0
+tags:
+- causal-lm
+- llama
+- pretrained
+- swedish
+- gqa
+- sungpt
+pipeline_tag: text-generation
+---
+# sungpt-swe-410m
+A 410M-parameter causal language model trained from scratch on Swedish text, English web text, math, and code.
+Built with the [sungpt](https://github.com/your-org/sungpt) training framework — a Llama-style architecture
+(RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to `LlamaForCausalLM` for zero-friction HF compatibility.
+> **Base model only.** This is a raw pretrained model — it continues text, not follows instructions.
+> For chat/instruction use, fine-tune with SFT on an instruction dataset.
+---
+## Model details
+| Hyperparameter       | Value                                      |
+|----------------------|--------------------------------------------|
+| Architecture         | LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA) |
+| Hidden size          | 1024                                       |
+| Layers               | 24                                         |
+| Attention heads      | 16                                         |
+| KV heads (GQA)       | 8                                          |
+| FFN intermediate     | 4096 (SwiGLU)                              |
+| Max sequence length  | 4096                                       |
+| Vocab size           | 32,000                                     |
+| Parameters           | ~435M                                      |
+| Precision            | bfloat16                                   |
+| Tied embeddings      | Yes                                        |
+---
+## Quick start
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "your-hf-username/sungpt-swe-410m"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+prompts = {
+    "code":  "def merge_sort(arr):\n    \"\"\"Sort a list using merge sort.\"\"\"\n",
+    "math":  "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get",
+    "english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by",
+    "swedish": "Sverige är känt för sin starka välfärdsmodell och",
+}
+for domain, prompt in prompts.items():
+    print(f"\n--- {domain} ---")
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    out = model.generate(
+        **inputs,
+        max_new_tokens=150,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.95,
+        repetition_penalty=1.1,
+    )
+    print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+**CPU / low-VRAM:**
+```python
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
+```
+Default generation settings (`generation_config.json`): `temperature=0.8`, `top_p=0.95`, `top_k=50`,
+`repetition_penalty=1.1`, `max_new_tokens=512` — so a bare `model.generate(**inputs)` already samples.
+---
+## Training
+| Property    | Value |
+|-------------|-------|
+| Framework   | [sungpt](https://github.com/your-org/sungpt) (custom, Llama-style) |
+| Hardware    | 1× H200 80 GB |
+| Precision   | bfloat16, gradient checkpointing, `torch.compile` |
+| Optimizer   | AdamW — lr 2e-4, β=(0.9, 0.95), cosine decay |
+| Batch size  | 64 sequences × 4096 tokens = ~262K tokens/step |
+| Throughput  | ~48K tokens/sec at plateau |
+**Data mix (~1.2B tokens):**
+| Dataset | Samples | Notes |
+|---------|---------|-------|
+| [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | 200,000 | English web |
+| [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) | 400,000 | Code |
+| [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 200,000 | Educational web |
+| [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) | 395,000 | Math reasoning |
+Data was pre-tokenized into memmap shards before training for maximum GPU throughput.
+---
+## Tokenizer
+Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text.
+Special tokens: `[BOS]` (id 2), `[EOS]` (id 3), `[PAD]` (id 1).
+```python
+tokenizer = AutoTokenizer.from_pretrained("your-hf-username/sungpt-swe-410m")
+tokens = tokenizer("Hej världen!", return_tensors="pt")
+```
+---
+## Limitations
+- **Base model** — does not follow instructions or chat; fine-tune for that.
+- **Swedish skew** — better at Swedish and code than general English.
+- **No RLHF / safety alignment** — outputs may be biased or inappropriate.
+- **410M parameters** — capacity is limited; expect repetition on long contexts without `repetition_penalty`.
+---
+## License
+Apache 2.0 — see [LICENSE](LICENSE).

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "model_type": "llama",
+  "vocab_size": 32000,
+  "hidden_size": 1024,
+  "intermediate_size": 4096,
+  "num_hidden_layers": 24,
+  "num_attention_heads": 16,
+  "num_key_value_heads": 8,
+  "max_position_embeddings": 4096,
+  "hidden_act": "silu",
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 10000.0,
+  "rope_scaling": null,
+  "tie_word_embeddings": true,
+  "attention_bias": false,
+  "mlp_bias": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.40.0"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token_id": 2,
+  "eos_token_id": 3,
+  "pad_token_id": 1,
+  "do_sample": true,
+  "temperature": 0.8,
+  "top_p": 0.95,
+  "top_k": 50,
+  "repetition_penalty": 1.1,
+  "max_new_tokens": 512,
+  "transformers_version": "4.40.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:175c08e1f6b45545532797be5294963d2bc66ee33af54a14a2c6600adbceff00
+size 1772319024

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "[BOS]",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false
+  },
+  "eos_token": {
+    "content": "[EOS]",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "single_word": false,
+    "lstrip": false,
+    "rstrip": false,
+    "normalized": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "model_max_length": 4096,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "bos_token": "[BOS]",
+  "eos_token": "[EOS]",
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]",
+  "clean_up_tokenization_spaces": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "1": {
+      "content": "[PAD]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "2": {
+      "content": "[BOS]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "3": {
+      "content": "[EOS]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    }
+  }
+}