Rohanify commited on
Commit
1b2c914
Β·
verified Β·
1 Parent(s): 7b9b38e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md CHANGED
@@ -1,3 +1,113 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - code
9
+ - python
10
+ - gguf
11
+ - small-model
12
+ - pretrained-from-scratch
13
+ - gpt2
14
+ - from-scratch
15
+ - coding
16
+ - SLM
17
  ---
18
+
19
+ # PyBlissa-Coder-50M
20
+
21
+ A 50M-parameter Python code generation model trained from scratch on a single RTX 5080. Built as part of the **PRIME** lineup of small, locally-runnable AI systems.
22
+
23
+ Despite its size, PyBlissa punches well above its weight on Python instruction-following tasks. Trained near-Chinchilla optimal (~13 tokens/parameter) for maximum capacity utilization.
24
+
25
+ [loss_visualization.png]
26
+
27
+ ## Stats
28
+
29
+ | | |
30
+ |---|---|
31
+ | Parameters | 50.2M |
32
+ | Architecture | Decoder-only transformer (GPT-2 style) |
33
+ | Context length | 1024 tokens |
34
+ | Vocab size | 16,000 (custom ByteLevel BPE) |
35
+ | Train tokens | 166M |
36
+ | Final val loss | 0.474 |
37
+ | Training time | 73 minutes (RTX 5080) |
38
+
39
+ ## Architecture
40
+
41
+ ```
42
+ d_model: 640
43
+ n_layer: 8
44
+ n_head: 8
45
+ d_ff: 2560
46
+ block_size: 1024
47
+ tied embeddings, pre-LN, no bias, GELU MLP, SDPA attention
48
+ ```
49
+
50
+ ## Training data
51
+
52
+ Two-source code-instruction corpus, 425k samples β†’ 166M tokens after BPE tokenization:
53
+
54
+ - **`nvidia/OpenCodeInstruct`** β€” 400k high-quality instruction-code pairs
55
+ - **`flytech/python-codes-25k`** β€” 25k Python-focused instruction-code pairs
56
+
57
+ Trained for 4 epochs with cosine LR schedule (3e-4 β†’ 3e-5), bf16 autocast, batch size 20.
58
+
59
+ ## Prompt format
60
+
61
+ Trained on a strict prefix structure:
62
+
63
+ ```
64
+ PROMPT: <your instruction>
65
+ CODE:
66
+ <generated code>
67
+ ```
68
+
69
+ Anything else is out-of-distribution. The Modelfile in this repo handles the formatting automatically.
70
+
71
+ ## Usage β€” Ollama (recommended)
72
+
73
+ ```bash
74
+ ollama run hf.co/Rohanify/PyBlissa-Coder-50M
75
+ ```
76
+
77
+ Or pull the GGUF directly and create locally:
78
+
79
+ ```bash
80
+ ollama create pyblissa-coder -f Modelfile
81
+ ollama run pyblissa-coder "write a function to merge two sorted lists"
82
+ ```
83
+
84
+ ## Usage β€” llama.cpp
85
+
86
+ ```bash
87
+ ./llama-cli -m PyBlissa-Coder-50M-F32.gguf \
88
+ -p "PROMPT: write a function to reverse a string\nCODE:\n" \
89
+ --temp 0.8 --top-k 50 --top-p 0.95 -n 256
90
+ ```
91
+
92
+ ## Files
93
+
94
+ | File | Purpose |
95
+ |---|---|
96
+ | `PyBlissa-Coder-50M-F32.gguf` | Full-precision GGUF weights |
97
+ | `Modelfile` | Ollama config (prompt template, stop tokens, sampling) |
98
+ | `tokenizer.json` | Custom 16k BPE tokenizer |
99
+
100
+ ## Limitations
101
+
102
+ - Python-only β€” other languages weren't in training data
103
+ - 1024-token context β€” longer programs get truncated
104
+ - Small flytech subset (~6% of training data) contains code with unescaped quote bugs; the model occasionally inherits this pattern
105
+ - No safety tuning, no RLHF β€” base model only
106
+
107
+ ## Acknowledgments
108
+
109
+ Datasets by NVIDIA and flytech. Built using the nanoGPT-style training recipe with custom tokenization. Tooling: PyTorch, HuggingFace `tokenizers`, llama.cpp for GGUF conversion.
110
+
111
+ ---
112
+
113
+ Made by Rohan. Also known as ElectroPlayin on YouTube