Pinkstack commited on
Commit
f0fd979
·
verified ·
1 Parent(s): ef4ac4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -41
README.md CHANGED
@@ -1,59 +1,121 @@
1
  ---
2
- base_model: Pinkstack/thenew-oe
3
- library_name: transformers
4
- model_name: '12347'
5
  tags:
6
- - generated_from_trainer
7
- - trl
8
- - sft
9
- - unsloth
10
- licence: license
 
 
 
 
 
 
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- # Model Card for 12347
14
 
15
- This model is a fine-tuned version of [Pinkstack/thenew-oe](https://huggingface.co/Pinkstack/thenew-oe).
16
- It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
- ## Quick start
19
 
20
- ```python
21
- from transformers import pipeline
 
22
 
23
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
24
- generator = pipeline("text-generation", model="Pinkstack/12347", device="cuda")
25
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
26
- print(output["generated_text"])
27
- ```
28
 
29
- ## Training procedure
30
 
31
-
32
 
 
 
33
 
34
- This model was trained with SFT.
 
 
 
 
 
35
 
36
- ### Framework versions
 
37
 
38
- - TRL: 0.25.1
39
- - Transformers: 4.57.2
40
- - Pytorch: 2.9.0+cu126
41
- - Datasets: 4.0.0
42
- - Tokenizers: 0.22.1
43
 
44
- ## Citations
45
 
 
 
46
 
47
 
48
- Cite TRL as:
49
-
50
- ```bibtex
51
- @misc{vonwerra2022trl,
52
- title = {{TRL: Transformer Reinforcement Learning}},
53
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
54
- year = 2020,
55
- journal = {GitHub repository},
56
- publisher = {GitHub},
57
- howpublished = {\url{https://github.com/huggingface/trl}}
58
- }
59
- ```
 
1
  ---
2
+ license: apache-2.0
3
+
 
4
  tags:
5
+ - cot
6
+ - code
7
+ - mixtral
8
+ - generalist
9
+ - moe
10
+ - base
11
+ language:
12
+ - en
13
+ library_name: transformers
14
+ datasets:
15
+ - Private
16
  ---
17
+ This is the base model!!
18
+
19
+ # Fijik-1.5 2.6B
20
+
21
+ Trained on H200,A100 and some use of rtx 2000 Ada gpus.
22
+ Fijik-1.5 2.6b boasts serious performance at a fraction of the price, while keeping incredible interference speeds. The model runs at about 300 tokens/s on a single rtx 3080 (bf16 gguf), supports 32k context (in theory can be scaled up to 128k with minimal quality issues) while keeping a low memory footprint so many other users could use it at the same time. Or a single user on an edge device.
23
+ ![banner](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/w50uTbGedNBKnaOsWgOYZ.png)
24
+
25
+ # What it is
26
+ Fijik 1.5 is a generalist llm, with a knowledge cutoff date of march 2025, yet with limited information after July 2024. The original model was pre-trained on 2T tokens (by huggingface, this model is based on smollm2 135m: https://huggingface.co/HuggingFaceTB/SmolLM2-135M) then we turned the original model into a 32 expert "Franken"-moe.
27
+ Obviously, after that stage, it was nowhere near finished, so heavy CPT (continual pre-training) was done, this also allowed us to scale the context from 8192 tokens to 32k, and techincally the model should work up to 128k tokens.
28
+
29
+ This model is completely uncensored, and thus is not ideal for production use-cases when safety is a must.
30
+
31
+
32
+ **The model should be used for:**
33
+ - General chat applications
34
+ - Fun quick local model
35
+ - Code suggestions / generation
36
+ - Fine-tuning for domain specific tasks (eg; only front-end generation, title generation, tool calling etc.)
37
+
38
+ **The model should NOT be used for:**
39
+ - Anything which needs lots of knowledge (model is too small for that)
40
+ - Medical, law or high-risk fields
41
+ - Math (From internal testing, the model is not good at math, could be fine-tuned to excel at it)
42
+
43
+
44
+ Overall, it is a special little model, it has a different style compared to other similarly sized LLMs, is uncensored completely and is a very small MoE.
45
+
46
+ # Model information
47
+ | Feature | Amount/other |
48
+ |:-------------------|:----------------:|
49
+ | Chat model? | **No** |
50
+ | architecture | **Mixtral** |
51
+ | max_position_embeddings | **32,768** |
52
+ | intermediate_size | **1,536** |
53
+ | num_hidden_layers | **30** |
54
+ | hidden_size | **576** |
55
+ | num_experts_per_tok | **4** |
56
+ | num_attention_heads | **9** |
57
+ | vocab_size | **49,166** |
58
+ | rope_theta | **500,000** |
59
+
60
+
61
+ # CPT (continual pre-training)
62
+
63
+ To make a proper decent base model for the size, CPT had to be done, both to make the experts actual experts and to improve the context, knowledge of the model.
64
+
65
+ The CPT data was ~60% synthetic and ~40% non synthetic (across all CPT stages combined)
66
+
67
+ 5 stages of continual pre-training were done.
68
+
69
+ Started with low batches, high noise, forced diversion. Dataset included slightly lower quality general 2025 Wiki articles, older wiki articles, synthetic math with deepseek r1, mix of synthetic and non synthetic code and some synthetic web datasets (like cosmopedia) Afterwards similar higher batched and at stage 3 gpt-oss reasoning traces were added. Up until stage 4 (including stage 4) overall cleaner datasets, slightly higher lr's (full training, not Lora for all stages of CPT) at stage 5 we used a very similar dataset to the stage 4 dataset, but with added deepseek r1 reasoning traces, less sources and more focused data on code Gen (from qwen3 480b, deepseek r1), gpt oss generated and cleaned articles, and more 2025 data with a 32k context length.
70
+
71
+ By doing this, the model got an effective knowledge cutoff date of March 2025, but with limited information past July 2024.
72
+
73
+ # SFT (supervised fine-tuning)
74
+
75
+ For sft, a ~549M token high quality diverse dataset was used. It was almost completely synthetic, with many examples generated by deepseek r1, qwen3 80b.
76
+
77
+ Estimated data mix:
78
+ - ~12% tool/json
79
+ - ~27% code generation (front-end, backend, competitive coding)
80
+ - ~43% general chats / instruction following
81
+ - ~18% math
82
+
83
+ Estimated based on pure dataset mix but real percentages are unknown.
84
 
85
+ # RL (reinforcment learning)
86
 
87
+ SFT was not enough, especially in todays times. After SFT 3 different "rounds" of DPO (direct preference optimization) were done, which improved instruction following significantly, yet, that was still not enough and more RL was done.
 
88
 
89
+ After the 3 DPO stages, DeepSeek-R1-like GRPO was done, (note: DPO, GRPO were done with LORA other than the final DPO stage that would be talked about soon) the grpo had very hard rewards, that the model had a "hard time" getting good reward, but this actually helped it, before this GRPO stage(s) the model had significant looping issues, more incoherent outputs and worse instruction following. This GRPO helped it think for less time, go into loops less and be better overall.
90
 
91
+ But still, a little more was done. After this, two final stages were done:
92
+ - DPO (final): Different DPO dataset with more coding, stricter instruction following, generalist chat (eg; Hi! what are you?) was done, with full fine-tuning enabled (no lora).
93
+ - GRPO (final): Two epochs of the same dataset and rewards as the previous GRPO stages just as a last push.
94
 
95
+ # Benchmarks
 
 
 
 
96
 
97
+ None done yet, soon.
98
 
99
+ # How to run
100
 
101
+ This model should be ran with a system prompt ideally, works perfectly fine without.
102
+ It uses standard qwen3 tool calls but it should be fine-tuned to excell at it as it currently has some issues with tool calling.
103
 
104
+ **recommended sampling parameters:**
105
+ - Tempature: ```0.35```
106
+ - Top-k: ```35```
107
+ - Repetition penalty: ```1.1```
108
+ - Top-p: ```0.85```
109
+ - Min-p: ```0.1``` (optional)
110
 
111
+ Test it out with a simple prompt, like "Why is the sky blue"
112
+ Keep in mind, this model **does** support multi-turns, but be aware it expect the previous response to also have reasoning, removing reasoning from the previous response could save compute and context, but would break the model.
113
 
114
+ When fine-tuning, you would need at minimum 8gb of memory for basic QLoRa with low context, ideally, 16gb.
 
 
 
 
115
 
 
116
 
117
+ # Special thanks
118
+ This wouldn't have been possible without HuggingfaceTB (They trained smollm2 135M), Unsloth, MergeKit, Transformers.
119
 
120
 
121
+ For questions, Open a community discussion.