Pinkstack commited on
Commit
ec26fc6
·
verified ·
1 Parent(s): 6301388

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -13
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - code
7
  - mixtral
8
  - conversational
9
- - math
10
  - moe
11
  language:
12
  - en
@@ -14,31 +14,108 @@ library_name: transformers
14
  datasets:
15
  - Private
16
  ---
17
- # Fijik-1.5 2.6B
 
 
18
 
19
  Trained on H200,A100 and some use of rtx 2000 Ada gpus.
20
- Fijik-1.5 2.6b boasts serious performance at a fraction of the price, while keeping incredible interference speeds. The model runs at about 300 tokens/s on a single rtx 3080, supports 32k context (in theory can be scaled up to 128k with minimal quality issues) while keeping a low memory footprint so many other users could use it at the same time. Or a single user on an edge device.
 
21
 
22
- Pictue
23
  # What it is
24
- Fijik 1.5 is a generalist llm, with a knowledge cutoff date of march 2025, yet with limited information after July 2024. The original model was pre-trained on 2T tokens (by huggingface, this model is based on smollm2) then we turned the original model into a 32 expert Franken-moe. afterwards we did heavy continual pre-training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- The continual pre-training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  5 stages of continual pre-training were done.
29
 
30
- We started with low batches, high noise, forced diversion. Dataset included slightly lower quality general 2025 Wiki articles, older wiki articles, synthetic math with deepseek r1, mix of synthetic and non synthetic code and some synthetically cleaned web datasets (like cosmopedia) Afterwards similar higher batched and at stage 3 gpt-oss reasoning traces were added. Up until stage 4 (including stage 4) overall cleaner datasets, slightly higher lr's (full training, not Lora for all stages of CPT) at stage 5 we used a very similar dataset to the stage 4 dataset, but with added deepseek r1 reasoning traces, less sources and more focused data on code Gen (from qwen3 480b, deepseek r1), gpt oss generated and cleaned articles, and more 2025 data.
31
 
32
  By doing this, the model got an effective knowledge cutoff date of March 2025, but with limited information past July 2024.
33
 
34
- # SFT data
35
 
36
- For sft, we used a 549M token high quality diverse dataset. It was almost completely synthetic, with many examples generated by deepseek r1, qwen3 80b.
37
 
38
  Estimated data mix:
39
- - 12% tool/json
40
- - 27% code gen
41
- - 43% simpler chats / instruction following
42
- - 18% math
43
 
44
  Estimated based on pure dataset mix but real percentages are unknown.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - code
7
  - mixtral
8
  - conversational
9
+ - generalist
10
  - moe
11
  language:
12
  - en
 
14
  datasets:
15
  - Private
16
  ---
17
+ This is the final, fully finished checkpoint chat-ready checkpoint.
18
+
19
+ # Fijik-1.5 2.6B
20
 
21
  Trained on H200,A100 and some use of rtx 2000 Ada gpus.
22
+ Fijik-1.5 2.6b boasts serious performance at a fraction of the price, while keeping incredible interference speeds. The model runs at about 300 tokens/s on a single rtx 3080 (bf16 gguf), supports 32k context (in theory can be scaled up to 128k with minimal quality issues) while keeping a low memory footprint so many other users could use it at the same time. Or a single user on an edge device.
23
+ ![banner](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/w50uTbGedNBKnaOsWgOYZ.png)
24
 
 
25
  # What it is
26
+ Fijik 1.5 is a generalist llm, with a knowledge cutoff date of march 2025, yet with limited information after July 2024. The original model was pre-trained on 2T tokens (by huggingface, this model is based on smollm2 135m: https://huggingface.co/HuggingFaceTB/SmolLM2-135M) then we turned the original model into a 32 expert "Franken"-moe.
27
+ Obviously, after that stage, it was nowhere near finished, so heavy CPT (continual pre-training) was done, this also allowed us to scale the context from 8192 tokens to 32k, and techincally the model should work up to 128k tokens.
28
+
29
+ This model is completely uncensored, and thus is not ideal for production use-cases when safety is a must.
30
+
31
+
32
+ **The model should be used for:**
33
+ - General chat applications
34
+ - Fun quick local model
35
+ - Code suggestions / generation
36
+ - Fine-tuning for domain specific tasks (eg; only front-end generation, title generation, tool calling etc.)
37
+
38
+ **The model should NOT be used for:**
39
+ - Anything which needs lots of knowledge (model is too small for that)
40
+ - Medical, law or high-risk fields
41
+ - Math (From internal testing, the model is not good at math, could be fine-tuned to excel at it)
42
+
43
+
44
+ Overall, it is a special little model, it has a different style compared to other similarly sized LLMs, is uncensored completely and is a very small MoE.
45
 
46
+ # Model information
47
+ | Feature | Amount/other |
48
+ |:-------------------|:----------------:|
49
+ | Chat model? | **Yes** |
50
+ | architecture | **Mixtral** |
51
+ | max_position_embeddings | **32,768** |
52
+ | intermediate_size | **1,536** |
53
+ | num_hidden_layers | **30** |
54
+ | hidden_size | **576** |
55
+ | num_experts_per_tok | **4** |
56
+ | num_attention_heads | **9** |
57
+ | vocab_size | **49,166** |
58
+ | rope_theta | **500,000** |
59
+
60
+
61
+ # CPT (continual pre-training)
62
+
63
+ To make a proper decent base model for the size, CPT had to be done, both to make the experts actual experts and to improve the context, knowledge of the model.
64
+
65
+ The CPT data was ~60% synthetic and ~40% non synthetic (across all CPT stages combined)
66
 
67
  5 stages of continual pre-training were done.
68
 
69
+ Started with low batches, high noise, forced diversion. Dataset included slightly lower quality general 2025 Wiki articles, older wiki articles, synthetic math with deepseek r1, mix of synthetic and non synthetic code and some synthetic web datasets (like cosmopedia) Afterwards similar higher batched and at stage 3 gpt-oss reasoning traces were added. Up until stage 4 (including stage 4) overall cleaner datasets, slightly higher lr's (full training, not Lora for all stages of CPT) at stage 5 we used a very similar dataset to the stage 4 dataset, but with added deepseek r1 reasoning traces, less sources and more focused data on code Gen (from qwen3 480b, deepseek r1), gpt oss generated and cleaned articles, and more 2025 data with a 32k context length.
70
 
71
  By doing this, the model got an effective knowledge cutoff date of March 2025, but with limited information past July 2024.
72
 
73
+ # SFT (supervised fine-tuning)
74
 
75
+ For sft, a ~549M token high quality diverse dataset was used. It was almost completely synthetic, with many examples generated by deepseek r1, qwen3 80b.
76
 
77
  Estimated data mix:
78
+ - ~12% tool/json
79
+ - ~27% code generation (front-end, backend, competitive coding)
80
+ - ~43% general chats / instruction following
81
+ - ~18% math
82
 
83
  Estimated based on pure dataset mix but real percentages are unknown.
84
+
85
+ # RL (reinforcment learning)
86
+
87
+ SFT was not enough, especially in todays times. After SFT 3 different "rounds" of DPO (direct preference optimization) were done, which improved instruction following significantly, yet, that was still not enough and more RL was done.
88
+
89
+ After the 3 DPO stages, DeepSeek-R1-like GRPO was done, (note: DPO, GRPO were done with LORA other than the final DPO stage that would be talked about soon) the grpo had very hard rewards, that the model had a "hard time" getting good reward, but this actually helped it, before this GRPO stage(s) the model had significant looping issues, more incoherent outputs and worse instruction following. This GRPO helped it think for less time, go into loops less and be better overall.
90
+
91
+ But still, a little more was done. After this, two final stages were done:
92
+ - DPO (final): Different DPO dataset with more coding, stricter instruction following, generalist chat (eg; Hi! what are you?) was done, with full fine-tuning enabled (no lora).
93
+ - GRPO (final): Two epochs of the same dataset and rewards as the previous GRPO stages just as a last push.
94
+
95
+ # Benchmarks
96
+
97
+ None done yet, soon.
98
+
99
+ # How to run
100
+
101
+ This model should be ran with a system prompt ideally, works perfectly fine without.
102
+ It uses standard qwen3 tool calls but it should be fine-tuned to excell at it as it currently has some issues with tool calling.
103
+
104
+ **recommended sampling parameters:**
105
+ - Tempature: ```0.35```
106
+ - Top-k: ```35```
107
+ - Repetition penalty: ```1.1```
108
+ - Top-p: ```0.85```
109
+ - Min-p: ```0.1``` (optional)
110
+
111
+ Test it out with a simple prompt, like "Why is the sky blue"
112
+ Keep in mind, this model **does** support multi-turns, but be aware it expect the previous response to also have reasoning, removing reasoning from the previous response could save compute and context, but would break the model.
113
+
114
+ When fine-tuning, you would need at minimum 8gb of memory for basic QLoRa with low context, ideally, 16gb.
115
+
116
+
117
+ # Special thanks
118
+ This wouldn't have been possible without HuggingfaceTB (They trained smollm2 135M), Unsloth, MergeKit, Transformers.
119
+
120
+
121
+ For questions, Open a community discussion.