Instructions to use Pinkstack/Fijik1.5-2.6b-a380m-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Pinkstack/Fijik1.5-2.6b-a380m-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Pinkstack/Fijik1.5-2.6b-a380m-base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Pinkstack/Fijik1.5-2.6b-a380m-base")
model = AutoModelForCausalLM.from_pretrained("Pinkstack/Fijik1.5-2.6b-a380m-base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Pinkstack/Fijik1.5-2.6b-a380m-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Pinkstack/Fijik1.5-2.6b-a380m-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pinkstack/Fijik1.5-2.6b-a380m-base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Pinkstack/Fijik1.5-2.6b-a380m-base

SGLang

How to use Pinkstack/Fijik1.5-2.6b-a380m-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Pinkstack/Fijik1.5-2.6b-a380m-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pinkstack/Fijik1.5-2.6b-a380m-base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Pinkstack/Fijik1.5-2.6b-a380m-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pinkstack/Fijik1.5-2.6b-a380m-base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Pinkstack/Fijik1.5-2.6b-a380m-base with Docker Model Runner:
```
docker model run hf.co/Pinkstack/Fijik1.5-2.6b-a380m-base
```

Pinkstack commited on Dec 29, 2025

Commit

f0fd979

verified ·

1 Parent(s): ef4ac4f

Update README.md

Browse files

Files changed (1) hide show

README.md +103 -41

README.md CHANGED Viewed

@@ -1,59 +1,121 @@
 ---
-base_model: Pinkstack/thenew-oe
-library_name: transformers
-model_name: '12347'
 tags:
-- generated_from_trainer
-- trl
-- sft
-- unsloth
-licence: license
 ---
-# Model Card for 12347
-This model is a fine-tuned version of [Pinkstack/thenew-oe](https://huggingface.co/Pinkstack/thenew-oe).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
-```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="Pinkstack/12347", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
-```
-## Training procedure
-This model was trained with SFT.
-### Framework versions
-- TRL: 0.25.1
-- Transformers: 4.57.2
-- Pytorch: 2.9.0+cu126
-- Datasets: 4.0.0
-- Tokenizers: 0.22.1
-## Citations
-Cite TRL as:
-```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
-}
-```

 ---
+license: apache-2.0
 tags:
+- cot
+- code
+- mixtral
+- generalist
+- moe
+- base
+language:
+- en
+library_name: transformers
+datasets:
+- Private
 ---
+This is the base model!!
+# Fijik-1.5 2.6B
+Trained on H200,A100 and some use of rtx 2000 Ada gpus.
+Fijik-1.5 2.6b boasts serious performance at a fraction of the price, while keeping incredible interference speeds. The model runs at about 300 tokens/s on a single rtx 3080 (bf16 gguf), supports 32k context (in theory can be scaled up to 128k with minimal quality issues) while keeping a low memory footprint so many other users could use it at the same time. Or a single user on an edge device.
+![banner](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/w50uTbGedNBKnaOsWgOYZ.png)
+# What it is
+Fijik 1.5 is a generalist llm, with a knowledge cutoff date of march 2025, yet with limited information after July 2024. The original model was pre-trained on 2T tokens (by huggingface, this model is based on smollm2 135m: https://huggingface.co/HuggingFaceTB/SmolLM2-135M) then we turned the original model into a 32 expert "Franken"-moe.
+Obviously, after that stage, it was nowhere near finished, so heavy CPT (continual pre-training) was done, this also allowed us to scale the context from 8192 tokens to 32k, and techincally the model should work up to 128k tokens.
+This model is completely uncensored, and thus is not ideal for production use-cases when safety is a must.
+**The model should be used for:**
+- General chat applications
+- Fun quick local model
+- Code suggestions / generation
+- Fine-tuning for domain specific tasks (eg; only front-end generation, title generation, tool calling etc.)
+**The model should NOT be used for:**
+- Anything which needs lots of knowledge (model is too small for that)
+- Medical, law or high-risk fields
+- Math (From internal testing, the model is not good at math, could be fine-tuned to excel at it)
+Overall, it is a special little model, it has a different style compared to other similarly sized LLMs, is uncensored completely and is a very small MoE.
+# Model information
+| Feature            | Amount/other |
+|:-------------------|:----------------:|
+| Chat model?         | **No**         |
+| architecture         | **Mixtral**         |
+| max_position_embeddings         | **32,768**         |
+| intermediate_size         | **1,536**         |
+| num_hidden_layers         | **30**         |
+| hidden_size         | **576**         |
+| num_experts_per_tok         | **4**         |
+| num_attention_heads         | **9**         |
+| vocab_size         | **49,166**         |
+| rope_theta         | **500,000**         |
+# CPT (continual pre-training)
+To make a proper decent base model for the size, CPT had to be done, both to make the experts actual experts and to improve the context, knowledge of the model.
+The CPT data was ~60% synthetic and ~40% non synthetic (across all CPT stages combined)
+5 stages of continual pre-training were done.
+Started with low batches, high noise, forced diversion. Dataset included slightly lower quality general 2025 Wiki articles, older wiki articles, synthetic math with deepseek r1, mix of synthetic and non synthetic code and some synthetic web datasets (like cosmopedia) Afterwards similar higher batched and at stage 3 gpt-oss reasoning traces were added. Up until stage 4 (including stage 4) overall cleaner datasets, slightly higher lr's (full training, not Lora for all stages of CPT) at stage 5 we used a very similar dataset to the stage 4 dataset, but with added deepseek r1 reasoning traces, less sources and more focused data on code Gen (from qwen3 480b, deepseek r1), gpt oss generated and cleaned articles, and more 2025 data with a 32k context length.
+By doing this, the model got an effective knowledge cutoff date of March 2025, but with limited information past July 2024.
+# SFT (supervised fine-tuning)
+For sft, a ~549M token high quality diverse dataset was used. It was almost completely synthetic, with many examples generated by deepseek r1, qwen3 80b.
+Estimated data mix:
+- ~12% tool/json
+- ~27% code generation (front-end, backend, competitive coding)
+- ~43% general chats / instruction following
+- ~18% math
+Estimated based on pure dataset mix but real percentages are unknown.
+# RL (reinforcment learning)
+SFT was not enough, especially in todays times. After SFT 3 different "rounds" of DPO (direct preference optimization) were done, which improved instruction following significantly, yet, that was still not enough and more RL was done.
+After the 3 DPO stages, DeepSeek-R1-like GRPO was done, (note: DPO, GRPO were done with LORA other than the final DPO stage that would be talked about soon) the grpo had very hard rewards, that the model had a "hard time" getting good reward, but this actually helped it, before this GRPO stage(s) the model had significant looping issues, more incoherent outputs and worse instruction following. This GRPO helped it think for less time, go into loops less and be better overall.
+But still, a little more was done. After this, two final stages were done:
+- DPO (final): Different DPO dataset with more coding, stricter instruction following, generalist chat (eg; Hi! what are you?) was done, with full fine-tuning enabled (no lora).
+- GRPO (final): Two epochs of the same dataset and rewards as the previous GRPO stages just as a last push.
+# Benchmarks
+None done yet, soon.
+# How to run
+This model should be ran with a system prompt ideally, works perfectly fine without.
+It uses standard qwen3 tool calls but it should be fine-tuned to excell at it as it currently has some issues with tool calling.
+**recommended sampling parameters:**
+- Tempature: ```0.35```
+- Top-k: ```35```
+- Repetition penalty: ```1.1```
+- Top-p: ```0.85```
+- Min-p: ```0.1``` (optional)
+Test it out with a simple prompt, like "Why is the sky blue"
+Keep in mind, this model **does** support multi-turns, but be aware it expect the previous response to also have reasoning, removing reasoning from the previous response could save compute and context, but would break the model.
+When fine-tuning, you would need at minimum 8gb of memory for basic QLoRa with low context, ideally, 16gb.
+# Special thanks
+This wouldn't have been possible without HuggingfaceTB (They trained smollm2 135M), Unsloth, MergeKit, Transformers.
+For questions, Open a community discussion.