MiniReasoner-70M

Model Description

MiniReasoner-70M is a 70M-parameter modern transformer model, developed using PyTorch and trained on 1.4B tokens from the synthetic dataset nvidia/OpenCodeReasoning (built on DeepSeek R1 reasoning). The primary objective of this project is educational—to explore the full lifecycle of designing and developing a transformer-based language model, rather than to compete with state-of-the-art large language models.

This work provides hands-on experience with tokenization, model architecture, optimization strategies, and training dynamics, while also emphasizing efficiency and modern improvements over the original Attention is All You Need (Vaswani et al., 2017).


Features

  • RMSNorm replacing LayerNorm → improves stability by normalizing activations more efficiently.
  • PreNorm transformer → enables better gradient flow in deeper networks.
  • SwiGLU activation (PaLM, LLaMA) → increases model capacity and smoothness.
  • Removed bias terms and dropout → makes the model leaner and small-model friendly.
  • 2.66 rule for hidden size / heads ratio → ensures balanced attention head dimensions.
  • FlashAttention → efficient and scalable attention with reduced memory usage.
  • RoPE (Rotary Positional Embeddings) → better handling of positional information and extrapolation.
  • Weight tying (input ↔ output embeddings) → parameter efficiency and improved generalization.
  • Cosine learning rate decay with warmup (1000 steps) → smooth optimization and convergence.
  • Mixed precision (bfloat16) + gradient clipping → faster training while preventing instability.
  • Custom tokenizer trained from scratch (vocab = 8192) → independence from existing tokenizers.
  • Streaming DataLoader → dynamically loads data batches as needed.
  • Checkpointing every 500 steps → reliable pause/resume capability.
  • TensorBoard integration → tracked loss and perplexity with real-time visualization.

Results

Disclaimer: The model does not produce good results yet.

  • Responses often fail to solve tasks or reason properly.
  • This is expected, since the goal was educational: implementing a training pipeline, not producing a competitive LLM.

Example Outputs

Input : Given a string s. Can you make it a palindrome by deleting exactly one character? Note that size of the string after deletion would be one less than it was before. Input First line of the input contains a single integer T denoting number of test cases. For each test case, you are given a single line containing string s. Output For each test case, print YES or NO depending on the answer of the problem. Constraints Example Input: 4 aaa abc abdbca abba Output: YES NO YES YES Explanation Example case 1. Delete any one 'a', resulting string is "aa" which is a palindrome. Example case 2. It is not possible to delete exactly one character and having a palindrome. Example case 3. Delete 'c', resulting string is "abdba" which is a palindrome. Example case 4. Delete 'b', resulting string is "aba" which is a palindrome.

Output
Okay, I need to solve this problem where I have to determine if a given string can be made into a palindrome by deleting exactly one character. Let's think about how to approach this.

First, the problem says that after deletion, the string must be a palindrome. So, the approach should be to check for each possible character in the string whether it can be removed to make it a palindrome. But checking every possible character would be O(n^2) per test case, which is way too slow for n up to 1e5. So that's not feasible. So I need a smarter way.

Hmm. Let's think about the conditions. For a palindrome to be possible, the string must have a structure where each character appears at most twice, and the structure must be such that the maximum count of any character is at most half. Wait, maybe not. Let's think: for a palindrome to be possible, the maximum count of any character must be <= half of the length. Because if the maximum count is <= half, then we can remove all characters. But wait, maybe not. Let's see.

Wait, for example, in the first sample input: 'abdca'. The maximum count is 2. 2 <= 2.5. So yes. The second sample: 'abdca' has counts 2 each. 2 <= 2.5. So yes. The third sample: 'abdca' has counts 3 each. 3 <= 2.5. So yes. The fourth sample: 'abdca' has counts 3 each. 3 <= 2.5. So yes. So the condition is that the maximum count of any character must be <= half of the length.

So the condition is that the maximum count of any character must be <= half of the length. Because then, removing all characters would make the string a palindrome.

So the problem reduces to checking for each character in the string, how many times it appears in the string. Because each occurrence of that character must be removed. So for example, if the maximum count is 3, then we need at least 3 occurrences. So the condition is that the maximum count must be <= half of the length.

Wait, but in the first sample, the maximum count is 2, which is <= half of 3. So yes. The second sample's maximum count is 3, which is more than half. So the condition is not sufficient. Because in that case, removing all characters would leave a palindrome. So the condition is not just that the maximum count is <= half of the length. So how can we model this?

Hmm. Let's think about the maximum possible number of occurrences of a character. For example, if the maximum count is 3, then we can remove 3 occurrences. But if the maximum count is 2, then we can remove 2 occurrences. Wait, but in that case, the maximum count would be 2, which is <= half of the length. So yes. So the condition is that the maximum count of any character is <= half of the length. Wait, but how?

Wait, the maximum count of any character is <= half of the length. So the condition is that the maximum count of any character is <= half of the length. Because if that's true, then removing all characters would leave a palindrome. So the condition is that the maximum count of any character is <= half of the length. Wait, no. Let's see. Suppose the maximum count is 3, half is 1.5. Then removing 3 occurrences leaves a palindrome. So yes. So the condition is that the maximum count of any character is <= half of the length. Wait, but that's not always true. For example, if the maximum count is 3, half is 1.5. Then removing 3 occurrences leaves a palindrome. So yes. So the condition is that the maximum count of any character is <= half of the length. Because if that's true, then removing all characters leaves a palindrome. So the condition is that the maximum count of any character is <= half of the length.

Wait, but that's not possible. For example, if the maximum count is 3, half is 1.5. Then removing 3 occurrences leaves a palindrome. So yes. So the condition is that the maximum count of any character is <= half of the length. Because then, removing all characters leaves a palindrome.

So the condition is that the maximum count of any character is <= half of the length. Because if that's the case, then removing all characters leaves a palindrome. So the condition is that the maximum count of any character is <= half of the length. Wait, but that's not necessarily true. For example, if the maximum count is 3, half is 1.5. Then removing all 3 would leave a palindrome. So yes. So the condition is that the maximum count of any character is <= half of the length.

So the problem reduces to: for each character in the string, compute the maximum count of any character. Then check if this maximum count is <= half of the length. If yes, then answer is YES, else NO.

Wait, but in the first sample, the maximum count is 3. So 3 <= half is true, so the condition is satisfied. So answer is YES. Which matches the first sample.

Second sample: the string is "abc". The maximum count is 2. So 2 <= half is true, so condition is satisfied. But the answer is NO. So the condition is not sufficient. Wait, but in the second sample, the maximum count is 2, which is exactly half of the length. So condition is satisfied. But the answer is NO. So the condition is not sufficient.

So the initial approach is wrong. So what's the correct way to model this?

Hmm. So what's the correct way to model this?

Alternative approach: the maximum number of occurrences of any character is <= half of the length. So the condition is that the maximum count of any character is <= half of the length. Because if that's true, then removing all characters leaves a palindrome. So the condition is satisfied.

Wait, but that's not true. For example, if the maximum count is 3, half is 1.5. Then removing 3 occurrences leaves a palindrome. So yes. So the condition is satisfied. So the condition is that the maximum count of any character is <= half of the length.

So the problem is to check if the maximum count of any character is <= half of the length.

So the code should be:

Read T.

For each T:

s = input().strip()

if s == s[::-1]:

   print YES

else:

   for each c in s:

       if s[i] != c and s[i+1] != c:

           print YES

   else:

       print NO

That's the code.

So the code is straightforward.

T = int(input())
for _ in range(T):
    s = input().strip()
    if s == s[::-1]:
        print("YES")
    else:
        for c in s:
            if s[i] != c and s[i+1] != c:
                print("YES")
        else:
            print("NO")

Training Curves

Cross Entropy Loss

image

Perplexity

image


Usage (via Hugging Face)

Note: It is advised to use the provided sample_output.py script in the repository for generating better results.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Jainil2502/minireasoner-70m", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Jainil2502/minireasoner-70m", trust_remote_code=True)

# Example prompt
prompt = """"""
inputs = tokenizer(prompt, return_tensors="pt")

# Generate output
with torch.no_grad():
    outputs = model.generate(
        inputs['input_ids'],
        max_new_tokens=1000,
        temperature=0.5,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode and print
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {result}")

Repository

The full source code, examples, and additional resources for MiniReasoner-70M are available on

GitHub : Code

Acknowledgements

A special thanks to Andrej Karpathy for his Zero to Hero playlist — a true masterpiece that inspired and guided this project.

Downloads last month
2
Safetensors
Model size
74.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Jainil2502/minireasoner-70m