README.md · Harley-ml/StopAskingQuestionsMini-656k at main

File size: 9,447 Bytes

---
license: mit
language:
- en
tags:
- tiny
- slm
- tlm
- llm
- small
- question-generator
- harley-ml
- small-language-model
- experiment
- experimental
- text-generation
- question-generation
- questions
- question
---

# StopAskingQuestionsMini-656k
This model is small. Well, that's an understatement. But welcome to the world of tiny language models.
StopAskingQuestionsMini is a six-hundred and fifty-six thousand parameter language model trained on roughly 23 million tokens of questions without answers. That may sound counterintuitive:
> What is the point of generating questions with no answer?

There is no practical reason for doing so. However, this model wasn't built for practical use, it was built to answer the ongoing question that I am trying to answer:
> How much intellect can you stuff into a tiny model before it collapses?

This project, or any of our projects, don't truly answer this - because every day, there is always a new advancement. For example, DeepSeek created [Engram](https://arxiv.org/pdf/2601.07372), a novel architecture component that increases knowledge storage at very low compute cost.

Furthermore, 

> What can this model even do?

Not much. It can generate partially coherent questions, and that's pretty much it.

## Architecture

StopAskingQuestionsMini uses a scaled down version of the [Qwen3](https://arxiv.org/abs/2505.09388) architecture.


| Parameter | Value |
|-----------|-------|
| Hidden Layers | 2 |
| Hidden Size | 128 |
| Attention Heads | 2 |
| KV Heads | 2 |
| Intermediate Size | 512 |
| RoPE Theta | 10000.0 |
| Max Position Embeddings | 96 |
| Tie Word Embeddings | True |
| Vocab Size | 1024 |

## Training

StopAskingQuestionsMini trained on 23 million tokens of questions for two epochs with a batch size of 16.

### Training Results


| Epoch | Train Loss | Eval Loss | Train PPL | Eval PPL |
|-------|------------|-----------|-----------|----------|
| 0.07  | 4.0797     | 3.0011    | 59.05     | 20.11    |
| 0.22  | 2.6331     | 2.5703    | 13.92     | 13.07    |
| 0.37  | 2.4906     | 2.4586    | 12.07     | 11.68    |
| 0.52  | 2.4213     | 2.3989    | 11.26     | 11.01    |
| 0.66  | 2.3700     | 2.3552    | 10.70     | 10.54    |
| 0.81  | 2.3375     | 2.3242    | 10.35     | 10.22    |
| 0.96  | 2.3094     | 2.2949    | 10.07     | 9.92     |
| 1.11  | 2.2720     | 2.2746    | 9.70      | 9.72     |
| 1.26  | 2.2527     | 2.2533    | 9.51      | 9.52     |
| 1.40  | 2.2345     | 2.2367    | 9.34      | 9.36     |
| 1.55  | 2.2239     | 2.2212    | 9.24      | 9.22     |
| 1.70  | 2.2043     | 2.2044    | 9.06      | 9.06     |
| 1.85  | 2.1885     | 2.1930    | 8.92      | 8.96     |
| 1.99  | 2.1843     | 2.1854    | 8.88      | 8.90     |

## Benchmarks

We benchmarked our model against GPT-2, SmolLM-135M, and Qwen3-0.6B-Base on a question generation task:

| Model | Params | Avg Score | Coherent | Mostly Coherent | Partially Coherent | Incoherent |
|-------|--------|-----------|----------|-----------------|--------------------|------------|
| **StopAskingQuestionsMini** (this) | 656K | 0.4395 | 42 | 60 | 37 | 161 |
| GPT-2 | 117M | 0.3874 | 16 | 50 | 49 | 185 |
| SmolLM2-135M | 135M | 0.5193 | 36 | 98 | 40 | 111 |
| Qwen3-0.6B-Base | 600M | 0.7359 | 165 | 79 | 16 | 40 |

Each model generated two to three hundred continuations of the prefix `Question:`. [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) scored each one using a decimal grading system (0.0 to 1.0).
Our model generated the second highest number of coherent questions with less parameters than most character level RNNs.

## Generations

Prompt: **`Question:`**

Generation1:
```text
what legal reforms faced rafer leadership during ww1?
```

Generation2:
```text
How many emissions should a frather?
```

Generation3:
```text
What do foreigners do?
```

Generation4:
```text
What is the best appropriate way to learn Japanese?
```

Generation5:
```text
How much is the MDU and JavaScript to the new UK?
```

## Use Cases

Unfortunately, there is no practical use case as we stated earlier, but here are some interesting ideas:

1. Test model for pipelines, code, and training
2. Educational research on language models
3. Experimentation on constrained hardware
4. Or, more simply, for fun.

## Limitations

Everything.
But more specifically, 

1. Cannot generate sentences, paragraphs, code, or anything other than questions
2. Cannot reason
3. Short context
4. Incoherent

## Inference

```python
# =============================================================================
#  Inference
# =============================================================================

MODEL_DIR      = "harley-ml/StopAskingQuestionsMini-656k"   # path
TOKENIZER_PATH = "harley-ml/StopAskingQuestionsMini-656k"

# --- Generation settings ---
PROMPT             = "Question:"   # prompt
MAX_NEW_TOKENS     = 96
TEMPERATURE        = 1.0
TOP_P              = 0.95
TOP_K              = 50
REPETITION_PENALTY = 1.1
DO_SAMPLE          = True

# =============================================================================

import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps"  if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer  (mirrors training setup)
# ---------------------------------------------------------------------------

def load_tokenizer(path: str):
    p = Path(path).resolve()
    if not p.exists():
        raise FileNotFoundError(f"Tokenizer not found: {p}")
    tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
    specials = {}
    if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
    if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
    if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
    if tok.pad_token is None:
        if tok.eos_token is not None:
            tok.pad_token = tok.eos_token
        else:
            specials["pad_token"] = AddedToken("<|pad|>", special=True)
    if specials:
        tok.add_special_tokens(specials)
    tok.padding_side = "left"  # left-pad for batched generation
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {tokenizer.vocab_size}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)
model.eval()
model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str             = PROMPT,
    max_new_tokens: int     = MAX_NEW_TOKENS,
    temperature: float      = TEMPERATURE,
    top_p: float            = TOP_P,
    top_k: int              = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool         = DO_SAMPLE,
) -> str:
    
    bos         = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)
    inputs.pop("token_type_ids", None)  # Qwen3 doesn't use this

    gen_kwargs = dict(
        max_new_tokens     = max_new_tokens,
        do_sample          = do_sample,
        repetition_penalty = repetition_penalty,
        eos_token_id       = tokenizer.eos_token_id,
        pad_token_id       = tokenizer.pad_token_id,
    )
    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"]       = top_p
        gen_kwargs["top_k"]       = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    # Strip the prompt tokens so we only return what was generated
    prompt_len = inputs["input_ids"].shape[-1]
    new_ids    = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)


# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)
```

## Citation

```bibtex
@misc{stopaskingquestionsmini-656k,
  title     = {StopAskingQuestionsMini-656k: Questions with No Answers},
  author    = {Harley-ml},
  year      = {2026},
  url       = {https://huggingface.co/Harley-ml/StopAskingQuestionsMini-656k}
}
```