| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - tiny |
| - slm |
| - tlm |
| - llm |
| - small |
| - question-generator |
| - harley-ml |
| - small-language-model |
| - experiment |
| - experimental |
| - text-generation |
| - question-generation |
| - questions |
| - question |
| --- |
| |
| # StopAskingQuestionsMini-656k |
| This model is small. Well, that's an understatement. But welcome to the world of tiny language models. |
| StopAskingQuestionsMini is a six-hundred and fifty-six thousand parameter language model trained on roughly 23 million tokens of questions without answers. That may sound counterintuitive: |
| > What is the point of generating questions with no answer? |
|
|
| There is no practical reason for doing so. However, this model wasn't built for practical use, it was built to answer the ongoing question that I am trying to answer: |
| > How much intellect can you stuff into a tiny model before it collapses? |
|
|
| This project, or any of our projects, don't truly answer this - because every day, there is always a new advancement. For example, DeepSeek created [Engram](https://arxiv.org/pdf/2601.07372), a novel architecture component that increases knowledge storage at very low compute cost. |
|
|
| Furthermore, |
|
|
| > What can this model even do? |
|
|
| Not much. It can generate partially coherent questions, and that's pretty much it. |
|
|
| ## Architecture |
|
|
| StopAskingQuestionsMini uses a scaled down version of the [Qwen3](https://arxiv.org/abs/2505.09388) architecture. |
|
|
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Hidden Layers | 2 | |
| | Hidden Size | 128 | |
| | Attention Heads | 2 | |
| | KV Heads | 2 | |
| | Intermediate Size | 512 | |
| | RoPE Theta | 10000.0 | |
| | Max Position Embeddings | 96 | |
| | Tie Word Embeddings | True | |
| | Vocab Size | 1024 | |
|
|
| ## Training |
|
|
| StopAskingQuestionsMini trained on 23 million tokens of questions for two epochs with a batch size of 16. |
|
|
| ### Training Results |
|
|
|
|
| | Epoch | Train Loss | Eval Loss | Train PPL | Eval PPL | |
| |-------|------------|-----------|-----------|----------| |
| | 0.07 | 4.0797 | 3.0011 | 59.05 | 20.11 | |
| | 0.22 | 2.6331 | 2.5703 | 13.92 | 13.07 | |
| | 0.37 | 2.4906 | 2.4586 | 12.07 | 11.68 | |
| | 0.52 | 2.4213 | 2.3989 | 11.26 | 11.01 | |
| | 0.66 | 2.3700 | 2.3552 | 10.70 | 10.54 | |
| | 0.81 | 2.3375 | 2.3242 | 10.35 | 10.22 | |
| | 0.96 | 2.3094 | 2.2949 | 10.07 | 9.92 | |
| | 1.11 | 2.2720 | 2.2746 | 9.70 | 9.72 | |
| | 1.26 | 2.2527 | 2.2533 | 9.51 | 9.52 | |
| | 1.40 | 2.2345 | 2.2367 | 9.34 | 9.36 | |
| | 1.55 | 2.2239 | 2.2212 | 9.24 | 9.22 | |
| | 1.70 | 2.2043 | 2.2044 | 9.06 | 9.06 | |
| | 1.85 | 2.1885 | 2.1930 | 8.92 | 8.96 | |
| | 1.99 | 2.1843 | 2.1854 | 8.88 | 8.90 | |
|
|
| ## Benchmarks |
|
|
| We benchmarked our model against GPT-2, SmolLM-135M, and Qwen3-0.6B-Base on a question generation task: |
|
|
| | Model | Params | Avg Score | Coherent | Mostly Coherent | Partially Coherent | Incoherent | |
| |-------|--------|-----------|----------|-----------------|--------------------|------------| |
| | **StopAskingQuestionsMini** (this) | 656K | 0.4395 | 42 | 60 | 37 | 161 | |
| | GPT-2 | 117M | 0.3874 | 16 | 50 | 49 | 185 | |
| | SmolLM2-135M | 135M | 0.5193 | 36 | 98 | 40 | 111 | |
| | Qwen3-0.6B-Base | 600M | 0.7359 | 165 | 79 | 16 | 40 | |
|
|
| Each model generated two to three hundred continuations of the prefix `Question:`. [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) scored each one using a decimal grading system (0.0 to 1.0). |
| Our model generated the second highest number of coherent questions with less parameters than most character level RNNs. |
|
|
| ## Generations |
|
|
| Prompt: **`Question:`** |
|
|
| Generation1: |
| ```text |
| what legal reforms faced rafer leadership during ww1? |
| ``` |
|
|
| Generation2: |
| ```text |
| How many emissions should a frather? |
| ``` |
|
|
| Generation3: |
| ```text |
| What do foreigners do? |
| ``` |
|
|
| Generation4: |
| ```text |
| What is the best appropriate way to learn Japanese? |
| ``` |
|
|
| Generation5: |
| ```text |
| How much is the MDU and JavaScript to the new UK? |
| ``` |
|
|
| ## Use Cases |
|
|
| Unfortunately, there is no practical use case as we stated earlier, but here are some interesting ideas: |
|
|
| 1. Test model for pipelines, code, and training |
| 2. Educational research on language models |
| 3. Experimentation on constrained hardware |
| 4. Or, more simply, for fun. |
|
|
| ## Limitations |
|
|
| Everything. |
| But more specifically, |
|
|
| 1. Cannot generate sentences, paragraphs, code, or anything other than questions |
| 2. Cannot reason |
| 3. Short context |
| 4. Incoherent |
|
|
| ## Inference |
|
|
| ```python |
| # ============================================================================= |
| # Inference |
| # ============================================================================= |
| |
| MODEL_DIR = "harley-ml/StopAskingQuestionsMini-656k" # path |
| TOKENIZER_PATH = "harley-ml/StopAskingQuestionsMini-656k" |
| |
| # --- Generation settings --- |
| PROMPT = "Question:" # prompt |
| MAX_NEW_TOKENS = 96 |
| TEMPERATURE = 1.0 |
| TOP_P = 0.95 |
| TOP_K = 50 |
| REPETITION_PENALTY = 1.1 |
| DO_SAMPLE = True |
| |
| # ============================================================================= |
| |
| import torch |
| from pathlib import Path |
| from transformers import ( |
| AutoModelForCausalLM, |
| PreTrainedTokenizerFast, |
| AddedToken, |
| ) |
| |
| # --------------------------------------------------------------------------- |
| # Device |
| # --------------------------------------------------------------------------- |
| |
| device = ( |
| "cuda" if torch.cuda.is_available() else |
| "mps" if torch.backends.mps.is_available() else |
| "cpu" |
| ) |
| print(f"Device : {device}") |
| |
| # --------------------------------------------------------------------------- |
| # Tokenizer (mirrors training setup) |
| # --------------------------------------------------------------------------- |
| |
| def load_tokenizer(path: str): |
| p = Path(path).resolve() |
| if not p.exists(): |
| raise FileNotFoundError(f"Tokenizer not found: {p}") |
| tok = PreTrainedTokenizerFast(tokenizer_file=str(p)) |
| specials = {} |
| if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True) |
| if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True) |
| if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True) |
| if tok.pad_token is None: |
| if tok.eos_token is not None: |
| tok.pad_token = tok.eos_token |
| else: |
| specials["pad_token"] = AddedToken("<|pad|>", special=True) |
| if specials: |
| tok.add_special_tokens(specials) |
| tok.padding_side = "left" # left-pad for batched generation |
| return tok |
| |
| print("Loading tokenizer...") |
| tokenizer = load_tokenizer(TOKENIZER_PATH) |
| print(f" Vocab size : {tokenizer.vocab_size}") |
| print(f" BOS : {tokenizer.bos_token!r}") |
| print(f" EOS : {tokenizer.eos_token!r}") |
| print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})") |
| |
| # --------------------------------------------------------------------------- |
| # Model |
| # --------------------------------------------------------------------------- |
| |
| print(f"\nLoading model from {MODEL_DIR} ...") |
| model = AutoModelForCausalLM.from_pretrained( |
| MODEL_DIR, |
| dtype=torch.float16 if device == "cuda" else torch.float32, |
| low_cpu_mem_usage=True, |
| ) |
| model.eval() |
| model.to(device) |
| |
| total_params = sum(p.numel() for p in model.parameters()) |
| print(f" Parameters : {total_params:,}") |
| |
| # --------------------------------------------------------------------------- |
| # Generation helper |
| # --------------------------------------------------------------------------- |
| |
| def generate( |
| prompt: str = PROMPT, |
| max_new_tokens: int = MAX_NEW_TOKENS, |
| temperature: float = TEMPERATURE, |
| top_p: float = TOP_P, |
| top_k: int = TOP_K, |
| repetition_penalty: float = REPETITION_PENALTY, |
| do_sample: bool = DO_SAMPLE, |
| ) -> str: |
| |
| bos = tokenizer.bos_token or "" |
| full_prompt = bos + prompt |
| |
| inputs = tokenizer( |
| full_prompt, |
| return_tensors="pt", |
| add_special_tokens=False, |
| ).to(device) |
| inputs.pop("token_type_ids", None) # Qwen3 doesn't use this |
| |
| gen_kwargs = dict( |
| max_new_tokens = max_new_tokens, |
| do_sample = do_sample, |
| repetition_penalty = repetition_penalty, |
| eos_token_id = tokenizer.eos_token_id, |
| pad_token_id = tokenizer.pad_token_id, |
| ) |
| if do_sample: |
| gen_kwargs["temperature"] = temperature |
| gen_kwargs["top_p"] = top_p |
| gen_kwargs["top_k"] = top_k |
| |
| with torch.inference_mode(): |
| output_ids = model.generate(**inputs, **gen_kwargs) |
| |
| # Strip the prompt tokens so we only return what was generated |
| prompt_len = inputs["input_ids"].shape[-1] |
| new_ids = output_ids[0][prompt_len:] |
| return tokenizer.decode(new_ids, skip_special_tokens=True) |
| |
| |
| # --------------------------------------------------------------------------- |
| # Run |
| # --------------------------------------------------------------------------- |
| |
| if __name__ == "__main__": |
| print(f"\nPrompt : {PROMPT!r}") |
| print("-" * 60) |
| |
| output = generate(PROMPT) |
| |
| print("Generated:") |
| print(output) |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{stopaskingquestionsmini-656k, |
| title = {StopAskingQuestionsMini-656k: Questions with No Answers}, |
| author = {Harley-ml}, |
| year = {2026}, |
| url = {https://huggingface.co/Harley-ml/StopAskingQuestionsMini-656k} |
| } |
| ``` |