Update README.md

Browse files

Files changed (1) hide show

README.md +231 -3

README.md CHANGED Viewed

@@ -1,3 +1,231 @@
----
-license: mit
----

+---
+license: mit
+language:
+- en
+tags:
+- tiny
+- slm
+- tlm
+- llm
+- small
+- question-generator
+- harley-ml
+- small-language-model
+- experiment
+- experimental
+- text-generation
+- question-generation
+- questions
+- question
+---
+# StopAskingQuestionsMini-656k
+This model is small. Well, that's an understatement. But welcome to the world of tiny language models.
+StopAskingQuestionsMini is a six-hundred and fifty-six thousand parameter language model trained on roughly 23 million tokens of questions without answers. That may sound counterintuitive:
+> What is the point of generating questions with no answer?
+There is no practical reason for doing so. However, this model wasn't built for practical use, it was built to answer the ongoing question that I am trying to answer:
+> How much intellect can you stuff into a tiny model before it collapses?
+This project, or any of our projects, don't truly answer this - because every day, there is always a new advancement. For example, DeepSeek created [Engram](https://arxiv.org/pdf/2601.07372), a novel architecture component that increases knowledge storage at very low compute cost.
+Furthermore,
+> What can this model even do?
+Not much. It can generate partially coherent questions, and that's pretty much it.
+## Architecture
+StopAskingQuestionsMini uses a scaled down version of the [Qwen3](https://arxiv.org/abs/2505.09388) architecture.
+| Parameter | Value |
+|-----------|-------|
+| Hidden Layers | 2 |
+| Hidden Size | 128 |
+| Attention Heads | 2 |
+| KV Heads | 2 |
+| Intermediate Size | 512 |
+| RoPE Theta | 10000.0 |
+| Max Position Embeddings | 96 |
+| Tie Word Embeddings | True |
+| Vocab Size | 1024 |
+## Benchmarks
+We benchmarked our model against GPT-2, SmolLM-135M, and Qwen3-0.6B-Base on a question generation task:
+| Model | Params | Avg Score | Coherent | Mostly Coherent | Partially Coherent | Incoherent |
+|-------|--------|-----------|----------|-----------------|--------------------|------------|
+| **StopAskingQuestionsMini** (this) | 656K | 0.4395 | 42 | 60 | 37 | 161 |
+| GPT-2 | 117M | 0.3874 | 16 | 50 | 49 | 185 |
+| SmolLM2-135M | 135M | 0.5193 | 36 | 98 | 40 | 111 |
+| Qwen3-0.6B-Base | 600M | 0.7359 | 165 | 79 | 16 | 40 |
+Each model generated three hundred continuations of the prefix `Question:`. [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) scored each one using a decimal grading system (0.0 to 1.0).
+Our model generated the second highest number of coherent questions with less parameters than most character level RNNs.
+## Use Cases
+Unfortunately, there is no practical use case as we stated earlier, but here are some interesting ideas:
+1. Test model for pipelines, code, and training
+2. Educational research on language models
+3. Experimentation on constrained hardware
+4. Or, more simply, for fun.
+## Limitations
+Everything.
+But more specifically,
+1. Cannot generate sentences, paragraphs, code, or anything other than questions
+2. Cannot reason
+3. Short context
+4. Incoherent
+## Inference
+```python
+# =============================================================================
+#  Inference
+# =============================================================================
+MODEL_DIR      = "harley-ml/StopAskingQuestionsMini-656k"   # path
+TOKENIZER_PATH = "harley-ml/StopAskingQuestionsMini-656k"
+# --- Generation settings ---
+PROMPT             = "Question:"   # prompt
+MAX_NEW_TOKENS     = 96
+TEMPERATURE        = 1.0
+TOP_P              = 0.95
+TOP_K              = 50
+REPETITION_PENALTY = 1.1
+DO_SAMPLE          = True
+# =============================================================================
+import torch
+from pathlib import Path
+from transformers import (
+    AutoModelForCausalLM,
+    PreTrainedTokenizerFast,
+    AddedToken,
+)
+# ---------------------------------------------------------------------------
+# Device
+# ---------------------------------------------------------------------------
+device = (
+    "cuda" if torch.cuda.is_available() else
+    "mps"  if torch.backends.mps.is_available() else
+    "cpu"
+)
+print(f"Device : {device}")
+# ---------------------------------------------------------------------------
+# Tokenizer  (mirrors training setup)
+# ---------------------------------------------------------------------------
+def load_tokenizer(path: str):
+    p = Path(path).resolve()
+    if not p.exists():
+        raise FileNotFoundError(f"Tokenizer not found: {p}")
+    tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
+    specials = {}
+    if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
+    if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
+    if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
+    if tok.pad_token is None:
+        if tok.eos_token is not None:
+            tok.pad_token = tok.eos_token
+        else:
+            specials["pad_token"] = AddedToken("<|pad|>", special=True)
+    if specials:
+        tok.add_special_tokens(specials)
+    tok.padding_side = "left"  # left-pad for batched generation
+    return tok
+print("Loading tokenizer...")
+tokenizer = load_tokenizer(TOKENIZER_PATH)
+print(f"  Vocab size : {tokenizer.vocab_size}")
+print(f"  BOS        : {tokenizer.bos_token!r}")
+print(f"  EOS        : {tokenizer.eos_token!r}")
+print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")
+# ---------------------------------------------------------------------------
+# Model
+# ---------------------------------------------------------------------------
+print(f"\nLoading model from {MODEL_DIR} ...")
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_DIR,
+    dtype=torch.float16 if device == "cuda" else torch.float32,
+    low_cpu_mem_usage=True,
+)
+model.eval()
+model.to(device)
+total_params = sum(p.numel() for p in model.parameters())
+print(f"  Parameters : {total_params:,}")
+# ---------------------------------------------------------------------------
+# Generation helper
+# ---------------------------------------------------------------------------
+def generate(
+    prompt: str             = PROMPT,
+    max_new_tokens: int     = MAX_NEW_TOKENS,
+    temperature: float      = TEMPERATURE,
+    top_p: float            = TOP_P,
+    top_k: int              = TOP_K,
+    repetition_penalty: float = REPETITION_PENALTY,
+    do_sample: bool         = DO_SAMPLE,
+) -> str:
+    bos         = tokenizer.bos_token or ""
+    full_prompt = bos + prompt
+    inputs = tokenizer(
+        full_prompt,
+        return_tensors="pt",
+        add_special_tokens=False,
+    ).to(device)
+    inputs.pop("token_type_ids", None)  # Qwen3 doesn't use this
+    gen_kwargs = dict(
+        max_new_tokens     = max_new_tokens,
+        do_sample          = do_sample,
+        repetition_penalty = repetition_penalty,
+        eos_token_id       = tokenizer.eos_token_id,
+        pad_token_id       = tokenizer.pad_token_id,
+    )
+    if do_sample:
+        gen_kwargs["temperature"] = temperature
+        gen_kwargs["top_p"]       = top_p
+        gen_kwargs["top_k"]       = top_k
+    with torch.inference_mode():
+        output_ids = model.generate(**inputs, **gen_kwargs)
+    # Strip the prompt tokens so we only return what was generated
+    prompt_len = inputs["input_ids"].shape[-1]
+    new_ids    = output_ids[0][prompt_len:]
+    return tokenizer.decode(new_ids, skip_special_tokens=True)
+# ---------------------------------------------------------------------------
+# Run
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    print(f"\nPrompt : {PROMPT!r}")
+    print("-" * 60)
+    output = generate(PROMPT)
+    print("Generated:")
+    print(output)
+```