README.md · Harley-ml/StopAskingQuestionsMini-656k at main

StopAskingQuestionsMini-656k / README.md

Harley-ml

Update README.md

d694823 verified 5 days ago

preview code

raw

history blame contribute delete

9.45 kB

	---
	license: mit
	language:
	- en
	tags:
	- tiny
	- slm
	- tlm
	- llm
	- small
	- question-generator
	- harley-ml
	- small-language-model
	- experiment
	- experimental
	- text-generation
	- question-generation
	- questions
	- question
	---

	# StopAskingQuestionsMini-656k
	This model is small. Well, that's an understatement. But welcome to the world of tiny language models.
	StopAskingQuestionsMini is a six-hundred and fifty-six thousand parameter language model trained on roughly 23 million tokens of questions without answers. That may sound counterintuitive:
	> What is the point of generating questions with no answer?

	There is no practical reason for doing so. However, this model wasn't built for practical use, it was built to answer the ongoing question that I am trying to answer:
	> How much intellect can you stuff into a tiny model before it collapses?

	This project, or any of our projects, don't truly answer this - because every day, there is always a new advancement. For example, DeepSeek created [Engram](https://arxiv.org/pdf/2601.07372), a novel architecture component that increases knowledge storage at very low compute cost.

	Furthermore,

	> What can this model even do?

	Not much. It can generate partially coherent questions, and that's pretty much it.

	## Architecture

	StopAskingQuestionsMini uses a scaled down version of the [Qwen3](https://arxiv.org/abs/2505.09388) architecture.


	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden Layers \| 2 \|
	\| Hidden Size \| 128 \|
	\| Attention Heads \| 2 \|
	\| KV Heads \| 2 \|
	\| Intermediate Size \| 512 \|
	\| RoPE Theta \| 10000.0 \|
	\| Max Position Embeddings \| 96 \|
	\| Tie Word Embeddings \| True \|
	\| Vocab Size \| 1024 \|

	## Training

	StopAskingQuestionsMini trained on 23 million tokens of questions for two epochs with a batch size of 16.

	### Training Results


	\| Epoch \| Train Loss \| Eval Loss \| Train PPL \| Eval PPL \|
	\|-------\|------------\|-----------\|-----------\|----------\|
	\| 0.07 \| 4.0797 \| 3.0011 \| 59.05 \| 20.11 \|
	\| 0.22 \| 2.6331 \| 2.5703 \| 13.92 \| 13.07 \|
	\| 0.37 \| 2.4906 \| 2.4586 \| 12.07 \| 11.68 \|
	\| 0.52 \| 2.4213 \| 2.3989 \| 11.26 \| 11.01 \|
	\| 0.66 \| 2.3700 \| 2.3552 \| 10.70 \| 10.54 \|
	\| 0.81 \| 2.3375 \| 2.3242 \| 10.35 \| 10.22 \|
	\| 0.96 \| 2.3094 \| 2.2949 \| 10.07 \| 9.92 \|
	\| 1.11 \| 2.2720 \| 2.2746 \| 9.70 \| 9.72 \|
	\| 1.26 \| 2.2527 \| 2.2533 \| 9.51 \| 9.52 \|
	\| 1.40 \| 2.2345 \| 2.2367 \| 9.34 \| 9.36 \|
	\| 1.55 \| 2.2239 \| 2.2212 \| 9.24 \| 9.22 \|
	\| 1.70 \| 2.2043 \| 2.2044 \| 9.06 \| 9.06 \|
	\| 1.85 \| 2.1885 \| 2.1930 \| 8.92 \| 8.96 \|
	\| 1.99 \| 2.1843 \| 2.1854 \| 8.88 \| 8.90 \|

	## Benchmarks

	We benchmarked our model against GPT-2, SmolLM-135M, and Qwen3-0.6B-Base on a question generation task:

	\| Model \| Params \| Avg Score \| Coherent \| Mostly Coherent \| Partially Coherent \| Incoherent \|
	\|-------\|--------\|-----------\|----------\|-----------------\|--------------------\|------------\|
	\| StopAskingQuestionsMini (this) \| 656K \| 0.4395 \| 42 \| 60 \| 37 \| 161 \|
	\| GPT-2 \| 117M \| 0.3874 \| 16 \| 50 \| 49 \| 185 \|
	\| SmolLM2-135M \| 135M \| 0.5193 \| 36 \| 98 \| 40 \| 111 \|
	\| Qwen3-0.6B-Base \| 600M \| 0.7359 \| 165 \| 79 \| 16 \| 40 \|

	Each model generated two to three hundred continuations of the prefix `Question:`. [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) scored each one using a decimal grading system (0.0 to 1.0).
	Our model generated the second highest number of coherent questions with less parameters than most character level RNNs.

	## Generations

	Prompt: `Question:`

	Generation1:
	```text
	what legal reforms faced rafer leadership during ww1?
	```

	Generation2:
	```text
	How many emissions should a frather?
	```

	Generation3:
	```text
	What do foreigners do?
	```

	Generation4:
	```text
	What is the best appropriate way to learn Japanese?
	```

	Generation5:
	```text
	How much is the MDU and JavaScript to the new UK?
	```

	## Use Cases

	Unfortunately, there is no practical use case as we stated earlier, but here are some interesting ideas:

	1. Test model for pipelines, code, and training
	2. Educational research on language models
	3. Experimentation on constrained hardware
	4. Or, more simply, for fun.

	## Limitations

	Everything.
	But more specifically,

	1. Cannot generate sentences, paragraphs, code, or anything other than questions
	2. Cannot reason
	3. Short context
	4. Incoherent

	## Inference

	```python
	# =============================================================================
	# Inference
	# =============================================================================

	MODEL_DIR = "harley-ml/StopAskingQuestionsMini-656k" # path
	TOKENIZER_PATH = "harley-ml/StopAskingQuestionsMini-656k"

	# --- Generation settings ---
	PROMPT = "Question:" # prompt
	MAX_NEW_TOKENS = 96
	TEMPERATURE = 1.0
	TOP_P = 0.95
	TOP_K = 50
	REPETITION_PENALTY = 1.1
	DO_SAMPLE = True

	# =============================================================================

	import torch
	from pathlib import Path
	from transformers import (
	AutoModelForCausalLM,
	PreTrainedTokenizerFast,
	AddedToken,
	)

	# ---------------------------------------------------------------------------
	# Device
	# ---------------------------------------------------------------------------

	device = (
	"cuda" if torch.cuda.is_available() else
	"mps" if torch.backends.mps.is_available() else
	"cpu"
	)
	print(f"Device : {device}")

	# ---------------------------------------------------------------------------
	# Tokenizer (mirrors training setup)
	# ---------------------------------------------------------------------------

	def load_tokenizer(path: str):
	p = Path(path).resolve()
	if not p.exists():
	raise FileNotFoundError(f"Tokenizer not found: {p}")
	tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
	specials = {}
	if tok.bos_token is None: specials["bos_token"] = AddedToken("<\|bos\|>", special=True)
	if tok.eos_token is None: specials["eos_token"] = AddedToken("<\|eos\|>", special=True)
	if tok.unk_token is None: specials["unk_token"] = AddedToken("<\|unk\|>", special=True)
	if tok.pad_token is None:
	if tok.eos_token is not None:
	tok.pad_token = tok.eos_token
	else:
	specials["pad_token"] = AddedToken("<\|pad\|>", special=True)
	if specials:
	tok.add_special_tokens(specials)
	tok.padding_side = "left" # left-pad for batched generation
	return tok

	print("Loading tokenizer...")
	tokenizer = load_tokenizer(TOKENIZER_PATH)
	print(f" Vocab size : {tokenizer.vocab_size}")
	print(f" BOS : {tokenizer.bos_token!r}")
	print(f" EOS : {tokenizer.eos_token!r}")
	print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")

	# ---------------------------------------------------------------------------
	# Model
	# ---------------------------------------------------------------------------

	print(f"\nLoading model from {MODEL_DIR} ...")
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_DIR,
	dtype=torch.float16 if device == "cuda" else torch.float32,
	low_cpu_mem_usage=True,
	)
	model.eval()
	model.to(device)

	total_params = sum(p.numel() for p in model.parameters())
	print(f" Parameters : {total_params:,}")

	# ---------------------------------------------------------------------------
	# Generation helper
	# ---------------------------------------------------------------------------

	def generate(
	prompt: str = PROMPT,
	max_new_tokens: int = MAX_NEW_TOKENS,
	temperature: float = TEMPERATURE,
	top_p: float = TOP_P,
	top_k: int = TOP_K,
	repetition_penalty: float = REPETITION_PENALTY,
	do_sample: bool = DO_SAMPLE,
	) -> str:

	bos = tokenizer.bos_token or ""
	full_prompt = bos + prompt

	inputs = tokenizer(
	full_prompt,
	return_tensors="pt",
	add_special_tokens=False,
	).to(device)
	inputs.pop("token_type_ids", None) # Qwen3 doesn't use this

	gen_kwargs = dict(
	max_new_tokens = max_new_tokens,
	do_sample = do_sample,
	repetition_penalty = repetition_penalty,
	eos_token_id = tokenizer.eos_token_id,
	pad_token_id = tokenizer.pad_token_id,
	)
	if do_sample:
	gen_kwargs["temperature"] = temperature
	gen_kwargs["top_p"] = top_p
	gen_kwargs["top_k"] = top_k

	with torch.inference_mode():
	output_ids = model.generate(inputs, gen_kwargs)

	# Strip the prompt tokens so we only return what was generated
	prompt_len = inputs["input_ids"].shape[-1]
	new_ids = output_ids[0][prompt_len:]
	return tokenizer.decode(new_ids, skip_special_tokens=True)


	# ---------------------------------------------------------------------------
	# Run
	# ---------------------------------------------------------------------------

	if __name__ == "__main__":
	print(f"\nPrompt : {PROMPT!r}")
	print("-" * 60)

	output = generate(PROMPT)

	print("Generated:")
	print(output)
	```

	## Citation

	```bibtex
	@misc{stopaskingquestionsmini-656k,
	title = {StopAskingQuestionsMini-656k: Questions with No Answers},
	author = {Harley-ml},
	year = {2026},
	url = {https://huggingface.co/Harley-ml/StopAskingQuestionsMini-656k}
	}
	```