noticecheck

Running on Zero

App Files Files Community

noticecheck / docs /model_experiment_notes.md

Abid Ali Awan

Tighten Qwen3.5 safety assessment prompt

3469346 20 days ago

preview code

Raw

History Blame Contribute Delete

2.48 kB

	# Qwen3.5 4B Q8 MTP model experiment notes

	## Experiment

	The production experiment serves `unsloth/Qwen3.5-4B-MTP-GGUF` with
	`Qwen3.5-4B-Q8_0.gguf`, `mmproj-F16.gguf`, and a pinned CUDA-enabled
	`llama.cpp` build. Modal supplies one L4 GPU and exposes a private,
	proxy-authenticated OpenAI-compatible endpoint.

	The server enables model-native speculative decoding:

	```text
	--spec-type draft-mtp --spec-draft-n-max 2
	```

	## Measured results

	The in-container MTP smoke test generated 440 draft tokens, accepted 222, and
	reported a 50.5% acceptance rate. The projector-enabled endpoint also read the
	courier screenshot successfully.

	The original ten-case evaluation produced:

	\| Measurement \| Result \|
	\| --- \| --- \|
	\| Strict passes \| 9/10 \|
	\| Average judge score \| 89.5/100 \|
	\| High-risk scam cases \| All passed \|
	\| Screenshot cases \| Both passed \|
	\| Mean case time \| 9.46 seconds \|
	\| Median case time \| 6.17 seconds \|

	The only strict failure was a harmless appointment reminder. The model selected
	the correct `Looks normal` label, but described it as irrelevant input. The
	production system prompt now explicitly states that appointment reminders,
	shipment updates, bills, and alerts must be assessed as notices.

	Case time includes both candidate generation and the independent judge request,
	so it is not a standalone inference benchmark.

	After adding explicit risk-label thresholds, evidence rules, and bounded output
	lengths, the same evaluation passed 10/10 cases with an average judge score of
	100/100. Excluding the first cold-start case, the nine warm candidate-plus-judge
	cases averaged about 8.5 seconds. The first case took about 95 seconds because
	both Modal endpoints had scaled to zero.

	## Output contract

	The app requests and validates schema-constrained JSON containing:

	- `risk_label`
	- `simple_explanation`
	- `red_flags`
	- `safe_next_steps`
	- `reply_draft`

	Thinking is disabled for these requests so the completion budget is used for
	the final structured response.

	## Product boundary

	The Modal deployment is the application's primary inference backend. It does
	not produce a rule-based assessment when the endpoint fails. A local endpoint
	can replace Modal through `MODEL_BASE_URL` and `MODEL_NAME`.

	## References

	- [Qwen3.5 4B MTP GGUF repository](https://huggingface.co/unsloth/Qwen3.5-4B-MTP-GGUF)
	- [llama-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server)
	- [Modal web servers](https://modal.com/docs/guide/webhooks)