noticecheck

Running on Zero

App Files Files Community

noticecheck / docs /model_experiment_notes.md

Abid Ali Awan

Tighten Qwen3.5 safety assessment prompt

3469346 19 days ago

preview code

Raw

History Blame Contribute Delete

2.48 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Qwen3.5 4B Q8 MTP model experiment notes

Experiment

The production experiment serves unsloth/Qwen3.5-4B-MTP-GGUF with Qwen3.5-4B-Q8_0.gguf, mmproj-F16.gguf, and a pinned CUDA-enabled llama.cpp build. Modal supplies one L4 GPU and exposes a private, proxy-authenticated OpenAI-compatible endpoint.

The server enables model-native speculative decoding:

--spec-type draft-mtp --spec-draft-n-max 2

Measured results

The in-container MTP smoke test generated 440 draft tokens, accepted 222, and reported a 50.5% acceptance rate. The projector-enabled endpoint also read the courier screenshot successfully.

The original ten-case evaluation produced:

Measurement	Result
Strict passes	9/10
Average judge score	89.5/100
High-risk scam cases	All passed
Screenshot cases	Both passed
Mean case time	9.46 seconds
Median case time	6.17 seconds

The only strict failure was a harmless appointment reminder. The model selected the correct Looks normal label, but described it as irrelevant input. The production system prompt now explicitly states that appointment reminders, shipment updates, bills, and alerts must be assessed as notices.

Case time includes both candidate generation and the independent judge request, so it is not a standalone inference benchmark.

After adding explicit risk-label thresholds, evidence rules, and bounded output lengths, the same evaluation passed 10/10 cases with an average judge score of 100/100. Excluding the first cold-start case, the nine warm candidate-plus-judge cases averaged about 8.5 seconds. The first case took about 95 seconds because both Modal endpoints had scaled to zero.

Output contract

The app requests and validates schema-constrained JSON containing:

risk_label
simple_explanation
red_flags
safe_next_steps
reply_draft

Thinking is disabled for these requests so the completion budget is used for the final structured response.

Product boundary

The Modal deployment is the application's primary inference backend. It does not produce a rule-based assessment when the endpoint fails. A local endpoint can replace Modal through MODEL_BASE_URL and MODEL_NAME.

Qwen3.5 4B Q8 MTP model experiment notes

Experiment

Measured results

Output contract

Product boundary

References