CMGUI Screen-Grounded Summarizer
This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project.
The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids.
Checkpoint
Source checkpoint:
runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best
Export date: 2026-05-19
Architecture:
SigLIP2 visual encoder
+ OCR/UI/layout element memory
+ task/context memory
+ mT5-large decoder
+ element-level structured heads for UI functions and evidence
This is a custom PyTorch checkpoint, not a plain AutoModel.from_pretrained package. The code snapshot in code/ shows the loader and inference path used by the project.
Files
| File | Purpose |
|---|---|
pytorch_model.bin |
Custom model state dict |
rich_config.json |
Model, data, decoding, and structured-head config |
decoder_tokenizer/ |
mT5 tokenizer |
image_processor/ |
SigLIP2 image processor |
checkpoint_metrics.json |
Best-checkpoint validation metrics saved during training |
checkpoint_manifest.json |
Export metadata and recommended runtime settings |
reports/eval_report_20260512_titlefix_s1e2.md |
Full valid/test evaluation report |
code/ |
Loader, CLI inference, and GUI code snapshot |
Recommended Inference Settings
num_beams=1
max_new_tokens=384
generation_no_repeat_ngram_size=3
generation_repetition_penalty=1.1
generation_block_extra_ids=true
generation_block_title_prefix=true
generation_force_json_start=false
structured_function_mode=heads
structured_function_threshold=0.20
structured_search_threshold=0.20
structured_evidence_mode=heads
structured_evidence_threshold=0.50
structured_max_functions=8
structured_max_evidence=8
structured_evidence_fallback_top1=false
allow_template_fallback=false
Evaluation
Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split.
| Split | Rows | Grounded | ROUGE-L char | Evidence precision | Function F1 |
|---|---|---|---|---|---|
| valid | 500 | 0.4434 | 0.4157 | 0.5542 | 0.3588 |
| test | 500 | 0.4398 | 0.4107 | 0.5831 | 0.3819 |
Generation health checks:
title_prefix_rate=0.0
extra_id_rate=0.0
json_valid_rate=1.0
The 0.20 function/search threshold is important. The checkpoint config stores 0.50, but the full threshold sweep found that 0.20 gives the best function-count balance and higher Function F1.
Usage Notes
The model is intended for research and demo use on mobile screenshot summarization. It is useful for:
- Chinese mobile screen summarization
- screenshot-grounded UI evidence selection
- function-entry detection from OCR/UI layout
- comparing a deployable student model against Qwen teacher models
Known limitations:
- Page intent can still be confused on ecommerce and search-result pages.
- Function heads under-predict dense icon grids and category/navigation pages.
- Evidence ids are useful but not perfectly localized.
- The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set.
Loading
Use the project code snapshot in code/ together with the original project data preprocessing format. In the original workspace, batch inference is run as:
python scripts\infer_rich.py `
--checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best `
--input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl `
--batch_size 2 `
--num_beams 1 `
--max_new_tokens 384 `
--structured_function_mode heads `
--structured_function_threshold 0.20 `
--structured_search_threshold 0.20 `
--structured_evidence_mode heads `
--structured_evidence_threshold 0.50 `
--allow_template_fallback false
Model tree for hsq12138/CMGUI_Screen-Grounded_Summarizer
Base model
google/mt5-large