hsq12138's picture
Upload CMGUI stage3 screen-grounded summarizer checkpoint
2f0e115 verified
---
language:
- zh
tags:
- image-to-text
- screenshot-understanding
- mobile-ui
- ocr
- siglip2
- mt5
- cmgui
library_name: pytorch
pipeline_tag: image-to-text
base_model:
- google/mt5-large
---
# CMGUI Screen-Grounded Summarizer
This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project.
The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids.
## Checkpoint
Source checkpoint:
```text
runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best
```
Export date: 2026-05-19
Architecture:
```text
SigLIP2 visual encoder
+ OCR/UI/layout element memory
+ task/context memory
+ mT5-large decoder
+ element-level structured heads for UI functions and evidence
```
This is a custom PyTorch checkpoint, not a plain `AutoModel.from_pretrained` package. The code snapshot in `code/` shows the loader and inference path used by the project.
## Files
| File | Purpose |
| --- | --- |
| `pytorch_model.bin` | Custom model state dict |
| `rich_config.json` | Model, data, decoding, and structured-head config |
| `decoder_tokenizer/` | mT5 tokenizer |
| `image_processor/` | SigLIP2 image processor |
| `checkpoint_metrics.json` | Best-checkpoint validation metrics saved during training |
| `checkpoint_manifest.json` | Export metadata and recommended runtime settings |
| `reports/eval_report_20260512_titlefix_s1e2.md` | Full valid/test evaluation report |
| `code/` | Loader, CLI inference, and GUI code snapshot |
## Recommended Inference Settings
```text
num_beams=1
max_new_tokens=384
generation_no_repeat_ngram_size=3
generation_repetition_penalty=1.1
generation_block_extra_ids=true
generation_block_title_prefix=true
generation_force_json_start=false
structured_function_mode=heads
structured_function_threshold=0.20
structured_search_threshold=0.20
structured_evidence_mode=heads
structured_evidence_threshold=0.50
structured_max_functions=8
structured_max_evidence=8
structured_evidence_fallback_top1=false
allow_template_fallback=false
```
## Evaluation
Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split.
| Split | Rows | Grounded | ROUGE-L char | Evidence precision | Function F1 |
| --- | ---: | ---: | ---: | ---: | ---: |
| valid | 500 | 0.4434 | 0.4157 | 0.5542 | 0.3588 |
| test | 500 | 0.4398 | 0.4107 | 0.5831 | 0.3819 |
Generation health checks:
```text
title_prefix_rate=0.0
extra_id_rate=0.0
json_valid_rate=1.0
```
The `0.20` function/search threshold is important. The checkpoint config stores `0.50`, but the full threshold sweep found that `0.20` gives the best function-count balance and higher Function F1.
## Usage Notes
The model is intended for research and demo use on mobile screenshot summarization. It is useful for:
- Chinese mobile screen summarization
- screenshot-grounded UI evidence selection
- function-entry detection from OCR/UI layout
- comparing a deployable student model against Qwen teacher models
Known limitations:
- Page intent can still be confused on ecommerce and search-result pages.
- Function heads under-predict dense icon grids and category/navigation pages.
- Evidence ids are useful but not perfectly localized.
- The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set.
## Loading
Use the project code snapshot in `code/` together with the original project data preprocessing format. In the original workspace, batch inference is run as:
```powershell
python scripts\infer_rich.py `
--checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best `
--input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl `
--batch_size 2 `
--num_beams 1 `
--max_new_tokens 384 `
--structured_function_mode heads `
--structured_function_threshold 0.20 `
--structured_search_threshold 0.20 `
--structured_evidence_mode heads `
--structured_evidence_threshold 0.50 `
--allow_template_fallback false
```