---
language:
  - zh
tags:
  - image-to-text
  - screenshot-understanding
  - mobile-ui
  - ocr
  - siglip2
  - mt5
  - cmgui
library_name: pytorch
pipeline_tag: image-to-text
base_model:
  - google/mt5-large
---

# CMGUI Screen-Grounded Summarizer

This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project.

The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids.

## Checkpoint

Source checkpoint:

```text
runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best
```

Export date: 2026-05-19

Architecture:

```text
SigLIP2 visual encoder
+ OCR/UI/layout element memory
+ task/context memory
+ mT5-large decoder
+ element-level structured heads for UI functions and evidence
```

This is a custom PyTorch checkpoint, not a plain `AutoModel.from_pretrained` package. The code snapshot in `code/` shows the loader and inference path used by the project.

## Files

| File | Purpose |
| --- | --- |
| `pytorch_model.bin` | Custom model state dict |
| `rich_config.json` | Model, data, decoding, and structured-head config |
| `decoder_tokenizer/` | mT5 tokenizer |
| `image_processor/` | SigLIP2 image processor |
| `checkpoint_metrics.json` | Best-checkpoint validation metrics saved during training |
| `checkpoint_manifest.json` | Export metadata and recommended runtime settings |
| `reports/eval_report_20260512_titlefix_s1e2.md` | Full valid/test evaluation report |
| `code/` | Loader, CLI inference, and GUI code snapshot |

## Recommended Inference Settings

```text
num_beams=1
max_new_tokens=384
generation_no_repeat_ngram_size=3
generation_repetition_penalty=1.1
generation_block_extra_ids=true
generation_block_title_prefix=true
generation_force_json_start=false
structured_function_mode=heads
structured_function_threshold=0.20
structured_search_threshold=0.20
structured_evidence_mode=heads
structured_evidence_threshold=0.50
structured_max_functions=8
structured_max_evidence=8
structured_evidence_fallback_top1=false
allow_template_fallback=false
```

## Evaluation

Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split.

| Split | Rows | Grounded | ROUGE-L char | Evidence precision | Function F1 |
| --- | ---: | ---: | ---: | ---: | ---: |
| valid | 500 | 0.4434 | 0.4157 | 0.5542 | 0.3588 |
| test | 500 | 0.4398 | 0.4107 | 0.5831 | 0.3819 |

Generation health checks:

```text
title_prefix_rate=0.0
extra_id_rate=0.0
json_valid_rate=1.0
```

The `0.20` function/search threshold is important. The checkpoint config stores `0.50`, but the full threshold sweep found that `0.20` gives the best function-count balance and higher Function F1.

## Usage Notes

The model is intended for research and demo use on mobile screenshot summarization. It is useful for:

- Chinese mobile screen summarization
- screenshot-grounded UI evidence selection
- function-entry detection from OCR/UI layout
- comparing a deployable student model against Qwen teacher models

Known limitations:

- Page intent can still be confused on ecommerce and search-result pages.
- Function heads under-predict dense icon grids and category/navigation pages.
- Evidence ids are useful but not perfectly localized.
- The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set.

## Loading

Use the project code snapshot in `code/` together with the original project data preprocessing format. In the original workspace, batch inference is run as:

```powershell
python scripts\infer_rich.py `
  --checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best `
  --input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl `
  --batch_size 2 `
  --num_beams 1 `
  --max_new_tokens 384 `
  --structured_function_mode heads `
  --structured_function_threshold 0.20 `
  --structured_search_threshold 0.20 `
  --structured_evidence_mode heads `
  --structured_evidence_threshold 0.50 `
  --allow_template_fallback false
```