Upload CMGUI stage3 screen-grounded summarizer checkpoint

2f0e115 verified 19 days ago

4.26 kB

	---
	language:
	- zh
	tags:
	- image-to-text
	- screenshot-understanding
	- mobile-ui
	- ocr
	- siglip2
	- mt5
	- cmgui
	library_name: pytorch
	pipeline_tag: image-to-text
	base_model:
	- google/mt5-large
	---

	# CMGUI Screen-Grounded Summarizer

	This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project.

	The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids.

	## Checkpoint

	Source checkpoint:

	```text
	runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best
	```

	Export date: 2026-05-19

	Architecture:

	```text
	SigLIP2 visual encoder
	+ OCR/UI/layout element memory
	+ task/context memory
	+ mT5-large decoder
	+ element-level structured heads for UI functions and evidence
	```

	This is a custom PyTorch checkpoint, not a plain `AutoModel.from_pretrained` package. The code snapshot in `code/` shows the loader and inference path used by the project.

	## Files

	\| File \| Purpose \|
	\| --- \| --- \|
	\| `pytorch_model.bin` \| Custom model state dict \|
	\| `rich_config.json` \| Model, data, decoding, and structured-head config \|
	\| `decoder_tokenizer/` \| mT5 tokenizer \|
	\| `image_processor/` \| SigLIP2 image processor \|
	\| `checkpoint_metrics.json` \| Best-checkpoint validation metrics saved during training \|
	\| `checkpoint_manifest.json` \| Export metadata and recommended runtime settings \|
	\| `reports/eval_report_20260512_titlefix_s1e2.md` \| Full valid/test evaluation report \|
	\| `code/` \| Loader, CLI inference, and GUI code snapshot \|

	## Recommended Inference Settings

	```text
	num_beams=1
	max_new_tokens=384
	generation_no_repeat_ngram_size=3
	generation_repetition_penalty=1.1
	generation_block_extra_ids=true
	generation_block_title_prefix=true
	generation_force_json_start=false
	structured_function_mode=heads
	structured_function_threshold=0.20
	structured_search_threshold=0.20
	structured_evidence_mode=heads
	structured_evidence_threshold=0.50
	structured_max_functions=8
	structured_max_evidence=8
	structured_evidence_fallback_top1=false
	allow_template_fallback=false
	```

	## Evaluation

	Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split.

	\| Split \| Rows \| Grounded \| ROUGE-L char \| Evidence precision \| Function F1 \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| valid \| 500 \| 0.4434 \| 0.4157 \| 0.5542 \| 0.3588 \|
	\| test \| 500 \| 0.4398 \| 0.4107 \| 0.5831 \| 0.3819 \|

	Generation health checks:

	```text
	title_prefix_rate=0.0
	extra_id_rate=0.0
	json_valid_rate=1.0
	```

	The `0.20` function/search threshold is important. The checkpoint config stores `0.50`, but the full threshold sweep found that `0.20` gives the best function-count balance and higher Function F1.

	## Usage Notes

	The model is intended for research and demo use on mobile screenshot summarization. It is useful for:

	- Chinese mobile screen summarization
	- screenshot-grounded UI evidence selection
	- function-entry detection from OCR/UI layout
	- comparing a deployable student model against Qwen teacher models

	Known limitations:

	- Page intent can still be confused on ecommerce and search-result pages.
	- Function heads under-predict dense icon grids and category/navigation pages.
	- Evidence ids are useful but not perfectly localized.
	- The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set.

	## Loading

	Use the project code snapshot in `code/` together with the original project data preprocessing format. In the original workspace, batch inference is run as:

	```powershell
	python scripts\infer_rich.py `
	--checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best `
	--input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl `
	--batch_size 2 `
	--num_beams 1 `
	--max_new_tokens 384 `
	--structured_function_mode heads `
	--structured_function_threshold 0.20 `
	--structured_search_threshold 0.20 `
	--structured_evidence_mode heads `
	--structured_evidence_threshold 0.50 `
	--allow_template_fallback false
	```