| --- |
| language: |
| - zh |
| tags: |
| - image-to-text |
| - screenshot-understanding |
| - mobile-ui |
| - ocr |
| - siglip2 |
| - mt5 |
| - cmgui |
| library_name: pytorch |
| pipeline_tag: image-to-text |
| base_model: |
| - google/mt5-large |
| --- |
| |
| # CMGUI Screen-Grounded Summarizer |
|
|
| This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project. |
|
|
| The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids. |
|
|
| ## Checkpoint |
|
|
| Source checkpoint: |
|
|
| ```text |
| runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best |
| ``` |
|
|
| Export date: 2026-05-19 |
|
|
| Architecture: |
|
|
| ```text |
| SigLIP2 visual encoder |
| + OCR/UI/layout element memory |
| + task/context memory |
| + mT5-large decoder |
| + element-level structured heads for UI functions and evidence |
| ``` |
|
|
| This is a custom PyTorch checkpoint, not a plain `AutoModel.from_pretrained` package. The code snapshot in `code/` shows the loader and inference path used by the project. |
|
|
| ## Files |
|
|
| | File | Purpose | |
| | --- | --- | |
| | `pytorch_model.bin` | Custom model state dict | |
| | `rich_config.json` | Model, data, decoding, and structured-head config | |
| | `decoder_tokenizer/` | mT5 tokenizer | |
| | `image_processor/` | SigLIP2 image processor | |
| | `checkpoint_metrics.json` | Best-checkpoint validation metrics saved during training | |
| | `checkpoint_manifest.json` | Export metadata and recommended runtime settings | |
| | `reports/eval_report_20260512_titlefix_s1e2.md` | Full valid/test evaluation report | |
| | `code/` | Loader, CLI inference, and GUI code snapshot | |
|
|
| ## Recommended Inference Settings |
|
|
| ```text |
| num_beams=1 |
| max_new_tokens=384 |
| generation_no_repeat_ngram_size=3 |
| generation_repetition_penalty=1.1 |
| generation_block_extra_ids=true |
| generation_block_title_prefix=true |
| generation_force_json_start=false |
| structured_function_mode=heads |
| structured_function_threshold=0.20 |
| structured_search_threshold=0.20 |
| structured_evidence_mode=heads |
| structured_evidence_threshold=0.50 |
| structured_max_functions=8 |
| structured_max_evidence=8 |
| structured_evidence_fallback_top1=false |
| allow_template_fallback=false |
| ``` |
|
|
| ## Evaluation |
|
|
| Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split. |
|
|
| | Split | Rows | Grounded | ROUGE-L char | Evidence precision | Function F1 | |
| | --- | ---: | ---: | ---: | ---: | ---: | |
| | valid | 500 | 0.4434 | 0.4157 | 0.5542 | 0.3588 | |
| | test | 500 | 0.4398 | 0.4107 | 0.5831 | 0.3819 | |
|
|
| Generation health checks: |
|
|
| ```text |
| title_prefix_rate=0.0 |
| extra_id_rate=0.0 |
| json_valid_rate=1.0 |
| ``` |
|
|
| The `0.20` function/search threshold is important. The checkpoint config stores `0.50`, but the full threshold sweep found that `0.20` gives the best function-count balance and higher Function F1. |
|
|
| ## Usage Notes |
|
|
| The model is intended for research and demo use on mobile screenshot summarization. It is useful for: |
|
|
| - Chinese mobile screen summarization |
| - screenshot-grounded UI evidence selection |
| - function-entry detection from OCR/UI layout |
| - comparing a deployable student model against Qwen teacher models |
|
|
| Known limitations: |
|
|
| - Page intent can still be confused on ecommerce and search-result pages. |
| - Function heads under-predict dense icon grids and category/navigation pages. |
| - Evidence ids are useful but not perfectly localized. |
| - The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set. |
|
|
| ## Loading |
|
|
| Use the project code snapshot in `code/` together with the original project data preprocessing format. In the original workspace, batch inference is run as: |
|
|
| ```powershell |
| python scripts\infer_rich.py ` |
| --checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best ` |
| --input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl ` |
| --batch_size 2 ` |
| --num_beams 1 ` |
| --max_new_tokens 384 ` |
| --structured_function_mode heads ` |
| --structured_function_threshold 0.20 ` |
| --structured_search_threshold 0.20 ` |
| --structured_evidence_mode heads ` |
| --structured_evidence_threshold 0.50 ` |
| --allow_template_fallback false |
| ``` |
|
|