--- language: - zh tags: - image-to-text - screenshot-understanding - mobile-ui - ocr - siglip2 - mt5 - cmgui library_name: pytorch pipeline_tag: image-to-text base_model: - google/mt5-large --- # CMGUI Screen-Grounded Summarizer This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project. The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids. ## Checkpoint Source checkpoint: ```text runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best ``` Export date: 2026-05-19 Architecture: ```text SigLIP2 visual encoder + OCR/UI/layout element memory + task/context memory + mT5-large decoder + element-level structured heads for UI functions and evidence ``` This is a custom PyTorch checkpoint, not a plain `AutoModel.from_pretrained` package. The code snapshot in `code/` shows the loader and inference path used by the project. ## Files | File | Purpose | | --- | --- | | `pytorch_model.bin` | Custom model state dict | | `rich_config.json` | Model, data, decoding, and structured-head config | | `decoder_tokenizer/` | mT5 tokenizer | | `image_processor/` | SigLIP2 image processor | | `checkpoint_metrics.json` | Best-checkpoint validation metrics saved during training | | `checkpoint_manifest.json` | Export metadata and recommended runtime settings | | `reports/eval_report_20260512_titlefix_s1e2.md` | Full valid/test evaluation report | | `code/` | Loader, CLI inference, and GUI code snapshot | ## Recommended Inference Settings ```text num_beams=1 max_new_tokens=384 generation_no_repeat_ngram_size=3 generation_repetition_penalty=1.1 generation_block_extra_ids=true generation_block_title_prefix=true generation_force_json_start=false structured_function_mode=heads structured_function_threshold=0.20 structured_search_threshold=0.20 structured_evidence_mode=heads structured_evidence_threshold=0.50 structured_max_functions=8 structured_max_evidence=8 structured_evidence_fallback_top1=false allow_template_fallback=false ``` ## Evaluation Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split. | Split | Rows | Grounded | ROUGE-L char | Evidence precision | Function F1 | | --- | ---: | ---: | ---: | ---: | ---: | | valid | 500 | 0.4434 | 0.4157 | 0.5542 | 0.3588 | | test | 500 | 0.4398 | 0.4107 | 0.5831 | 0.3819 | Generation health checks: ```text title_prefix_rate=0.0 extra_id_rate=0.0 json_valid_rate=1.0 ``` The `0.20` function/search threshold is important. The checkpoint config stores `0.50`, but the full threshold sweep found that `0.20` gives the best function-count balance and higher Function F1. ## Usage Notes The model is intended for research and demo use on mobile screenshot summarization. It is useful for: - Chinese mobile screen summarization - screenshot-grounded UI evidence selection - function-entry detection from OCR/UI layout - comparing a deployable student model against Qwen teacher models Known limitations: - Page intent can still be confused on ecommerce and search-result pages. - Function heads under-predict dense icon grids and category/navigation pages. - Evidence ids are useful but not perfectly localized. - The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set. ## Loading Use the project code snapshot in `code/` together with the original project data preprocessing format. In the original workspace, batch inference is run as: ```powershell python scripts\infer_rich.py ` --checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best ` --input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl ` --batch_size 2 ` --num_beams 1 ` --max_new_tokens 384 ` --structured_function_mode heads ` --structured_function_threshold 0.20 ` --structured_search_threshold 0.20 ` --structured_evidence_mode heads ` --structured_evidence_threshold 0.50 ` --allow_template_fallback false ```