Saidgurbuz commited on
Commit
5502c07
·
verified ·
1 Parent(s): 4f79cd7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -219,7 +219,7 @@ print(f"\nTotal: {time.time() - start:.1f}s for {len(batched_inputs)} images")
219
 
220
  ## Training
221
 
222
- ScreenVLM was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework on IBM's Blue Vela supercomputing cluster with NVIDIA H100 GPUs.
223
 
224
  **Training data**: [ScreenParse](https://huggingface.co/docling-project/screenparse) — 771K web page screenshots with dense annotations across 55 UI element classes, including bounding boxes, semantic labels, text content, interactability flags, and reading order. Annotations were generated through automated DOM extraction, IoU-based filtering, and VLM-based refinement (Qwen3-VL-8B).
225
 
@@ -228,7 +228,6 @@ ScreenVLM was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM
228
  - Optimized for **web page screenshots**; performance on mobile or desktop application UIs may vary
229
  - Coordinate predictions are approximate — fine-grained pixel-level precision is not guaranteed
230
  - May struggle with very dense or highly dynamic UIs (e.g., complex dashboards with hundreds of elements)
231
- - Not designed for general image understanding — use [Granite Vision](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800) for general-purpose vision tasks
232
 
233
  ## Citation
234
 
 
219
 
220
  ## Training
221
 
222
+ ScreenVLM was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework with 16 NVIDIA H100 GPUs.
223
 
224
  **Training data**: [ScreenParse](https://huggingface.co/docling-project/screenparse) — 771K web page screenshots with dense annotations across 55 UI element classes, including bounding boxes, semantic labels, text content, interactability flags, and reading order. Annotations were generated through automated DOM extraction, IoU-based filtering, and VLM-based refinement (Qwen3-VL-8B).
225
 
 
228
  - Optimized for **web page screenshots**; performance on mobile or desktop application UIs may vary
229
  - Coordinate predictions are approximate — fine-grained pixel-level precision is not guaranteed
230
  - May struggle with very dense or highly dynamic UIs (e.g., complex dashboards with hundreds of elements)
 
231
 
232
  ## Citation
233