Buckets:
| # Multimodal Dataset Profiling | |
| Analyze an image-text dataset and generate automated quality insights: | |
| resolution statistics, caption analysis, thumbnail gallery, and a | |
| composite health score. | |
| ## What this example shows | |
| - Accessing `PIL.Image` objects from a Hub dataset via `dataset["image"]` | |
| - Computing per-image resolution, aspect ratio, and color mode statistics | |
| - Analyzing caption token counts and vocabulary size across 5 captions per image | |
| - Writing a thumbnail gallery to disk with a JSON manifest | |
| - Combining image and caption signals into a health score with flagging logic | |
| - Returning plain `dict` results for report-style assets that don't need IO manager persistence | |
| ## Dataset | |
| [`nlphuji/flickr30k`](https://huggingface.co/datasets/nlphuji/flickr30k) — 31,783 Flickr images each paired with | |
| 5 human-written captions. Stored with PIL Image objects in the `image` | |
| column and caption lists in the `caption` column. | |
| > **Note:** `nlphuji/flickr30k` requires a Hub login. Set `HF_TOKEN` | |
| > or pass `token=` to `HuggingFaceResource`. | |
| ## Asset graph | |
| ``` | |
| flickr30k_raw | |
| / \ | |
| image_stats caption_stats sample_gallery | |
| \ / | |
| dataset_health_report | |
| ``` | |
| ## Key implementation details | |
| **Accessing images:** PIL Images are returned directly from dataset iteration: | |
| ```python | |
| for example in dataset: | |
| img = example["image"] # PIL.Image.Image | |
| width, height = img.size | |
| ``` | |
| **Thumbnail generation:** `PIL.Image.thumbnail()` is in-place and | |
| maintains aspect ratio: | |
| ```python | |
| img = example["image"].copy() # copy before mutating | |
| img.thumbnail((128, 128)) | |
| img.save(out_path, format="JPEG") | |
| ``` | |
| **Health score formula:** | |
| ``` | |
| health_score = 100 | |
| - (extreme_aspect_pct × 0.30) | |
| - (short_caption_pct × 0.40) | |
| - (missing_caption_pct × 0.30) | |
| ``` | |
| Captions are weighted most heavily as they are the primary text signal. | |
| ## Flagged quality issues | |
| | Issue | Threshold | Weight in score | | |
| |-------|-----------|-----------------| | |
| | Extreme aspect ratio | < 0.2 or > 5.0 | 30% | | |
| | Short captions | avg < 4 tokens | 40% | | |
| | Missing captions | empty list | 30% | | |
| ## Storage layout | |
| ``` | |
| .dagster_hf_storage/ | |
| ├── flickr30k_raw/ | |
| ├── image_stats/ # Dataset with per-image width/height/aspect/mode | |
| ├── caption_stats/ # Dataset with per-example caption length stats | |
| └── sample_gallery/ # Written directly by asset (not via IO manager) | |
| ├── sample_0000.jpg | |
| ├── sample_0001.jpg | |
| ... | |
| ├── sample_0015.jpg | |
| └── manifest.json | |
| ``` | |
| `sample_gallery` and `dataset_health_report` return plain `dict` values | |
| and are not persisted by the IO manager. | |
| ## How to run | |
| ```bash | |
| pip install dagster dagster-hf-datasets Pillow | |
| cd dagster_hf_datasets_examples | |
| dagster dev -m multi_modal_data_profiling.definitions | |
| ``` | |
| Materialize `flickr30k_raw` first, then `image_stats`, `caption_stats`, | |
| and `sample_gallery` in parallel, then `dataset_health_report` last. | |
| --- | |
| ## LLaVA-150K: Modern Instruction-Tuned Vision-Language Data | |
| This example now includes **LLaVA-150K** alongside Flickr30K, showcasing modern instruction-tuning data for vision-language models. | |
| ### Why LLaVA? | |
| | Aspect | Flickr30K | LLaVA-150K | | |
| |--------|-----------|-----------| | |
| | **Data Type** | Raw image-caption pairs | Instruction-response pairs (Q&A) | | |
| | **Use Case** | Image captioning pretraining | Instruction-tuning VLMs | | |
| | **Structure** | 5 captions per image | Instruction + response dialogue | | |
| | **Size** | 31K images | 150K examples | | |
| | **Modernity** | Classic benchmark (2014) | Modern instruction-tuning (2023+) | | |
| | **Model Fit** | Generic image understanding | Following visual instructions | | |
| ### LLaVA Asset Graph | |
| ``` | |
| llava_instruct_raw (150K examples, sampled to 5K) | |
| | | |
| llava_instruction_stats (analyze Q&A structure) | |
| | | |
| llava_quality_profile (instruction-tuning quality metrics) | |
| ``` | |
| ### Key Assets | |
| #### 1. `llava_instruct_raw` → `MaterializeResult` | |
| Ingests [liuhaotian/llava-instruct-150k](https://huggingface.co/datasets/liuhaotian/llava-instruct-150k): | |
| - 150K image-instruction-response triplets | |
| - Sampled to 5K for dev (or adjust as needed) | |
| - Metadata: row count, column names, dataset info | |
| **Columns**: | |
| ```python | |
| { | |
| "image": PIL.Image, | |
| "conversations": [ | |
| {"from": "human", "value": "What is in the image?"}, | |
| {"from": "gpt", "value": "The image shows a dog..."} | |
| ] | |
| } | |
| ``` | |
| #### 2. `llava_instruction_stats` → `Dataset` | |
| Analyzes instruction-response pair structure: | |
| - **Instruction tokens**: How complex are the questions? | |
| - **Response tokens**: How detailed are the answers? | |
| - **Is question**: Binary flag (ends with `?`) | |
| - Metrics logged: token distributions, question %, response diversity | |
| **Output per example**: | |
| ```json | |
| { | |
| "idx": 123, | |
| "instruction_tokens": 12, | |
| "response_tokens": 45, | |
| "instruction_length": 142, | |
| "response_length": 520, | |
| "is_question": true | |
| } | |
| ``` | |
| #### 3. `llava_quality_profile` → `dict` | |
| Computes quality metrics for instruction-tuning: | |
| ```json | |
| { | |
| "total_examples": 5000, | |
| "valid_instruction_response_pairs": 4950, | |
| "very_short_responses_count": 50, | |
| "very_long_responses_count": 120, | |
| "balanced_responses": 4830, | |
| "balanced_response_pct": 96.6, | |
| "question_percentage": 68.5, | |
| "instruction_complexity_score": 1.2 | |
| } | |
| ``` | |
| **Quality Thresholds**: | |
| - ✅ **Balanced response**: 5–500 tokens | |
| - ❌ **Very short**: < 5 tokens (incomplete answers) | |
| - ❌ **Very long**: > 500 tokens (off-topic rambling) | |
| ### Updated Asset Graph | |
| ``` | |
| flickr30k_raw llava_instruct_raw | |
| / | \ | | |
| / | \ | | |
| image_stats caption_stats llava_instruction_stats | |
| \ | / | | |
| \ health_report llava_quality_profile | |
| \ / | |
| Both available in Dagster UI | |
| ``` | |
| ### Comparing Image-Caption vs. Instruction-Tuned Data | |
| | Metric | Flickr30K | LLaVA | | |
| |--------|-----------|-------| | |
| | Images per dataset | 31.8K | 150K | | |
| | Text type | Captions (descriptive) | Instructions + Responses (interactive) | | |
| | Caption avg length | ~15 tokens | Instruction: ~12 tokens, Response: ~45 tokens | | |
| | Primary use | Image description pretraining | Visual QA & instruction-following | | |
| ### Running Both | |
| ```bash | |
| dagster dev -f definitions.py | |
| ``` | |
| Materialize order: | |
| 1. `flickr30k_raw` + `llava_instruct_raw` (in parallel) | |
| 2. `image_stats` + `caption_stats` + `llava_instruction_stats` (in parallel) | |
| 3. `sample_gallery` + `dataset_health_report` + `llava_quality_profile` (in parallel) | |
| ### Use Cases | |
| **Flickr30K**: Image-to-text pretraining, image understanding, captioning models | |
| **LLaVA**: | |
| - Instruction-tuning VLMs (LLaVA, Qwen-VL, etc.) | |
| - Visual question answering (VQA) | |
| - Building custom instruction-tuned vision-language models | |
| ### Customization | |
| **Adjust LLaVA sample size**: | |
| ```python | |
| # In llava_instruct_raw() | |
| dataset.select(range(min(20000, len(dataset)))) # Larger sample | |
| ``` | |
| **Add your own instruction-tuned dataset**: | |
| ```python | |
| @hf_dataset_asset( | |
| path="your-org/your-vl-dataset", | |
| split="train", | |
| group_name="multimodal_profiling", | |
| ) | |
| def custom_instruct_raw(dataset: Dataset) -> MaterializeResult: | |
| ... | |
| ``` | |
| **Combine both datasets for joint training**: | |
| ```python | |
| @asset(group_name="multimodal_profiling") | |
| def combined_multimodal_data( | |
| flickr30k_raw: Dataset, | |
| llava_instruct_raw: Dataset, | |
| ) -> Dataset: | |
| """Mix image-caption and instruction-tuned data.""" | |
| # Convert Flickr30K to instruction format | |
| # Concatenate with LLaVA | |
| # Return combined dataset | |
| ``` |
Xet Storage Details
- Size:
- 7.52 kB
- Xet hash:
- c623316cd110563dfa0e521cddbefd012c3ebeadbf5a37e6540c0a6245bf4df6
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.