Buckets:

the-hf-stack
/

dagster-hf-datasets-examples

Files

xet

the-hf-stack/dagster-hf-datasets-examples / multi_modal_data_profiling /README.md

AINovice2005

17 days ago

preview code

download

raw

7.52 kB

	# Multimodal Dataset Profiling

	Analyze an image-text dataset and generate automated quality insights:
	resolution statistics, caption analysis, thumbnail gallery, and a
	composite health score.

	## What this example shows

	- Accessing `PIL.Image` objects from a Hub dataset via `dataset["image"]`
	- Computing per-image resolution, aspect ratio, and color mode statistics
	- Analyzing caption token counts and vocabulary size across 5 captions per image
	- Writing a thumbnail gallery to disk with a JSON manifest
	- Combining image and caption signals into a health score with flagging logic
	- Returning plain `dict` results for report-style assets that don't need IO manager persistence

	## Dataset

	[`nlphuji/flickr30k`](https://huggingface.co/datasets/nlphuji/flickr30k) — 31,783 Flickr images each paired with
	5 human-written captions. Stored with PIL Image objects in the `image`
	column and caption lists in the `caption` column.

	> Note: `nlphuji/flickr30k` requires a Hub login. Set `HF_TOKEN`
	> or pass `token=` to `HuggingFaceResource`.

	## Asset graph

	```
	flickr30k_raw
	/ \
	image_stats caption_stats sample_gallery
	\ /
	dataset_health_report
	```

	## Key implementation details

	Accessing images: PIL Images are returned directly from dataset iteration:
	```python
	for example in dataset:
	img = example["image"] # PIL.Image.Image
	width, height = img.size
	```

	Thumbnail generation: `PIL.Image.thumbnail()` is in-place and
	maintains aspect ratio:
	```python
	img = example["image"].copy() # copy before mutating
	img.thumbnail((128, 128))
	img.save(out_path, format="JPEG")
	```

	Health score formula:
	```
	health_score = 100
	- (extreme_aspect_pct × 0.30)
	- (short_caption_pct × 0.40)
	- (missing_caption_pct × 0.30)
	```
	Captions are weighted most heavily as they are the primary text signal.

	## Flagged quality issues

	\| Issue \| Threshold \| Weight in score \|
	\|-------\|-----------\|-----------------\|
	\| Extreme aspect ratio \| < 0.2 or > 5.0 \| 30% \|
	\| Short captions \| avg < 4 tokens \| 40% \|
	\| Missing captions \| empty list \| 30% \|

	## Storage layout

	```
	.dagster_hf_storage/
	├── flickr30k_raw/
	├── image_stats/ # Dataset with per-image width/height/aspect/mode
	├── caption_stats/ # Dataset with per-example caption length stats
	└── sample_gallery/ # Written directly by asset (not via IO manager)
	├── sample_0000.jpg
	├── sample_0001.jpg
	...
	├── sample_0015.jpg
	└── manifest.json
	```

	`sample_gallery` and `dataset_health_report` return plain `dict` values
	and are not persisted by the IO manager.

	## How to run

	```bash
	pip install dagster dagster-hf-datasets Pillow
	cd dagster_hf_datasets_examples

	dagster dev -m multi_modal_data_profiling.definitions
	```

	Materialize `flickr30k_raw` first, then `image_stats`, `caption_stats`,
	and `sample_gallery` in parallel, then `dataset_health_report` last.

	---

	## LLaVA-150K: Modern Instruction-Tuned Vision-Language Data

	This example now includes LLaVA-150K alongside Flickr30K, showcasing modern instruction-tuning data for vision-language models.

	### Why LLaVA?

	\| Aspect \| Flickr30K \| LLaVA-150K \|
	\|--------\|-----------\|-----------\|
	\| Data Type \| Raw image-caption pairs \| Instruction-response pairs (Q&A) \|
	\| Use Case \| Image captioning pretraining \| Instruction-tuning VLMs \|
	\| Structure \| 5 captions per image \| Instruction + response dialogue \|
	\| Size \| 31K images \| 150K examples \|
	\| Modernity \| Classic benchmark (2014) \| Modern instruction-tuning (2023+) \|
	\| Model Fit \| Generic image understanding \| Following visual instructions \|

	### LLaVA Asset Graph

	```
	llava_instruct_raw (150K examples, sampled to 5K)
	\|
	llava_instruction_stats (analyze Q&A structure)
	\|
	llava_quality_profile (instruction-tuning quality metrics)
	```

	### Key Assets

	#### 1. `llava_instruct_raw` → `MaterializeResult`

	Ingests [liuhaotian/llava-instruct-150k](https://huggingface.co/datasets/liuhaotian/llava-instruct-150k):
	- 150K image-instruction-response triplets
	- Sampled to 5K for dev (or adjust as needed)
	- Metadata: row count, column names, dataset info

	Columns:
	```python
	{
	"image": PIL.Image,
	"conversations": [
	{"from": "human", "value": "What is in the image?"},
	{"from": "gpt", "value": "The image shows a dog..."}
	]
	}
	```

	#### 2. `llava_instruction_stats` → `Dataset`

	Analyzes instruction-response pair structure:
	- Instruction tokens: How complex are the questions?
	- Response tokens: How detailed are the answers?
	- Is question: Binary flag (ends with `?`)
	- Metrics logged: token distributions, question %, response diversity

	Output per example:
	```json
	{
	"idx": 123,
	"instruction_tokens": 12,
	"response_tokens": 45,
	"instruction_length": 142,
	"response_length": 520,
	"is_question": true
	}
	```

	#### 3. `llava_quality_profile` → `dict`

	Computes quality metrics for instruction-tuning:
	```json
	{
	"total_examples": 5000,
	"valid_instruction_response_pairs": 4950,
	"very_short_responses_count": 50,
	"very_long_responses_count": 120,
	"balanced_responses": 4830,
	"balanced_response_pct": 96.6,
	"question_percentage": 68.5,
	"instruction_complexity_score": 1.2
	}
	```

	Quality Thresholds:
	- ✅ Balanced response: 5–500 tokens
	- ❌ Very short: < 5 tokens (incomplete answers)
	- ❌ Very long: > 500 tokens (off-topic rambling)

	### Updated Asset Graph

	```
	flickr30k_raw llava_instruct_raw
	/ \| \ \|
	/ \| \ \|
	image_stats caption_stats llava_instruction_stats
	\ \| / \|
	\ health_report llava_quality_profile
	\ /
	Both available in Dagster UI
	```

	### Comparing Image-Caption vs. Instruction-Tuned Data

	\| Metric \| Flickr30K \| LLaVA \|
	\|--------\|-----------\|-------\|
	\| Images per dataset \| 31.8K \| 150K \|
	\| Text type \| Captions (descriptive) \| Instructions + Responses (interactive) \|
	\| Caption avg length \| ~15 tokens \| Instruction: ~12 tokens, Response: ~45 tokens \|
	\| Primary use \| Image description pretraining \| Visual QA & instruction-following \|

	### Running Both

	```bash
	dagster dev -f definitions.py
	```

	Materialize order:
	1. `flickr30k_raw` + `llava_instruct_raw` (in parallel)
	2. `image_stats` + `caption_stats` + `llava_instruction_stats` (in parallel)
	3. `sample_gallery` + `dataset_health_report` + `llava_quality_profile` (in parallel)

	### Use Cases

	Flickr30K: Image-to-text pretraining, image understanding, captioning models

	LLaVA:
	- Instruction-tuning VLMs (LLaVA, Qwen-VL, etc.)
	- Visual question answering (VQA)
	- Building custom instruction-tuned vision-language models

	### Customization

	Adjust LLaVA sample size:
	```python
	# In llava_instruct_raw()
	dataset.select(range(min(20000, len(dataset)))) # Larger sample
	```

	Add your own instruction-tuned dataset:
	```python
	@hf_dataset_asset(
	path="your-org/your-vl-dataset",
	split="train",
	group_name="multimodal_profiling",
	)
	def custom_instruct_raw(dataset: Dataset) -> MaterializeResult:
	...
	```

	Combine both datasets for joint training:
	```python
	@asset(group_name="multimodal_profiling")
	def combined_multimodal_data(
	flickr30k_raw: Dataset,
	llava_instruct_raw: Dataset,
	) -> Dataset:
	"""Mix image-caption and instruction-tuned data."""
	# Convert Flickr30K to instruction format
	# Concatenate with LLaVA
	# Return combined dataset
	```

Xet Storage Details

Size:: 7.52 kB
Xet hash:: c623316cd110563dfa0e521cddbefd012c3ebeadbf5a37e6540c0a6245bf4df6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.