Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| README.md | 7.52 kB xet | c623316c | |
| __init__.py | 152 Bytes xet | 7781bea3 | |
| assets.py | 17.1 kB xet | db7d8c77 | |
| definitions.py | 954 Bytes xet | f4e14e2e |
Multimodal Dataset Profiling
Analyze an image-text dataset and generate automated quality insights: resolution statistics, caption analysis, thumbnail gallery, and a composite health score.
What this example shows
- Accessing
PIL.Imageobjects from a Hub dataset viadataset["image"] - Computing per-image resolution, aspect ratio, and color mode statistics
- Analyzing caption token counts and vocabulary size across 5 captions per image
- Writing a thumbnail gallery to disk with a JSON manifest
- Combining image and caption signals into a health score with flagging logic
- Returning plain
dictresults for report-style assets that don't need IO manager persistence
Dataset
nlphuji/flickr30k — 31,783 Flickr images each paired with
5 human-written captions. Stored with PIL Image objects in the image
column and caption lists in the caption column.
Note:
nlphuji/flickr30krequires a Hub login. SetHF_TOKENor passtoken=toHuggingFaceResource.
Asset graph
flickr30k_raw
/ \
image_stats caption_stats sample_gallery
\ /
dataset_health_report
Key implementation details
Accessing images: PIL Images are returned directly from dataset iteration:
for example in dataset:
img = example["image"] # PIL.Image.Image
width, height = img.size
Thumbnail generation: PIL.Image.thumbnail() is in-place and
maintains aspect ratio:
img = example["image"].copy() # copy before mutating
img.thumbnail((128, 128))
img.save(out_path, format="JPEG")
Health score formula:
health_score = 100
- (extreme_aspect_pct × 0.30)
- (short_caption_pct × 0.40)
- (missing_caption_pct × 0.30)
Captions are weighted most heavily as they are the primary text signal.
Flagged quality issues
| Issue | Threshold | Weight in score |
|---|---|---|
| Extreme aspect ratio | < 0.2 or > 5.0 | 30% |
| Short captions | avg < 4 tokens | 40% |
| Missing captions | empty list | 30% |
Storage layout
.dagster_hf_storage/
├── flickr30k_raw/
├── image_stats/ # Dataset with per-image width/height/aspect/mode
├── caption_stats/ # Dataset with per-example caption length stats
└── sample_gallery/ # Written directly by asset (not via IO manager)
├── sample_0000.jpg
├── sample_0001.jpg
...
├── sample_0015.jpg
└── manifest.json
sample_gallery and dataset_health_report return plain dict values
and are not persisted by the IO manager.
How to run
pip install dagster dagster-hf-datasets Pillow
cd dagster_hf_datasets_examples
dagster dev -m multi_modal_data_profiling.definitions
Materialize flickr30k_raw first, then image_stats, caption_stats,
and sample_gallery in parallel, then dataset_health_report last.
LLaVA-150K: Modern Instruction-Tuned Vision-Language Data
This example now includes LLaVA-150K alongside Flickr30K, showcasing modern instruction-tuning data for vision-language models.
Why LLaVA?
| Aspect | Flickr30K | LLaVA-150K |
|---|---|---|
| Data Type | Raw image-caption pairs | Instruction-response pairs (Q&A) |
| Use Case | Image captioning pretraining | Instruction-tuning VLMs |
| Structure | 5 captions per image | Instruction + response dialogue |
| Size | 31K images | 150K examples |
| Modernity | Classic benchmark (2014) | Modern instruction-tuning (2023+) |
| Model Fit | Generic image understanding | Following visual instructions |
LLaVA Asset Graph
llava_instruct_raw (150K examples, sampled to 5K)
|
llava_instruction_stats (analyze Q&A structure)
|
llava_quality_profile (instruction-tuning quality metrics)
Key Assets
1. llava_instruct_raw → MaterializeResult
Ingests liuhaotian/llava-instruct-150k:
- 150K image-instruction-response triplets
- Sampled to 5K for dev (or adjust as needed)
- Metadata: row count, column names, dataset info
Columns:
{
"image": PIL.Image,
"conversations": [
{"from": "human", "value": "What is in the image?"},
{"from": "gpt", "value": "The image shows a dog..."}
]
}
2. llava_instruction_stats → Dataset
Analyzes instruction-response pair structure:
- Instruction tokens: How complex are the questions?
- Response tokens: How detailed are the answers?
- Is question: Binary flag (ends with
?) - Metrics logged: token distributions, question %, response diversity
Output per example:
{
"idx": 123,
"instruction_tokens": 12,
"response_tokens": 45,
"instruction_length": 142,
"response_length": 520,
"is_question": true
}
3. llava_quality_profile → dict
Computes quality metrics for instruction-tuning:
{
"total_examples": 5000,
"valid_instruction_response_pairs": 4950,
"very_short_responses_count": 50,
"very_long_responses_count": 120,
"balanced_responses": 4830,
"balanced_response_pct": 96.6,
"question_percentage": 68.5,
"instruction_complexity_score": 1.2
}
Quality Thresholds:
- ✅ Balanced response: 5–500 tokens
- ❌ Very short: < 5 tokens (incomplete answers)
- ❌ Very long: > 500 tokens (off-topic rambling)
Updated Asset Graph
flickr30k_raw llava_instruct_raw
/ | \ |
/ | \ |
image_stats caption_stats llava_instruction_stats
\ | / |
\ health_report llava_quality_profile
\ /
Both available in Dagster UI
Comparing Image-Caption vs. Instruction-Tuned Data
| Metric | Flickr30K | LLaVA |
|---|---|---|
| Images per dataset | 31.8K | 150K |
| Text type | Captions (descriptive) | Instructions + Responses (interactive) |
| Caption avg length | ~15 tokens | Instruction: ~12 tokens, Response: ~45 tokens |
| Primary use | Image description pretraining | Visual QA & instruction-following |
Running Both
dagster dev -f definitions.py
Materialize order:
flickr30k_raw+llava_instruct_raw(in parallel)image_stats+caption_stats+llava_instruction_stats(in parallel)sample_gallery+dataset_health_report+llava_quality_profile(in parallel)
Use Cases
Flickr30K: Image-to-text pretraining, image understanding, captioning models
LLaVA:
- Instruction-tuning VLMs (LLaVA, Qwen-VL, etc.)
- Visual question answering (VQA)
- Building custom instruction-tuned vision-language models
Customization
Adjust LLaVA sample size:
# In llava_instruct_raw()
dataset.select(range(min(20000, len(dataset)))) # Larger sample
Add your own instruction-tuned dataset:
@hf_dataset_asset(
path="your-org/your-vl-dataset",
split="train",
group_name="multimodal_profiling",
)
def custom_instruct_raw(dataset: Dataset) -> MaterializeResult:
...
Combine both datasets for joint training:
@asset(group_name="multimodal_profiling")
def combined_multimodal_data(
flickr30k_raw: Dataset,
llava_instruct_raw: Dataset,
) -> Dataset:
"""Mix image-caption and instruction-tuned data."""
# Convert Flickr30K to instruction format
# Concatenate with LLaVA
# Return combined dataset
- Total size
- 210 kB
- Files
- 70
- Last updated
- Jun 14
- Pre-warmed CDN
- US EU US EU