Everything is open-sourced: datasets, adapters, and code.
https://huggingface.co/blog/OpenMed/synthvision
Join the community of Machine Learners and AI enthusiasts.
Sign UpInteresting work. The part that stands out isn’t just the cost efficiency, but the discipline in pipeline design.
A lot of teams are still chasing “perfect” datasets with heavy manual annotation, while this approach shows that synthetic data + cross-model validation can already reach production-grade quality when done carefully.
A few takeaways that feel increasingly hard to ignore:
Synthetic data is no longer the bottleneck if validation is handled properly
The real leverage is in data curation pipelines, not raw data collection
Smaller models (2–3B) can outperform expectations when trained on clean, consistent signals
The dual-VLM agreement (~93%) is particularly interesting. It’s a pragmatic way to approximate label reliability without introducing significant human cost.
Also worth noting: achieving this under $500 challenges a lot of assumptions around “necessary” infrastructure and annotation budgets.
Overall, this feels less like a modeling breakthrough and more like a well-executed data engineering strategy—which, in practice, is where most real gains come from.
Congrats!