|
|
--- |
|
|
title: Dowser |
|
|
emoji: ⏱️ |
|
|
colorFrom: red |
|
|
colorTo: yellow |
|
|
sdk: docker |
|
|
pinned: true |
|
|
tags: |
|
|
- llm |
|
|
- language-models |
|
|
- training-data |
|
|
- dataset-analysis |
|
|
- data-selection |
|
|
- data-efficiency |
|
|
- model-evaluation |
|
|
- benchmarking |
|
|
- fine-tuning |
|
|
- machine-learning |
|
|
- deep-learning |
|
|
- nlp |
|
|
- transformers |
|
|
- research |
|
|
- mlops |
|
|
- model-training |
|
|
- evaluation |
|
|
--- |
|
|
## Problem |
|
|
|
|
|
|
|
|
AI teams are data constrained, not model constrained and waste millions retraining models on data with little or negative impact. |
|
|
|
|
|
They spend most of their budget collecting, processing, and labeling data without knowing what actually improves performance. |
|
|
|
|
|
This leads to repeated failed retraining cycles, wasted GPU runs, and slow iteration because teams lack insights in which datasets improve the model and which degrade it. |
|
|
|
|
|
## Solution |
|
|
|
|
|
Influence guided training has been shown to halve the convergence time. [*Dowser by Durinn](http://durinn.ai/)* tells AI teams which training data improves model performance and which data hurts it, democratizing what big model providers are doing. |
|
|
|
|
|
## Product |
|
|
|
|
|
[*Dowser*](https://durinn-concept-explorer.azurewebsites.net/) doesn’t just recommend data or provide infrastructure — it directly benchmarks models to produce confident influence scores, with sub-**2-minute** cached results and **10–30 minute** fresh evaluations across **100 open source datasets** on a 8gb RAM and 2 vCPU host. |
|
|
|
|
|
## How it works |
|
|
|
|
|
Teams define a target capability or task → *Dowser* identifies high impact datasets from [Huggingface](https://huggingface.co/) and suggests optimized training directions. |
|
|
|
|
|
## Why now? |
|
|
|
|
|
- Training costs are exploding while performance gains are flattening |
|
|
- Synthetic data is increasingly contaminating training pipelines |
|
|
- Teams need precision, not more data |
|
|
- Influence methods are now viable via proxy models and distillation |
|
|
|
|
|
## Market |
|
|
|
|
|
- Every company training or fine tuning LLMs |
|
|
- 59% of AI budgets go to training data |
|
|
- 40% of firms spend over 70% of AI budget on data |
|
|
- Initial wedge is small and mid sized model teams |