Spaces:
No application file
No application file
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,74 @@
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: yellow
|
| 5 |
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
pinned: false
|
| 8 |
---
|
|
|
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
+
emoji: 💻
|
| 4 |
colorFrom: yellow
|
| 5 |
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
+
# DataCreator AI
|
| 10 |
|
| 11 |
+
**DataCreator AI** focuses on generating high-quality synthetic datasets for training and evaluating AI systems, particularly for Natural Language Processing (NLP) tasks.
|
| 12 |
+
|
| 13 |
+
Our goal is to make high-quality training data accessible to researchers, developers, and organizations building AI applications.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## What We Do
|
| 18 |
+
|
| 19 |
+
- Generate synthetic datasets for LLM training and evaluation
|
| 20 |
+
- Create datasets for tasks such as:
|
| 21 |
+
- Question Answering
|
| 22 |
+
- Instruction Tuning
|
| 23 |
+
- Text Classification
|
| 24 |
+
- Dialogue
|
| 25 |
+
- Preference datasets (DPO / alignment)
|
| 26 |
+
- Support multilingual dataset generation, with a growing focus on **Indic languages**
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## Why Synthetic Data?
|
| 31 |
+
|
| 32 |
+
Synthetic data helps solve several common challenges in AI development:
|
| 33 |
+
|
| 34 |
+
- **Data scarcity** – generate datasets when real data is unavailable
|
| 35 |
+
- **Privacy concerns** – avoid using sensitive or proprietary data
|
| 36 |
+
- **Class imbalance** – create balanced training datasets
|
| 37 |
+
- **Rapid experimentation** – quickly prototype datasets for model testing
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## Focus Areas
|
| 42 |
+
|
| 43 |
+
Current dataset development focuses on:
|
| 44 |
+
|
| 45 |
+
- Instruction tuning datasets
|
| 46 |
+
- NLP Datasets
|
| 47 |
+
- Conversational Datasets
|
| 48 |
+
- Alignment datasets (chosen/rejected pairs)
|
| 49 |
+
- Educational AI datasets
|
| 50 |
+
- Indic language datasets
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## Example Dataset Types
|
| 55 |
+
|
| 56 |
+
Datasets published in this organization include:
|
| 57 |
+
|
| 58 |
+
- Question–Answer datasets
|
| 59 |
+
- Instruction–Response datasets
|
| 60 |
+
- Preference datasets for RLHF / DPO
|
| 61 |
+
- Educational datasets
|
| 62 |
+
- Multilingual NLP datasets
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Vision
|
| 67 |
+
|
| 68 |
+
We believe AI should be accessible to everyone. High-quality data should not be limited to organizations with large budgets. Synthetic data combined with human expertise can help democratize AI development.
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Links
|
| 73 |
+
|
| 74 |
+
- Website: https://datacreatorai.com
|