Priyanka72 commited on
Commit
6a3bee0
·
verified ·
1 Parent(s): ee3d1d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -2
README.md CHANGED
@@ -1,10 +1,74 @@
1
  ---
2
  title: README
3
- emoji: 🐠
4
  colorFrom: yellow
5
  colorTo: blue
6
  sdk: gradio
7
  pinned: false
8
  ---
 
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: README
3
+ emoji: 💻
4
  colorFrom: yellow
5
  colorTo: blue
6
  sdk: gradio
7
  pinned: false
8
  ---
9
+ # DataCreator AI
10
 
11
+ **DataCreator AI** focuses on generating high-quality synthetic datasets for training and evaluating AI systems, particularly for Natural Language Processing (NLP) tasks.
12
+
13
+ Our goal is to make high-quality training data accessible to researchers, developers, and organizations building AI applications.
14
+
15
+ ---
16
+
17
+ ## What We Do
18
+
19
+ - Generate synthetic datasets for LLM training and evaluation
20
+ - Create datasets for tasks such as:
21
+ - Question Answering
22
+ - Instruction Tuning
23
+ - Text Classification
24
+ - Dialogue
25
+ - Preference datasets (DPO / alignment)
26
+ - Support multilingual dataset generation, with a growing focus on **Indic languages**
27
+
28
+ ---
29
+
30
+ ## Why Synthetic Data?
31
+
32
+ Synthetic data helps solve several common challenges in AI development:
33
+
34
+ - **Data scarcity** – generate datasets when real data is unavailable
35
+ - **Privacy concerns** – avoid using sensitive or proprietary data
36
+ - **Class imbalance** – create balanced training datasets
37
+ - **Rapid experimentation** – quickly prototype datasets for model testing
38
+
39
+ ---
40
+
41
+ ## Focus Areas
42
+
43
+ Current dataset development focuses on:
44
+
45
+ - Instruction tuning datasets
46
+ - NLP Datasets
47
+ - Conversational Datasets
48
+ - Alignment datasets (chosen/rejected pairs)
49
+ - Educational AI datasets
50
+ - Indic language datasets
51
+
52
+ ---
53
+
54
+ ## Example Dataset Types
55
+
56
+ Datasets published in this organization include:
57
+
58
+ - Question–Answer datasets
59
+ - Instruction–Response datasets
60
+ - Preference datasets for RLHF / DPO
61
+ - Educational datasets
62
+ - Multilingual NLP datasets
63
+
64
+ ---
65
+
66
+ ## Vision
67
+
68
+ We believe AI should be accessible to everyone. High-quality data should not be limited to organizations with large budgets. Synthetic data combined with human expertise can help democratize AI development.
69
+
70
+ ---
71
+
72
+ ## Links
73
+
74
+ - Website: https://datacreatorai.com