4nkh commited on
Commit
82b4e1c
·
verified ·
1 Parent(s): 37447ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -30
README.md CHANGED
@@ -1,35 +1,97 @@
1
  ---
2
- license: apache-2.0
 
 
3
  language:
4
  - en
5
- base_model:
6
- - google-bert/bert-base-uncased
7
- pipeline_tag: text-classification
8
  tags:
9
- - multi-label
10
- - theme
11
- - theme_detection
12
- - mentorship
13
- - entrepreneurship
14
- - startup_success
15
- - community
16
- - AI
17
- - JSON
18
- - Json_automation
19
- metrics:
20
- - f1
21
- - precision
22
- - recall
23
- - accuracy
24
- library_name: transformers
25
  ---
26
- 3---
27
- license: apache-2.0
28
- language:
29
- - en
30
- base_model:
31
- - google-bert/bert-base-uncased
32
- pipeline_tag: text-classification
33
- tags:
34
- - multi-label
35
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ task_categories:
4
+ - text-classification
5
  language:
6
  - en
 
 
 
7
  tags:
8
+ - text-classification
9
+ - multi-label-classification
10
+ - theme-detection
11
+ - tone-classification
12
+ - cultural-knowledge
13
+ pretty_name: Knowledge Theme Training Model
14
+ size_categories:
15
+ - n<1K
 
 
 
 
 
 
 
 
16
  ---
17
+
18
+ # Dataset Card for Dataset Name
19
+
20
+ <!-- Provide a quick summary of the dataset. -->
21
+
22
+ A curated set of short “knowledge submissions” paired with multi-label theme tags (for a theme model) and an optional single-label tone (for a tone model). Built to support automated tagging of Knowledge Sample JSON submissions for the Kuumba Agent / Cultural Remix Engine.
23
+
24
+ ## Dataset Details
25
+
26
+ ### Dataset Description
27
+
28
+ <!-- Provide a longer summary of what this dataset is. -->
29
+ This dataset contains short narrative passages (original_text) with associated metadata and labels. The primary target is themes, a multi-label list of theme tags used to train a theme classification model. A secondary label tone may be used to train a tone classifier.
30
+
31
+ ### Direct Use
32
+
33
+ <!-- This section describes suitable use cases for the dataset. -->
34
+ Train a multi-label text classification model that predicts themes from original_text
35
+ Train a single-label text classifier that predicts tone from original_text
36
+ Evaluate/benchmark tagging pipelines for structured Knowledge Sample JSON submissions
37
+
38
+ ### Out-of-Scope Use
39
+
40
+ <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
41
+
42
+ High-stakes decision-making (medical, legal, employment, housing, finance)
43
+ Inferring sensitive personal attributes or identity traits
44
+ Treating predictions as ground-truth without human review
45
+ Broad “general web” theme classification (dataset is project-scoped and may not generalize)
46
+
47
+ ## Dataset Structure
48
+
49
+ <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
50
+
51
+ Each data point is a JSON object with these fields:
52
+ knowledge_submission_id (string): unique record id
53
+ original_text (string): model input text
54
+ summary (string): short summary of the passage
55
+ category (string): high-level category label (e.g., “Business & Culture”)
56
+ themes (list[string]): multi-label theme tags (primary training target)
57
+ tone (string): single-label tone (optional training target)
58
+ knowledge_type (string): type label such as “story”
59
+ Recommended splits (if/when added): train, validation, test.
60
+
61
+ ## Dataset Creation
62
+ Created to support automated theme tagging for Knowledge Sample JSON submissions, enabling consistent multi-label theme assignment and optional tone labeling for downstream models in the Kuumba Agent / Cultural Remix Engine workflow.
63
+
64
+ #### Data Collection and Processing
65
+
66
+ <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
67
+
68
+ Examples are curated/authored knowledge submissions intended for training and evaluation
69
+ Stored as normalized JSON with consistent keys across records
70
+ Theme tags are assigned as a list to support multi-label learning
71
+ Optional tone labels are assigned as a single categorical value
72
+
73
+ #### Who are the source data producers?
74
+
75
+ <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
76
+
77
+ The source texts are authored/curated examples produced for this project and are not collected from a public platform or scraped source.
78
+
79
+ #### Annotation process
80
+
81
+ <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
82
+
83
+ themes are assigned per record as multi-label tags based on the main idea(s) of the passage
84
+ tone is assigned per record as a single-label descriptor of the writing style or emotional/communicative intent
85
+ Labels are curated to be consistent across the dataset, and may evolve as the taxonomy expands
86
+
87
+
88
+ **BibTeX:**
89
+
90
+ @dataset{henson_kuumba_theme_dataset_2025,
91
+ author = {Henson, James},
92
+ title = {Kuumba Knowledge Theme Training Data},
93
+ year = {2025},
94
+ publisher = {Hugging Face},
95
+ }
96
+
97
+