File size: 4,216 Bytes
d9d84f8
82b4e1c
 
 
d9d84f8
 
 
82b4e1c
 
 
 
 
 
 
 
28c57e8
 
 
 
 
 
 
d9d84f8
82b4e1c
 
 
 
 
 
9e837ef
 
 
 
 
 
 
 
82b4e1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28c57e8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: mit
task_categories:
- text-classification
language:
- en
tags:
- text-classification
- multi-label-classification
- theme-detection
- tone-classification
- cultural-knowledge
pretty_name: Knowledge Theme Training Model
size_categories:
- n<1K
datasets:
- 4nkh/theme_data
metrics:
- bertscore
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
---

## Dataset Details

### Dataset Description

<!-- Provide a longer summary of what this dataset is. -->
This dataset contains short narrative passages (original_text) with associated metadata and labels. 
The primary target is themes,
a multi-label list of theme tags used to train a theme classification model.
1. Startup Success
2. Mentorship
3. Entrepreneurship

A secondary label tone may be used to train a tone classifier.

### Direct Use

<!-- This section describes suitable use cases for the dataset. -->
Train a multi-label text classification model that predicts themes from original_text
Train a single-label text classifier that predicts tone from original_text
Evaluate/benchmark tagging pipelines for structured Knowledge Sample JSON submissions

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->

High-stakes decision-making (medical, legal, employment, housing, finance)
Inferring sensitive personal attributes or identity traits
Treating predictions as ground-truth without human review
Broad “general web” theme classification (dataset is project-scoped and may not generalize)

## Dataset Structure

<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->

Each data point is a JSON object with these fields:
knowledge_submission_id (string): unique record id
original_text (string): model input text
summary (string): short summary of the passage
category (string): high-level category label (e.g., “Business & Culture”)
themes (list[string]): multi-label theme tags (primary training target)
tone (string): single-label tone (optional training target)
knowledge_type (string): type label such as “story”
Recommended splits (if/when added): train, validation, test.

## Dataset Creation
Created to support automated theme tagging for Knowledge Sample JSON submissions, enabling consistent multi-label theme assignment and optional tone labeling for downstream models in the Kuumba Agent / Cultural Remix Engine workflow.

#### Data Collection and Processing

<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->

Examples are curated/authored knowledge submissions intended for training and evaluation
Stored as normalized JSON with consistent keys across records
Theme tags are assigned as a list to support multi-label learning
Optional tone labels are assigned as a single categorical value

#### Who are the source data producers?

<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->

The source texts are authored/curated examples produced for this project and are not collected from a public platform or scraped source.

#### Annotation process

<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->

themes are assigned per record as multi-label tags based on the main idea(s) of the passage
tone is assigned per record as a single-label descriptor of the writing style or emotional/communicative intent
Labels are curated to be consistent across the dataset, and may evolve as the taxonomy expands


**BibTeX:**

@dataset{henson_kuumba_theme_dataset_2025,
  author = {Henson, James},
  title = {Kuumba Knowledge Theme Training Data},
  year = {2025},
  publisher = {Hugging Face},
}