Dataset Details
Dataset Description
This dataset contains short narrative passages (original_text) with associated metadata and labels. The primary target is themes, a multi-label list of theme tags used to train a theme classification model.
- Startup Success
- Mentorship
- Entrepreneurship
A secondary label tone may be used to train a tone classifier.
Direct Use
Train a multi-label text classification model that predicts themes from original_text Train a single-label text classifier that predicts tone from original_text Evaluate/benchmark tagging pipelines for structured Knowledge Sample JSON submissions
Out-of-Scope Use
High-stakes decision-making (medical, legal, employment, housing, finance) Inferring sensitive personal attributes or identity traits Treating predictions as ground-truth without human review Broad “general web” theme classification (dataset is project-scoped and may not generalize)
Dataset Structure
Each data point is a JSON object with these fields: knowledge_submission_id (string): unique record id original_text (string): model input text summary (string): short summary of the passage category (string): high-level category label (e.g., “Business & Culture”) themes (list[string]): multi-label theme tags (primary training target) tone (string): single-label tone (optional training target) knowledge_type (string): type label such as “story” Recommended splits (if/when added): train, validation, test.
Dataset Creation
Created to support automated theme tagging for Knowledge Sample JSON submissions, enabling consistent multi-label theme assignment and optional tone labeling for downstream models in the Kuumba Agent / Cultural Remix Engine workflow.
Data Collection and Processing
Examples are curated/authored knowledge submissions intended for training and evaluation Stored as normalized JSON with consistent keys across records Theme tags are assigned as a list to support multi-label learning Optional tone labels are assigned as a single categorical value
Who are the source data producers?
The source texts are authored/curated examples produced for this project and are not collected from a public platform or scraped source.
Annotation process
themes are assigned per record as multi-label tags based on the main idea(s) of the passage tone is assigned per record as a single-label descriptor of the writing style or emotional/communicative intent Labels are curated to be consistent across the dataset, and may evolve as the taxonomy expands
BibTeX:
@dataset{henson_kuumba_theme_dataset_2025, author = {Henson, James}, title = {Kuumba Knowledge Theme Training Data}, year = {2025}, publisher = {Hugging Face}, }
Model tree for 4nkh/theme_model
Base model
google-bert/bert-base-uncased