4nkh
/

theme_model

@@ -1,35 +1,97 @@
 ---
-license: apache-2.0
 language:
 - en
-base_model:
-- google-bert/bert-base-uncased
-pipeline_tag: text-classification
 tags:
-- multi-label
-- theme
-- theme_detection
-- mentorship
-- entrepreneurship
-- startup_success
-- community
-- AI
-- JSON
-- Json_automation
-metrics:
-- f1
-- precision
-- recall
-- accuracy
-library_name: transformers
 ---
-        3---
-license: apache-2.0
-language:
-- en
-base_model:
-- google-bert/bert-base-uncased
-pipeline_tag: text-classification
-tags:
-- multi-label
----

 ---
+license: mit
+task_categories:
+- text-classification
 language:
 - en
 tags:
+- text-classification
+- multi-label-classification
+- theme-detection
+- tone-classification
+- cultural-knowledge
+pretty_name: Knowledge Theme Training Model
+size_categories:
+- n<1K
 ---
+# Dataset Card for Dataset Name
+<!-- Provide a quick summary of the dataset. -->
+A curated set of short “knowledge submissions” paired with multi-label theme tags (for a theme model) and an optional single-label tone (for a tone model). Built to support automated tagging of Knowledge Sample JSON submissions for the Kuumba Agent / Cultural Remix Engine.
+## Dataset Details
+### Dataset Description
+<!-- Provide a longer summary of what this dataset is. -->
+This dataset contains short narrative passages (original_text) with associated metadata and labels. The primary target is themes, a multi-label list of theme tags used to train a theme classification model. A secondary label tone may be used to train a tone classifier.
+### Direct Use
+<!-- This section describes suitable use cases for the dataset. -->
+Train a multi-label text classification model that predicts themes from original_text
+Train a single-label text classifier that predicts tone from original_text
+Evaluate/benchmark tagging pipelines for structured Knowledge Sample JSON submissions
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
+High-stakes decision-making (medical, legal, employment, housing, finance)
+Inferring sensitive personal attributes or identity traits
+Treating predictions as ground-truth without human review
+Broad “general web” theme classification (dataset is project-scoped and may not generalize)
+## Dataset Structure
+<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
+Each data point is a JSON object with these fields:
+knowledge_submission_id (string): unique record id
+original_text (string): model input text
+summary (string): short summary of the passage
+category (string): high-level category label (e.g., “Business & Culture”)
+themes (list[string]): multi-label theme tags (primary training target)
+tone (string): single-label tone (optional training target)
+knowledge_type (string): type label such as “story”
+Recommended splits (if/when added): train, validation, test.
+## Dataset Creation
+Created to support automated theme tagging for Knowledge Sample JSON submissions, enabling consistent multi-label theme assignment and optional tone labeling for downstream models in the Kuumba Agent / Cultural Remix Engine workflow.
+#### Data Collection and Processing
+<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
+Examples are curated/authored knowledge submissions intended for training and evaluation
+Stored as normalized JSON with consistent keys across records
+Theme tags are assigned as a list to support multi-label learning
+Optional tone labels are assigned as a single categorical value
+#### Who are the source data producers?
+<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
+The source texts are authored/curated examples produced for this project and are not collected from a public platform or scraped source.
+#### Annotation process
+<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
+themes are assigned per record as multi-label tags based on the main idea(s) of the passage
+tone is assigned per record as a single-label descriptor of the writing style or emotional/communicative intent
+Labels are curated to be consistent across the dataset, and may evolve as the taxonomy expands
+**BibTeX:**
+@dataset{henson_kuumba_theme_dataset_2025,
+  author = {Henson, James},
+  title = {Kuumba Knowledge Theme Training Data},
+  year = {2025},
+  publisher = {Hugging Face},
+}