4nkh commited on
Commit
05a8b69
·
verified ·
1 Parent(s): 6789740

Model save

Browse files
Files changed (1) hide show
  1. README.md +41 -91
README.md CHANGED
@@ -1,111 +1,61 @@
1
  ---
2
- license: mit
3
- task_categories:
4
- - text-classification
5
- language:
6
- - en
7
  tags:
8
- - text-classification
9
- - multi-label-classification
10
- - theme-detection
11
- - tone-classification
12
- - cultural-knowledge
13
- pretty_name: Knowledge Theme Training Model
14
- size_categories:
15
- - n<1K
16
- datasets:
17
- - 4nkh/theme_data
18
- metrics:
19
- - bertscore
20
- base_model:
21
- - google-bert/bert-base-uncased
22
- pipeline_tag: text-classification
23
  ---
24
 
25
- ## Dataset Details
 
26
 
27
- <iframe
28
- src="https://huggingface.co/datasets/4nkh/theme_data/embed/viewer/default/train"
29
- frameborder="0"
30
- width="100%"
31
- height="560px"
32
- ></iframe>
33
 
 
 
 
 
 
 
 
 
 
34
 
35
- ### Dataset Description
36
 
37
- <!-- Provide a longer summary of what this dataset is. -->
38
- This dataset contains short narrative passages (original_text) with associated metadata and labels.
39
- The primary target is themes,
40
- a multi-label list of theme tags used to train a theme classification model.
41
- 1. Startup Success
42
- 2. Mentorship
43
- 3. Entrepreneurship
44
 
45
- A secondary label tone may be used to train a tone classifier.
46
 
47
- ### Direct Use
48
 
49
- <!-- This section describes suitable use cases for the dataset. -->
50
- Train a multi-label text classification model that predicts themes from original_text
51
- Train a single-label text classifier that predicts tone from original_text
52
- Evaluate/benchmark tagging pipelines for structured Knowledge Sample JSON submissions
53
 
54
- ### Out-of-Scope Use
55
 
56
- <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
57
 
58
- High-stakes decision-making (medical, legal, employment, housing, finance)
59
- Inferring sensitive personal attributes or identity traits
60
- Treating predictions as ground-truth without human review
61
- Broad “general web” theme classification (dataset is project-scoped and may not generalize)
62
 
63
- ## Dataset Structure
 
 
 
 
 
 
 
64
 
65
- <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
66
 
67
- Each data point is a JSON object with these fields:
68
- knowledge_submission_id (string): unique record id
69
- original_text (string): model input text
70
- summary (string): short summary of the passage
71
- category (string): high-level category label (e.g., “Business & Culture”)
72
- themes (list[string]): multi-label theme tags (primary training target)
73
- tone (string): single-label tone (optional training target)
74
- knowledge_type (string): type label such as “story”
75
- Recommended splits (if/when added): train, validation, test.
76
 
77
- ## Dataset Creation
78
- Created to support automated theme tagging for Knowledge Sample JSON submissions, enabling consistent multi-label theme assignment and optional tone labeling for downstream models in the Kuumba Agent / Cultural Remix Engine workflow.
79
 
80
- #### Data Collection and Processing
81
 
82
- <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
83
-
84
- Examples are curated/authored knowledge submissions intended for training and evaluation
85
- Stored as normalized JSON with consistent keys across records
86
- Theme tags are assigned as a list to support multi-label learning
87
- Optional tone labels are assigned as a single categorical value
88
-
89
- #### Who are the source data producers?
90
-
91
- <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
92
-
93
- The source texts are authored/curated examples produced for this project and are not collected from a public platform or scraped source.
94
-
95
- #### Annotation process
96
-
97
- <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
98
-
99
- themes are assigned per record as multi-label tags based on the main idea(s) of the passage
100
- tone is assigned per record as a single-label descriptor of the writing style or emotional/communicative intent
101
- Labels are curated to be consistent across the dataset, and may evolve as the taxonomy expands
102
-
103
-
104
- **BibTeX:**
105
-
106
- @dataset{henson_kuumba_theme_dataset_2025,
107
- author = {Henson, James},
108
- title = {Kuumba Knowledge Theme Training Data},
109
- year = {2025},
110
- publisher = {Hugging Face},
111
- }
 
1
  ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: bert-base-uncased
 
 
5
  tags:
6
+ - generated_from_trainer
7
+ model-index:
8
+ - name: theme_model
9
+ results: []
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
+ should probably proofread and complete it, then remove this comment. -->
14
 
15
+ # theme_model
 
 
 
 
 
16
 
17
+ This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the None dataset.
18
+ It achieves the following results on the evaluation set:
19
+ - Loss: 0.1822
20
+ - Micro/precision: 1.0
21
+ - Micro/recall: 1.0
22
+ - Micro/f1: 1.0
23
+ - Macro/precision: 1.0
24
+ - Macro/recall: 1.0
25
+ - Macro/f1: 1.0
26
 
27
+ ## Model description
28
 
29
+ More information needed
 
 
 
 
 
 
30
 
31
+ ## Intended uses & limitations
32
 
33
+ More information needed
34
 
35
+ ## Training and evaluation data
 
 
 
36
 
37
+ More information needed
38
 
39
+ ## Training procedure
40
 
41
+ ### Training hyperparameters
 
 
 
42
 
43
+ The following hyperparameters were used during training:
44
+ - learning_rate: 2e-05
45
+ - train_batch_size: 8
46
+ - eval_batch_size: 16
47
+ - seed: 42
48
+ - optimizer: Use adamw_torch_fused with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
49
+ - lr_scheduler_type: linear
50
+ - num_epochs: 5
51
 
52
+ ### Training results
53
 
 
 
 
 
 
 
 
 
 
54
 
 
 
55
 
56
+ ### Framework versions
57
 
58
+ - Transformers 4.57.3
59
+ - Pytorch 2.8.0
60
+ - Datasets 4.4.2
61
+ - Tokenizers 0.22.2