Muiru commited on
Commit
d0d6ce3
·
1 Parent(s): 3504f2d

Update README files with latest improvements and features

Browse files
Files changed (2) hide show
  1. README.hf.md +17 -3
  2. README.md +14 -3
README.hf.md CHANGED
@@ -48,9 +48,20 @@ Foundational fine‑tuned model developed by CogniX LTD.
48
 
49
  ### Dataset Sources:
50
 
51
- - **Native datasets**: Open mental health dialogue corpora curated for supportive conversation and coaching contexts. This includes publicly available datasets such as **Counsel Chat** and **Psych8k**.
52
- - **Synthetic datasets**: Additional coaching‑style dialogues generated using OpenAI models (**GPT‑4o**) to augment coverage and style diversity.
53
- - **Release & Licensing**: Fine‑tuning combined both native and synthetic sources with safety‑oriented filtering and prompt design. Full dataset provenance will be released with our open dataset under a **CC‑BY 4.0** license.
 
 
 
 
 
 
 
 
 
 
 
54
 
55
 
56
  ### Model Details:
@@ -108,6 +119,9 @@ Our evaluation framework operationalizes Google's Responsible AI Principles:
108
 
109
  *Full evaluation suite and rubrics available at [https://github.com/CogniX-LTD/Cogni-OpenModel].*
110
 
 
 
 
111
 
112
  ### Generation Configuration:
113
 
 
48
 
49
  ### Dataset Sources:
50
 
51
+ Our training data prioritizes real therapy conversations from licensed professionals over synthetic data for authenticity. Primary sources include:
52
+
53
+ - **Amod/mental_health_counseling_conversations**: Real counseling platform Q&A.
54
+ - **nbertagnolli/counsel-chat**: Licensed therapist responses.
55
+ - **EmoCareAI/Psych8k**: Transcripts from real counseling sessions.
56
+ - **vzeizer/MentalHealth_Analysis**: mental health condition recognition / classification eg. depression, anxiety, suicidal ideation.
57
+
58
+ These are supplemented with synthetic data (generated using **GPT‑4o/Claude Sonnet 4.5** with safety filtering) to enhance coverage of specific scenarios while maintaining therapeutic quality.
59
+
60
+ #### Curation Rationale:
61
+ "There is a lack of high quality open source mental health data available for study in NLP. Most datasets revolve around forums like Reddit, which can provide great insights, but don't capture the type of language often used by counselors. This dataset seeks to help bridge that gap and provide additional data of counselors interacting with patients in need."
62
+
63
+ #### Release & Licensing:
64
+ Full dataset provenance will be released with our open dataset under a **CC‑BY 4.0** license.
65
 
66
 
67
  ### Model Details:
 
119
 
120
  *Full evaluation suite and rubrics available at [https://github.com/CogniX-LTD/Cogni-OpenModel].*
121
 
122
+ We will continuously evaluate our model using DeepEval/GEval to monitor therapeutic quality and safety metrics, ensuring that real data grounding remains effective as we scale.
123
+
124
+
125
 
126
  ### Generation Configuration:
127
 
README.md CHANGED
@@ -31,9 +31,20 @@ Foundational fine‑tuned model developed by CogniX LTD.
31
 
32
  ### Dataset Sources:
33
 
34
- - **Native datasets**: Open mental health dialogue corpora curated for supportive conversation and coaching contexts. This includes publicly available datasets such as **Counsel Chat** and **Psych8k**.
35
- - **Synthetic datasets**: Additional coaching‑style dialogues generated using OpenAI models (**GPT‑4o**) to augment coverage and style diversity.
36
- - **Release & Licensing**: Fine‑tuning combined both native and synthetic sources with safety‑oriented filtering and prompt design. Full dataset provenance will be released with our open dataset under a **CC‑BY 4.0** license.
 
 
 
 
 
 
 
 
 
 
 
37
 
38
 
39
  ### Model Details:
 
31
 
32
  ### Dataset Sources:
33
 
34
+ Our training data prioritizes real therapy conversations from licensed professionals over synthetic data for authenticity. Primary sources include:
35
+
36
+ - **Amod/mental_health_counseling_conversations**: Real counseling platform Q&A.
37
+ - **nbertagnolli/counsel-chat**: Licensed therapist responses.
38
+ - **EmoCareAI/Psych8k**: Transcripts from real counseling sessions.
39
+ - **vzeizer/MentalHealth_Analysis**: mental health condition recognition / classification eg. depression, anxiety, suicidal ideation.
40
+
41
+ These are supplemented with synthetic data (generated using **GPT‑4o/Claude Sonnet 4.5** with safety filtering) to enhance coverage of specific scenarios while maintaining therapeutic quality.
42
+
43
+ #### Curation Rationale:
44
+ "There is a lack of high quality open source mental health data available for study in NLP. Most datasets revolve around forums like Reddit, which can provide great insights, but don't capture the type of language often used by counselors. This dataset seeks to help bridge that gap and provide additional data of counselors interacting with patients in need."
45
+
46
+ #### Release & Licensing:
47
+ Full dataset provenance will be released with our open dataset under a **CC‑BY 4.0** license.
48
 
49
 
50
  ### Model Details: