Hugging_face Directory Guide

This document captures the current layout of the Hugging_face workspace so newcomers can see where finetuning scripts, datasets, and artifacts live and what each area contains.

Structure

Hugging_face/
|-- Benchmark/
|   |-- deepeval_conversation_bench.py
|   `-- trditional_conversation_bench.py
|   `-- trditional_benchmark_mac.py
|-- Benchmark_Results/
|   |-- Deepeval_Run/
|   `-- Traditional_Run/
|-- datasets/
|   |-- CounseLLMe/
|   `-- MHQA/
|-- Finetuned models/
|   |-- Jeethu/
|   `-- Kanishkha/
|-- Finetuning/
|   |-- Jeethu/
|   `-- Kanishkha/
`-- README.md

Folder Details

Benchmark/

Scripts for scoring conversational mental-health models.

trditional_conversation_bench.py: runs a "traditional" automatic evaluation loop that scores generations against reference therapist replies using lexical and embedding-based metrics.
- Datasets covered (5 total):
  1. EmpatheticDialogues (facebook/empathetic_dialogues) – pairs odd user utterances with the following empathetic listener turn, rewriting each example into an Alpaca-style prompt with emotion and situation context.
  2. MentalChat16K (a.k.a. mental health 16k; ShenLab/MentalChat16K) – uses the instruction/input/output fields to evaluate coping-support responses across labeled mental-health categories.
  3. CounseLLMe (local copy under datasets/CounseLLMe/) – ingests JSON therapy transcripts exported from the CounseLLMe study, stripping reminder prompts and formatting each counselor turn as a response to the latest client message.
  4. MHQA (local TSV under datasets/MHQA/) – reads the mental-health question answering benchmark (test.txt) and compares generated answers with the curated raw responses for each labeled category.
  5. Mental Health Counseling Conversations (Amod/mental_health_counseling_conversations) – converts counselor guidance into instruction-following prompts anchored to user questions.
- Metrics computed per example and at corpus level:
  - ROUGE-1, ROUGE-2, ROUGE-L (Hugging Face evaluate package) to quantify n-gram overlap with the gold counselor answer.
  - BLEU-1 (unigram BLEU) to measure lexical precision on the immediate response surface.
  - BERTScore F1 (with roberta-base) to capture semantic alignment between the model reply and the reference counselor output.
deepeval_conversation_bench.py: mirrors the generation pipeline but swaps metrics for judge-model evaluations powered by DeepEval. Each example is rated on Answer Relevancy, Coherence, Helpfulness, Readability, Faithfulness, Conciseness, and Bias, providing qualitative signal beyond lexical overlap. See the DeepEval metric cards for definitions and scoring rubrics.

Benchmark_Results/

Snapshot of evaluation runs; each subfolder keeps its generated metrics grouped by benchmarking style.

Traditional_Run/ and Deepeval_Run/ house time-stamped run directories containing config.json, example_metrics.csv, corpus_metrics.csv, and model_comparison.xlsx for quick post-analysis.
Expect per-run folders named run_YYYYMMDD_HHMMSS/, which you can archive or compare across models.

datasets/

Landing zone for local corpora that the benchmarks load from disk.

CounseLLMe/: drop the JSON transcript dumps from the CounseLLMe project so the loaders can format counselor turns into prompts.
MHQA/: store the question answering TSVs (test.txt, etc.) required for the MHQA loader.
Add future offline datasets here and update the Benchmark loaders/README with preprocessing expectations.

Finetuned models/

Contributor-organized exports of fine-tuned checkpoints ready for benchmarking or downstream inference.

Jeethu/: contains Jeethu’s merged or adapter checkpoints (e.g., Gemma-3 270M variants) used by their scripts.
Kanishkha/: holds Kanishkha’s published weights, matching the repos referenced in their training and inference code.
Keep README files or tags alongside models to note training dates, datasets, and evaluation highlights.

Finetuning/

Working copies of training, alignment, and inference scripts maintained by each contributor (Jeethu/, Kanishkha/, etc.).

Jeethu/gemma3_270_FullFinetune.py: end-to-end TRL SFTTrainer run that merges multiple counseling datasets, supports W&B logging, and optionally pushes a fully fine-tuned Gemma checkpoint to the Hub.
Jeethu/gemma270_SFT.py: QLoRA-style supervised fine-tune (single dataset) that prepares the tokenizer/chat template, configures LoRA adapters, and uploads results.
Jeethu/gemma270_DPO.py: Direct Preference Optimization stage built on top of the SFT adapter checkpoint using PsychoCounsel-Preference comparisons.
Jeethu/gemma270_ORPO.py: placeholder for an ORPO pipeline (empty scaffold—extend from the Kanishkha version when ready).
Jeethu/gemma270_inference.py: Gradio demo + inference harness that loads either PEFT adapters or merged weights and exposes a simple chat UI.
The Kanishkha/ directory mirrors the same toolkit with contributor-specific defaults:
- gemma3_270_FullFinetune.py: identical multi-dataset finetune for Kanishkha’s output repo.
- gemma270_SFT.py: supervised LoRA SFT configuration targeting jkanishkha0305/gemma3_270m_sft_qlora.
- gemma270_DPO.py: DPO trainer with HF/W&B login helpers and dataset subsampling knobs.
- gemma270_ORPO.py: full ORPO alignment script configuring ORPOTrainer against the preference dataset.
- gemma270_inference.py: shared inference stack pointing to the SFT LoRA repository for quick validation chats.

README.md

This guide. Update the tree and descriptions whenever you add or remove assets so the workspace stays self-documenting.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support