webesama commited on
Commit
0356fa3
·
verified ·
1 Parent(s): dba0dda

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -47
README.md CHANGED
@@ -49,6 +49,33 @@ This model is intended for research and development use. It is not a certified m
49
 
50
  ## 🚀 How to Use
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ### Load model and tokenizer:
53
 
54
  ```python
@@ -61,64 +88,27 @@ model = AutoModelForSequenceClassification.from_pretrained(model_name)
61
  model.eval().to("cuda" if torch.cuda.is_available() else "cpu")
62
  ```
63
 
64
- ### 📝 Predict on a full structured interview:
65
  Assume you have a conversation log like this:
66
 
67
  ```python
68
- conversation_log = [
69
- {"Speaker": "Interviewer", "Content": "Wie war Ihr Appetit?", "Topic": "Appetit"},
70
- {"Speaker": "Patient", "Content": "Ich hatte guten Appetit.", "Topic": "Appetit"},
71
- {"Speaker": "Interviewer", "Content": "Wie war Ihr Schlaf?", "Topic": "Schlaf"},
72
- {"Speaker": "Patient", "Content": "Ich konnte gut schlafen.", "Topic": "Schlaf"},
73
- # etc.
74
- ]
75
- topics = ["Traurigkeit", "Anspannung", "Schlaf", "Appetit", "Konzentration", "Antriebslosigkeit", "Gefühlslosigkeit", "Gedanken", "Suizid"]
76
- ```
77
-
78
- Use the prediction function:
79
-
80
- ```python
81
- def predict_scores_per_topic(conversation_log, topics, tokenizer, model):
82
  device = model.device
83
  predictions = {}
84
- for topic in topics:
85
- topic_dialogue = "\n".join(
86
- [f"{entry['Speaker']}: {entry['Content']}" for entry in conversation_log if entry["Topic"] == topic]
87
- )
88
- if not topic_dialogue:
89
- predictions[topic] = None
90
- continue
91
- inputs = tokenizer(topic_dialogue, truncation=True, padding="max_length", max_length=512, return_tensors="pt").to(device)
92
  with torch.no_grad():
93
  score = torch.round(model(**inputs).logits).clamp(0, 6).item()
94
  predictions[topic] = score
95
- return predictions
96
- ```
97
-
98
- ---
99
 
100
- ## 🧹 Preprocessing Custom Data
101
 
102
- If you want to prepare your own data (e.g., from JSONL with structure: `User ID`, `Speaker`, `Transcription`, `Topic`, `Score`), use the preprocessing below:
 
 
 
103
 
104
- ```python
105
- from datasets import load_dataset
106
-
107
- dataset = load_dataset("json", data_files="your_data.jsonl", split="train")
108
-
109
- def preprocess_function(examples):
110
- scores = [int(float(output.split(":")[1].strip())) for output in examples['output']]
111
- topics = [
112
- input_text.split("\n")[0].replace("Topic: ", "").strip()
113
- if "Topic:" in input_text else "Unknown"
114
- for input_text in examples['input']
115
- ]
116
- encoded = tokenizer(examples['input'], truncation=True, padding="max_length", max_length=512)
117
- encoded["labels"] = scores
118
- encoded["Topic"] = topics
119
- return encoded
120
-
121
- tokenized_dataset = dataset.map(preprocess_function, batched=True)
122
  ```
123
 
124
  ---
 
49
 
50
  ## 🚀 How to Use
51
 
52
+ ### Preprocess Data File:
53
+
54
+ Please organize your data equivalent to the example data (synthetic data) with columns: Subject, Speaker, Transcription, Topic, Score.
55
+
56
+ ```python
57
+
58
+ import pandas as pd
59
+
60
+ def load_and_prepare_conversations(filepath):
61
+ df = pd.read_excel(filepath)
62
+ conversations = []
63
+
64
+ for topic in df['Topic'].unique():
65
+ topic_df = df[df['Topic'] == topic]
66
+ if topic_df.empty: continue
67
+
68
+ dialogue = "\n".join([
69
+ f"{row['Speaker']}: {row['Transcription']}"
70
+ for _, row in topic_df.iterrows()
71
+ if pd.notnull(row['Transcription'])
72
+ ])
73
+
74
+ conversations.append((topic, dialogue))
75
+ return conversations
76
+
77
+ ```
78
+
79
  ### Load model and tokenizer:
80
 
81
  ```python
 
88
  model.eval().to("cuda" if torch.cuda.is_available() else "cpu")
89
  ```
90
 
91
+ ### 📝 Predict on a full structured interview / Run inference:
92
  Assume you have a conversation log like this:
93
 
94
  ```python
95
+ def predict_madrs_scores(conversations, tokenizer, model):
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  device = model.device
97
  predictions = {}
98
+
99
+ for topic, dialogue in conversations:
100
+ inputs = tokenizer(dialogue, truncation=True, padding="max_length", max_length=512, return_tensors="pt").to(device)
 
 
 
 
 
101
  with torch.no_grad():
102
  score = torch.round(model(**inputs).logits).clamp(0, 6).item()
103
  predictions[topic] = score
 
 
 
 
104
 
105
+ return predictions
106
 
107
+ file_path = "example_interview.xlsx"
108
+ conversations = load_and_prepare_conversations(file_path)
109
+ scores = predict_madrs_scores(conversations, tokenizer, model)
110
+ print(scores)
111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  ```
113
 
114
  ---