Kévin Yauy commited on
Commit
5472b6c
·
1 Parent(s): ea2f6c6

fix(write): fix writing issue and hf compatibility

Browse files
Files changed (4) hide show
  1. .gitignore +2 -0
  2. README.md +23 -24
  3. clinfly_app_cli.py +152 -50
  4. data/test.tsv +2 -0
.gitignore CHANGED
@@ -4,3 +4,5 @@ __pycache__/
4
  .vscode/
5
  .venv/
6
  poetry.lock
 
 
 
4
  .vscode/
5
  .venv/
6
  poetry.lock
7
+ .DS*
8
+ output/
README.md CHANGED
@@ -1,9 +1,9 @@
1
  ---
2
  title: ClinFly
3
  emoji: small_airplane
4
- sdk_version: 1.21.0
5
- streamlit_file: clinfly_app_st.py
6
- CLI_file: clinfly_app_cli.py
7
  pinned: true
8
  ---
9
 
@@ -16,56 +16,55 @@ Contact : [kevin.yauy@chu-montpellier.fr](mailto:kevin.yauy@chu-montpellier.fr)
16
 
17
  ## Introduction
18
 
19
- Precision medicine (PM) for rare diseases requires both precision phenotyping and data sharing. However, the majority of digital phenotyping tools only deal with the English language.
20
 
21
- Using French as a proof of concept, we have developed ClinFly, an automated framework to anonymize, translate and summarize clinical reports using Human Phenotype Ontology (HPO) terms compliant with medical data privacy standards. The output consists of a de-identified translated clinical report and a summary report in HPO format.
22
 
23
- By facilitating the translation and anonymization of clinical reports, ClinFly has the potential to facilitate inter-hospital data sharing, accelerate medical discoveries and open up the possibility of an international patient file without limitations due to non-English speakers.
24
 
25
  ## Pipeline
26
 
27
  ![](img/pipeline.png)
28
 
29
- ## Poetry Installation
 
 
30
 
31
- To install on your local machine, you need `poetry` package manager and launch in the folder:
32
  ```
33
  poetry install
34
  ```
35
 
36
- Using requirement ?
37
  ```
38
  poetry export --without-hashes --format=requirements.txt > requirements.txt
39
  ```
40
 
41
- ## Run the code
42
-
43
- ### Graphical User Interface - Single report usage with interactive analysis
44
 
45
- A webapp is accessible at https://huggingface.co/spaces/kyauy/ClinFly, please try it !
46
 
47
- It's a streamlit application, where code is accessible in ̀`clinfly_app_st.py` file. The functions are accessible in the `utilities` folder.
48
 
49
- To run the streamlit application on your local computer :
50
  ```
51
  poetry shell
52
  streamlit run clinfly_app_st.py
53
  ```
54
 
55
- ### Command Line Interface - Multiple report usage with offline options
56
 
57
- The code is accessible in ̀`clinfly_app_cli.py` file. The functions are accessible in the `utilities` folder.
58
 
59
- The entry file must be a TSV .txt with the informations structured like this :
60
  ```
61
- Doe John Report
 
 
62
  ```
63
 
64
- The output will be placed in the `results` folder according to the file extension.
65
-
66
- A resume of the deidentify report will be generated and placed in the `results/Reports` folder.
67
-
68
- Three HPO extraction output will be generated, TSV, TXT and Json.
69
 
70
  To run the CLI application on your local computer :
71
  ```
 
1
  ---
2
  title: ClinFly
3
  emoji: small_airplane
4
+ sdk_version: 1.25.0
5
+ sdk: streamlit
6
+ app_file: clinfly_app_st.py
7
  pinned: true
8
  ---
9
 
 
16
 
17
  ## Introduction
18
 
19
+ ClinFly is an automated framework designed to facilitate precision medicine (PM) for rare diseases. It addresses the challenge of precision phenotyping and data sharing across different languages.
20
 
21
+ ClinFly can anonymize, translate, and summarize clinical reports using Human Phenotype Ontology (HPO) terms, ensuring compliance with medical data privacy standards. The output includes a de-identified translated clinical report and a summary report in HPO format.
22
 
23
+ By streamlining the translation and anonymization of clinical reports, ClinFly aims to enhance inter-hospital data sharing, expedite medical discoveries, and pave the way for an international patient file accessible to non-English speakers.
24
 
25
  ## Pipeline
26
 
27
  ![](img/pipeline.png)
28
 
29
+ ## Installation
30
+
31
+ To install ClinFly on your local machine, you need the `poetry` package manager. Navigate to the project folder and run:
32
 
 
33
  ```
34
  poetry install
35
  ```
36
 
37
+ If you need to generate a `requirements.txt` file, use the following command:
38
  ```
39
  poetry export --without-hashes --format=requirements.txt > requirements.txt
40
  ```
41
 
42
+ ## Usage
 
 
43
 
44
+ ### Graphical User Interface
45
 
46
+ For single report usage with interactive analysis, ClinFly provides a web application accessible at https://huggingface.co/spaces/kyauy/ClinFly.
47
 
48
+ To run the Streamlit application on your local computer, activate the poetry shell and run the `clinfly_app_st.py` file:
49
  ```
50
  poetry shell
51
  streamlit run clinfly_app_st.py
52
  ```
53
 
54
+ ### Command Line Interface
55
 
56
+ For processing multiple reports with offline options, use the command line interface provided by `clinfly_app_cli.py`.
57
 
58
+ The input should be a TSV .txt file structured as follows (see `data/test.tsv` for an example):
59
  ```
60
+ Report_id_1 Doe John Report text
61
+ ...
62
+ Report_id_X Doe John Report text
63
  ```
64
 
65
+ Outputs will be placed in the `results` folder according to the file extension, using first three columns in filename.
66
+ - The deidentify report will be generated and placed in the `results/Reports` folder.
67
+ - Three HPO extraction outputs will be generated in `TSV`, `TXT` and `JSON` folders.
 
 
68
 
69
  To run the CLI application on your local computer :
70
  ```
clinfly_app_cli.py CHANGED
@@ -2,9 +2,23 @@ import csv
2
  import os
3
  import argparse
4
  import pandas as pd
5
- from utilities.anonymize import get_cities_list,get_abbreviation_dict_correction, reformat_to_report, anonymize_analyzer, anonymize_engine, add_space_to_comma_endpoint,get_list_not_deidentify, config_deidentify
 
 
 
 
 
 
 
 
 
6
  from utilities.translate import get_translation_dict_correction, translate_report
7
- from utilities.convert import convert_df_no_header, convert_df, convert_json, convert_list_phenogenius
 
 
 
 
 
8
  from utilities.extract_hpo import add_biometrics, extract_hpo
9
  from utilities.get_model import get_models, get_nlp_marian
10
  import gc
@@ -31,7 +45,9 @@ def main():
31
  analyzer_results_return,
32
  _,
33
  _,
34
- ) = anonymize_analyzer(MarianText_report, analyzer, proper_noun, Last_name, First_name)
 
 
35
 
36
  print(MarianText_anonymize_report_analyze)
37
 
@@ -43,17 +59,32 @@ def main():
43
  [x for x in MarianText_anonymize_report_engine.split("\n")]
44
  )
45
 
46
-
47
-
48
  MarianText_anonymize_report_engine_df = MarianText_anonymize_report_engine_modif
49
- with open(os.path.join(args.result_dir,"Reports","") + Last_name + "_" + First_name + "_translated_and_deindentified_report.txt", 'w') as file:
50
- file.write(convert_df_no_header(MarianText_anonymize_report_engine_df).decode("utf-8"))
51
- print("Text file created successfully : " + Last_name + "_" + First_name + "_translated_and_deindentified_report.txt")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  print("Summarization")
54
 
55
-
56
-
57
  MarianText_anonymized_reformat_space = add_space_to_comma_endpoint(
58
  MarianText_anonymize_report_engine, nlp_fr
59
  )
@@ -88,64 +119,133 @@ def main():
88
  clinphen_all = clinphen_all[cols]
89
 
90
  clinphen_df = clinphen_all
91
- clinphen_df_without_low_confidence = clinphen_df[clinphen_df["To keep in list"]== True]
 
 
92
  del clinphen
93
  del clinphen_unsafe_check_raw
94
  gc.collect()
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
- with open(os.path.join(args.result_dir,"TSV","") + Last_name + "_" + First_name + "_summarized_report.tsv", 'w') as file:
98
- file.write(convert_df(clinphen_df).decode("utf-8"))
99
- print("Tsv file created successfully : " + Last_name + "_" + First_name + "_summarized_report.tsv")
100
-
101
-
102
- with open(os.path.join(args.result_dir,"JSON","") + Last_name + "_" + First_name + "_summarized_report.json", 'w') as file:
103
- file.write(convert_json(clinphen_df_without_low_confidence))
104
- print("JSON file created successfully : " + Last_name + "_" + First_name + "_summarized_report.json")
105
-
106
-
107
- with open(os.path.join(args.result_dir,"TXT","") + Last_name + "_" + First_name + "_summarized_report.txt", 'w') as file:
108
- file.write(convert_list_phenogenius(clinphen_df_without_low_confidence))
109
- print("Text file created successfully : " + Last_name + "_" + First_name + "_summarized_report.txt")
 
 
 
 
 
 
 
 
110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
 
113
  if __name__ == "__main__":
114
 
115
  print("Welcome to the Clinfly app")
116
 
117
-
118
  parser = argparse.ArgumentParser(description="Description of clinfly arguments")
119
- parser.add_argument("--file", type=str,help="the input file which contains the visits informations", required=True)
120
- parser.add_argument("--language", choices=['fr', 'es', 'de'],type=str, help="The language of the input : fr, es , de",required=True)
121
- parser.add_argument("--model_dir",default=os.path.expanduser("~"),type=str, help="The directory where the models will be downloaded.")
122
- parser.add_argument("--result_dir",default="Results",type=str, help="The directory where the results will be placed.")
123
-
124
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  args = parser.parse_args()
127
 
128
-
129
  if not os.path.exists(args.model_dir):
130
  os.makedirs(args.model_dir)
131
 
132
  if not os.path.exists(args.result_dir):
133
  os.makedirs(args.result_dir)
134
-
135
- if not os.path.exists(os.path.join(args.result_dir,"Reports")):
136
- os.makedirs(os.path.join(args.result_dir,"Reports"))
137
 
138
- if not os.path.exists(os.path.join(args.result_dir,"TSV")):
139
- os.makedirs(os.path.join(args.result_dir,"TSV"))
 
 
 
140
 
141
- if not os.path.exists(os.path.join(args.result_dir,"JSON")):
142
- os.makedirs(os.path.join(args.result_dir,"JSON"))
143
 
144
- if not os.path.exists(os.path.join(args.result_dir,"TXT")):
145
- os.makedirs(os.path.join(args.result_dir,"TXT"))
146
 
147
  print("Language chosen :", args.language)
148
- models_status = get_models(args.language,args.model_dir)
149
  dict_correction = get_translation_dict_correction()
150
  dict_abbreviation_correction = get_abbreviation_dict_correction()
151
  proper_noun = get_list_not_deidentify()
@@ -153,16 +253,18 @@ if __name__ == "__main__":
153
  analyzer, engine = config_deidentify(cities_list)
154
  nlp_fr, marian_fr_en = get_nlp_marian(args.language)
155
 
156
- file_name = args.file
157
- Last_name :str
158
- First_name : str
159
- Report : str
160
- with open(file_name, 'r') as fichier:
 
161
  for ligne in fichier:
162
- elements = ligne.strip().split('\t')
163
- Last_name, First_name, Report = elements
 
164
  print("Last_name:", Last_name)
165
  print("First_name:", First_name)
166
  print("Report:", Report)
167
  main()
168
- print()
 
2
  import os
3
  import argparse
4
  import pandas as pd
5
+ from utilities.anonymize import (
6
+ get_cities_list,
7
+ get_abbreviation_dict_correction,
8
+ reformat_to_report,
9
+ anonymize_analyzer,
10
+ anonymize_engine,
11
+ add_space_to_comma_endpoint,
12
+ get_list_not_deidentify,
13
+ config_deidentify,
14
+ )
15
  from utilities.translate import get_translation_dict_correction, translate_report
16
+ from utilities.convert import (
17
+ convert_df_no_header,
18
+ convert_df,
19
+ convert_json,
20
+ convert_list_phenogenius,
21
+ )
22
  from utilities.extract_hpo import add_biometrics, extract_hpo
23
  from utilities.get_model import get_models, get_nlp_marian
24
  import gc
 
45
  analyzer_results_return,
46
  _,
47
  _,
48
+ ) = anonymize_analyzer(
49
+ MarianText_report, analyzer, proper_noun, Last_name, First_name
50
+ )
51
 
52
  print(MarianText_anonymize_report_analyze)
53
 
 
59
  [x for x in MarianText_anonymize_report_engine.split("\n")]
60
  )
61
 
 
 
62
  MarianText_anonymize_report_engine_df = MarianText_anonymize_report_engine_modif
63
+ with open(
64
+ os.path.join(args.result_dir, "Reports", "")
65
+ + Report_id
66
+ + "_"
67
+ + Last_name
68
+ + "_"
69
+ + First_name
70
+ + "_translated_and_deindentified_report.txt",
71
+ "w",
72
+ ) as file:
73
+ file.write(
74
+ convert_df_no_header(MarianText_anonymize_report_engine_df).decode("utf-8")
75
+ )
76
+ print(
77
+ "Text file created successfully : "
78
+ + Report_id
79
+ + "_"
80
+ + Last_name
81
+ + "_"
82
+ + First_name
83
+ + "_translated_and_deindentified_report.txt"
84
+ )
85
 
86
  print("Summarization")
87
 
 
 
88
  MarianText_anonymized_reformat_space = add_space_to_comma_endpoint(
89
  MarianText_anonymize_report_engine, nlp_fr
90
  )
 
119
  clinphen_all = clinphen_all[cols]
120
 
121
  clinphen_df = clinphen_all
122
+ clinphen_df_without_low_confidence = clinphen_df[
123
+ clinphen_df["To keep in list"] == True
124
+ ]
125
  del clinphen
126
  del clinphen_unsafe_check_raw
127
  gc.collect()
128
 
129
+ with open(
130
+ os.path.join(args.result_dir, "TSV", "")
131
+ + Report_id
132
+ + "_"
133
+ + Last_name
134
+ + "_"
135
+ + First_name
136
+ + "_summarized_report.tsv",
137
+ "w",
138
+ ) as file:
139
+ file.write(convert_df(clinphen_df).decode("utf-8"))
140
+ print(
141
+ "Tsv file created successfully : "
142
+ + os.path.join(args.result_dir, "TSV", "")
143
+ + Report_id
144
+ + "_"
145
+ + Last_name
146
+ + "_"
147
+ + First_name
148
+ + "_summarized_report.tsv"
149
+ )
150
 
151
+ with open(
152
+ os.path.join(args.result_dir, "JSON", "")
153
+ + Report_id
154
+ + "_"
155
+ + Last_name
156
+ + "_"
157
+ + First_name
158
+ + "_summarized_report.json",
159
+ "w",
160
+ ) as file:
161
+ file.write(convert_json(clinphen_df_without_low_confidence))
162
+ print(
163
+ "JSON file created successfully : "
164
+ + os.path.join(args.result_dir, "JSON", "")
165
+ + Report_id
166
+ + "_"
167
+ + Last_name
168
+ + "_"
169
+ + First_name
170
+ + "_summarized_report.json"
171
+ )
172
 
173
+ with open(
174
+ os.path.join(args.result_dir, "TXT", "")
175
+ + Report_id
176
+ + "_"
177
+ + Last_name
178
+ + "_"
179
+ + First_name
180
+ + "_summarized_report.txt",
181
+ "w",
182
+ ) as file:
183
+ file.write(convert_list_phenogenius(clinphen_df_without_low_confidence))
184
+ print(
185
+ "Text file created successfully : "
186
+ + os.path.join(args.result_dir, "TXT", "")
187
+ + Report_id
188
+ + "_"
189
+ + Last_name
190
+ + "_"
191
+ + First_name
192
+ + "_summarized_report.txt"
193
+ )
194
 
195
 
196
  if __name__ == "__main__":
197
 
198
  print("Welcome to the Clinfly app")
199
 
 
200
  parser = argparse.ArgumentParser(description="Description of clinfly arguments")
201
+ parser.add_argument(
202
+ "--file",
203
+ type=str,
204
+ help="the input file which contains the visits informations",
205
+ required=True,
206
+ )
207
+ parser.add_argument(
208
+ "--language",
209
+ choices=["fr", "es", "de"],
210
+ type=str,
211
+ help="The language of the input : fr, es , de",
212
+ required=True,
213
+ )
214
+ parser.add_argument(
215
+ "--model_dir",
216
+ default=os.path.expanduser("~"),
217
+ type=str,
218
+ help="The directory where the models will be downloaded.",
219
+ )
220
+ parser.add_argument(
221
+ "--result_dir",
222
+ default="Results",
223
+ type=str,
224
+ help="The directory where the results will be placed.",
225
+ )
226
 
227
  args = parser.parse_args()
228
 
 
229
  if not os.path.exists(args.model_dir):
230
  os.makedirs(args.model_dir)
231
 
232
  if not os.path.exists(args.result_dir):
233
  os.makedirs(args.result_dir)
 
 
 
234
 
235
+ if not os.path.exists(os.path.join(args.result_dir, "Reports")):
236
+ os.makedirs(os.path.join(args.result_dir, "Reports"))
237
+
238
+ if not os.path.exists(os.path.join(args.result_dir, "TSV")):
239
+ os.makedirs(os.path.join(args.result_dir, "TSV"))
240
 
241
+ if not os.path.exists(os.path.join(args.result_dir, "JSON")):
242
+ os.makedirs(os.path.join(args.result_dir, "JSON"))
243
 
244
+ if not os.path.exists(os.path.join(args.result_dir, "TXT")):
245
+ os.makedirs(os.path.join(args.result_dir, "TXT"))
246
 
247
  print("Language chosen :", args.language)
248
+ models_status = get_models(args.language, args.model_dir)
249
  dict_correction = get_translation_dict_correction()
250
  dict_abbreviation_correction = get_abbreviation_dict_correction()
251
  proper_noun = get_list_not_deidentify()
 
253
  analyzer, engine = config_deidentify(cities_list)
254
  nlp_fr, marian_fr_en = get_nlp_marian(args.language)
255
 
256
+ file_name = args.file
257
+ Report_id: str
258
+ Last_name: str
259
+ First_name: str
260
+ Report: str
261
+ with open(file_name, "r") as fichier:
262
  for ligne in fichier:
263
+ elements = ligne.strip().split("\t")
264
+ Report_id, Last_name, First_name, Report = elements
265
+ print("Report_id:", Report_id)
266
  print("Last_name:", Last_name)
267
  print("First_name:", First_name)
268
  print("Report:", Report)
269
  main()
270
+ print()
data/test.tsv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Report_id_1 Doe John Chers collegues, j'ai recu en consultation M. John Doe né le 14/07/1789 pour une fièvre récurrente et une maladie de Crohn. Il a pour antécédent des epistaxis recurrents. Parmi les antécédants familiaux, sa maman a présenté un cancer des ovaires. Il mesure 1.90 m (+2.5 DS), pèse 93 kg (+3.6 DS) et son PC est à 57 cm (+0DS) ...
2
+ Report_id_X Doe John Chers collegues, j'ai recu en consultation M. John Doe né le 14/07/1789 pour une fièvre récurrente et une maladie de Crohn. Il a pour antécédent des epistaxis recurrents. Parmi les antécédants familiaux, sa maman a présenté un cancer des ovaires. Il mesure 1.90 m (+2.5 DS), pèse 93 kg (+3.6 DS) et son PC est à 57 cm (+0DS) ...