Spaces:

kyauy
/

ClinFly

Running

App Files Files Community

Kévin Yauy commited on Apr 23, 2024

Commit

5472b6c

1 Parent(s): ea2f6c6

fix(write): fix writing issue and hf compatibility

Browse files

Files changed (4) hide show

.gitignore +2 -0
README.md +23 -24
clinfly_app_cli.py +152 -50
data/test.tsv +2 -0

.gitignore CHANGED Viewed

@@ -4,3 +4,5 @@ __pycache__/
 .vscode/
 .venv/
 poetry.lock

 .vscode/
 .venv/
 poetry.lock
+.DS*
+output/

README.md CHANGED Viewed

@@ -1,9 +1,9 @@
 ---
 title: ClinFly
 emoji: small_airplane
-sdk_version: 1.21.0
-streamlit_file: clinfly_app_st.py
-CLI_file: clinfly_app_cli.py
 pinned: true
 ---
@@ -16,56 +16,55 @@ Contact : [kevin.yauy@chu-montpellier.fr](mailto:kevin.yauy@chu-montpellier.fr)
 ## Introduction
-Precision medicine (PM) for rare diseases requires both precision phenotyping and data sharing. However, the majority of digital phenotyping tools only deal with the English language.
-Using French as a proof of concept, we have developed ClinFly, an automated framework to anonymize, translate and summarize clinical reports using Human Phenotype Ontology (HPO) terms compliant with medical data privacy standards. The output consists of a de-identified translated clinical report and a summary report in HPO format.
-By facilitating the translation and anonymization of clinical reports, ClinFly has the potential to facilitate inter-hospital data sharing, accelerate medical discoveries and open up the possibility of an international patient file without limitations due to non-English speakers.
 ## Pipeline
 ![](img/pipeline.png)
-## Poetry Installation
-To install on your local machine, you need `poetry` package manager and launch in the folder:
 ```
 poetry install
 ```
-Using requirement ?
 ```
 poetry export --without-hashes --format=requirements.txt > requirements.txt
 ```
-## Run the code
-### Graphical User Interface - Single report usage with interactive analysis
-A webapp is accessible at https://huggingface.co/spaces/kyauy/ClinFly, please try it !
-It's a streamlit application, where code is accessible in ̀`clinfly_app_st.py` file. The functions are accessible in the `utilities` folder.
-To run the streamlit application on your local computer :
 ```
 poetry shell
 streamlit run clinfly_app_st.py
 ```
-### Command Line Interface - Multiple report usage with offline options
-The code is accessible in ̀`clinfly_app_cli.py` file. The functions are accessible in the `utilities` folder.
-The entry file must be a TSV .txt with the informations structured like this :
 ```
-Doe  John  Report
 ```
-The output will be placed in the `results` folder according to the file extension.
-A resume of the deidentify report will be generated and placed in the `results/Reports` folder.
-Three HPO extraction output will be generated, TSV, TXT and Json.
 To run the CLI application on your local computer :
 ```

 ---
 title: ClinFly
 emoji: small_airplane
+sdk_version: 1.25.0
+sdk: streamlit
+app_file: clinfly_app_st.py
 pinned: true
 ---
 ## Introduction
+ClinFly is an automated framework designed to facilitate precision medicine (PM) for rare diseases. It addresses the challenge of precision phenotyping and data sharing across different languages.
+ClinFly can anonymize, translate, and summarize clinical reports using Human Phenotype Ontology (HPO) terms, ensuring compliance with medical data privacy standards. The output includes a de-identified translated clinical report and a summary report in HPO format.
+By streamlining the translation and anonymization of clinical reports, ClinFly aims to enhance inter-hospital data sharing, expedite medical discoveries, and pave the way for an international patient file accessible to non-English speakers.
 ## Pipeline
 ![](img/pipeline.png)
+## Installation
+To install ClinFly on your local machine, you need the `poetry` package manager. Navigate to the project folder and run:
 ```
 poetry install
 ```
+If you need to generate a `requirements.txt` file, use the following command:
 ```
 poetry export --without-hashes --format=requirements.txt > requirements.txt
 ```
+## Usage
+### Graphical User Interface
+For single report usage with interactive analysis, ClinFly provides a web application accessible at https://huggingface.co/spaces/kyauy/ClinFly.
+To run the Streamlit application on your local computer, activate the poetry shell and run the `clinfly_app_st.py` file:
 ```
 poetry shell
 streamlit run clinfly_app_st.py
 ```
+### Command Line Interface
+For processing multiple reports with offline options, use the command line interface provided by `clinfly_app_cli.py`.
+The input should be a TSV .txt file structured as follows (see `data/test.tsv` for an example):
 ```
+Report_id_1   Doe  John  Report text
+...
+Report_id_X   Doe  John  Report text
 ```
+Outputs will be placed in the `results` folder according to the file extension, using first three columns in filename.
+- The deidentify report will be generated and placed in the `results/Reports` folder.
+- Three HPO extraction outputs will be generated in `TSV`, `TXT` and `JSON` folders.
 To run the CLI application on your local computer :
 ```

clinfly_app_cli.py CHANGED Viewed

@@ -2,9 +2,23 @@ import csv
 import os
 import argparse
 import pandas as pd
-from utilities.anonymize import get_cities_list,get_abbreviation_dict_correction, reformat_to_report, anonymize_analyzer, anonymize_engine, add_space_to_comma_endpoint,get_list_not_deidentify, config_deidentify
 from utilities.translate import get_translation_dict_correction, translate_report
-from utilities.convert import convert_df_no_header, convert_df, convert_json, convert_list_phenogenius
 from utilities.extract_hpo import add_biometrics, extract_hpo
 from utilities.get_model import get_models, get_nlp_marian
 import gc
@@ -31,7 +45,9 @@ def main():
         analyzer_results_return,
         _,
         _,
-    ) = anonymize_analyzer(MarianText_report, analyzer, proper_noun, Last_name, First_name)
     print(MarianText_anonymize_report_analyze)
@@ -43,17 +59,32 @@ def main():
         [x for x in MarianText_anonymize_report_engine.split("\n")]
     )
     MarianText_anonymize_report_engine_df = MarianText_anonymize_report_engine_modif
-    with open(os.path.join(args.result_dir,"Reports","") + Last_name + "_" + First_name + "_translated_and_deindentified_report.txt", 'w') as file:
-            file.write(convert_df_no_header(MarianText_anonymize_report_engine_df).decode("utf-8"))
-    print("Text file created successfully : " + Last_name + "_" + First_name + "_translated_and_deindentified_report.txt")
     print("Summarization")
     MarianText_anonymized_reformat_space = add_space_to_comma_endpoint(
         MarianText_anonymize_report_engine, nlp_fr
     )
@@ -88,64 +119,133 @@ def main():
     clinphen_all = clinphen_all[cols]
     clinphen_df = clinphen_all
-    clinphen_df_without_low_confidence = clinphen_df[clinphen_df["To keep in list"]== True]
     del clinphen
     del clinphen_unsafe_check_raw
     gc.collect()
-    with open(os.path.join(args.result_dir,"TSV","") + Last_name + "_" + First_name + "_summarized_report.tsv", 'w') as file:
-            file.write(convert_df(clinphen_df).decode("utf-8"))
-    print("Tsv file created successfully : " + Last_name + "_" + First_name + "_summarized_report.tsv")
-    with open(os.path.join(args.result_dir,"JSON","") + Last_name + "_" + First_name + "_summarized_report.json", 'w') as file:
-            file.write(convert_json(clinphen_df_without_low_confidence))
-    print("JSON file created successfully : " + Last_name + "_" + First_name + "_summarized_report.json")
-    with open(os.path.join(args.result_dir,"TXT","") + Last_name + "_" + First_name + "_summarized_report.txt", 'w') as file:
-            file.write(convert_list_phenogenius(clinphen_df_without_low_confidence))
-    print("Text file created successfully : " + Last_name + "_" + First_name + "_summarized_report.txt")
 if __name__ == "__main__":
     print("Welcome to the Clinfly app")
     parser = argparse.ArgumentParser(description="Description of clinfly arguments")
-    parser.add_argument("--file", type=str,help="the input file which contains the visits informations", required=True)
-    parser.add_argument("--language", choices=['fr', 'es', 'de'],type=str, help="The language of the input : fr, es , de",required=True)
-    parser.add_argument("--model_dir",default=os.path.expanduser("~"),type=str, help="The directory where the models will be downloaded.")
-    parser.add_argument("--result_dir",default="Results",type=str, help="The directory where the results will be placed.")
     args = parser.parse_args()
     if not os.path.exists(args.model_dir):
         os.makedirs(args.model_dir)
     if not os.path.exists(args.result_dir):
         os.makedirs(args.result_dir)
-    if not os.path.exists(os.path.join(args.result_dir,"Reports")):
-        os.makedirs(os.path.join(args.result_dir,"Reports"))
-    if not os.path.exists(os.path.join(args.result_dir,"TSV")):
-        os.makedirs(os.path.join(args.result_dir,"TSV"))
-    if not os.path.exists(os.path.join(args.result_dir,"JSON")):
-        os.makedirs(os.path.join(args.result_dir,"JSON"))
-    if not os.path.exists(os.path.join(args.result_dir,"TXT")):
-        os.makedirs(os.path.join(args.result_dir,"TXT"))
     print("Language chosen :", args.language)
-    models_status = get_models(args.language,args.model_dir)
     dict_correction = get_translation_dict_correction()
     dict_abbreviation_correction = get_abbreviation_dict_correction()
     proper_noun = get_list_not_deidentify()
@@ -153,16 +253,18 @@ if __name__ == "__main__":
     analyzer, engine = config_deidentify(cities_list)
     nlp_fr, marian_fr_en = get_nlp_marian(args.language)
-    file_name = args.file
-    Last_name :str
-    First_name : str
-    Report : str
-    with open(file_name, 'r') as fichier:
         for ligne in fichier:
-            elements = ligne.strip().split('\t')
-            Last_name, First_name, Report = elements
             print("Last_name:", Last_name)
             print("First_name:", First_name)
             print("Report:", Report)
             main()
-            print()

 import os
 import argparse
 import pandas as pd
+from utilities.anonymize import (
+    get_cities_list,
+    get_abbreviation_dict_correction,
+    reformat_to_report,
+    anonymize_analyzer,
+    anonymize_engine,
+    add_space_to_comma_endpoint,
+    get_list_not_deidentify,
+    config_deidentify,
+)
 from utilities.translate import get_translation_dict_correction, translate_report
+from utilities.convert import (
+    convert_df_no_header,
+    convert_df,
+    convert_json,
+    convert_list_phenogenius,
+)
 from utilities.extract_hpo import add_biometrics, extract_hpo
 from utilities.get_model import get_models, get_nlp_marian
 import gc
         analyzer_results_return,
         _,
         _,
+    ) = anonymize_analyzer(
+        MarianText_report, analyzer, proper_noun, Last_name, First_name
+    )
     print(MarianText_anonymize_report_analyze)
         [x for x in MarianText_anonymize_report_engine.split("\n")]
     )
     MarianText_anonymize_report_engine_df = MarianText_anonymize_report_engine_modif
+    with open(
+        os.path.join(args.result_dir, "Reports", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_translated_and_deindentified_report.txt",
+        "w",
+    ) as file:
+        file.write(
+            convert_df_no_header(MarianText_anonymize_report_engine_df).decode("utf-8")
+        )
+    print(
+        "Text file created successfully : "
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_translated_and_deindentified_report.txt"
+    )
     print("Summarization")
     MarianText_anonymized_reformat_space = add_space_to_comma_endpoint(
         MarianText_anonymize_report_engine, nlp_fr
     )
     clinphen_all = clinphen_all[cols]
     clinphen_df = clinphen_all
+    clinphen_df_without_low_confidence = clinphen_df[
+        clinphen_df["To keep in list"] == True
+    ]
     del clinphen
     del clinphen_unsafe_check_raw
     gc.collect()
+    with open(
+        os.path.join(args.result_dir, "TSV", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_summarized_report.tsv",
+        "w",
+    ) as file:
+        file.write(convert_df(clinphen_df).decode("utf-8"))
+    print(
+        "Tsv file created successfully : "
+        + os.path.join(args.result_dir, "TSV", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_summarized_report.tsv"
+    )
+    with open(
+        os.path.join(args.result_dir, "JSON", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_summarized_report.json",
+        "w",
+    ) as file:
+        file.write(convert_json(clinphen_df_without_low_confidence))
+    print(
+        "JSON file created successfully : "
+        + os.path.join(args.result_dir, "JSON", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_summarized_report.json"
+    )
+    with open(
+        os.path.join(args.result_dir, "TXT", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_summarized_report.txt",
+        "w",
+    ) as file:
+        file.write(convert_list_phenogenius(clinphen_df_without_low_confidence))
+    print(
+        "Text file created successfully : "
+        + os.path.join(args.result_dir, "TXT", "")
+        + Report_id
+        + "_"
+        + Last_name
+        + "_"
+        + First_name
+        + "_summarized_report.txt"
+    )
 if __name__ == "__main__":
     print("Welcome to the Clinfly app")
     parser = argparse.ArgumentParser(description="Description of clinfly arguments")
+    parser.add_argument(
+        "--file",
+        type=str,
+        help="the input file which contains the visits informations",
+        required=True,
+    )
+    parser.add_argument(
+        "--language",
+        choices=["fr", "es", "de"],
+        type=str,
+        help="The language of the input : fr, es , de",
+        required=True,
+    )
+    parser.add_argument(
+        "--model_dir",
+        default=os.path.expanduser("~"),
+        type=str,
+        help="The directory where the models will be downloaded.",
+    )
+    parser.add_argument(
+        "--result_dir",
+        default="Results",
+        type=str,
+        help="The directory where the results will be placed.",
+    )
     args = parser.parse_args()
     if not os.path.exists(args.model_dir):
         os.makedirs(args.model_dir)
     if not os.path.exists(args.result_dir):
         os.makedirs(args.result_dir)
+    if not os.path.exists(os.path.join(args.result_dir, "Reports")):
+        os.makedirs(os.path.join(args.result_dir, "Reports"))
+    if not os.path.exists(os.path.join(args.result_dir, "TSV")):
+        os.makedirs(os.path.join(args.result_dir, "TSV"))
+    if not os.path.exists(os.path.join(args.result_dir, "JSON")):
+        os.makedirs(os.path.join(args.result_dir, "JSON"))
+    if not os.path.exists(os.path.join(args.result_dir, "TXT")):
+        os.makedirs(os.path.join(args.result_dir, "TXT"))
     print("Language chosen :", args.language)
+    models_status = get_models(args.language, args.model_dir)
     dict_correction = get_translation_dict_correction()
     dict_abbreviation_correction = get_abbreviation_dict_correction()
     proper_noun = get_list_not_deidentify()
     analyzer, engine = config_deidentify(cities_list)
     nlp_fr, marian_fr_en = get_nlp_marian(args.language)
+    file_name = args.file
+    Report_id: str
+    Last_name: str
+    First_name: str
+    Report: str
+    with open(file_name, "r") as fichier:
         for ligne in fichier:
+            elements = ligne.strip().split("\t")
+            Report_id, Last_name, First_name, Report = elements
+            print("Report_id:", Report_id)
             print("Last_name:", Last_name)
             print("First_name:", First_name)
             print("Report:", Report)
             main()
+            print()

data/test.tsv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ Report_id_1 Doe John Chers collegues, j'ai recu en consultation M. John Doe né le 14/07/1789 pour une fièvre récurrente et une maladie de Crohn. Il a pour antécédent des epistaxis recurrents. Parmi les antécédants familiaux, sa maman a présenté un cancer des ovaires. Il mesure 1.90 m (+2.5 DS), pèse 93 kg (+3.6 DS) et son PC est à 57 cm (+0DS) ...
2	+ Report_id_X Doe John Chers collegues, j'ai recu en consultation M. John Doe né le 14/07/1789 pour une fièvre récurrente et une maladie de Crohn. Il a pour antécédent des epistaxis recurrents. Parmi les antécédants familiaux, sa maman a présenté un cancer des ovaires. Il mesure 1.90 m (+2.5 DS), pèse 93 kg (+3.6 DS) et son PC est à 57 cm (+0DS) ...