query_intent / README.md

add dataset link

632453d verified 6 months ago

4.51 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- google-bert/bert-base-uncased
	---
	# Model Card for `query_intent` model

	The `query_intent` model is a trained BERT model intended to be used for query augmented generation (QAG) on the GDC. The model accepts a natural language query as input and outputs a label or "intent". The label maps the query to a GDC API endpoint during query augmented generation.

	## Model Details

	### Model Description

	Model Name (model_id): `uc-ctds/query_intent`

	Model Description:

	The model is trained over the `bert-base-uncased` base model using synthetically generated paired queries and labels. Details of synthetic data generation will be presented in the upcoming paper. Training data is all open-source, with genes, mutations and cancer information obtained using the `/ssm` GDC API endpoint.

	This model is used in the [GDC QAG](https://huggingface.co/spaces/uc-ctds/GDC-QAG) web app running on HuggingFace Spaces.

	- Developed by: Center for Translational Data Science
	- Model type: BERT
	- Language(s) (NLP): English
	- License: apache-2.0
	- Finetuned from model: `google-bert/bert-base-uncased`

	Model Parameters:109M


	### Model Sources

	- Repository: `https://huggingface.co/uc-ctds/query_intent`
	- Paper: coming soon
	- Demo: `https://huggingface.co/spaces/uc-ctds/GDC-QAG`


	## Uses

	The model is intended to be used for cancer research, on select precision oncology use cases. The model accepts a natural language query as input and outputs an intent label.

	Example Input

	What is the co-occurence frequency of somatic homozygous deletions in CDKN2A and CDKN2B in the mesothelioma project TCGA-MESO in the genomic data commons?

	Example Output

	`freq_cnv_loss_or_gain`

	The model is trained on queries concerning frequencies of simple somatic mutations, frequencies of copy number variants losses or gains, frequencies of microsatellite instability, or frequencies of combination variants. In QAG, this model helps to classify queries into different `labels` as listed below.
	```
	Query use cases supported, labels
	frequencies of simple somatic mutations, ssm_frequency
	frequencies of copy number variant losses or gains, freq_cnv_loss_or_gain
	frequency of microsatellite instability, msi_h_frequency
	frequency of copy number variants and/or simple somatic mutations, cnv_and_ssm
	```

	### Direct Use

	Primary use is for QAG, where the model output serves as an intermediate step towards the final result

	### Out-of-Scope Use

	The model is trained on limited use cases. It will not work well for any use case outside of those it is trained on. Please see the query use cases supported under `Uses`


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```
	query = 'What is the co-occurence frequency of somatic homozygous deletions in CDKN2A and CDKN2B in the mesothelioma project TCGA-MESO in the genomic data commons?'
	model_id = 'uc-ctds/query_intent'
	intent_tok = AutoTokenizer.from_pretrained(
	model_id, trust_remote_code=True,
	token=AUTH_TOKEN
	)
	intent_model = BertForSequenceClassification.from_pretrained(model_id)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	intent_model.to(device)
	inputs = intent_tok(query, return_tensors="pt", truncation=True, padding=True)
	inputs = {k: v.to(device) for k, v in inputs.items()}
	outputs = intent_model(**inputs)
	probs = torch.nn.functional.softmax(outputs.logits, dim=1)
	predicted_label = torch.argmax(probs, dim=1).item()
	for k, v in intent_labels.items():
	if v == predicted_label:
	return k

	```

	## Training Details


	### Training Procedure

	A paired dataset of N=63756 synthetic questions and labels were generated using template questions and open-source data from the GDC API for genes and mutations observed in different cancer types. A 70:30 train test split was used in `sklearn` to generate training and evaluation datasets and trained for two epochs.
	Please refer to our [GitHub repo](https://github.com/uc-cdis/gdc-qag/blob/feature/gradio/notebooks/synthetic_data_query_intent_train_bert.ipynb) for training details
	Training was performed on an A100 GPU with 40GB RAM in an on-prem GPU cluster.

	The dataset used for the training is available here (https://huggingface.co/datasets/uc-ctds/query_intent_dataset)

	### Compute Infrastructure

	The model is intended to be used for QAG, and can be ran in a V100 GPU with 16GB GPU RAM.

	## Citation

	Coming up

	BibTeX: