File size: 7,961 Bytes
609a109
 
fbfacf8
be0b5fa
609a109
 
be0b5fa
 
 
 
 
609a109
fbfacf8
609a109
 
 
 
 
 
 
 
 
 
 
 
 
1d579f2
609a109
1d579f2
 
 
609a109
 
 
 
 
82b3ded
 
be0b5fa
0aedc37
82b3ded
be0b5fa
5529c8c
82b3ded
609a109
be0b5fa
5529c8c
 
609a109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5529c8c
609a109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be0b5fa
 
 
 
 
 
 
 
ba6536e
be0b5fa
 
 
 
609a109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
700af5d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: other
library_name: transformers
pipeline_tag: text-generation
language:
- en
tags:
- gene-set-analysis
- biomedical
- reasoning
- llama
base_model:
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.2-1B-Instruct
- meta-llama/Llama-3.2-3B-Instruct
---

# Overview of Gene-R1

**Introduction**

- Gene-R1 is a data-augmented learning framework that equips lightweight and open-source LLMs with step-by-step reasoning capabilities tailored to the gene set analysis task. 
- It has been fine-tuned by ~270K gene sets collected from 16 genomic databases.
- Experimental results demonstrate that Gene-R1 achieves substantial performance gains, matching commercial LLMs.
- For more details, please check out our [paper](https://www.worldscientific.com/doi/abs/10.1142/9789819824755_0035) (PSB, 2026).

**Gene-R1 helps for gene set analysis through fine-tuned small language models (SLMs) that can be locally deployed.** 
The model contains three versions:
- [Gene-R1-8B](https://huggingface.co/ncbi/Gene-R1-8B): A version fine-tuned based on the Llama-3.1-8B-Instruct.
- [Gene-R1-1B](https://huggingface.co/ncbi/Gene-R1-1B): A version fine-tuned based on the Llama-3.2-1B-Instruct.
- [Gene-R1-3B](https://huggingface.co/ncbi/Gene-R1-3B): A version fine-tuned based on the Llama-3.2-3B-Instruct.


# Model Deployment for Private Gene Set Analysis

```python
  import transformers
  from transformers import AutoTokenizer, AutoModelForCausalLM

  model_id = "ncbi/Gene-R1-8B"
  tokenizer_test = AutoTokenizer.from_pretrained(
      model_id,
      token = "xxxxxxxxx" # Your access key of hugging face
  ) 
  model_test = AutoModelForCausalLM.from_pretrained(
      model_id,
      device_map = "auto",
      token = "xxxxxxxxx" # Your access key of hugging face
  )
  
  def complete_chat(system, prompt, model, tokenizer):
      model.generation_config.do_sample=False
      tokenized_chat = tokenizer('#SYSTEM: \n'+ system + '#USER: \n'+ prompt+' #Assistant: \n', return_tensors="pt").input_ids.to(model.device)
      outputs = model.generate(tokenized_chat, max_new_tokens=4000, temperature = 0) 
      return tokenizer.decode(outputs[0])
  
  system = "You are an efficient and insightful assistant to a molecular biologist."
  users = lambda genes: f"""
  Write a critical analysis of the biological processes performed by this system of interacting proteins.
  Base your analysis on prior knowledge available in your training data.
  After the analysis, propose a brief name for the most prominent biological process performed by the system.
  Place the name at the top of the analysis in the format: "Process: <name>".
  Be concise. Avoid unnecessary words.
  Use plain text only. Do not include format symbols such as asterisks, dashes, or bullets.
  Be specific. Avoid overly general statements such as "the proteins are involved in various cellular processes."
  Be factual. Do not include editorial opinions or unsupported claims.
  For each important point, clearly explain your reasoning and provide supporting information.
  For each identified biological function, specify the corresponding gene names.
  Here is the gene set: {genes}
  """

  def llama(genes):    
      genes = genes.replace("/",",").replace(" ",",")
      prompt = users(genes)
      summary =complete_chat(system, prompt, model_test, tokenizer_test)
      return summary

  if __name__ == "__main__":
    genes = "xxxxxxxxx" # Your private gene set that is separated by comma (,)!
    result = llama(genes)
    print(result)
```

The expected output looks like:
```
  Process: Pancreatic development and glucose homeostasis
  
  1. PDX1 is a homeodomain transcription factor involved in the specification of the early pancreatic epithelium and its subsequent differentiation. 
  It activates the transcription of several genes including insulin, somatostatin, glucokinase and glucose transporter type 2. 
  It is essential for maintenance of the normal hormone-producing phenotype in the pancreatic beta-cell. 
  In pancreatic acinar cells, forms a complex with PBX1b and MEIS2b and mediates the activation of the ELA1 enhancer.
  
  2. NKX6-1 is also a transcription factor involved in the development of pancreatic beta-cells during the secondary transition. 
  Together with NKX2-2 and IRX3, controls the generation of motor neurons in the neural tube and belongs to the neural progenitor 
  factors induced by Sonic Hedgehog (SHH) signals.
  
  3.GCG and GLP1, respectively glucagon and glucagon-like peptide 1, are involved in glucose metabolism and homeostasis. 
  GCG raises blood glucose levels by promoting gluconeogenesis and is the counter regulatory hormone of Insulin. 
  GLP1 is a potent stimulator of Glucose-Induced Insulin Secretion (GSIS). Plays roles in gastric motility and suppresses blood glucagon levels. 
  Promotes growth of the intestinal epithelium and pancreatic islet mass both by islet neogenesis and islet cell proliferation.
  
  4. SLC2A2, also known as GLUT2, is a facilitative hexose transporter. In hepatocytes, it mediates bi-directional transport of glucose accross the plasma membranes, 
  while in the pancreatic beta-cell, it is the main transporter responsible for glucose uptake and part of the cell's glucose-sensing mechanism. 
  It is involved in glucose transport in the small intestine and kidney too.
  
  To summarize, the genes in this set are involved in the specification, differentiation, growth and functionality of the pancreas, 
  with a particular emphasis on the pancreatic beta-cell. Particularly, the architecture of the pancreatic islet ensures proper glucose sensing 
  and homeostasis via a number of different hormones and receptors that can elicit both synergistic and antagonistic effects in the pancreas itself and other peripheral tissues.
```

⚠️ **Notice: The outputs sometimes are not following the instruction, you can try again if this case occurs.**

More details of model usage can be referred at our GitHub: [GitHub](https://github.com/ncbi-nlp/Gene-R1)

# Download statistics

Hugging Face tracks downloads automatically based on requests to model query files such as `config.json`. 
To ensure downloads are counted, please load the full models directly from the Hub using `transformers`:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "ncbi/Gene-R1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id, hf_token)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", hf_token)
```

# Acknowledgments

This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH). 
The contributions of the NIH authors are considered Works of the United States Government. 
The findings and conclusions presented in this paper are those of the authors and do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services.

# Disclaimer

These models show the results of research conducted in the Computational Biology Branch, NCBI/NLM. 
The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. 
Individuals should not change their health behavior solely on the basis of information produced on this website. 
NIH does not independently verify the validity or utility of the information produced by this tool. 
If you have questions about the information produced on this website, please see a health care professional.
More information about NCBI's disclaimer policy is available.

# Citation

```bibtext
@inproceedings{wang2025gene,
  title={Gene-R1: Reasoning with Data-Augmented Lightweight LLMs for Gene Set Analysis},
  author={Wang, Zhizheng and Yang, Yifan and Jin, Qiao and Lu, Zhiyong},
  booktitle={Biocomputing 2026: Proceedings of the Pacific Symposium},
  pages={494--507},
  year={2025},
  organization={World Scientific}
}
```