mgtotaro commited on
Commit
7d6838f
·
1 Parent(s): 08446bb

README update

Browse files
Files changed (3) hide show
  1. LICENSE +1 -1
  2. header.md +7 -3
  3. instructions.md +18 -10
LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2024, Massimo G. Totaro All rights reserved.
2
 
3
  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
4
 
 
1
+ Copyright (c) 2024-2025, Massimo G. Totaro All rights reserved.
2
 
3
  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
4
 
header.md CHANGED
@@ -1,5 +1,9 @@
1
- Calculate the fitness of single amino acid substitutions on proteins, using a [zero-shot](https://doi.org/10.1101/2021.07.09.450648) [language model predictor](https://github.com/facebookresearch/esm)
2
 
3
- **WARNING:**
 
 
 
4
  Due to high server traffic, the tool might become slow or unresponsive.
5
- In this case, it is recommended to duplicate and clone the space in your personal HuggingFace account by clicking the top right menu.
 
 
1
+ Calculate the fitness of single amino acid substitutions on proteins, using a [zero-shot](https://doi.org/10.1101/2021.07.09.450648) protein language predictor of the [ESM model family](https://huggingface.co/facebook/esm2_t6_8M_UR50D).
2
 
3
+ **UPDATE:**
4
+ [Profluent-Bio](https://huggingface.co/Profluent-Bio)'s [E1 model family](https://huggingface.co/Profluent-Bio/E1-150m) is now available for inference.
5
+
6
+ **WARNING:**
7
  Due to high server traffic, the tool might become slow or unresponsive.
8
+ In this case, it is recommended to duplicate and clone the space in your personal HuggingFace account by clicking [here](https://huggingface.co/spaces/thaidaev/zsp?duplicate=true).
9
+ In the top right corner, there are options to run the app locally or clone the repository.
instructions.md CHANGED
@@ -7,7 +7,7 @@ If the server remains idle for a period, it will enter standby mode. Running a c
7
  ## Input
8
 
9
  **Sequence**: Enter the full amino acid sequence to be analyzed in the **Sequence** text box.
10
- Note: While jolly characters (e.g., `-X.B`) can be included, they currently cannot be visualised.
11
 
12
  **Substitutions**: Specify the substitutions you wish to test in the **Substitutions** box. The tool supports three running modes based on your input:
13
 
@@ -16,22 +16,25 @@ If the server remains idle for a period, it will enter standby mode. Running a c
16
  - **Same-Length Sequence**: Analyze differing amino acid substitutions one by one within sequences of equal length.
17
  - **Different Inputs**: For any other input format, a deep mutational scan of the full sequence will be performed.
18
 
19
- **Model Selection**: Choose an ESM model for calculations from those available on Hugging Face Model Hub.
20
- The model `esm2_t33_650M_UR50D` offers an optimal balance between cost and accuracy [*](https://doi.org/10.1126/science.ade2574).
21
 
22
  **Accuracy Option**: The **Use higher accuracy** option applies a masked-marginals scoring strategy, which considers sequence context during inference.
23
- While this method is slower, it enhances accuracy. If you experience long runtimes, unchecking this option can significantly speed up calculations at the cost of some accuracy.
 
24
 
25
- **Deep Mutational Scan Recommendations**: When performing a deep mutational scan, it is advisable to use smaller models (8M, 35M, or 150M parameters) due to significant runtime concernsespecially with longer sequences or during peak server usage times.
26
- For example, calculating a 300-residue-long sequence with larger models may require over 30 minutes.
27
- Generally, accuracy is more affected by the scoring strategy than by model size; therefore, prioritise reducing model size when optimizing for runtime.
28
- The computational cost of the scoring strategy scales with the number of substitutions tested, while model cost scales with wild-type sequence length.
29
 
30
- **Concurrent Substitutions**: To calculate the effect of multiple concurrent substitutions, you must manually change the input sequence and rerun the calculation. Accuracy is not guaranteed as this use case is yet untested.
 
 
31
 
32
  ## Output
33
 
34
- Results are displayed in a color-coded table, except for deep mutational scans, which produce a heatmap.
35
  In the table:
36
 
37
  - Beneficial substitutions are highlighted in green with positive values.
@@ -44,6 +47,11 @@ As a rule of thumb, score differences of *4* or more are considered significant.
44
 
45
  The **Download raw data** button lets you download the output in CSV format.
46
 
 
 
 
 
 
47
 
48
  **If you use this tool in your research, please cite**:
49
 
 
7
  ## Input
8
 
9
  **Sequence**: Enter the full amino acid sequence to be analyzed in the **Sequence** text box.
10
+ Note: While jolly characters (e.g., `-X.B`) can be included, they currently cannot be visualized.
11
 
12
  **Substitutions**: Specify the substitutions you wish to test in the **Substitutions** box. The tool supports three running modes based on your input:
13
 
 
16
  - **Same-Length Sequence**: Analyze differing amino acid substitutions one by one within sequences of equal length.
17
  - **Different Inputs**: For any other input format, a deep mutational scan of the full sequence will be performed.
18
 
19
+ **Model Selection**: Choose a model for calculations from those available on Hugging Face Model Hub.
20
+ The `esm2_t33_650M_UR50D` model offers an optimal balance between cost and accuracy [*](https://doi.org/10.1126/science.ade2574).
21
 
22
  **Accuracy Option**: The **Use higher accuracy** option applies a masked-marginals scoring strategy, which considers sequence context during inference.
23
+ While this method is slower, it enhances accuracy.
24
+ If you experience long runtimes, unchecking this option can significantly speed up calculations at the cost of some accuracy.
25
 
26
+ **Deep Mutational Scan Recommendations**: When performing a deep mutational scan, it is advisable to use smaller models (8M, 35M, or 150M parameters) due to significant runtime concerns, especially with longer sequences or during peak server usage times.
27
+ For example, calculating a 300-residue-long sequence with larger models may require over 30 minutes.
28
+ Generally, accuracy is more affected by the scoring strategy than by model size; therefore, prioritize reducing model size when optimizing for runtime.
29
+ The computational cost of the scoring strategy scales with the number of substitutions tested, while model cost scales with wild-type sequence length.
30
 
31
+ **Concurrent Substitutions**:
32
+ To calculate the effect of multiple concurrent substitutions, you must manually change the input sequence and rerun the calculation.
33
+ Accuracy is not guaranteed as this use case is yet untested.
34
 
35
  ## Output
36
 
37
+ Results are displayed in a colour-coded table, except for deep mutational scans, which produce a heatmap.
38
  In the table:
39
 
40
  - Beneficial substitutions are highlighted in green with positive values.
 
47
 
48
  The **Download raw data** button lets you download the output in CSV format.
49
 
50
+ ## Debugging
51
+
52
+ A basic error message will be displayed if the tool fails to process your input, but you can also check the server's [logs](https://huggingface.co/spaces/thaidaev/zsp?logs=container) for additional information.
53
+ The logs also show a progress bar that indicates how far along the calculation is.
54
+
55
 
56
  **If you use this tool in your research, please cite**:
57