johnnas12 commited on
Commit
62a0c2c
·
verified ·
1 Parent(s): 90563f7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +656 -0
README.md ADDED
@@ -0,0 +1,656 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:68840
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: intfloat/e5-base-v2
10
+ widget:
11
+ - source_sentence: 'query: How can I hicmergeloops?'
12
+ sentences:
13
+ - 'passage: WindowMasker mkcounts. Construct WindowMasker unit counts table. **What
14
+ it does** This tool runs `stage 1 <https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/winmasker/>`_
15
+ of the WindowMasker analysis to produce a unit counts file for a genome assembly.'
16
+ - 'passage: GROMACS simulation. for system equilibration or data collection. ..
17
+ class:: infomark **What it does** This tool performs a molecular dynamics simulation
18
+ with GROMACS. _____ .. class:: infomark **Input** - GRO structure file. -
19
+ Topology (TOP) file. A variety of other options can also be specified: -
20
+ MDP parameter file to take advantage of all GROMACS features. Otherwise, choose
21
+ parameters through the Galaxy interface. See the `manual`_ for more information
22
+ on the options. - Accepting and producing checkpoint (CPT) input/output
23
+ files, which allows sequential MD simulations, e.g. when performing NVT and NPT
24
+ equilibration followed by a production simulation. - Position restraint
25
+ (ITP) file, useful for equilibrating solvent around a protein. - Choice
26
+ of ensemble: NVT or NPT. - Whether to return trajectory (XTC or TRR) and/or
27
+ structure (GRO or PDB) files. .. _`manual`: http://manual.gromacs.org/documentation/2018/user-guide/mdp-options.html _____ ..
28
+ class:: infomark **Output** - Structure and/or trajectory files as specified
29
+ in the input.'
30
+ - 'passage: hicMergeLoops. merge detected loops of different resolutions.. Merge
31
+ detected loops ==================== This script merges the loop locations of
32
+ different different resolutions. Loops need to have the following format: chr
33
+ start end chr start end A merge happens if x and y position of a loop overlaps
34
+ with x and y position of another loop; all loops are considered as an overlap
35
+ within +/- the bin size of the lowest resolution. I.e. for a loop with coordinates
36
+ x and y, the overlap to all other loops is searched for (x - lowest resolution)
37
+ and (y + lowest resolution). If two or more locations should be merged, the one
38
+ with the lowest resolution is taken as the merged loop. Example usage: `$ hicMergeLoops
39
+ -i gm12878_10kb.bedgraph gm12878_5kb.bedgraph gm12878_25kb.bedgraph -o merged_result.bedgraph
40
+ -r 25000` Please recall: We work with binned data i.e. the lowest resolution
41
+ is therefore the one where we merge the most bases into one bin. In the above
42
+ example the lowest resultion is 25kb, the highest resolution is 5kb. For more
43
+ information about HiCExplorer please consider our documentation on readthedocs.io_ ..
44
+ _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html'
45
+ - source_sentence: 'query: How can I featurefindercentroided?'
46
+ sentences:
47
+ - 'passage: ETE lineage generator. from a list of species/taxids using the ETE Toolkit.
48
+ Generates a table with lineage information for a list of species (also taxids
49
+ and arbitrary taxons are accepted) using the `ETE Toolkit`_. .. _ETE Toolkit:
50
+ https://etetoolkit.org/ **Input** - *Species file* a single column tabular file
51
+ - *(ETE3) Taxonomy Database* a sqlite database that has been created by ETE from
52
+ the NCBI taxonomy dump **Options** - *Taxonomic levels* the columns to be incuded
53
+ in the output table. There are two presets (full and primary) - *Full* contains
54
+ all 29 ranks included in the NCBI taxonomy - *Primary* contains the primary
55
+ ranks (kingdom, phylum, class, order, family, genus, species) - *Manual* the
56
+ ranks of interest can be chosen by the user. The primary levels are chosen by
57
+ default. - *Fill unnamed ranks* Get missing data from "nearby" levels: -
58
+ Some nodes in the NCBI taxonomy tree have no name (no rank) these are shown by
59
+ default as "NA" in the output. If the *compress* option is selected then the rank
60
+ is accepted if the level name is included (e.g. superorder is accepted as order
61
+ if the order is unnamed but the name of the superorder is given) - *Prefer lower
62
+ ranks for filling* for compressing lower levels are prefered over higher ones **Output** Table
63
+ (tab separated). The first column contains the species names. The following columns
64
+ contain the rank names of the levels of interest.'
65
+ - 'passage: FeatureFinderCentroided. Detects two-dimensional features in LC-MS data..
66
+ Detects two-dimensional features in LC-MS data. For more information, visit
67
+ http://ftp.mi.fu-berlin.de/OpenMS/release-documentation/html/TOPP_FeatureFinderCentroided.html'
68
+ - 'passage: HyPhy-SLAC. Single Likelihood Ancestor Counting. SLAC (Single-Likelihood
69
+ Ancestor Counting) uses a combination of maximum-likelihood and counting approaches
70
+ to infer nonsynonymous and synonymous substitution rates on a per-site basis for
71
+ a given coding alignment and corresponding phylogeny. SLAC assumes that the selection
72
+ pressure for each site is constant along the entire phylogeny. See the online
73
+ documentation_ for more information. .. _documentation: http://hyphy.org/methods/selection-methods/#slac'
74
+ - source_sentence: 'query: Tool for using a new batch of labeled data'
75
+ sentences:
76
+ - 'passage: qiime2 sample-classifier fit-classifier. Fit a supervised learning classifier..
77
+ QIIME 2: sample-classifier fit-classifier =========================================
78
+ Fit a supervised learning classifier. Outputs: -------- :sample_estimator.qza:
79
+ Trained sample classifier. :feature_importance.qza: Importance of each input feature
80
+ to model accuracy. | Description: ------------ Fit a supervised learning classifier.
81
+ Outputs the fit estimator (for prediction of test samples and/or unknown samples)
82
+ and the relative importance of each feature for model accuracy. Optionally use
83
+ k-fold cross-validation for automatic recursive feature elimination and hyperparameter
84
+ tuning. |'
85
+ - 'passage: HyPhy-FEL. Fixed Effects Likelihood. FEL : Fixed effects likelihood
86
+ ============================== What question does this method answer? -------------------------------------- Which
87
+ site(s) in a gene are subject to pervasive, i.e. consistently across the entire
88
+ phylogeny, diversifying selection? Recommended Applications ------------------------ The
89
+ phenomenon of pervasive selection is generally most prevalent in pathogen evolution
90
+ and any biological system influenced by evolutionary arms race dynamics (or balancing
91
+ selection), including adaptive immune escape by viruses. As such, FEL is ideally
92
+ suited to identify sites under positive selection which represent candidate sites
93
+ subject to strong selective pressures across the entire phylogeny. FEL is our
94
+ recommended method for analyzing small-to-medium size datasets when one wishes
95
+ only to study pervasive selection at individual sites. Brief description ----------------- FEL
96
+ (Fixed Effects Likelihood) estimates site-wise synonymous (alpha) and non-synonymous
97
+ rates (beta), and uses a likelihood ratio test to determine if beta != alpha at
98
+ a site. The estimates aggregate information over all branches, so the signal is
99
+ derived from pervasive diversification or conservation. A subset of branches can
100
+ be selected for testing as well, in which case an additional (nuisance) parameter
101
+ will be inferred -- the non-synonymous rate on branches NOT selected for testing. Input
102
+ ----- 1. A *FASTA* sequence alignment. 2. A phylogenetic tree in the *Newick*
103
+ format Note: the names of sequences in the alignment must match the names of
104
+ the sequences in the tree. Output ------ A JSON file with analysis results
105
+ (http://hyphy.org/resources/json-fields.pdf). A custom visualization module for
106
+ viewing these results is available (see http://vision.hyphy.org/FEL for an example) Further
107
+ reading --------------- http://hyphy.org/methods/selection-methods/#FEL Tool
108
+ options ------------ :: --code Which genetic code to use --branches Which
109
+ branches should be tested for selection? All [default]
110
+ : test all branches Internal : test only internal
111
+ branches (suitable for intra-host pathogen evolution
112
+ for example, where terminal branches may contain polymorphism
113
+ data) Leaves: test only terminal (leaf) branches Unlabeled:
114
+ if the Newick string is labeled using the {} notation, test
115
+ only branches without explicit labels (see http://hyphy.org/tutorials/phylotree/) --pvalue The
116
+ significance level used to determine significance --srv Include
117
+ site-to-site synonymous rate variation? Yes [default]
118
+ or No'
119
+ - 'passage: Evaluate a Fitted Model. using a new batch of labeled data. **What it
120
+ does** Given a fitted estimator and a labeled dataset, this tool outputs the
121
+ performances of the fitted estimator on the labeled dataset with selected scorers. For
122
+ the estimator, this tool supports fitted sklearn estimators and trained deep learning
123
+ models. For input datasets, it supports the following: - tabular - sparse **Output** A
124
+ tabular file containing performance scores, e.g.: ======== ======== =========
125
+ accuracy f1_macro precision ======== ======== ========= 0.8613 0.6759 0.7928
126
+ ======== ======== ========='
127
+ - source_sentence: 'query: How can I cellprofiler?'
128
+ sentences:
129
+ - 'passage: CellProfiler. run a CellProfiler pipeline. .. class:: infomark **What
130
+ it does** This is the last tool in a CellProfiler workflow and runs a CellProfiler
131
+ 4.2.1 pipeline file on a collection of images. .. class:: infomark **Input** -
132
+ Collection of images. - Existing CellProfiler pipeline file *(.cppipe)*
133
+ or generated by linking CellProfiler tools. .. class:: infomark **Output** -
134
+ Images if the tool *SaveImages* was included in the workflow. - The features
135
+ selected if the tool *ExportToSpreadsheet* was included in the workflow. ..
136
+ class:: warningmark **IMPORTANT** Only the pipelines generated with
137
+ the version 4.2.1 of CellProfiler can be run, other versions may cause problems.'
138
+ - 'passage: Download and Generate Pileup Format. from NCBI SRA. This tool produces
139
+ pileup format from sra archives using sra-pileup. The sra-pileup program is developed
140
+ at NCBI, and is available at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software. Galaxy
141
+ tool wrapper originally written by Matt Shirley (mdshw5 at gmail.com). Wrapper
142
+ modified by Philip Mabon ( philip.mabon at phac-aspc.gc.ca ). Tool dependencies,
143
+ clean-up and bug-fixes by Marius van den Beek (m.vandenbeek at gmail.com). For
144
+ support and bug reports contact Matt Shirley or Marius van den Beek or go to https://github.com/galaxyproject/tools-iuc.'
145
+ - 'passage: Seurat Export2CellBrowser. produces files for UCSC CellBrowser import..
146
+ .. class:: infomark **What it does** Seurat_ is a toolkit for quality control,
147
+ analysis, and exploration of single cell RNA sequencing data. It is developed
148
+ and maintained by the `Satija Lab`_ at NYGC. Seurat aims to enable users to identify
149
+ and interpret sources of heterogeneity from single cell transcriptomic measurements,
150
+ and to integrate diverse types of single cell data. This tool converts a Seurat
151
+ object (hopefully with t-SNE results) and its accompanying marker genes file (optional)
152
+ to a tar that can be feed to the UCSC CellBrowser tool. ----- **Inputs** *
153
+ RDS object ----- **Outputs** * Text file .. _Seurat: https://www.nature.com/articles/nbt.4096
154
+ .. _Satija Lab: https://satijalab.org/seurat/ **Version history** 0.0.1: Initial
155
+ contribution. Maria Doyle, https://github.com/mblue9. 2.3.1+galaxy0: Improved
156
+ documentation and further exposition of all script''s options. Pablo Moreno, Jonathan
157
+ Manning and Ni Huang, Expression Atlas team https://www.ebi.ac.uk/gxa/home at
158
+ EMBL-EBI https://www.ebi.ac.uk/. Parts obtained from wrappers from Christophe
159
+ Antoniewski(https://github.com/drosofff) and Lea Bellenger(https://github.com/bellenger-l).'
160
+ - source_sentence: 'query: Which tool can be used to qiime2 diversity pcoa-biplot?'
161
+ sentences:
162
+ - 'passage: Manipulate loom object. Add layers, or row/column attributes to a loom
163
+ file. This tool allows the user to modify an existing loom data file by adding
164
+ column attributes, row attributes or additional layers via tsv files.'
165
+ - "passage: Biosigner. Molecular signature discovery from omics data. .. class::\
166
+ \ infomark **Author**\tPhilippe Rinaudo and Etienne Thevenot (CEA, LIST, MetaboHUB\
167
+ \ Paris, etienne.thevenot@cea.fr) ---------------------------------------------------\
168
+ \ .. class:: infomark **Please cite** Rinaudo P., Boudah S., Junot C. and Thevenot\
169
+ \ E.A. (2016). *biosigner*: a new method for the discovery of significant molecular\
170
+ \ signatures from omics data. *Frontiers in Molecular Biosciences*, **3** (http://dx.doi.org/10.3389/fmolb.2016.00026).\
171
+ \ --------------------------------------------------- .. class:: infomark **R\
172
+ \ package** The *biosigner* package is available from the bioconductor repository\
173
+ \ (http://bioconductor.org/packages/biosigner). ---------------------------------------------------\
174
+ \ .. class:: infomark **Tool updates** See the **NEWS** section at the bottom\
175
+ \ of this page --------------------------------------------------- ==========================================================\
176
+ \ *biosigner*: Molecular signature discovery from omics data ==========================================================\
177
+ \ ----------- Description ----------- High-throughput, non-targeted, technologies\
178
+ \ such as transcriptomics, proteomics and metabolomics, are widely used to **discover\
179
+ \ molecules** which allow to efficiently discriminate between biological or clinical\
180
+ \ conditions of interest (e.g., disease vs control states). Powerful **machine\
181
+ \ learning** approaches such as Partial Least Square Discriminant Analysis (PLS-DA),\
182
+ \ Random Forest (RF) and Support Vector Machines (SVM) have been shown to achieve\
183
+ \ high levels of prediction accuracy. **Feature selection**, i.e., the selection\
184
+ \ of the few features (i.e., the molecular signature) which are of highest discriminating\
185
+ \ value, is a critical step in building a robust and relevant classifier (Guyon\
186
+ \ and Elisseeff, 2003): First, dimension reduction is usefull to limit the risk\
187
+ \ of overfitting and reduce the prediction variability of the model; second, intrepretation\
188
+ \ of the molecular signature is facilitated; third, in case of the development\
189
+ \ of diagnostic product, a restricted list is required for the subsequent validation\
190
+ \ steps (Rifai et al, 2006). Since the comprehensive analysis of all combinations\
191
+ \ of features is not computationally tractable, several selection techniques have\
192
+ \ been described (Saeys et al, 2007). The major challenge for such methods is\
193
+ \ to be fast and extract **restricted and stable molecular signatures** which\
194
+ \ still provide high performance of the classifier (Gromski et al, 2014; Determan,\
195
+ \ 2015). The **biosigner** module implements a new feature selection algorithm\
196
+ \ to assess the relevance of the variables for the prediction performances of\
197
+ \ the classifier (Rinaudo et al, submitted). Three binary classifiers can be run\
198
+ \ in parallel, namely **PLS-DA**, **Random Forest** and **SVM**, as the performances\
199
+ \ of each machine learning approach may vary depending on the structure of the\
200
+ \ dataset. The algorithm computes the *tier* of each feature for the selected\
201
+ \ classifer(s): tier *S* corresponds to the final signature, i.e., features which\
202
+ \ have been found significant in all the selection steps; features with tier *A*\
203
+ \ have been found significant in all but the last selection, and so on for tier\
204
+ \ *B* to *E*. It returns the **signature** (by default from the *S* tier) for\
205
+ \ each of the selected classifier as an additional column of the **variableMetadata**\
206
+ \ table. In addition the *tiers* and **individual boxplots** of the selected features\
207
+ \ are returned. The module has been successfully applied to **transcriptomics**\
208
+ \ and **metabolomics** data. Note: | 1) Only **binary** classification is\
209
+ \ currently available, | 2) If the **dataMatrix** contains **missing** values\
210
+ \ (NA), these features will be removed prior to modeling with Random Forest and\
211
+ \ SVM (in contrast, the NIPALS algorithm from PLS-DA can handle missing values),\
212
+ \ | 3) As the algorithm relies on bootstrapping, re-running the module may\
213
+ \ result in slightly different results. To ensure that returned results are exactly\
214
+ \ the same, the **seed** (advanced) parameter can be used. | ---------------------------------------------------\
215
+ \ .. class:: infomark **References** | Determan C. (2015). Optimal algorithm\
216
+ \ for metabolomics classification and feature selection varies by dataset. International\
217
+ \ *Journal of Biology* 7, 100-115. | Gromski P.S., Xu Y., Correa E., Ellis D.I.,\
218
+ \ Turner M.L. and Goodacre R. (2014). A comparative investigation of modern feature\
219
+ \ selection and classification approaches for the analysis of mass spectrometry\
220
+ \ data . *Analytica Chimica Acta* 829, 1-8. | Guyon I. and Elisseeff A. (2003).\
221
+ \ An introduction to variable and feature selection. *Journal of Machine Learning\
222
+ \ Research* 3, 1157-1182. | Rifai N., Gillette M.A. and Carr S.A. (2006). Protein\
223
+ \ biomarker discovery and validation: the long and uncertain path to clinical\
224
+ \ utility. *Nature Biotechnology* 24, 971-983. | Rinaudo P., Junot C. and Thevenot\
225
+ \ E.A. *biosigner*: A new method for the discovery of restricted and stable molecular\
226
+ \ signatures from omics data. *submitted*. | Saeys Y., Inza I. and Larranaga P.\
227
+ \ (2007). A review of feature selection techniques in bioinformatics. *Bioinformatics*\
228
+ \ 23, 2507-2517. --------------------------------------------------- -----------------\
229
+ \ Workflow position ----------------- .. image:: biosigner_workflowPositionImage.png\
230
+ \ :width: 600 ----------- Input files ----------- +---------------------------+------------+\
231
+ \ | File | Format | +===========================+============+\
232
+ \ | 1) Data matrix | tabular | +---------------------------+------------+\
233
+ \ | 2) Sample metadata | tabular | +---------------------------+------------+\
234
+ \ | 3) Variable metadata | tabular | +---------------------------+------------+\
235
+ \ ---------- Parameters ---------- Data matrix file \t| variable x sample **dataMatrix**\
236
+ \ tabular separated file of the numeric intensities, with . as decimal, and NA\
237
+ \ for missing values; use the **Check Format** tool in the **LC-MS/Quality Control**\
238
+ \ section to check the formats of your **dataMatrix**, **sampleMetadata** and\
239
+ \ **variableMetadata** files \t| Sample metadata file \t| sample x metadata\
240
+ \ **sampleMetadata** tabular separated file of the numeric and/or character sample\
241
+ \ metadata, with . as decimal and NA for missing values; use the **Check Format**\
242
+ \ tool in the **LC-MS/Quality Control** section to check the formats of your **dataMatrix**,\
243
+ \ **sampleMetadata** and **variableMetadata** files \t| Variable metadata file\
244
+ \ \t| variable x metadata **variableMetadata** tabular separated file of the numeric\
245
+ \ and/or character variable metadata, with . as decimal and NA for missing values;\
246
+ \ use the **Check Format** tool in the **LC-MS/Quality Control** section to check\
247
+ \ the formats of your **dataMatrix**, **sampleMetadata** and **variableMetadata**\
248
+ \ files \t| Classes of samples \t| Column of the sample metadata table to\
249
+ \ be used as the qualitative **binary** response to be modelled; the column should\
250
+ \ contain only two types of strings (e.g., 'case' and 'control') \t| Advanced:\
251
+ \ Classification method(s) (default = all) \t| Either one or all of the following\
252
+ \ classifiers: Partial Least Squares Discriminant Analysis (PLS-DA), or Random\
253
+ \ Forest, or Support Vector Machine (SVM) \t| Advanced: Number of bootstraps\
254
+ \ (default = 50) \t| This parameter controls the number of times the model performance\
255
+ \ is compared to the prediction on a test subset where the intensities of the\
256
+ \ candidate feature have been randomly permuted. \t| Advanced: Selection\
257
+ \ tier(s) (default = S) \t| Tier *S* corresponds to the final signature, i.e.,\
258
+ \ features which have been found significant in all the backward selection steps;\
259
+ \ features with tier *A* have been found significant in all but the last selection,\
260
+ \ and so on for tier *B* to *E*. Default selection tier is *S*, meaning that the\
261
+ \ final signature only is returned; to view a larger number of candidate features,\
262
+ \ the *S+A* tiers can be selected. \t| Advanced: p-value threshold (default\
263
+ \ = 0.05) \t| This threshold controls the selection of the features at each selection\
264
+ \ round (tier): to be selected, the proportion of times the prediction on the\
265
+ \ test set with the randomized intensities of the feature is more accurate than\
266
+ \ on the original test set must be inferior to this threshold. For example, if\
267
+ \ the number of bootstraps is 50, no more than 2 out of the 50 predictions on\
268
+ \ the randomized test set must not be more accurate than on the original test\
269
+ \ set (since 1/50 = 0.02). Advanced: Seed (default = 0) \t| As the algorithm\
270
+ \ relies on resampling (bootstrap), re-running the module may result in slightly\
271
+ \ different signatures. To ensure that returned results are exactly the same,\
272
+ \ the **seed** parameter (integer) can be used; the default, 0, means that no\
273
+ \ seed is used. \t| ------------ Output files ------------ variableMetadata_out.tabular\
274
+ \ \t| When a least one feature has been selected, a **tier** column is added indicating\
275
+ \ for each feature the classifier(s) it was selected from. \t| figure-tier.pdf\
276
+ \ \t| Graphic summarizing which features were selected, with their corresponding\
277
+ \ tier (i.e., round(s) of selection) for each classifier. \t| figure-boxplot.pdf\
278
+ \ \t| Individual boxplots of the features which were selected in at least one\
279
+ \ of the signatures. Features selected for a single classifier are colored (*red*\
280
+ \ for PLS-DA, *green* for Random Forest, and *blue* for SVM) \t| \t\t\t information.txt\
281
+ \ \t| Text file with all messages and warnings generated during the computation.\
282
+ \ \t| --------------------------------------------------- --------------- Working\
283
+ \ example --------------- See the **W4M00001a_sacurine-subset-statistics** and\
284
+ \ **W4M00003_diaplasma** shared histories in the **Shared Data/Published Histories**\
285
+ \ menu (https://galaxy.workflow4metabolomics.org/history/list_published) Figure\
286
+ \ output ============= .. image:: biosigner_workingExampleImage.png :width:\
287
+ \ 600 --------------------------------------------------- ---- NEWS\
288
+ \ ---- CHANGES IN VERSION 2.2.6 ======================== INTERNAL MODIFICATIONS\
289
+ \ Minor internal modifications CHANGES IN VERSION 2.2.4 ========================\
290
+ \ INTERNAL MODIFICATIONS Creating additional files for planemo and travis running\
291
+ \ and installation validation CHANGES IN VERSION 2.2.2 ========================\
292
+ \ INTERNAL MODIFICATIONS Internal updates to biosigner package versions of 1.0.0\
293
+ \ and above, and ropls versions of 1.4.0 and above (i.e. using S4 methods instead\
294
+ \ of S3) CHANGES IN VERSION 2.2.1 ======================== NEW FEATURE Creation\
295
+ \ of the tool"
296
+ - 'passage: qiime2 diversity pcoa-biplot. Principal Coordinate Analysis Biplot.
297
+ QIIME 2: diversity pcoa-biplot ============================== Principal Coordinate
298
+ Analysis Biplot Outputs: -------- :biplot.qza: The resulting PCoA matrix. | Description:
299
+ ------------ Project features into a principal coordinates matrix. The features
300
+ used should be the features used to compute the distance matrix. It is recommended
301
+ that these variables be normalized in cases of dimensionally heterogeneous physical
302
+ variables. |'
303
+ pipeline_tag: sentence-similarity
304
+ library_name: sentence-transformers
305
+ ---
306
+
307
+ # SentenceTransformer based on intfloat/e5-base-v2
308
+
309
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
310
+
311
+ ## Model Details
312
+
313
+ ### Model Description
314
+ - **Model Type:** Sentence Transformer
315
+ - **Base model:** [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) <!-- at revision f52bf8ec8c7124536f0efb74aca902b2995e5bcd -->
316
+ - **Maximum Sequence Length:** 512 tokens
317
+ - **Output Dimensionality:** 768 dimensions
318
+ - **Similarity Function:** Cosine Similarity
319
+ <!-- - **Training Dataset:** Unknown -->
320
+ <!-- - **Language:** Unknown -->
321
+ <!-- - **License:** Unknown -->
322
+
323
+ ### Model Sources
324
+
325
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
326
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
327
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
328
+
329
+ ### Full Model Architecture
330
+
331
+ ```
332
+ SentenceTransformer(
333
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
334
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
335
+ (2): Normalize()
336
+ )
337
+ ```
338
+
339
+ ## Usage
340
+
341
+ ### Direct Usage (Sentence Transformers)
342
+
343
+ First install the Sentence Transformers library:
344
+
345
+ ```bash
346
+ pip install -U sentence-transformers
347
+ ```
348
+
349
+ Then you can load this model and run inference.
350
+ ```python
351
+ from sentence_transformers import SentenceTransformer
352
+
353
+ # Download from the 🤗 Hub
354
+ model = SentenceTransformer("sentence_transformers_model_id")
355
+ # Run inference
356
+ sentences = [
357
+ 'query: Which tool can be used to qiime2 diversity pcoa-biplot?',
358
+ 'passage: qiime2 diversity pcoa-biplot. Principal Coordinate Analysis Biplot. QIIME 2: diversity pcoa-biplot ============================== Principal Coordinate Analysis Biplot Outputs: -------- :biplot.qza: The resulting PCoA matrix. | Description: ------------ Project features into a principal coordinates matrix. The features used should be the features used to compute the distance matrix. It is recommended that these variables be normalized in cases of dimensionally heterogeneous physical variables. |',
359
+ 'passage: Manipulate loom object. Add layers, or row/column attributes to a loom file. This tool allows the user to modify an existing loom data file by adding column attributes, row attributes or additional layers via tsv files.',
360
+ ]
361
+ embeddings = model.encode(sentences)
362
+ print(embeddings.shape)
363
+ # [3, 768]
364
+
365
+ # Get the similarity scores for the embeddings
366
+ similarities = model.similarity(embeddings, embeddings)
367
+ print(similarities.shape)
368
+ # [3, 3]
369
+ ```
370
+
371
+ <!--
372
+ ### Direct Usage (Transformers)
373
+
374
+ <details><summary>Click to see the direct usage in Transformers</summary>
375
+
376
+ </details>
377
+ -->
378
+
379
+ <!--
380
+ ### Downstream Usage (Sentence Transformers)
381
+
382
+ You can finetune this model on your own dataset.
383
+
384
+ <details><summary>Click to expand</summary>
385
+
386
+ </details>
387
+ -->
388
+
389
+ <!--
390
+ ### Out-of-Scope Use
391
+
392
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
393
+ -->
394
+
395
+ <!--
396
+ ## Bias, Risks and Limitations
397
+
398
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
399
+ -->
400
+
401
+ <!--
402
+ ### Recommendations
403
+
404
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
405
+ -->
406
+
407
+ ## Training Details
408
+
409
+ ### Training Dataset
410
+
411
+ #### Unnamed Dataset
412
+
413
+ * Size: 68,840 training samples
414
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
415
+ * Approximate statistics based on the first 1000 samples:
416
+ | | sentence_0 | sentence_1 |
417
+ |:--------|:----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
418
+ | type | string | string |
419
+ | details | <ul><li>min: 7 tokens</li><li>mean: 14.66 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 21 tokens</li><li>mean: 298.3 tokens</li><li>max: 512 tokens</li></ul> |
420
+ * Samples:
421
+ | sentence_0 | sentence_1 |
422
+ |:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
423
+ | <code>query: Tool for vcf/bcf conversion, view, subset and filter vcf/bcf files</code> | <code>passage: bcftools view. VCF/BCF conversion, view, subset and filter VCF/BCF files. ===================================== bcftools view ===================================== VCF/BCF conversion, view, subset and filter VCF/BCF files. Region Selections ----------------- Regions can be specified in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive. Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can resul...</code> |
424
+ | <code>query: Tool for de novo assembly of rna-seq data</code> | <code>passage: Trinity. de novo assembly of RNA-Seq data. Trinity_ assembles transcript sequences from Illumina RNA-Seq data. .. _Trinity: http://trinityrnaseq.github.io</code> |
425
+ | <code>query: I want to das tool in Galaxy</code> | <code>passage: DAS Tool. for genome-resolved metagenomics. What it does ============ DAS Tool is an automated method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly. Inputs ====== - Bins: Tab-separated files of contig-IDs and bin-IDs. Contigs to bin file example: :: Contig_1 bin.01 Contig_8 bin.01 Contig_42 bin.02 Contig_49 bin.03 - Contigs: Assembled contigs in fasta format: :: >Contig_1 ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA... >Contig_2 GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG... - [Optional] Proteins: Predicted proteins in prodigal fasta format. The header contains contig-ID and gene number: :: >Contig_1_1 MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG... >Contig_1_2 MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN... Outputs ======= - Summary of output bins including quality and c...</code> |
426
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
427
+ ```json
428
+ {
429
+ "scale": 20.0,
430
+ "similarity_fct": "cos_sim"
431
+ }
432
+ ```
433
+
434
+ ### Training Hyperparameters
435
+ #### Non-Default Hyperparameters
436
+
437
+ - `per_device_train_batch_size`: 16
438
+ - `per_device_eval_batch_size`: 16
439
+ - `num_train_epochs`: 4
440
+ - `multi_dataset_batch_sampler`: round_robin
441
+
442
+ #### All Hyperparameters
443
+ <details><summary>Click to expand</summary>
444
+
445
+ - `overwrite_output_dir`: False
446
+ - `do_predict`: False
447
+ - `eval_strategy`: no
448
+ - `prediction_loss_only`: True
449
+ - `per_device_train_batch_size`: 16
450
+ - `per_device_eval_batch_size`: 16
451
+ - `per_gpu_train_batch_size`: None
452
+ - `per_gpu_eval_batch_size`: None
453
+ - `gradient_accumulation_steps`: 1
454
+ - `eval_accumulation_steps`: None
455
+ - `torch_empty_cache_steps`: None
456
+ - `learning_rate`: 5e-05
457
+ - `weight_decay`: 0.0
458
+ - `adam_beta1`: 0.9
459
+ - `adam_beta2`: 0.999
460
+ - `adam_epsilon`: 1e-08
461
+ - `max_grad_norm`: 1
462
+ - `num_train_epochs`: 4
463
+ - `max_steps`: -1
464
+ - `lr_scheduler_type`: linear
465
+ - `lr_scheduler_kwargs`: {}
466
+ - `warmup_ratio`: 0.0
467
+ - `warmup_steps`: 0
468
+ - `log_level`: passive
469
+ - `log_level_replica`: warning
470
+ - `log_on_each_node`: True
471
+ - `logging_nan_inf_filter`: True
472
+ - `save_safetensors`: True
473
+ - `save_on_each_node`: False
474
+ - `save_only_model`: False
475
+ - `restore_callback_states_from_checkpoint`: False
476
+ - `no_cuda`: False
477
+ - `use_cpu`: False
478
+ - `use_mps_device`: False
479
+ - `seed`: 42
480
+ - `data_seed`: None
481
+ - `jit_mode_eval`: False
482
+ - `use_ipex`: False
483
+ - `bf16`: False
484
+ - `fp16`: False
485
+ - `fp16_opt_level`: O1
486
+ - `half_precision_backend`: auto
487
+ - `bf16_full_eval`: False
488
+ - `fp16_full_eval`: False
489
+ - `tf32`: None
490
+ - `local_rank`: 0
491
+ - `ddp_backend`: None
492
+ - `tpu_num_cores`: None
493
+ - `tpu_metrics_debug`: False
494
+ - `debug`: []
495
+ - `dataloader_drop_last`: False
496
+ - `dataloader_num_workers`: 0
497
+ - `dataloader_prefetch_factor`: None
498
+ - `past_index`: -1
499
+ - `disable_tqdm`: False
500
+ - `remove_unused_columns`: True
501
+ - `label_names`: None
502
+ - `load_best_model_at_end`: False
503
+ - `ignore_data_skip`: False
504
+ - `fsdp`: []
505
+ - `fsdp_min_num_params`: 0
506
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
507
+ - `fsdp_transformer_layer_cls_to_wrap`: None
508
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
509
+ - `deepspeed`: None
510
+ - `label_smoothing_factor`: 0.0
511
+ - `optim`: adamw_torch
512
+ - `optim_args`: None
513
+ - `adafactor`: False
514
+ - `group_by_length`: False
515
+ - `length_column_name`: length
516
+ - `ddp_find_unused_parameters`: None
517
+ - `ddp_bucket_cap_mb`: None
518
+ - `ddp_broadcast_buffers`: False
519
+ - `dataloader_pin_memory`: True
520
+ - `dataloader_persistent_workers`: False
521
+ - `skip_memory_metrics`: True
522
+ - `use_legacy_prediction_loop`: False
523
+ - `push_to_hub`: False
524
+ - `resume_from_checkpoint`: None
525
+ - `hub_model_id`: None
526
+ - `hub_strategy`: every_save
527
+ - `hub_private_repo`: None
528
+ - `hub_always_push`: False
529
+ - `gradient_checkpointing`: False
530
+ - `gradient_checkpointing_kwargs`: None
531
+ - `include_inputs_for_metrics`: False
532
+ - `include_for_metrics`: []
533
+ - `eval_do_concat_batches`: True
534
+ - `fp16_backend`: auto
535
+ - `push_to_hub_model_id`: None
536
+ - `push_to_hub_organization`: None
537
+ - `mp_parameters`:
538
+ - `auto_find_batch_size`: False
539
+ - `full_determinism`: False
540
+ - `torchdynamo`: None
541
+ - `ray_scope`: last
542
+ - `ddp_timeout`: 1800
543
+ - `torch_compile`: False
544
+ - `torch_compile_backend`: None
545
+ - `torch_compile_mode`: None
546
+ - `dispatch_batches`: None
547
+ - `split_batches`: None
548
+ - `include_tokens_per_second`: False
549
+ - `include_num_input_tokens_seen`: False
550
+ - `neftune_noise_alpha`: None
551
+ - `optim_target_modules`: None
552
+ - `batch_eval_metrics`: False
553
+ - `eval_on_start`: False
554
+ - `use_liger_kernel`: False
555
+ - `eval_use_gather_object`: False
556
+ - `average_tokens_across_devices`: False
557
+ - `prompts`: None
558
+ - `batch_sampler`: batch_sampler
559
+ - `multi_dataset_batch_sampler`: round_robin
560
+
561
+ </details>
562
+
563
+ ### Training Logs
564
+ | Epoch | Step | Training Loss |
565
+ |:------:|:-----:|:-------------:|
566
+ | 0.1162 | 500 | 0.0921 |
567
+ | 0.2324 | 1000 | 0.0066 |
568
+ | 0.3486 | 1500 | 0.0062 |
569
+ | 0.4648 | 2000 | 0.0081 |
570
+ | 0.5810 | 2500 | 0.0073 |
571
+ | 0.6972 | 3000 | 0.0091 |
572
+ | 0.8134 | 3500 | 0.0053 |
573
+ | 0.9296 | 4000 | 0.0083 |
574
+ | 1.0458 | 4500 | 0.0073 |
575
+ | 1.1620 | 5000 | 0.0059 |
576
+ | 1.2782 | 5500 | 0.0068 |
577
+ | 1.3944 | 6000 | 0.0047 |
578
+ | 1.5106 | 6500 | 0.0077 |
579
+ | 1.6268 | 7000 | 0.0071 |
580
+ | 1.7430 | 7500 | 0.0067 |
581
+ | 1.8592 | 8000 | 0.0069 |
582
+ | 1.9754 | 8500 | 0.0077 |
583
+ | 2.0916 | 9000 | 0.0064 |
584
+ | 2.2078 | 9500 | 0.0073 |
585
+ | 2.3240 | 10000 | 0.0075 |
586
+ | 2.4402 | 10500 | 0.0049 |
587
+ | 2.5564 | 11000 | 0.0071 |
588
+ | 2.6726 | 11500 | 0.0075 |
589
+ | 2.7888 | 12000 | 0.0078 |
590
+ | 2.9050 | 12500 | 0.0086 |
591
+ | 3.0211 | 13000 | 0.0069 |
592
+ | 3.1373 | 13500 | 0.0052 |
593
+ | 3.2535 | 14000 | 0.0065 |
594
+ | 3.3697 | 14500 | 0.0066 |
595
+ | 3.4859 | 15000 | 0.0068 |
596
+ | 3.6021 | 15500 | 0.0079 |
597
+ | 3.7183 | 16000 | 0.0077 |
598
+ | 3.8345 | 16500 | 0.0066 |
599
+ | 3.9507 | 17000 | 0.0046 |
600
+
601
+
602
+ ### Framework Versions
603
+ - Python: 3.12.8
604
+ - Sentence Transformers: 3.4.1
605
+ - Transformers: 4.49.0
606
+ - PyTorch: 2.6.0+cu124
607
+ - Accelerate: 1.4.0
608
+ - Datasets: 3.3.2
609
+ - Tokenizers: 0.21.0
610
+
611
+ ## Citation
612
+
613
+ ### BibTeX
614
+
615
+ #### Sentence Transformers
616
+ ```bibtex
617
+ @inproceedings{reimers-2019-sentence-bert,
618
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
619
+ author = "Reimers, Nils and Gurevych, Iryna",
620
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
621
+ month = "11",
622
+ year = "2019",
623
+ publisher = "Association for Computational Linguistics",
624
+ url = "https://arxiv.org/abs/1908.10084",
625
+ }
626
+ ```
627
+
628
+ #### MultipleNegativesRankingLoss
629
+ ```bibtex
630
+ @misc{henderson2017efficient,
631
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
632
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
633
+ year={2017},
634
+ eprint={1705.00652},
635
+ archivePrefix={arXiv},
636
+ primaryClass={cs.CL}
637
+ }
638
+ ```
639
+
640
+ <!--
641
+ ## Glossary
642
+
643
+ *Clearly define terms in order to be accessible across audiences.*
644
+ -->
645
+
646
+ <!--
647
+ ## Model Card Authors
648
+
649
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
650
+ -->
651
+
652
+ <!--
653
+ ## Model Card Contact
654
+
655
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
656
+ -->