Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,656 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- sentence-transformers
|
| 4 |
+
- sentence-similarity
|
| 5 |
+
- feature-extraction
|
| 6 |
+
- generated_from_trainer
|
| 7 |
+
- dataset_size:68840
|
| 8 |
+
- loss:MultipleNegativesRankingLoss
|
| 9 |
+
base_model: intfloat/e5-base-v2
|
| 10 |
+
widget:
|
| 11 |
+
- source_sentence: 'query: How can I hicmergeloops?'
|
| 12 |
+
sentences:
|
| 13 |
+
- 'passage: WindowMasker mkcounts. Construct WindowMasker unit counts table. **What
|
| 14 |
+
it does** This tool runs `stage 1 <https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/winmasker/>`_
|
| 15 |
+
of the WindowMasker analysis to produce a unit counts file for a genome assembly.'
|
| 16 |
+
- 'passage: GROMACS simulation. for system equilibration or data collection. ..
|
| 17 |
+
class:: infomark **What it does** This tool performs a molecular dynamics simulation
|
| 18 |
+
with GROMACS. _____ .. class:: infomark **Input** - GRO structure file. -
|
| 19 |
+
Topology (TOP) file. A variety of other options can also be specified: -
|
| 20 |
+
MDP parameter file to take advantage of all GROMACS features. Otherwise, choose
|
| 21 |
+
parameters through the Galaxy interface. See the `manual`_ for more information
|
| 22 |
+
on the options. - Accepting and producing checkpoint (CPT) input/output
|
| 23 |
+
files, which allows sequential MD simulations, e.g. when performing NVT and NPT
|
| 24 |
+
equilibration followed by a production simulation. - Position restraint
|
| 25 |
+
(ITP) file, useful for equilibrating solvent around a protein. - Choice
|
| 26 |
+
of ensemble: NVT or NPT. - Whether to return trajectory (XTC or TRR) and/or
|
| 27 |
+
structure (GRO or PDB) files. .. _`manual`: http://manual.gromacs.org/documentation/2018/user-guide/mdp-options.html _____ ..
|
| 28 |
+
class:: infomark **Output** - Structure and/or trajectory files as specified
|
| 29 |
+
in the input.'
|
| 30 |
+
- 'passage: hicMergeLoops. merge detected loops of different resolutions.. Merge
|
| 31 |
+
detected loops ==================== This script merges the loop locations of
|
| 32 |
+
different different resolutions. Loops need to have the following format: chr
|
| 33 |
+
start end chr start end A merge happens if x and y position of a loop overlaps
|
| 34 |
+
with x and y position of another loop; all loops are considered as an overlap
|
| 35 |
+
within +/- the bin size of the lowest resolution. I.e. for a loop with coordinates
|
| 36 |
+
x and y, the overlap to all other loops is searched for (x - lowest resolution)
|
| 37 |
+
and (y + lowest resolution). If two or more locations should be merged, the one
|
| 38 |
+
with the lowest resolution is taken as the merged loop. Example usage: `$ hicMergeLoops
|
| 39 |
+
-i gm12878_10kb.bedgraph gm12878_5kb.bedgraph gm12878_25kb.bedgraph -o merged_result.bedgraph
|
| 40 |
+
-r 25000` Please recall: We work with binned data i.e. the lowest resolution
|
| 41 |
+
is therefore the one where we merge the most bases into one bin. In the above
|
| 42 |
+
example the lowest resultion is 25kb, the highest resolution is 5kb. For more
|
| 43 |
+
information about HiCExplorer please consider our documentation on readthedocs.io_ ..
|
| 44 |
+
_readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html'
|
| 45 |
+
- source_sentence: 'query: How can I featurefindercentroided?'
|
| 46 |
+
sentences:
|
| 47 |
+
- 'passage: ETE lineage generator. from a list of species/taxids using the ETE Toolkit.
|
| 48 |
+
Generates a table with lineage information for a list of species (also taxids
|
| 49 |
+
and arbitrary taxons are accepted) using the `ETE Toolkit`_. .. _ETE Toolkit:
|
| 50 |
+
https://etetoolkit.org/ **Input** - *Species file* a single column tabular file
|
| 51 |
+
- *(ETE3) Taxonomy Database* a sqlite database that has been created by ETE from
|
| 52 |
+
the NCBI taxonomy dump **Options** - *Taxonomic levels* the columns to be incuded
|
| 53 |
+
in the output table. There are two presets (full and primary) - *Full* contains
|
| 54 |
+
all 29 ranks included in the NCBI taxonomy - *Primary* contains the primary
|
| 55 |
+
ranks (kingdom, phylum, class, order, family, genus, species) - *Manual* the
|
| 56 |
+
ranks of interest can be chosen by the user. The primary levels are chosen by
|
| 57 |
+
default. - *Fill unnamed ranks* Get missing data from "nearby" levels: -
|
| 58 |
+
Some nodes in the NCBI taxonomy tree have no name (no rank) these are shown by
|
| 59 |
+
default as "NA" in the output. If the *compress* option is selected then the rank
|
| 60 |
+
is accepted if the level name is included (e.g. superorder is accepted as order
|
| 61 |
+
if the order is unnamed but the name of the superorder is given) - *Prefer lower
|
| 62 |
+
ranks for filling* for compressing lower levels are prefered over higher ones **Output** Table
|
| 63 |
+
(tab separated). The first column contains the species names. The following columns
|
| 64 |
+
contain the rank names of the levels of interest.'
|
| 65 |
+
- 'passage: FeatureFinderCentroided. Detects two-dimensional features in LC-MS data..
|
| 66 |
+
Detects two-dimensional features in LC-MS data. For more information, visit
|
| 67 |
+
http://ftp.mi.fu-berlin.de/OpenMS/release-documentation/html/TOPP_FeatureFinderCentroided.html'
|
| 68 |
+
- 'passage: HyPhy-SLAC. Single Likelihood Ancestor Counting. SLAC (Single-Likelihood
|
| 69 |
+
Ancestor Counting) uses a combination of maximum-likelihood and counting approaches
|
| 70 |
+
to infer nonsynonymous and synonymous substitution rates on a per-site basis for
|
| 71 |
+
a given coding alignment and corresponding phylogeny. SLAC assumes that the selection
|
| 72 |
+
pressure for each site is constant along the entire phylogeny. See the online
|
| 73 |
+
documentation_ for more information. .. _documentation: http://hyphy.org/methods/selection-methods/#slac'
|
| 74 |
+
- source_sentence: 'query: Tool for using a new batch of labeled data'
|
| 75 |
+
sentences:
|
| 76 |
+
- 'passage: qiime2 sample-classifier fit-classifier. Fit a supervised learning classifier..
|
| 77 |
+
QIIME 2: sample-classifier fit-classifier =========================================
|
| 78 |
+
Fit a supervised learning classifier. Outputs: -------- :sample_estimator.qza:
|
| 79 |
+
Trained sample classifier. :feature_importance.qza: Importance of each input feature
|
| 80 |
+
to model accuracy. | Description: ------------ Fit a supervised learning classifier.
|
| 81 |
+
Outputs the fit estimator (for prediction of test samples and/or unknown samples)
|
| 82 |
+
and the relative importance of each feature for model accuracy. Optionally use
|
| 83 |
+
k-fold cross-validation for automatic recursive feature elimination and hyperparameter
|
| 84 |
+
tuning. |'
|
| 85 |
+
- 'passage: HyPhy-FEL. Fixed Effects Likelihood. FEL : Fixed effects likelihood
|
| 86 |
+
============================== What question does this method answer? -------------------------------------- Which
|
| 87 |
+
site(s) in a gene are subject to pervasive, i.e. consistently across the entire
|
| 88 |
+
phylogeny, diversifying selection? Recommended Applications ------------------------ The
|
| 89 |
+
phenomenon of pervasive selection is generally most prevalent in pathogen evolution
|
| 90 |
+
and any biological system influenced by evolutionary arms race dynamics (or balancing
|
| 91 |
+
selection), including adaptive immune escape by viruses. As such, FEL is ideally
|
| 92 |
+
suited to identify sites under positive selection which represent candidate sites
|
| 93 |
+
subject to strong selective pressures across the entire phylogeny. FEL is our
|
| 94 |
+
recommended method for analyzing small-to-medium size datasets when one wishes
|
| 95 |
+
only to study pervasive selection at individual sites. Brief description ----------------- FEL
|
| 96 |
+
(Fixed Effects Likelihood) estimates site-wise synonymous (alpha) and non-synonymous
|
| 97 |
+
rates (beta), and uses a likelihood ratio test to determine if beta != alpha at
|
| 98 |
+
a site. The estimates aggregate information over all branches, so the signal is
|
| 99 |
+
derived from pervasive diversification or conservation. A subset of branches can
|
| 100 |
+
be selected for testing as well, in which case an additional (nuisance) parameter
|
| 101 |
+
will be inferred -- the non-synonymous rate on branches NOT selected for testing. Input
|
| 102 |
+
----- 1. A *FASTA* sequence alignment. 2. A phylogenetic tree in the *Newick*
|
| 103 |
+
format Note: the names of sequences in the alignment must match the names of
|
| 104 |
+
the sequences in the tree. Output ------ A JSON file with analysis results
|
| 105 |
+
(http://hyphy.org/resources/json-fields.pdf). A custom visualization module for
|
| 106 |
+
viewing these results is available (see http://vision.hyphy.org/FEL for an example) Further
|
| 107 |
+
reading --------------- http://hyphy.org/methods/selection-methods/#FEL Tool
|
| 108 |
+
options ------------ :: --code Which genetic code to use --branches Which
|
| 109 |
+
branches should be tested for selection? All [default]
|
| 110 |
+
: test all branches Internal : test only internal
|
| 111 |
+
branches (suitable for intra-host pathogen evolution
|
| 112 |
+
for example, where terminal branches may contain polymorphism
|
| 113 |
+
data) Leaves: test only terminal (leaf) branches Unlabeled:
|
| 114 |
+
if the Newick string is labeled using the {} notation, test
|
| 115 |
+
only branches without explicit labels (see http://hyphy.org/tutorials/phylotree/) --pvalue The
|
| 116 |
+
significance level used to determine significance --srv Include
|
| 117 |
+
site-to-site synonymous rate variation? Yes [default]
|
| 118 |
+
or No'
|
| 119 |
+
- 'passage: Evaluate a Fitted Model. using a new batch of labeled data. **What it
|
| 120 |
+
does** Given a fitted estimator and a labeled dataset, this tool outputs the
|
| 121 |
+
performances of the fitted estimator on the labeled dataset with selected scorers. For
|
| 122 |
+
the estimator, this tool supports fitted sklearn estimators and trained deep learning
|
| 123 |
+
models. For input datasets, it supports the following: - tabular - sparse **Output** A
|
| 124 |
+
tabular file containing performance scores, e.g.: ======== ======== =========
|
| 125 |
+
accuracy f1_macro precision ======== ======== ========= 0.8613 0.6759 0.7928
|
| 126 |
+
======== ======== ========='
|
| 127 |
+
- source_sentence: 'query: How can I cellprofiler?'
|
| 128 |
+
sentences:
|
| 129 |
+
- 'passage: CellProfiler. run a CellProfiler pipeline. .. class:: infomark **What
|
| 130 |
+
it does** This is the last tool in a CellProfiler workflow and runs a CellProfiler
|
| 131 |
+
4.2.1 pipeline file on a collection of images. .. class:: infomark **Input** -
|
| 132 |
+
Collection of images. - Existing CellProfiler pipeline file *(.cppipe)*
|
| 133 |
+
or generated by linking CellProfiler tools. .. class:: infomark **Output** -
|
| 134 |
+
Images if the tool *SaveImages* was included in the workflow. - The features
|
| 135 |
+
selected if the tool *ExportToSpreadsheet* was included in the workflow. ..
|
| 136 |
+
class:: warningmark **IMPORTANT** Only the pipelines generated with
|
| 137 |
+
the version 4.2.1 of CellProfiler can be run, other versions may cause problems.'
|
| 138 |
+
- 'passage: Download and Generate Pileup Format. from NCBI SRA. This tool produces
|
| 139 |
+
pileup format from sra archives using sra-pileup. The sra-pileup program is developed
|
| 140 |
+
at NCBI, and is available at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software. Galaxy
|
| 141 |
+
tool wrapper originally written by Matt Shirley (mdshw5 at gmail.com). Wrapper
|
| 142 |
+
modified by Philip Mabon ( philip.mabon at phac-aspc.gc.ca ). Tool dependencies,
|
| 143 |
+
clean-up and bug-fixes by Marius van den Beek (m.vandenbeek at gmail.com). For
|
| 144 |
+
support and bug reports contact Matt Shirley or Marius van den Beek or go to https://github.com/galaxyproject/tools-iuc.'
|
| 145 |
+
- 'passage: Seurat Export2CellBrowser. produces files for UCSC CellBrowser import..
|
| 146 |
+
.. class:: infomark **What it does** Seurat_ is a toolkit for quality control,
|
| 147 |
+
analysis, and exploration of single cell RNA sequencing data. It is developed
|
| 148 |
+
and maintained by the `Satija Lab`_ at NYGC. Seurat aims to enable users to identify
|
| 149 |
+
and interpret sources of heterogeneity from single cell transcriptomic measurements,
|
| 150 |
+
and to integrate diverse types of single cell data. This tool converts a Seurat
|
| 151 |
+
object (hopefully with t-SNE results) and its accompanying marker genes file (optional)
|
| 152 |
+
to a tar that can be feed to the UCSC CellBrowser tool. ----- **Inputs** *
|
| 153 |
+
RDS object ----- **Outputs** * Text file .. _Seurat: https://www.nature.com/articles/nbt.4096
|
| 154 |
+
.. _Satija Lab: https://satijalab.org/seurat/ **Version history** 0.0.1: Initial
|
| 155 |
+
contribution. Maria Doyle, https://github.com/mblue9. 2.3.1+galaxy0: Improved
|
| 156 |
+
documentation and further exposition of all script''s options. Pablo Moreno, Jonathan
|
| 157 |
+
Manning and Ni Huang, Expression Atlas team https://www.ebi.ac.uk/gxa/home at
|
| 158 |
+
EMBL-EBI https://www.ebi.ac.uk/. Parts obtained from wrappers from Christophe
|
| 159 |
+
Antoniewski(https://github.com/drosofff) and Lea Bellenger(https://github.com/bellenger-l).'
|
| 160 |
+
- source_sentence: 'query: Which tool can be used to qiime2 diversity pcoa-biplot?'
|
| 161 |
+
sentences:
|
| 162 |
+
- 'passage: Manipulate loom object. Add layers, or row/column attributes to a loom
|
| 163 |
+
file. This tool allows the user to modify an existing loom data file by adding
|
| 164 |
+
column attributes, row attributes or additional layers via tsv files.'
|
| 165 |
+
- "passage: Biosigner. Molecular signature discovery from omics data. .. class::\
|
| 166 |
+
\ infomark **Author**\tPhilippe Rinaudo and Etienne Thevenot (CEA, LIST, MetaboHUB\
|
| 167 |
+
\ Paris, etienne.thevenot@cea.fr) ---------------------------------------------------\
|
| 168 |
+
\ .. class:: infomark **Please cite** Rinaudo P., Boudah S., Junot C. and Thevenot\
|
| 169 |
+
\ E.A. (2016). *biosigner*: a new method for the discovery of significant molecular\
|
| 170 |
+
\ signatures from omics data. *Frontiers in Molecular Biosciences*, **3** (http://dx.doi.org/10.3389/fmolb.2016.00026).\
|
| 171 |
+
\ --------------------------------------------------- .. class:: infomark **R\
|
| 172 |
+
\ package** The *biosigner* package is available from the bioconductor repository\
|
| 173 |
+
\ (http://bioconductor.org/packages/biosigner). ---------------------------------------------------\
|
| 174 |
+
\ .. class:: infomark **Tool updates** See the **NEWS** section at the bottom\
|
| 175 |
+
\ of this page --------------------------------------------------- ==========================================================\
|
| 176 |
+
\ *biosigner*: Molecular signature discovery from omics data ==========================================================\
|
| 177 |
+
\ ----------- Description ----------- High-throughput, non-targeted, technologies\
|
| 178 |
+
\ such as transcriptomics, proteomics and metabolomics, are widely used to **discover\
|
| 179 |
+
\ molecules** which allow to efficiently discriminate between biological or clinical\
|
| 180 |
+
\ conditions of interest (e.g., disease vs control states). Powerful **machine\
|
| 181 |
+
\ learning** approaches such as Partial Least Square Discriminant Analysis (PLS-DA),\
|
| 182 |
+
\ Random Forest (RF) and Support Vector Machines (SVM) have been shown to achieve\
|
| 183 |
+
\ high levels of prediction accuracy. **Feature selection**, i.e., the selection\
|
| 184 |
+
\ of the few features (i.e., the molecular signature) which are of highest discriminating\
|
| 185 |
+
\ value, is a critical step in building a robust and relevant classifier (Guyon\
|
| 186 |
+
\ and Elisseeff, 2003): First, dimension reduction is usefull to limit the risk\
|
| 187 |
+
\ of overfitting and reduce the prediction variability of the model; second, intrepretation\
|
| 188 |
+
\ of the molecular signature is facilitated; third, in case of the development\
|
| 189 |
+
\ of diagnostic product, a restricted list is required for the subsequent validation\
|
| 190 |
+
\ steps (Rifai et al, 2006). Since the comprehensive analysis of all combinations\
|
| 191 |
+
\ of features is not computationally tractable, several selection techniques have\
|
| 192 |
+
\ been described (Saeys et al, 2007). The major challenge for such methods is\
|
| 193 |
+
\ to be fast and extract **restricted and stable molecular signatures** which\
|
| 194 |
+
\ still provide high performance of the classifier (Gromski et al, 2014; Determan,\
|
| 195 |
+
\ 2015). The **biosigner** module implements a new feature selection algorithm\
|
| 196 |
+
\ to assess the relevance of the variables for the prediction performances of\
|
| 197 |
+
\ the classifier (Rinaudo et al, submitted). Three binary classifiers can be run\
|
| 198 |
+
\ in parallel, namely **PLS-DA**, **Random Forest** and **SVM**, as the performances\
|
| 199 |
+
\ of each machine learning approach may vary depending on the structure of the\
|
| 200 |
+
\ dataset. The algorithm computes the *tier* of each feature for the selected\
|
| 201 |
+
\ classifer(s): tier *S* corresponds to the final signature, i.e., features which\
|
| 202 |
+
\ have been found significant in all the selection steps; features with tier *A*\
|
| 203 |
+
\ have been found significant in all but the last selection, and so on for tier\
|
| 204 |
+
\ *B* to *E*. It returns the **signature** (by default from the *S* tier) for\
|
| 205 |
+
\ each of the selected classifier as an additional column of the **variableMetadata**\
|
| 206 |
+
\ table. In addition the *tiers* and **individual boxplots** of the selected features\
|
| 207 |
+
\ are returned. The module has been successfully applied to **transcriptomics**\
|
| 208 |
+
\ and **metabolomics** data. Note: | 1) Only **binary** classification is\
|
| 209 |
+
\ currently available, | 2) If the **dataMatrix** contains **missing** values\
|
| 210 |
+
\ (NA), these features will be removed prior to modeling with Random Forest and\
|
| 211 |
+
\ SVM (in contrast, the NIPALS algorithm from PLS-DA can handle missing values),\
|
| 212 |
+
\ | 3) As the algorithm relies on bootstrapping, re-running the module may\
|
| 213 |
+
\ result in slightly different results. To ensure that returned results are exactly\
|
| 214 |
+
\ the same, the **seed** (advanced) parameter can be used. | ---------------------------------------------------\
|
| 215 |
+
\ .. class:: infomark **References** | Determan C. (2015). Optimal algorithm\
|
| 216 |
+
\ for metabolomics classification and feature selection varies by dataset. International\
|
| 217 |
+
\ *Journal of Biology* 7, 100-115. | Gromski P.S., Xu Y., Correa E., Ellis D.I.,\
|
| 218 |
+
\ Turner M.L. and Goodacre R. (2014). A comparative investigation of modern feature\
|
| 219 |
+
\ selection and classification approaches for the analysis of mass spectrometry\
|
| 220 |
+
\ data . *Analytica Chimica Acta* 829, 1-8. | Guyon I. and Elisseeff A. (2003).\
|
| 221 |
+
\ An introduction to variable and feature selection. *Journal of Machine Learning\
|
| 222 |
+
\ Research* 3, 1157-1182. | Rifai N., Gillette M.A. and Carr S.A. (2006). Protein\
|
| 223 |
+
\ biomarker discovery and validation: the long and uncertain path to clinical\
|
| 224 |
+
\ utility. *Nature Biotechnology* 24, 971-983. | Rinaudo P., Junot C. and Thevenot\
|
| 225 |
+
\ E.A. *biosigner*: A new method for the discovery of restricted and stable molecular\
|
| 226 |
+
\ signatures from omics data. *submitted*. | Saeys Y., Inza I. and Larranaga P.\
|
| 227 |
+
\ (2007). A review of feature selection techniques in bioinformatics. *Bioinformatics*\
|
| 228 |
+
\ 23, 2507-2517. --------------------------------------------------- -----------------\
|
| 229 |
+
\ Workflow position ----------------- .. image:: biosigner_workflowPositionImage.png\
|
| 230 |
+
\ :width: 600 ----------- Input files ----------- +---------------------------+------------+\
|
| 231 |
+
\ | File | Format | +===========================+============+\
|
| 232 |
+
\ | 1) Data matrix | tabular | +---------------------------+------------+\
|
| 233 |
+
\ | 2) Sample metadata | tabular | +---------------------------+------------+\
|
| 234 |
+
\ | 3) Variable metadata | tabular | +---------------------------+------------+\
|
| 235 |
+
\ ---------- Parameters ---------- Data matrix file \t| variable x sample **dataMatrix**\
|
| 236 |
+
\ tabular separated file of the numeric intensities, with . as decimal, and NA\
|
| 237 |
+
\ for missing values; use the **Check Format** tool in the **LC-MS/Quality Control**\
|
| 238 |
+
\ section to check the formats of your **dataMatrix**, **sampleMetadata** and\
|
| 239 |
+
\ **variableMetadata** files \t| Sample metadata file \t| sample x metadata\
|
| 240 |
+
\ **sampleMetadata** tabular separated file of the numeric and/or character sample\
|
| 241 |
+
\ metadata, with . as decimal and NA for missing values; use the **Check Format**\
|
| 242 |
+
\ tool in the **LC-MS/Quality Control** section to check the formats of your **dataMatrix**,\
|
| 243 |
+
\ **sampleMetadata** and **variableMetadata** files \t| Variable metadata file\
|
| 244 |
+
\ \t| variable x metadata **variableMetadata** tabular separated file of the numeric\
|
| 245 |
+
\ and/or character variable metadata, with . as decimal and NA for missing values;\
|
| 246 |
+
\ use the **Check Format** tool in the **LC-MS/Quality Control** section to check\
|
| 247 |
+
\ the formats of your **dataMatrix**, **sampleMetadata** and **variableMetadata**\
|
| 248 |
+
\ files \t| Classes of samples \t| Column of the sample metadata table to\
|
| 249 |
+
\ be used as the qualitative **binary** response to be modelled; the column should\
|
| 250 |
+
\ contain only two types of strings (e.g., 'case' and 'control') \t| Advanced:\
|
| 251 |
+
\ Classification method(s) (default = all) \t| Either one or all of the following\
|
| 252 |
+
\ classifiers: Partial Least Squares Discriminant Analysis (PLS-DA), or Random\
|
| 253 |
+
\ Forest, or Support Vector Machine (SVM) \t| Advanced: Number of bootstraps\
|
| 254 |
+
\ (default = 50) \t| This parameter controls the number of times the model performance\
|
| 255 |
+
\ is compared to the prediction on a test subset where the intensities of the\
|
| 256 |
+
\ candidate feature have been randomly permuted. \t| Advanced: Selection\
|
| 257 |
+
\ tier(s) (default = S) \t| Tier *S* corresponds to the final signature, i.e.,\
|
| 258 |
+
\ features which have been found significant in all the backward selection steps;\
|
| 259 |
+
\ features with tier *A* have been found significant in all but the last selection,\
|
| 260 |
+
\ and so on for tier *B* to *E*. Default selection tier is *S*, meaning that the\
|
| 261 |
+
\ final signature only is returned; to view a larger number of candidate features,\
|
| 262 |
+
\ the *S+A* tiers can be selected. \t| Advanced: p-value threshold (default\
|
| 263 |
+
\ = 0.05) \t| This threshold controls the selection of the features at each selection\
|
| 264 |
+
\ round (tier): to be selected, the proportion of times the prediction on the\
|
| 265 |
+
\ test set with the randomized intensities of the feature is more accurate than\
|
| 266 |
+
\ on the original test set must be inferior to this threshold. For example, if\
|
| 267 |
+
\ the number of bootstraps is 50, no more than 2 out of the 50 predictions on\
|
| 268 |
+
\ the randomized test set must not be more accurate than on the original test\
|
| 269 |
+
\ set (since 1/50 = 0.02). Advanced: Seed (default = 0) \t| As the algorithm\
|
| 270 |
+
\ relies on resampling (bootstrap), re-running the module may result in slightly\
|
| 271 |
+
\ different signatures. To ensure that returned results are exactly the same,\
|
| 272 |
+
\ the **seed** parameter (integer) can be used; the default, 0, means that no\
|
| 273 |
+
\ seed is used. \t| ------------ Output files ------------ variableMetadata_out.tabular\
|
| 274 |
+
\ \t| When a least one feature has been selected, a **tier** column is added indicating\
|
| 275 |
+
\ for each feature the classifier(s) it was selected from. \t| figure-tier.pdf\
|
| 276 |
+
\ \t| Graphic summarizing which features were selected, with their corresponding\
|
| 277 |
+
\ tier (i.e., round(s) of selection) for each classifier. \t| figure-boxplot.pdf\
|
| 278 |
+
\ \t| Individual boxplots of the features which were selected in at least one\
|
| 279 |
+
\ of the signatures. Features selected for a single classifier are colored (*red*\
|
| 280 |
+
\ for PLS-DA, *green* for Random Forest, and *blue* for SVM) \t| \t\t\t information.txt\
|
| 281 |
+
\ \t| Text file with all messages and warnings generated during the computation.\
|
| 282 |
+
\ \t| --------------------------------------------------- --------------- Working\
|
| 283 |
+
\ example --------------- See the **W4M00001a_sacurine-subset-statistics** and\
|
| 284 |
+
\ **W4M00003_diaplasma** shared histories in the **Shared Data/Published Histories**\
|
| 285 |
+
\ menu (https://galaxy.workflow4metabolomics.org/history/list_published) Figure\
|
| 286 |
+
\ output ============= .. image:: biosigner_workingExampleImage.png :width:\
|
| 287 |
+
\ 600 --------------------------------------------------- ---- NEWS\
|
| 288 |
+
\ ---- CHANGES IN VERSION 2.2.6 ======================== INTERNAL MODIFICATIONS\
|
| 289 |
+
\ Minor internal modifications CHANGES IN VERSION 2.2.4 ========================\
|
| 290 |
+
\ INTERNAL MODIFICATIONS Creating additional files for planemo and travis running\
|
| 291 |
+
\ and installation validation CHANGES IN VERSION 2.2.2 ========================\
|
| 292 |
+
\ INTERNAL MODIFICATIONS Internal updates to biosigner package versions of 1.0.0\
|
| 293 |
+
\ and above, and ropls versions of 1.4.0 and above (i.e. using S4 methods instead\
|
| 294 |
+
\ of S3) CHANGES IN VERSION 2.2.1 ======================== NEW FEATURE Creation\
|
| 295 |
+
\ of the tool"
|
| 296 |
+
- 'passage: qiime2 diversity pcoa-biplot. Principal Coordinate Analysis Biplot.
|
| 297 |
+
QIIME 2: diversity pcoa-biplot ============================== Principal Coordinate
|
| 298 |
+
Analysis Biplot Outputs: -------- :biplot.qza: The resulting PCoA matrix. | Description:
|
| 299 |
+
------------ Project features into a principal coordinates matrix. The features
|
| 300 |
+
used should be the features used to compute the distance matrix. It is recommended
|
| 301 |
+
that these variables be normalized in cases of dimensionally heterogeneous physical
|
| 302 |
+
variables. |'
|
| 303 |
+
pipeline_tag: sentence-similarity
|
| 304 |
+
library_name: sentence-transformers
|
| 305 |
+
---
|
| 306 |
+
|
| 307 |
+
# SentenceTransformer based on intfloat/e5-base-v2
|
| 308 |
+
|
| 309 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
| 310 |
+
|
| 311 |
+
## Model Details
|
| 312 |
+
|
| 313 |
+
### Model Description
|
| 314 |
+
- **Model Type:** Sentence Transformer
|
| 315 |
+
- **Base model:** [intfloat/e5-base-v2](https://huggingface.co/intfloat/e5-base-v2) <!-- at revision f52bf8ec8c7124536f0efb74aca902b2995e5bcd -->
|
| 316 |
+
- **Maximum Sequence Length:** 512 tokens
|
| 317 |
+
- **Output Dimensionality:** 768 dimensions
|
| 318 |
+
- **Similarity Function:** Cosine Similarity
|
| 319 |
+
<!-- - **Training Dataset:** Unknown -->
|
| 320 |
+
<!-- - **Language:** Unknown -->
|
| 321 |
+
<!-- - **License:** Unknown -->
|
| 322 |
+
|
| 323 |
+
### Model Sources
|
| 324 |
+
|
| 325 |
+
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
|
| 326 |
+
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
|
| 327 |
+
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
|
| 328 |
+
|
| 329 |
+
### Full Model Architecture
|
| 330 |
+
|
| 331 |
+
```
|
| 332 |
+
SentenceTransformer(
|
| 333 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
| 334 |
+
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 335 |
+
(2): Normalize()
|
| 336 |
+
)
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
## Usage
|
| 340 |
+
|
| 341 |
+
### Direct Usage (Sentence Transformers)
|
| 342 |
+
|
| 343 |
+
First install the Sentence Transformers library:
|
| 344 |
+
|
| 345 |
+
```bash
|
| 346 |
+
pip install -U sentence-transformers
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
Then you can load this model and run inference.
|
| 350 |
+
```python
|
| 351 |
+
from sentence_transformers import SentenceTransformer
|
| 352 |
+
|
| 353 |
+
# Download from the 🤗 Hub
|
| 354 |
+
model = SentenceTransformer("sentence_transformers_model_id")
|
| 355 |
+
# Run inference
|
| 356 |
+
sentences = [
|
| 357 |
+
'query: Which tool can be used to qiime2 diversity pcoa-biplot?',
|
| 358 |
+
'passage: qiime2 diversity pcoa-biplot. Principal Coordinate Analysis Biplot. QIIME 2: diversity pcoa-biplot ============================== Principal Coordinate Analysis Biplot Outputs: -------- :biplot.qza: The resulting PCoA matrix. | Description: ------------ Project features into a principal coordinates matrix. The features used should be the features used to compute the distance matrix. It is recommended that these variables be normalized in cases of dimensionally heterogeneous physical variables. |',
|
| 359 |
+
'passage: Manipulate loom object. Add layers, or row/column attributes to a loom file. This tool allows the user to modify an existing loom data file by adding column attributes, row attributes or additional layers via tsv files.',
|
| 360 |
+
]
|
| 361 |
+
embeddings = model.encode(sentences)
|
| 362 |
+
print(embeddings.shape)
|
| 363 |
+
# [3, 768]
|
| 364 |
+
|
| 365 |
+
# Get the similarity scores for the embeddings
|
| 366 |
+
similarities = model.similarity(embeddings, embeddings)
|
| 367 |
+
print(similarities.shape)
|
| 368 |
+
# [3, 3]
|
| 369 |
+
```
|
| 370 |
+
|
| 371 |
+
<!--
|
| 372 |
+
### Direct Usage (Transformers)
|
| 373 |
+
|
| 374 |
+
<details><summary>Click to see the direct usage in Transformers</summary>
|
| 375 |
+
|
| 376 |
+
</details>
|
| 377 |
+
-->
|
| 378 |
+
|
| 379 |
+
<!--
|
| 380 |
+
### Downstream Usage (Sentence Transformers)
|
| 381 |
+
|
| 382 |
+
You can finetune this model on your own dataset.
|
| 383 |
+
|
| 384 |
+
<details><summary>Click to expand</summary>
|
| 385 |
+
|
| 386 |
+
</details>
|
| 387 |
+
-->
|
| 388 |
+
|
| 389 |
+
<!--
|
| 390 |
+
### Out-of-Scope Use
|
| 391 |
+
|
| 392 |
+
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
| 393 |
+
-->
|
| 394 |
+
|
| 395 |
+
<!--
|
| 396 |
+
## Bias, Risks and Limitations
|
| 397 |
+
|
| 398 |
+
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
| 399 |
+
-->
|
| 400 |
+
|
| 401 |
+
<!--
|
| 402 |
+
### Recommendations
|
| 403 |
+
|
| 404 |
+
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
| 405 |
+
-->
|
| 406 |
+
|
| 407 |
+
## Training Details
|
| 408 |
+
|
| 409 |
+
### Training Dataset
|
| 410 |
+
|
| 411 |
+
#### Unnamed Dataset
|
| 412 |
+
|
| 413 |
+
* Size: 68,840 training samples
|
| 414 |
+
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
|
| 415 |
+
* Approximate statistics based on the first 1000 samples:
|
| 416 |
+
| | sentence_0 | sentence_1 |
|
| 417 |
+
|:--------|:----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
|
| 418 |
+
| type | string | string |
|
| 419 |
+
| details | <ul><li>min: 7 tokens</li><li>mean: 14.66 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 21 tokens</li><li>mean: 298.3 tokens</li><li>max: 512 tokens</li></ul> |
|
| 420 |
+
* Samples:
|
| 421 |
+
| sentence_0 | sentence_1 |
|
| 422 |
+
|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 423 |
+
| <code>query: Tool for vcf/bcf conversion, view, subset and filter vcf/bcf files</code> | <code>passage: bcftools view. VCF/BCF conversion, view, subset and filter VCF/BCF files. ===================================== bcftools view ===================================== VCF/BCF conversion, view, subset and filter VCF/BCF files. Region Selections ----------------- Regions can be specified in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive. Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can resul...</code> |
|
| 424 |
+
| <code>query: Tool for de novo assembly of rna-seq data</code> | <code>passage: Trinity. de novo assembly of RNA-Seq data. Trinity_ assembles transcript sequences from Illumina RNA-Seq data. .. _Trinity: http://trinityrnaseq.github.io</code> |
|
| 425 |
+
| <code>query: I want to das tool in Galaxy</code> | <code>passage: DAS Tool. for genome-resolved metagenomics. What it does ============ DAS Tool is an automated method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly. Inputs ====== - Bins: Tab-separated files of contig-IDs and bin-IDs. Contigs to bin file example: :: Contig_1 bin.01 Contig_8 bin.01 Contig_42 bin.02 Contig_49 bin.03 - Contigs: Assembled contigs in fasta format: :: >Contig_1 ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA... >Contig_2 GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG... - [Optional] Proteins: Predicted proteins in prodigal fasta format. The header contains contig-ID and gene number: :: >Contig_1_1 MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG... >Contig_1_2 MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN... Outputs ======= - Summary of output bins including quality and c...</code> |
|
| 426 |
+
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
|
| 427 |
+
```json
|
| 428 |
+
{
|
| 429 |
+
"scale": 20.0,
|
| 430 |
+
"similarity_fct": "cos_sim"
|
| 431 |
+
}
|
| 432 |
+
```
|
| 433 |
+
|
| 434 |
+
### Training Hyperparameters
|
| 435 |
+
#### Non-Default Hyperparameters
|
| 436 |
+
|
| 437 |
+
- `per_device_train_batch_size`: 16
|
| 438 |
+
- `per_device_eval_batch_size`: 16
|
| 439 |
+
- `num_train_epochs`: 4
|
| 440 |
+
- `multi_dataset_batch_sampler`: round_robin
|
| 441 |
+
|
| 442 |
+
#### All Hyperparameters
|
| 443 |
+
<details><summary>Click to expand</summary>
|
| 444 |
+
|
| 445 |
+
- `overwrite_output_dir`: False
|
| 446 |
+
- `do_predict`: False
|
| 447 |
+
- `eval_strategy`: no
|
| 448 |
+
- `prediction_loss_only`: True
|
| 449 |
+
- `per_device_train_batch_size`: 16
|
| 450 |
+
- `per_device_eval_batch_size`: 16
|
| 451 |
+
- `per_gpu_train_batch_size`: None
|
| 452 |
+
- `per_gpu_eval_batch_size`: None
|
| 453 |
+
- `gradient_accumulation_steps`: 1
|
| 454 |
+
- `eval_accumulation_steps`: None
|
| 455 |
+
- `torch_empty_cache_steps`: None
|
| 456 |
+
- `learning_rate`: 5e-05
|
| 457 |
+
- `weight_decay`: 0.0
|
| 458 |
+
- `adam_beta1`: 0.9
|
| 459 |
+
- `adam_beta2`: 0.999
|
| 460 |
+
- `adam_epsilon`: 1e-08
|
| 461 |
+
- `max_grad_norm`: 1
|
| 462 |
+
- `num_train_epochs`: 4
|
| 463 |
+
- `max_steps`: -1
|
| 464 |
+
- `lr_scheduler_type`: linear
|
| 465 |
+
- `lr_scheduler_kwargs`: {}
|
| 466 |
+
- `warmup_ratio`: 0.0
|
| 467 |
+
- `warmup_steps`: 0
|
| 468 |
+
- `log_level`: passive
|
| 469 |
+
- `log_level_replica`: warning
|
| 470 |
+
- `log_on_each_node`: True
|
| 471 |
+
- `logging_nan_inf_filter`: True
|
| 472 |
+
- `save_safetensors`: True
|
| 473 |
+
- `save_on_each_node`: False
|
| 474 |
+
- `save_only_model`: False
|
| 475 |
+
- `restore_callback_states_from_checkpoint`: False
|
| 476 |
+
- `no_cuda`: False
|
| 477 |
+
- `use_cpu`: False
|
| 478 |
+
- `use_mps_device`: False
|
| 479 |
+
- `seed`: 42
|
| 480 |
+
- `data_seed`: None
|
| 481 |
+
- `jit_mode_eval`: False
|
| 482 |
+
- `use_ipex`: False
|
| 483 |
+
- `bf16`: False
|
| 484 |
+
- `fp16`: False
|
| 485 |
+
- `fp16_opt_level`: O1
|
| 486 |
+
- `half_precision_backend`: auto
|
| 487 |
+
- `bf16_full_eval`: False
|
| 488 |
+
- `fp16_full_eval`: False
|
| 489 |
+
- `tf32`: None
|
| 490 |
+
- `local_rank`: 0
|
| 491 |
+
- `ddp_backend`: None
|
| 492 |
+
- `tpu_num_cores`: None
|
| 493 |
+
- `tpu_metrics_debug`: False
|
| 494 |
+
- `debug`: []
|
| 495 |
+
- `dataloader_drop_last`: False
|
| 496 |
+
- `dataloader_num_workers`: 0
|
| 497 |
+
- `dataloader_prefetch_factor`: None
|
| 498 |
+
- `past_index`: -1
|
| 499 |
+
- `disable_tqdm`: False
|
| 500 |
+
- `remove_unused_columns`: True
|
| 501 |
+
- `label_names`: None
|
| 502 |
+
- `load_best_model_at_end`: False
|
| 503 |
+
- `ignore_data_skip`: False
|
| 504 |
+
- `fsdp`: []
|
| 505 |
+
- `fsdp_min_num_params`: 0
|
| 506 |
+
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
|
| 507 |
+
- `fsdp_transformer_layer_cls_to_wrap`: None
|
| 508 |
+
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
|
| 509 |
+
- `deepspeed`: None
|
| 510 |
+
- `label_smoothing_factor`: 0.0
|
| 511 |
+
- `optim`: adamw_torch
|
| 512 |
+
- `optim_args`: None
|
| 513 |
+
- `adafactor`: False
|
| 514 |
+
- `group_by_length`: False
|
| 515 |
+
- `length_column_name`: length
|
| 516 |
+
- `ddp_find_unused_parameters`: None
|
| 517 |
+
- `ddp_bucket_cap_mb`: None
|
| 518 |
+
- `ddp_broadcast_buffers`: False
|
| 519 |
+
- `dataloader_pin_memory`: True
|
| 520 |
+
- `dataloader_persistent_workers`: False
|
| 521 |
+
- `skip_memory_metrics`: True
|
| 522 |
+
- `use_legacy_prediction_loop`: False
|
| 523 |
+
- `push_to_hub`: False
|
| 524 |
+
- `resume_from_checkpoint`: None
|
| 525 |
+
- `hub_model_id`: None
|
| 526 |
+
- `hub_strategy`: every_save
|
| 527 |
+
- `hub_private_repo`: None
|
| 528 |
+
- `hub_always_push`: False
|
| 529 |
+
- `gradient_checkpointing`: False
|
| 530 |
+
- `gradient_checkpointing_kwargs`: None
|
| 531 |
+
- `include_inputs_for_metrics`: False
|
| 532 |
+
- `include_for_metrics`: []
|
| 533 |
+
- `eval_do_concat_batches`: True
|
| 534 |
+
- `fp16_backend`: auto
|
| 535 |
+
- `push_to_hub_model_id`: None
|
| 536 |
+
- `push_to_hub_organization`: None
|
| 537 |
+
- `mp_parameters`:
|
| 538 |
+
- `auto_find_batch_size`: False
|
| 539 |
+
- `full_determinism`: False
|
| 540 |
+
- `torchdynamo`: None
|
| 541 |
+
- `ray_scope`: last
|
| 542 |
+
- `ddp_timeout`: 1800
|
| 543 |
+
- `torch_compile`: False
|
| 544 |
+
- `torch_compile_backend`: None
|
| 545 |
+
- `torch_compile_mode`: None
|
| 546 |
+
- `dispatch_batches`: None
|
| 547 |
+
- `split_batches`: None
|
| 548 |
+
- `include_tokens_per_second`: False
|
| 549 |
+
- `include_num_input_tokens_seen`: False
|
| 550 |
+
- `neftune_noise_alpha`: None
|
| 551 |
+
- `optim_target_modules`: None
|
| 552 |
+
- `batch_eval_metrics`: False
|
| 553 |
+
- `eval_on_start`: False
|
| 554 |
+
- `use_liger_kernel`: False
|
| 555 |
+
- `eval_use_gather_object`: False
|
| 556 |
+
- `average_tokens_across_devices`: False
|
| 557 |
+
- `prompts`: None
|
| 558 |
+
- `batch_sampler`: batch_sampler
|
| 559 |
+
- `multi_dataset_batch_sampler`: round_robin
|
| 560 |
+
|
| 561 |
+
</details>
|
| 562 |
+
|
| 563 |
+
### Training Logs
|
| 564 |
+
| Epoch | Step | Training Loss |
|
| 565 |
+
|:------:|:-----:|:-------------:|
|
| 566 |
+
| 0.1162 | 500 | 0.0921 |
|
| 567 |
+
| 0.2324 | 1000 | 0.0066 |
|
| 568 |
+
| 0.3486 | 1500 | 0.0062 |
|
| 569 |
+
| 0.4648 | 2000 | 0.0081 |
|
| 570 |
+
| 0.5810 | 2500 | 0.0073 |
|
| 571 |
+
| 0.6972 | 3000 | 0.0091 |
|
| 572 |
+
| 0.8134 | 3500 | 0.0053 |
|
| 573 |
+
| 0.9296 | 4000 | 0.0083 |
|
| 574 |
+
| 1.0458 | 4500 | 0.0073 |
|
| 575 |
+
| 1.1620 | 5000 | 0.0059 |
|
| 576 |
+
| 1.2782 | 5500 | 0.0068 |
|
| 577 |
+
| 1.3944 | 6000 | 0.0047 |
|
| 578 |
+
| 1.5106 | 6500 | 0.0077 |
|
| 579 |
+
| 1.6268 | 7000 | 0.0071 |
|
| 580 |
+
| 1.7430 | 7500 | 0.0067 |
|
| 581 |
+
| 1.8592 | 8000 | 0.0069 |
|
| 582 |
+
| 1.9754 | 8500 | 0.0077 |
|
| 583 |
+
| 2.0916 | 9000 | 0.0064 |
|
| 584 |
+
| 2.2078 | 9500 | 0.0073 |
|
| 585 |
+
| 2.3240 | 10000 | 0.0075 |
|
| 586 |
+
| 2.4402 | 10500 | 0.0049 |
|
| 587 |
+
| 2.5564 | 11000 | 0.0071 |
|
| 588 |
+
| 2.6726 | 11500 | 0.0075 |
|
| 589 |
+
| 2.7888 | 12000 | 0.0078 |
|
| 590 |
+
| 2.9050 | 12500 | 0.0086 |
|
| 591 |
+
| 3.0211 | 13000 | 0.0069 |
|
| 592 |
+
| 3.1373 | 13500 | 0.0052 |
|
| 593 |
+
| 3.2535 | 14000 | 0.0065 |
|
| 594 |
+
| 3.3697 | 14500 | 0.0066 |
|
| 595 |
+
| 3.4859 | 15000 | 0.0068 |
|
| 596 |
+
| 3.6021 | 15500 | 0.0079 |
|
| 597 |
+
| 3.7183 | 16000 | 0.0077 |
|
| 598 |
+
| 3.8345 | 16500 | 0.0066 |
|
| 599 |
+
| 3.9507 | 17000 | 0.0046 |
|
| 600 |
+
|
| 601 |
+
|
| 602 |
+
### Framework Versions
|
| 603 |
+
- Python: 3.12.8
|
| 604 |
+
- Sentence Transformers: 3.4.1
|
| 605 |
+
- Transformers: 4.49.0
|
| 606 |
+
- PyTorch: 2.6.0+cu124
|
| 607 |
+
- Accelerate: 1.4.0
|
| 608 |
+
- Datasets: 3.3.2
|
| 609 |
+
- Tokenizers: 0.21.0
|
| 610 |
+
|
| 611 |
+
## Citation
|
| 612 |
+
|
| 613 |
+
### BibTeX
|
| 614 |
+
|
| 615 |
+
#### Sentence Transformers
|
| 616 |
+
```bibtex
|
| 617 |
+
@inproceedings{reimers-2019-sentence-bert,
|
| 618 |
+
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 619 |
+
author = "Reimers, Nils and Gurevych, Iryna",
|
| 620 |
+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 621 |
+
month = "11",
|
| 622 |
+
year = "2019",
|
| 623 |
+
publisher = "Association for Computational Linguistics",
|
| 624 |
+
url = "https://arxiv.org/abs/1908.10084",
|
| 625 |
+
}
|
| 626 |
+
```
|
| 627 |
+
|
| 628 |
+
#### MultipleNegativesRankingLoss
|
| 629 |
+
```bibtex
|
| 630 |
+
@misc{henderson2017efficient,
|
| 631 |
+
title={Efficient Natural Language Response Suggestion for Smart Reply},
|
| 632 |
+
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
|
| 633 |
+
year={2017},
|
| 634 |
+
eprint={1705.00652},
|
| 635 |
+
archivePrefix={arXiv},
|
| 636 |
+
primaryClass={cs.CL}
|
| 637 |
+
}
|
| 638 |
+
```
|
| 639 |
+
|
| 640 |
+
<!--
|
| 641 |
+
## Glossary
|
| 642 |
+
|
| 643 |
+
*Clearly define terms in order to be accessible across audiences.*
|
| 644 |
+
-->
|
| 645 |
+
|
| 646 |
+
<!--
|
| 647 |
+
## Model Card Authors
|
| 648 |
+
|
| 649 |
+
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
| 650 |
+
-->
|
| 651 |
+
|
| 652 |
+
<!--
|
| 653 |
+
## Model Card Contact
|
| 654 |
+
|
| 655 |
+
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
| 656 |
+
-->
|