Sentence Similarity
sentence-transformers
Safetensors
xlm-roberta
feature-extraction
Generated from Trainer
dataset_size:69500
loss:Infonce
text-embeddings-inference
Instructions to use Jrinky/snowflake with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Jrinky/snowflake with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Jrinky/snowflake") sentences = [ "What aspect of human relationship to nature is omitted from the text", "There are a few good ones, though. Here are the best WWE apps and WWE games for Android! The first five are the best games...\nGo Android Apps (blog)\nThe Best Themes for Android Free Download: Hi friend we are again back with our new top ten best free themes for android list. This article is especially dedicated for those persons who want to make their smartphone...\nParagon Software has created an app for Android that allows your device to natively read partitions in file systems that Android normally can't handle, such as Microsoft's NTFS, allowing immediate and easy use of... While the Sentio Desktop app can be used on its own, it was primarily meant to complement Sentio's Superbook, a crowdfunded laptop shell for Android smartphones and tablets that's just entering production after...\n... phone then GBWhatsapp is the app for you. GBWhatsapp is basically similar to Whatsapp+ in terms of features. The newest available version right now is GBWhatsapp 6.40 APK for Android devices.", "A true entertainer. date city state venue 11/23/2012 West Palm Beach FL Kravis Center 11/24/2012 Sarasota FL Van Wezel Performing Arts Hall 11/25/2012 Clearwater FL Capitol Theatre 11/29/2012 Durham NC Durham Performing Arts Center 12/1/2012 Atlantic City NJ Trump Taj Mahal 12/2/2012 Staten Island NY St. George Theatre 12/4/2012 Bethlehem PA Musikfest Cafe 12/5/2012 Verona NY Turning Stone Casino 12/6/2012 Stamford CT Palace Theatre Stamford 12/8/2012 Shippensburg PA Luhrs Center 12/9/2012 Boston MA Wilbur Theatre 12/11/2012 Greensburg PA The Palace Theatre 12/12/2012 Easton MD Avalon Theatre 12/15/2012 Saint Charles IL Arcada Theater 12/16/2012 Milwaukee WI Potawatomi Bingo Casino 12/18/2012 Beaver Creek CO Vilar Performing Arts Center 12/20/2012 Chandler AZ Ovations Live!", "The reader will gain a better understanding of the direction nature and culture is heading today by learning how connections were made in the past. It omits that which Raymond Williams called \"a working landscape\" -- the most intimate human relationship to nature which is people who live and work on it." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Add new SentenceTransformer model
Browse files- .gitattributes +1 -0
- 1_Pooling/config.json +10 -0
- README.md +487 -0
- config.json +28 -0
- config_sentence_transformers.json +12 -0
- model.safetensors +3 -0
- modules.json +20 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +51 -0
- tokenizer.json +3 -0
- tokenizer_config.json +61 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
1_Pooling/config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"word_embedding_dimension": 1024,
|
| 3 |
+
"pooling_mode_cls_token": true,
|
| 4 |
+
"pooling_mode_mean_tokens": false,
|
| 5 |
+
"pooling_mode_max_tokens": false,
|
| 6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
| 7 |
+
"pooling_mode_weightedmean_tokens": false,
|
| 8 |
+
"pooling_mode_lasttoken": false,
|
| 9 |
+
"include_prompt": true
|
| 10 |
+
}
|
README.md
ADDED
|
@@ -0,0 +1,487 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- sentence-transformers
|
| 4 |
+
- sentence-similarity
|
| 5 |
+
- feature-extraction
|
| 6 |
+
- generated_from_trainer
|
| 7 |
+
- dataset_size:69500
|
| 8 |
+
- loss:Infonce
|
| 9 |
+
base_model: Snowflake/snowflake-arctic-embed-l-v2.0
|
| 10 |
+
widget:
|
| 11 |
+
- source_sentence: What aspect of human relationship to nature is omitted from the
|
| 12 |
+
text
|
| 13 |
+
sentences:
|
| 14 |
+
- 'There are a few good ones, though. Here are the best WWE apps and WWE games for
|
| 15 |
+
Android! The first five are the best games...
|
| 16 |
+
|
| 17 |
+
Go Android Apps (blog)
|
| 18 |
+
|
| 19 |
+
The Best Themes for Android Free Download: Hi friend we are again back with our
|
| 20 |
+
new top ten best free themes for android list. This article is especially dedicated
|
| 21 |
+
for those persons who want to make their smartphone...
|
| 22 |
+
|
| 23 |
+
Paragon Software has created an app for Android that allows your device to natively
|
| 24 |
+
read partitions in file systems that Android normally can''t handle, such as Microsoft''s
|
| 25 |
+
NTFS, allowing immediate and easy use of... While the Sentio Desktop app can be
|
| 26 |
+
used on its own, it was primarily meant to complement Sentio''s Superbook, a crowdfunded
|
| 27 |
+
laptop shell for Android smartphones and tablets that''s just entering production
|
| 28 |
+
after...
|
| 29 |
+
|
| 30 |
+
... phone then GBWhatsapp is the app for you. GBWhatsapp is basically similar
|
| 31 |
+
to Whatsapp+ in terms of features. The newest available version right now is GBWhatsapp
|
| 32 |
+
6.40 APK for Android devices.'
|
| 33 |
+
- A true entertainer. date city state venue 11/23/2012 West Palm Beach FL Kravis
|
| 34 |
+
Center 11/24/2012 Sarasota FL Van Wezel Performing Arts Hall 11/25/2012 Clearwater
|
| 35 |
+
FL Capitol Theatre 11/29/2012 Durham NC Durham Performing Arts Center 12/1/2012
|
| 36 |
+
Atlantic City NJ Trump Taj Mahal 12/2/2012 Staten Island NY St. George Theatre
|
| 37 |
+
12/4/2012 Bethlehem PA Musikfest Cafe 12/5/2012 Verona NY Turning Stone Casino
|
| 38 |
+
12/6/2012 Stamford CT Palace Theatre Stamford 12/8/2012 Shippensburg PA Luhrs
|
| 39 |
+
Center 12/9/2012 Boston MA Wilbur Theatre 12/11/2012 Greensburg PA The Palace
|
| 40 |
+
Theatre 12/12/2012 Easton MD Avalon Theatre 12/15/2012 Saint Charles IL Arcada
|
| 41 |
+
Theater 12/16/2012 Milwaukee WI Potawatomi Bingo Casino 12/18/2012 Beaver Creek
|
| 42 |
+
CO Vilar Performing Arts Center 12/20/2012 Chandler AZ Ovations Live!
|
| 43 |
+
- The reader will gain a better understanding of the direction nature and culture
|
| 44 |
+
is heading today by learning how connections were made in the past. It omits that
|
| 45 |
+
which Raymond Williams called "a working landscape" -- the most intimate human
|
| 46 |
+
relationship to nature which is people who live and work on it.
|
| 47 |
+
- source_sentence: Why is it recommended to contact a wedding agency or consultant
|
| 48 |
+
before making a decision
|
| 49 |
+
sentences:
|
| 50 |
+
- Perhaps owing to this humiliation I resigned as Chief Winery Warlord, and took
|
| 51 |
+
a position elsewhere. Following my resignation, we rebooked our date with axe
|
| 52 |
+
throwing destiny, and converted the night from a team building exercise to a majestic
|
| 53 |
+
send off in honour of my 10ish glorious years at Coffin Ridge. We arrived in our
|
| 54 |
+
most impeccable vestments.
|
| 55 |
+
- Therefore, those private companies increased their own rate of cash burn since
|
| 56 |
+
the financial markets were willing to fund money-losing enterprises without hesitation.
|
| 57 |
+
Out of the 100 largest North American-based technology companies, 16 have lost
|
| 58 |
+
money over the past year.
|
| 59 |
+
- Yet , it is best to contact a wedding agency or consultant before you make your
|
| 60 |
+
concluding decision. This will make certain you are dealing with a respectable
|
| 61 |
+
company.
|
| 62 |
+
- source_sentence: What is the Electronic Music Education and Preservation Project
|
| 63 |
+
(EMEAPP) and what are its functions
|
| 64 |
+
sentences:
|
| 65 |
+
- The Electronic Music Education and Preservation Project (EMEAPP) is the steward
|
| 66 |
+
of a privately held world-class curated collection of rare vintage electronic
|
| 67 |
+
instruments and stage-used gear. This includes effects units, amps, organs, synthesizers,
|
| 68 |
+
electro-mechanical instruments, guitars, prototypes, vintage audio/video media
|
| 69 |
+
and analog studio gear. In addition, EMEAPP itself is cultivating its own humble
|
| 70 |
+
collection. It is our charge to cultivate and reap excellent knowledge from these
|
| 71 |
+
unique resources and return it to our members and the world. We do this as a learning
|
| 72 |
+
center, through research projects, creative endeavors, media programming and tours,
|
| 73 |
+
enlightening many people along the way. There is so much to be harvested from
|
| 74 |
+
history; EMEAPP has a key to the vault. EMEAPP is a private museum, a critical
|
| 75 |
+
learning center and a multi-media production studio nicely packed into a brick-and-mortar
|
| 76 |
+
facility outside of Philadelphia, Pennsylvania. EMEAPP is a 501(c)(3) non-profit
|
| 77 |
+
organization.
|
| 78 |
+
- You got a problem? Yo, she'll splode it.
|
| 79 |
+
- I love sex; I think sex is completely absurdly demonized in our culture. But in
|
| 80 |
+
the end, however much sex you want to have, with however many people in how many
|
| 81 |
+
ways, to be loved and to love is what human beings really want.
|
| 82 |
+
- source_sentence: What year did the Duchess die and where did it happen
|
| 83 |
+
sentences:
|
| 84 |
+
- 'League One
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
League table
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
Results summary
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
Results by matchday
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
Matches
|
| 97 |
+
|
| 98 |
+
On 21 June 2018, the League One fixtures for the forthcoming season were announced.
|
| 99 |
+
FA Cup
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
The first round draw was made live on BBC by Dennis Wise and Dion Dublin on 22
|
| 103 |
+
October.'
|
| 104 |
+
- "The Duchess was widowed in 2007 and died in London in 2011. Issue \n\nThe Duke\
|
| 105 |
+
\ and Duchess of Buccleuch and Queensberry had four children:\nRichard Scott,\
|
| 106 |
+
\ 10th Duke of Buccleuch (b. 1954), married Lady Elizabeth Kerr, daughter of the\
|
| 107 |
+
\ Marquess of Lothian, and has issue two sons and two daughters. Lord John (born\
|
| 108 |
+
\ 9 August 1957), married Berrin Torolsan, and lives in Istanbul, Turkey. Lady\
|
| 109 |
+
\ Charlotte-Anne (born 9 January 1966), married Count Bernard de Castellane in\
|
| 110 |
+
\ 1991, and has issue two sons and a daughter. Lord Damian (born 8 October 1969),\
|
| 111 |
+
\ married Elizabeth Powis, and has issue. External links\nJane in her wedding\
|
| 112 |
+
\ dress \nMovie clip of Jane's wedding\n\nReferences \n\n1929 births\n2011 deaths\n\
|
| 113 |
+
British duchesses by marriage\nJane\nScottish female models\nBritish cookbook\
|
| 114 |
+
\ writers\nWomen cookbook writers"
|
| 115 |
+
- Is this common, do other people with epilepsy have dangerously low appetites?
|
| 116 |
+
So we left there and stopped and got her a bite to eat.
|
| 117 |
+
- source_sentence: Why is it important to keep moving over the summer
|
| 118 |
+
sentences:
|
| 119 |
+
- It's important to keep moving over the summer!
|
| 120 |
+
- '2008. CHENG HF, LEE YM, Chu CH, Leung WK & Mok TMY. - Journal Editor (Hong Kong
|
| 121 |
+
Medical Journal) 2008
|
| 122 |
+
|
| 123 |
+
- Editor-in-Chief (Hong Kong Dental Journal) 2007
|
| 124 |
+
|
| 125 |
+
- Editor-in-Chief (Hong Kong Dental Journal) 2006
|
| 126 |
+
|
| 127 |
+
- Deputy Editor (Hong Kong Dental Journal) 2004'
|
| 128 |
+
- Both demand collective action and shared resources. While one is distinctly egalitarian
|
| 129 |
+
and the other hierarchical in nature, both speak of sublimating private goals
|
| 130 |
+
for the achievement of larger, shared ones.
|
| 131 |
+
pipeline_tag: sentence-similarity
|
| 132 |
+
library_name: sentence-transformers
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
# SentenceTransformer based on Snowflake/snowflake-arctic-embed-l-v2.0
|
| 136 |
+
|
| 137 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
| 138 |
+
|
| 139 |
+
## Model Details
|
| 140 |
+
|
| 141 |
+
### Model Description
|
| 142 |
+
- **Model Type:** Sentence Transformer
|
| 143 |
+
- **Base model:** [Snowflake/snowflake-arctic-embed-l-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0) <!-- at revision 7f311bb640ad3babc0a4e3a8873240dcba44c9d2 -->
|
| 144 |
+
- **Maximum Sequence Length:** 1024 tokens
|
| 145 |
+
- **Output Dimensionality:** 1024 tokens
|
| 146 |
+
- **Similarity Function:** Cosine Similarity
|
| 147 |
+
<!-- - **Training Dataset:** Unknown -->
|
| 148 |
+
<!-- - **Language:** Unknown -->
|
| 149 |
+
<!-- - **License:** Unknown -->
|
| 150 |
+
|
| 151 |
+
### Model Sources
|
| 152 |
+
|
| 153 |
+
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
|
| 154 |
+
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
|
| 155 |
+
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
|
| 156 |
+
|
| 157 |
+
### Full Model Architecture
|
| 158 |
+
|
| 159 |
+
```
|
| 160 |
+
SentenceTransformer(
|
| 161 |
+
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
|
| 162 |
+
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 163 |
+
(2): Normalize()
|
| 164 |
+
)
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## Usage
|
| 168 |
+
|
| 169 |
+
### Direct Usage (Sentence Transformers)
|
| 170 |
+
|
| 171 |
+
First install the Sentence Transformers library:
|
| 172 |
+
|
| 173 |
+
```bash
|
| 174 |
+
pip install -U sentence-transformers
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
Then you can load this model and run inference.
|
| 178 |
+
```python
|
| 179 |
+
from sentence_transformers import SentenceTransformer
|
| 180 |
+
|
| 181 |
+
# Download from the 🤗 Hub
|
| 182 |
+
model = SentenceTransformer("Jrinky/snowflake")
|
| 183 |
+
# Run inference
|
| 184 |
+
sentences = [
|
| 185 |
+
'Why is it important to keep moving over the summer',
|
| 186 |
+
"It's important to keep moving over the summer!",
|
| 187 |
+
'2008. CHENG HF, LEE YM, Chu CH, Leung WK & Mok TMY. - Journal Editor (Hong Kong Medical Journal) 2008\n- Editor-in-Chief (Hong Kong Dental Journal) 2007\n- Editor-in-Chief (Hong Kong Dental Journal) 2006\n- Deputy Editor (Hong Kong Dental Journal) 2004',
|
| 188 |
+
]
|
| 189 |
+
embeddings = model.encode(sentences)
|
| 190 |
+
print(embeddings.shape)
|
| 191 |
+
# [3, 1024]
|
| 192 |
+
|
| 193 |
+
# Get the similarity scores for the embeddings
|
| 194 |
+
similarities = model.similarity(embeddings, embeddings)
|
| 195 |
+
print(similarities.shape)
|
| 196 |
+
# [3, 3]
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
<!--
|
| 200 |
+
### Direct Usage (Transformers)
|
| 201 |
+
|
| 202 |
+
<details><summary>Click to see the direct usage in Transformers</summary>
|
| 203 |
+
|
| 204 |
+
</details>
|
| 205 |
+
-->
|
| 206 |
+
|
| 207 |
+
<!--
|
| 208 |
+
### Downstream Usage (Sentence Transformers)
|
| 209 |
+
|
| 210 |
+
You can finetune this model on your own dataset.
|
| 211 |
+
|
| 212 |
+
<details><summary>Click to expand</summary>
|
| 213 |
+
|
| 214 |
+
</details>
|
| 215 |
+
-->
|
| 216 |
+
|
| 217 |
+
<!--
|
| 218 |
+
### Out-of-Scope Use
|
| 219 |
+
|
| 220 |
+
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
| 221 |
+
-->
|
| 222 |
+
|
| 223 |
+
<!--
|
| 224 |
+
## Bias, Risks and Limitations
|
| 225 |
+
|
| 226 |
+
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
| 227 |
+
-->
|
| 228 |
+
|
| 229 |
+
<!--
|
| 230 |
+
### Recommendations
|
| 231 |
+
|
| 232 |
+
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
| 233 |
+
-->
|
| 234 |
+
|
| 235 |
+
## Training Details
|
| 236 |
+
|
| 237 |
+
### Training Dataset
|
| 238 |
+
|
| 239 |
+
#### Unnamed Dataset
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
* Size: 69,500 training samples
|
| 243 |
+
* Columns: <code>anchor</code> and <code>positive</code>
|
| 244 |
+
* Approximate statistics based on the first 1000 samples:
|
| 245 |
+
| | anchor | positive |
|
| 246 |
+
|:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
|
| 247 |
+
| type | string | string |
|
| 248 |
+
| details | <ul><li>min: 6 tokens</li><li>mean: 17.47 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 113.33 tokens</li><li>max: 1024 tokens</li></ul> |
|
| 249 |
+
* Samples:
|
| 250 |
+
| anchor | positive |
|
| 251 |
+
|:-----------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 252 |
+
| <code>What might have been unnecessary if better emergency plans had been implemented</code> | <code>If better emergency plans had been in place, maybe chemical dipersants wouldn't be needed. And on and on.</code> |
|
| 253 |
+
| <code>What was the year of publication for the 3rd Edition of 'Regular Polytopes' by H.S.M. Coxeter</code> | <code>Coxeter, Regular Polytopes, 3rd Edition, Dover New York, 1973 <br> Kaleidoscopes: Selected Writings of H.S.M. Coxeter, edited by F. Arthur Sherk, Peter McMullen, Anthony C. Thompson, Asia Ivic Weiss, Wiley-Interscience Publication, 1995, <br> (Paper 22) H.S.M.</code> |
|
| 254 |
+
| <code>Who is the author of the GURPS Shapeshifters supplement</code> | <code>GURPS Shapeshifters () is a supplement by Robert M. Schroeck for the GURPS role-playing game system, third edition.</code> |
|
| 255 |
+
* Loss: <code>selfloss.Infonce</code> with these parameters:
|
| 256 |
+
```json
|
| 257 |
+
{
|
| 258 |
+
"scale": 20.0,
|
| 259 |
+
"similarity_fct": "cos_sim"
|
| 260 |
+
}
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### Evaluation Dataset
|
| 264 |
+
|
| 265 |
+
#### Unnamed Dataset
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
* Size: 17,376 evaluation samples
|
| 269 |
+
* Columns: <code>anchor</code> and <code>positive</code>
|
| 270 |
+
* Approximate statistics based on the first 1000 samples:
|
| 271 |
+
| | anchor | positive |
|
| 272 |
+
|:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
|
| 273 |
+
| type | string | string |
|
| 274 |
+
| details | <ul><li>min: 6 tokens</li><li>mean: 16.87 tokens</li><li>max: 45 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 115.36 tokens</li><li>max: 1024 tokens</li></ul> |
|
| 275 |
+
* Samples:
|
| 276 |
+
| anchor | positive |
|
| 277 |
+
|:---------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 278 |
+
| <code>What impressive achievements did the Warriors accomplish during their last season in Division III</code> | <code>The Warriors were among the most lethal offensive teams in Division III this past year, posting a team batting average of .344 and averaging nearly seven runs per game, smacking 29 home runs, and collecting nearly 600 total bases. They shared the Little East Conference regular-season championship and later knocked off the top seed in the NCAA regional tournament (Montclair State) en route to their winningest season in 14 years.</code> |
|
| 279 |
+
| <code>How many bars had nectar and capped honey on them</code> | <code>Eight of the bars had nectar and capped honey on them. There are eighteen bars with brood in some form on them and a mix of workers and drones.</code> |
|
| 280 |
+
| <code>What idea is being requested regarding the 'triangle'</code> | <code>Next up...the "triangle". Please, seriously, if anyone could float me an idea, I would really appreciate it.</code> |
|
| 281 |
+
* Loss: <code>selfloss.Infonce</code> with these parameters:
|
| 282 |
+
```json
|
| 283 |
+
{
|
| 284 |
+
"scale": 20.0,
|
| 285 |
+
"similarity_fct": "cos_sim"
|
| 286 |
+
}
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
+
### Training Hyperparameters
|
| 290 |
+
#### Non-Default Hyperparameters
|
| 291 |
+
|
| 292 |
+
- `eval_strategy`: steps
|
| 293 |
+
- `per_device_train_batch_size`: 3
|
| 294 |
+
- `per_device_eval_batch_size`: 3
|
| 295 |
+
- `learning_rate`: 5e-06
|
| 296 |
+
- `num_train_epochs`: 5
|
| 297 |
+
- `warmup_ratio`: 0.1
|
| 298 |
+
- `fp16`: True
|
| 299 |
+
- `batch_sampler`: no_duplicates
|
| 300 |
+
|
| 301 |
+
#### All Hyperparameters
|
| 302 |
+
<details><summary>Click to expand</summary>
|
| 303 |
+
|
| 304 |
+
- `overwrite_output_dir`: False
|
| 305 |
+
- `do_predict`: False
|
| 306 |
+
- `eval_strategy`: steps
|
| 307 |
+
- `prediction_loss_only`: True
|
| 308 |
+
- `per_device_train_batch_size`: 3
|
| 309 |
+
- `per_device_eval_batch_size`: 3
|
| 310 |
+
- `per_gpu_train_batch_size`: None
|
| 311 |
+
- `per_gpu_eval_batch_size`: None
|
| 312 |
+
- `gradient_accumulation_steps`: 1
|
| 313 |
+
- `eval_accumulation_steps`: None
|
| 314 |
+
- `torch_empty_cache_steps`: None
|
| 315 |
+
- `learning_rate`: 5e-06
|
| 316 |
+
- `weight_decay`: 0.0
|
| 317 |
+
- `adam_beta1`: 0.9
|
| 318 |
+
- `adam_beta2`: 0.999
|
| 319 |
+
- `adam_epsilon`: 1e-08
|
| 320 |
+
- `max_grad_norm`: 1.0
|
| 321 |
+
- `num_train_epochs`: 5
|
| 322 |
+
- `max_steps`: -1
|
| 323 |
+
- `lr_scheduler_type`: linear
|
| 324 |
+
- `lr_scheduler_kwargs`: {}
|
| 325 |
+
- `warmup_ratio`: 0.1
|
| 326 |
+
- `warmup_steps`: 0
|
| 327 |
+
- `log_level`: passive
|
| 328 |
+
- `log_level_replica`: warning
|
| 329 |
+
- `log_on_each_node`: True
|
| 330 |
+
- `logging_nan_inf_filter`: True
|
| 331 |
+
- `save_safetensors`: True
|
| 332 |
+
- `save_on_each_node`: False
|
| 333 |
+
- `save_only_model`: False
|
| 334 |
+
- `restore_callback_states_from_checkpoint`: False
|
| 335 |
+
- `no_cuda`: False
|
| 336 |
+
- `use_cpu`: False
|
| 337 |
+
- `use_mps_device`: False
|
| 338 |
+
- `seed`: 42
|
| 339 |
+
- `data_seed`: None
|
| 340 |
+
- `jit_mode_eval`: False
|
| 341 |
+
- `use_ipex`: False
|
| 342 |
+
- `bf16`: False
|
| 343 |
+
- `fp16`: True
|
| 344 |
+
- `fp16_opt_level`: O1
|
| 345 |
+
- `half_precision_backend`: auto
|
| 346 |
+
- `bf16_full_eval`: False
|
| 347 |
+
- `fp16_full_eval`: False
|
| 348 |
+
- `tf32`: None
|
| 349 |
+
- `local_rank`: 0
|
| 350 |
+
- `ddp_backend`: None
|
| 351 |
+
- `tpu_num_cores`: None
|
| 352 |
+
- `tpu_metrics_debug`: False
|
| 353 |
+
- `debug`: []
|
| 354 |
+
- `dataloader_drop_last`: True
|
| 355 |
+
- `dataloader_num_workers`: 0
|
| 356 |
+
- `dataloader_prefetch_factor`: None
|
| 357 |
+
- `past_index`: -1
|
| 358 |
+
- `disable_tqdm`: False
|
| 359 |
+
- `remove_unused_columns`: True
|
| 360 |
+
- `label_names`: None
|
| 361 |
+
- `load_best_model_at_end`: False
|
| 362 |
+
- `ignore_data_skip`: False
|
| 363 |
+
- `fsdp`: []
|
| 364 |
+
- `fsdp_min_num_params`: 0
|
| 365 |
+
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
|
| 366 |
+
- `fsdp_transformer_layer_cls_to_wrap`: None
|
| 367 |
+
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
|
| 368 |
+
- `deepspeed`: None
|
| 369 |
+
- `label_smoothing_factor`: 0.0
|
| 370 |
+
- `optim`: adamw_torch
|
| 371 |
+
- `optim_args`: None
|
| 372 |
+
- `adafactor`: False
|
| 373 |
+
- `group_by_length`: False
|
| 374 |
+
- `length_column_name`: length
|
| 375 |
+
- `ddp_find_unused_parameters`: None
|
| 376 |
+
- `ddp_bucket_cap_mb`: None
|
| 377 |
+
- `ddp_broadcast_buffers`: False
|
| 378 |
+
- `dataloader_pin_memory`: True
|
| 379 |
+
- `dataloader_persistent_workers`: False
|
| 380 |
+
- `skip_memory_metrics`: True
|
| 381 |
+
- `use_legacy_prediction_loop`: False
|
| 382 |
+
- `push_to_hub`: False
|
| 383 |
+
- `resume_from_checkpoint`: None
|
| 384 |
+
- `hub_model_id`: None
|
| 385 |
+
- `hub_strategy`: every_save
|
| 386 |
+
- `hub_private_repo`: False
|
| 387 |
+
- `hub_always_push`: False
|
| 388 |
+
- `gradient_checkpointing`: False
|
| 389 |
+
- `gradient_checkpointing_kwargs`: None
|
| 390 |
+
- `include_inputs_for_metrics`: False
|
| 391 |
+
- `eval_do_concat_batches`: True
|
| 392 |
+
- `fp16_backend`: auto
|
| 393 |
+
- `push_to_hub_model_id`: None
|
| 394 |
+
- `push_to_hub_organization`: None
|
| 395 |
+
- `mp_parameters`:
|
| 396 |
+
- `auto_find_batch_size`: False
|
| 397 |
+
- `full_determinism`: False
|
| 398 |
+
- `torchdynamo`: None
|
| 399 |
+
- `ray_scope`: last
|
| 400 |
+
- `ddp_timeout`: 1800
|
| 401 |
+
- `torch_compile`: False
|
| 402 |
+
- `torch_compile_backend`: None
|
| 403 |
+
- `torch_compile_mode`: None
|
| 404 |
+
- `dispatch_batches`: None
|
| 405 |
+
- `split_batches`: None
|
| 406 |
+
- `include_tokens_per_second`: False
|
| 407 |
+
- `include_num_input_tokens_seen`: False
|
| 408 |
+
- `neftune_noise_alpha`: None
|
| 409 |
+
- `optim_target_modules`: None
|
| 410 |
+
- `batch_eval_metrics`: False
|
| 411 |
+
- `eval_on_start`: False
|
| 412 |
+
- `eval_use_gather_object`: False
|
| 413 |
+
- `batch_sampler`: no_duplicates
|
| 414 |
+
- `multi_dataset_batch_sampler`: proportional
|
| 415 |
+
|
| 416 |
+
</details>
|
| 417 |
+
|
| 418 |
+
### Training Logs
|
| 419 |
+
| Epoch | Step | Training Loss | Validation Loss |
|
| 420 |
+
|:------:|:----:|:-------------:|:---------------:|
|
| 421 |
+
| 0.0777 | 150 | 0.0257 | 0.0134 |
|
| 422 |
+
| 0.1554 | 300 | 0.0136 | 0.0082 |
|
| 423 |
+
| 0.2332 | 450 | 0.0079 | 0.0062 |
|
| 424 |
+
| 0.3109 | 600 | 0.0065 | 0.0051 |
|
| 425 |
+
| 0.3886 | 750 | 0.0059 | 0.0045 |
|
| 426 |
+
| 0.4663 | 900 | 0.0057 | 0.0040 |
|
| 427 |
+
| 0.5440 | 1050 | 0.0064 | 0.0037 |
|
| 428 |
+
| 0.6218 | 1200 | 0.005 | 0.0034 |
|
| 429 |
+
| 0.6995 | 1350 | 0.0052 | 0.0034 |
|
| 430 |
+
| 0.7772 | 1500 | 0.0041 | 0.0032 |
|
| 431 |
+
|
| 432 |
+
|
| 433 |
+
### Framework Versions
|
| 434 |
+
- Python: 3.12.3
|
| 435 |
+
- Sentence Transformers: 3.2.0
|
| 436 |
+
- Transformers: 4.44.2
|
| 437 |
+
- PyTorch: 2.6.0+cu124
|
| 438 |
+
- Accelerate: 1.3.0
|
| 439 |
+
- Datasets: 2.19.0
|
| 440 |
+
- Tokenizers: 0.19.1
|
| 441 |
+
|
| 442 |
+
## Citation
|
| 443 |
+
|
| 444 |
+
### BibTeX
|
| 445 |
+
|
| 446 |
+
#### Sentence Transformers
|
| 447 |
+
```bibtex
|
| 448 |
+
@inproceedings{reimers-2019-sentence-bert,
|
| 449 |
+
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 450 |
+
author = "Reimers, Nils and Gurevych, Iryna",
|
| 451 |
+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 452 |
+
month = "11",
|
| 453 |
+
year = "2019",
|
| 454 |
+
publisher = "Association for Computational Linguistics",
|
| 455 |
+
url = "https://arxiv.org/abs/1908.10084",
|
| 456 |
+
}
|
| 457 |
+
```
|
| 458 |
+
|
| 459 |
+
#### Infonce
|
| 460 |
+
```bibtex
|
| 461 |
+
@misc{henderson2017efficient,
|
| 462 |
+
title={Efficient Natural Language Response Suggestion for Smart Reply},
|
| 463 |
+
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
|
| 464 |
+
year={2017},
|
| 465 |
+
eprint={1705.00652},
|
| 466 |
+
archivePrefix={arXiv},
|
| 467 |
+
primaryClass={cs.CL}
|
| 468 |
+
}
|
| 469 |
+
```
|
| 470 |
+
|
| 471 |
+
<!--
|
| 472 |
+
## Glossary
|
| 473 |
+
|
| 474 |
+
*Clearly define terms in order to be accessible across audiences.*
|
| 475 |
+
-->
|
| 476 |
+
|
| 477 |
+
<!--
|
| 478 |
+
## Model Card Authors
|
| 479 |
+
|
| 480 |
+
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
| 481 |
+
-->
|
| 482 |
+
|
| 483 |
+
<!--
|
| 484 |
+
## Model Card Contact
|
| 485 |
+
|
| 486 |
+
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
| 487 |
+
-->
|
config.json
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "/root/autodl-tmp/work_dir/models/snowflake-arctic-embed-l-v2.0-selfloss/checkpoint-1500",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"XLMRobertaModel"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 0,
|
| 8 |
+
"classifier_dropout": null,
|
| 9 |
+
"eos_token_id": 2,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 1024,
|
| 13 |
+
"initializer_range": 0.02,
|
| 14 |
+
"intermediate_size": 4096,
|
| 15 |
+
"layer_norm_eps": 1e-05,
|
| 16 |
+
"max_position_embeddings": 8194,
|
| 17 |
+
"model_type": "xlm-roberta",
|
| 18 |
+
"num_attention_heads": 16,
|
| 19 |
+
"num_hidden_layers": 24,
|
| 20 |
+
"output_past": true,
|
| 21 |
+
"pad_token_id": 1,
|
| 22 |
+
"position_embedding_type": "absolute",
|
| 23 |
+
"torch_dtype": "float32",
|
| 24 |
+
"transformers_version": "4.44.2",
|
| 25 |
+
"type_vocab_size": 1,
|
| 26 |
+
"use_cache": true,
|
| 27 |
+
"vocab_size": 250002
|
| 28 |
+
}
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"sentence_transformers": "3.2.0",
|
| 4 |
+
"transformers": "4.44.2",
|
| 5 |
+
"pytorch": "2.6.0+cu124"
|
| 6 |
+
},
|
| 7 |
+
"prompts": {
|
| 8 |
+
"query": "query: "
|
| 9 |
+
},
|
| 10 |
+
"default_prompt_name": null,
|
| 11 |
+
"similarity_fn_name": null
|
| 12 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1c7800583766c01c099fbd028593959fbf0032a6c9e1b164a6a54389fda3d8da
|
| 3 |
+
size 2271064456
|
modules.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.models.Transformer"
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"idx": 1,
|
| 10 |
+
"name": "1",
|
| 11 |
+
"path": "1_Pooling",
|
| 12 |
+
"type": "sentence_transformers.models.Pooling"
|
| 13 |
+
},
|
| 14 |
+
{
|
| 15 |
+
"idx": 2,
|
| 16 |
+
"name": "2",
|
| 17 |
+
"path": "2_Normalize",
|
| 18 |
+
"type": "sentence_transformers.models.Normalize"
|
| 19 |
+
}
|
| 20 |
+
]
|
sentence_bert_config.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_seq_length": 1024,
|
| 3 |
+
"do_lower_case": false
|
| 4 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": true,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "</s>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "<unk>",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6e3b8957de04e3a4ed42b1a11381556f9adad8d0d502b9dd071c75f626b28f40
|
| 3 |
+
size 17083053
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"250001": {
|
| 36 |
+
"content": "<mask>",
|
| 37 |
+
"lstrip": true,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"bos_token": "<s>",
|
| 45 |
+
"clean_up_tokenization_spaces": true,
|
| 46 |
+
"cls_token": "<s>",
|
| 47 |
+
"eos_token": "</s>",
|
| 48 |
+
"mask_token": "<mask>",
|
| 49 |
+
"max_length": 512,
|
| 50 |
+
"model_max_length": 1024,
|
| 51 |
+
"pad_to_multiple_of": null,
|
| 52 |
+
"pad_token": "<pad>",
|
| 53 |
+
"pad_token_type_id": 0,
|
| 54 |
+
"padding_side": "right",
|
| 55 |
+
"sep_token": "</s>",
|
| 56 |
+
"stride": 0,
|
| 57 |
+
"tokenizer_class": "XLMRobertaTokenizer",
|
| 58 |
+
"truncation_side": "right",
|
| 59 |
+
"truncation_strategy": "longest_first",
|
| 60 |
+
"unk_token": "<unk>"
|
| 61 |
+
}
|