deepvk
/

deberta-v1-base

Feature Extraction

Model card Files Files and versions

falca commited on Aug 2, 2023

Commit

7b324fb

·

1 Parent(s): 7506cb1

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -49,7 +49,7 @@ A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.r
 1. Calculate shingles with size of 5
 2. Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100
 3. Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample
-4. For each bucket find duplicates: find samples which have the same hash → calculate pair-wise jaccard distance of similarity → if the similarity is >0.7 than it's a duplicate
 5. Gather duplicates from all the buckets and filter
 ### Training Hyperparameters

 1. Calculate shingles with size of 5
 2. Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100
 3. Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample
+4. For each bucket find duplicates: find samples which have the same hash → calculate pair-wise jaccard similarity → if the similarity is >0.7 than it's a duplicate
 5. Gather duplicates from all the buckets and filter
 ### Training Hyperparameters