Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,70 @@
|
|
| 1 |
---
|
| 2 |
license: cc0-1.0
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc0-1.0
|
| 3 |
+
datasets:
|
| 4 |
+
- go_emotions
|
| 5 |
+
pipeline_tag: sentence-similarity
|
| 6 |
---
|
| 7 |
+
|
| 8 |
+
### Model Description
|
| 9 |
+
|
| 10 |
+
Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
|
| 11 |
+
This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE.
|
| 12 |
+
|
| 13 |
+
- **Developed by:** Ziv Arin
|
| 14 |
+
- **Model type:** Sentence similarity lossless compression
|
| 15 |
+
- **License:** CC0-1.0
|
| 16 |
+
|
| 17 |
+
### Demo
|
| 18 |
+
|
| 19 |
+
Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
|
| 20 |
+
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space saving efficiency)
|
| 21 |
+
|
| 22 |
+
[The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb)
|
| 23 |
+
```py
|
| 24 |
+
import sentencepiece as spm
|
| 25 |
+
|
| 26 |
+
bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')
|
| 27 |
+
|
| 28 |
+
def encode_id(bit_text):
|
| 29 |
+
encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
|
| 30 |
+
encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
|
| 31 |
+
assert any([id_ <= 255 for id_ in encoded_ids])
|
| 32 |
+
string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
|
| 33 |
+
return string_ids
|
| 34 |
+
|
| 35 |
+
def decode_id(hex_string):
|
| 36 |
+
u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
|
| 37 |
+
encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
|
| 38 |
+
return encoded_tokens
|
| 39 |
+
|
| 40 |
+
# Encode text
|
| 41 |
+
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
|
| 42 |
+
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
|
| 43 |
+
encoded_ids = encode_id(new_sentence)
|
| 44 |
+
decoded_tokens = decode_id(encoded_ids)
|
| 45 |
+
|
| 46 |
+
print("length:", len(encoded_tokens))
|
| 47 |
+
print("encoded_tokens:", encoded_tokens)
|
| 48 |
+
print("encoded_ids:", encoded_ids)
|
| 49 |
+
print("same?:", encoded_tokens == decoded_tokens)
|
| 50 |
+
|
| 51 |
+
count = Counter(encoded_tokens)
|
| 52 |
+
print("count:", count)
|
| 53 |
+
```
|
| 54 |
+
Output:
|
| 55 |
+
```
|
| 56 |
+
length: 13
|
| 57 |
+
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
|
| 58 |
+
encoded_ids: 1ab2ed09d7a9617206894e0608
|
| 59 |
+
same?: True
|
| 60 |
+
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Bias, Risks, and Limitations
|
| 64 |
+
|
| 65 |
+
It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
|
| 66 |
+
The model doesn't compress well strings with fewer zeros.
|
| 67 |
+
|
| 68 |
+
## Environmental Impact
|
| 69 |
+
- **Hardware Type:** GTX 1650 Mobile
|
| 70 |
+
- **Hours used:** 3 hours
|