baiango
/

384_bit_comp

Sentence Similarity

Model card Files Files and versions

xet

Community

baiango commited on Apr 28, 2024

Commit

1f21607

verified ·

1 Parent(s): 0efffc1

Update README.md

Browse files

Files changed (1) hide show

README.md +67 -0

README.md CHANGED Viewed

@@ -1,3 +1,70 @@
 ---
 license: cc0-1.0
 ---

 ---
 license: cc0-1.0
+datasets:
+- go_emotions
+pipeline_tag: sentence-similarity
 ---
+### Model Description
+Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
+This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE.
+- **Developed by:** Ziv Arin
+- **Model type:** Sentence similarity lossless compression
+- **License:** CC0-1.0
+### Demo
+Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
+Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space saving efficiency)
+[The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb)
+```py
+import sentencepiece as spm
+bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')
+def encode_id(bit_text):
+	encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
+	encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
+	assert any([id_ <= 255 for id_ in encoded_ids])
+	string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
+	return string_ids
+def decode_id(hex_string):
+	u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
+	encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
+	return encoded_tokens
+# Encode text
+new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
+encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
+encoded_ids = encode_id(new_sentence)
+decoded_tokens = decode_id(encoded_ids)
+print("length:", len(encoded_tokens))
+print("encoded_tokens:", encoded_tokens)
+print("encoded_ids:", encoded_ids)
+print("same?:", encoded_tokens == decoded_tokens)
+count = Counter(encoded_tokens)
+print("count:", count)
+```
+Output:
+```
+length: 13
+encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
+encoded_ids: 1ab2ed09d7a9617206894e0608
+same?: True
+count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
+```
+## Bias, Risks, and Limitations
+It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
+The model doesn't compress well strings with fewer zeros.
+## Environmental Impact
+- **Hardware Type:** GTX 1650 Mobile
+- **Hours used:** 3 hours