baiango commited on
Commit
1f21607
·
verified ·
1 Parent(s): 0efffc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md CHANGED
@@ -1,3 +1,70 @@
1
  ---
2
  license: cc0-1.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc0-1.0
3
+ datasets:
4
+ - go_emotions
5
+ pipeline_tag: sentence-similarity
6
  ---
7
+
8
+ ### Model Description
9
+
10
+ Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
11
+ This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE.
12
+
13
+ - **Developed by:** Ziv Arin
14
+ - **Model type:** Sentence similarity lossless compression
15
+ - **License:** CC0-1.0
16
+
17
+ ### Demo
18
+
19
+ Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
20
+ Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space saving efficiency)
21
+
22
+ [The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb)
23
+ ```py
24
+ import sentencepiece as spm
25
+
26
+ bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')
27
+
28
+ def encode_id(bit_text):
29
+ encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
30
+ encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
31
+ assert any([id_ <= 255 for id_ in encoded_ids])
32
+ string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
33
+ return string_ids
34
+
35
+ def decode_id(hex_string):
36
+ u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
37
+ encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
38
+ return encoded_tokens
39
+
40
+ # Encode text
41
+ new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
42
+ encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
43
+ encoded_ids = encode_id(new_sentence)
44
+ decoded_tokens = decode_id(encoded_ids)
45
+
46
+ print("length:", len(encoded_tokens))
47
+ print("encoded_tokens:", encoded_tokens)
48
+ print("encoded_ids:", encoded_ids)
49
+ print("same?:", encoded_tokens == decoded_tokens)
50
+
51
+ count = Counter(encoded_tokens)
52
+ print("count:", count)
53
+ ```
54
+ Output:
55
+ ```
56
+ length: 13
57
+ encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
58
+ encoded_ids: 1ab2ed09d7a9617206894e0608
59
+ same?: True
60
+ count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
61
+ ```
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
66
+ The model doesn't compress well strings with fewer zeros.
67
+
68
+ ## Environmental Impact
69
+ - **Hardware Type:** GTX 1650 Mobile
70
+ - **Hours used:** 3 hours