File size: 2,695 Bytes
28c5847
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
PS C:\Users\Lenovo\Desktop\KK_Data\ERA_V4\Stock_Market_BPE> python .\train_tokenizer.py
File size: 3.68 MB
Reading data from stock_corpus.txt...
Data size: 3,817,320 characters
Sample data:
TECH|AAPL|2020-11|MON|UNDER200|OPEN:113.9|HIGH:117.8|LOW:113.7|CLOSE:115.9|VOL:HIGH
TECH|AAPL|2020-12|TUE|UNDER200|OPEN:117.8|HIGH:120.2|LOW:116.8|CLOSE:119.5|VOL:MED
TECH|AAPL|2020-12|WED|UNDER200|OP...

Training tokenizer with vocab size 5500...
This should take 2-5 minutes...

Training Stock BPE: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5244/5244 [51:21<00:00,  1.70merge/s]

Training complete. Final vocab size: 5500

Training took 3081.27 seconds (51.35 minutes)

Saving tokenizer...
βœ“ Saved to: stock_bpe.merges and stock_bpe.vocab

======================================================================
VERIFICATION RESULTS
======================================================================
Compression Ratio: 8.44
Vocabulary Size: 5500

======================================================================
βœ… SUCCESS: Requirements met!
  βœ“ Vocabulary size: 5500 (required: > 5000)
  βœ“ Compression ratio: 8.44 (required: >= 3.0)
======================================================================

Testing encoding/decoding...

Original: TECH|AAPL|2020-11|MON|UNDER200|OPEN:113.9|HIGH:117.8|LOW:113.7|CLOSE:115.9|VOL:HIGH
Encoded:  [518, 1895, 437, 634, 626, 638, 502, 634, 513, 637, 853]... (11 tokens)
Decoded:  TECH|AAPL|2020-11|MON|UNDER200|OPEN:113.9|HIGH:117.8|LOW:113.7|CLOSE:115.9|VOL:HIGH
Match: βœ“

Original: TECH|AAPL|2020-12|TUE|UNDER200|OPEN:117.8|HIGH:120.2|LOW:116.8|CLOSE:119.5|VOL:MED
Encoded:  [518, 1686, 638, 515, 2767, 639, 503, 633, 891]... (9 tokens)
Decoded:  TECH|AAPL|2020-12|TUE|UNDER200|OPEN:117.8|HIGH:120.2|LOW:116.8|CLOSE:119.5|VOL:MED
Match: βœ“

Original: TECH|AAPL|2020-12|WED|UNDER200|OPEN:118.8|HIGH:120.1|LOW:117.7|CLOSE:119.8|VOL:MED
Encoded:  [518, 1687, 636, 515, 2620, 638, 513, 633, 880]... (9 tokens)
Decoded:  TECH|AAPL|2020-12|WED|UNDER200|OPEN:118.8|HIGH:120.1|LOW:117.7|CLOSE:119.8|VOL:MED
Match: βœ“

βœ… All encoding/decoding tests passed!

======================================================================
STATISTICS
======================================================================
Total characters: 3,817,320
Total lines: 46,472
Vocabulary size: 5,500
Compression ratio: 8.44x
Original size: 3,817,320 bytes
Compressed size: 452,474 tokens
======================================================================