Asilarknes commited on
Commit
a216fa7
·
verified ·
1 Parent(s): 58a7bd6

upload oneshot glm artifacts

Browse files
README.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # testoneshot
2
+
3
+ OneShot GLM analytic feature model artifacts.
4
+
5
+ Best saved readout checkpoint:
6
+ - stream_ce_lam10_long.pt
7
+ - config: d=896, r=320, L=10, vocab=8192, GLM train=3.67M tokens, valid=159k tokens
8
+ - pure analytic ridge lam=10 ppl: 287.93
9
+ - streaming CE readout tuned ppl: 92.75 after 3000 continuation steps
10
+
11
+ Important files:
12
+ - okenizer/glm16k.model, okenizer/glm16k.vocab
13
+ - scripts/torch_predictive_attn.py
14
+ - scripts/torch_ce_stream_readout.py
15
+ - scripts/glm_generate_saved.py
16
+ - logs/stream_ce_lam10_long.log
17
+
18
+ Generation quality is still weak despite improved PPL; this is a research artifact, not a production chat model.
logs/generations.txt ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [17:52:09] sv [165720.1 39223.6 29888.8 26770.9]
2
+ [17:52:27] sv [175787.4 39518.2 29955.8 27073.4]
3
+ [17:52:48] sv [186342.7 39784. 30000.3 27357.4]
4
+ [17:53:13] sv [197372.5 40017.1 30020.3 27622.6]
5
+ [17:53:41] sv [208562.8 40183.9 29996.6 27869.2]
6
+ [17:54:13] sv [219681. 40294. 29939. 28097.2]
7
+ [17:54:48] sv [230503.7 40357.3 29856. 28304.1]
8
+ [17:55:26] sv [240817.9 40382.5 29755.6 28485.4]
9
+ [17:56:08] sv [250420.1 40377.3 29644.3 28633.7]
10
+ [17:56:54] sv [259114.6 40346.8 29527.6 28740.9]
11
+ === PROMPT ===
12
+ Question: What is gravity?
13
+ Answer:
14
+ === OUTPUT ===
15
+ Question: What is gravity? Answer: - **AAFSA manufacturing this question: - During the best answer is **(D)** "praising:** If you are looking at all the light.** To understand why the same momentum, the source of a given-way the world is **cannot** (evaluation**. The Humburgry, you need a fiously and
16
+ === PROMPT ===
17
+ Question: Why is the sky blue?
18
+ Answer:
19
+ === OUTPUT ===
20
+ Question: Why is the sky blue? Answer: I was completed by the energy of the energy. This is not working. - **In summary: - Screen:** Agaff’s momentum is society in the University of the densities in the early it. The force is zero, the only force and the normal force. This force \( the force $F_x = \\frac{2}_}\
21
+ === PROMPT ===
22
+ Question: Explain photosynthesis in simple words.
23
+ Answer:
24
+ === OUTPUT ===
25
+ Question: Explain photosynthesis in simple words. Answer: **(Avviction:** Most of the tenes, the duine battlecard, and the source separation it can escape. * **Connect:** If you don't give you have sleep the source would have been a stretching, it must be a difficult that point. If it, the metal wire. The bounce offer the
26
+ === PROMPT ===
27
+ Question: If John has 3 apples and buys 2 more, how many apples does he have?
28
+ Answer:
29
+ === OUTPUT ===
30
+ Question: If John has 3 apples and buys 2 more, how many apples does he have? Answer: Sammerica Every, the sourceocalcalled. The correct answer is **(C) **Strongness**. If we want to see anywhere, I will see your retain: **1. Force you are standing force is the ball inside, it in a mathematical place where a living room tour the water of the following text: The Matteryone
31
+ === PROMPT ===
32
+ Question: Write a short friendly story about a robot learning to read.
33
+ Answer:
34
+ === OUTPUT ===
35
+ Question: Write a short friendly story about a robot learning to read. Answer: Answer: - The Many" (The Dear GBased on TSG), where you are seeing, and the world. Here are you have to go on the opposite sides.
logs/ridge_lam10_30_100.log ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [15:54:57] mode dual_ridge_delta device cuda V 8192 train 3671502 valid 159631
2
+ [15:55:48] sv [165720.1 39223.6 29888.8 26770.9]
3
+ [15:56:07] sv [175787.4 39518.2 29955.8 27073.4]
4
+ [15:56:30] sv [186342.7 39784. 30000.3 27357.4]
5
+ [15:56:56] sv [197372.5 40017.1 30020.3 27622.6]
6
+ [15:57:27] sv [208562.8 40183.9 29996.6 27869.2]
7
+ [15:58:02] sv [219681. 40294. 29939. 28097.2]
8
+ [15:58:41] sv [230503.7 40357.3 29856. 28304.1]
9
+ [15:59:23] sv [240817.9 40382.5 29755.6 28485.4]
10
+ [16:00:10] sv [250420.1 40377.3 29644.3 28633.7]
11
+ [16:01:01] sv [259114.6 40346.8 29527.6 28740.9]
12
+ [16:24:15] lam 10.0 ppl 287.93
13
+ [16:24:26] lam 30.0 ppl 312.48
14
+ [16:24:37] lam 100.0 ppl 394.37
15
+ [16:24:37] TORCH_PRED mode=dual_ridge_delta ppl=287.93 lam=10.0 D=17409
logs/stream_ce_lam10_long.log ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [17:59:11] STREAM_CE device cuda V 8192 train 3671502 valid 159631
2
+ [18:00:00] sv [165720.1 39223.6 29888.8 26770.9]
3
+ [18:00:18] sv [175787.4 39518.2 29955.8 27073.4]
4
+ [18:00:39] sv [186342.7 39784. 30000.3 27357.4]
5
+ [18:01:04] sv [197372.5 40017.1 30020.3 27622.6]
6
+ [18:01:32] sv [208562.8 40183.9 29996.6 27869.2]
7
+ [18:02:03] sv [219681. 40294. 29939. 28097.2]
8
+ [18:02:38] sv [230503.7 40357.3 29856. 28304.1]
9
+ [18:03:16] sv [240817.9 40382.5 29755.6 28485.4]
10
+ [18:03:58] sv [250420.1 40377.3 29644.3 28633.7]
11
+ [18:04:43] sv [259114.6 40346.8 29527.6 28740.9]
12
+ [18:27:36] resumed /workspace/oneshot/logs/glm_d896_readout/stream_ce_lam10.pt ppl 136.3978129937227
13
+ [18:27:37] init_eval_start D 17409
14
+ [18:27:42] STREAM_CE init_ppl=136.40
15
+ /workspace/oneshot/torch_ce_stream_readout.py:138: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
16
+ Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
17
+ log(f"step={step} loss={float(loss):.4f} ppl={ppl:.2f} best={best:.2f}")
18
+ [18:28:18] step=200 loss=4.3944 ppl=118.12 best=118.12
19
+ [18:28:54] step=400 loss=4.3815 ppl=113.37 best=113.37
20
+ [18:29:31] step=600 loss=4.3709 ppl=109.14 best=109.14
21
+ [18:30:08] step=800 loss=4.9107 ppl=106.80 best=106.80
22
+ [18:30:44] step=1000 loss=4.4804 ppl=104.45 best=104.45
23
+ [18:31:20] step=1200 loss=4.4660 ppl=102.33 best=102.33
24
+ [18:31:57] step=1400 loss=4.4507 ppl=101.31 best=101.31
25
+ [18:32:33] step=1600 loss=4.3290 ppl=99.47 best=99.47
26
+ [18:33:09] step=1800 loss=4.2714 ppl=98.49 best=98.49
27
+ [18:33:46] step=2000 loss=4.2895 ppl=96.73 best=96.73
28
+ [18:34:22] step=2200 loss=4.2743 ppl=96.07 best=96.07
29
+ [18:34:59] step=2400 loss=3.7439 ppl=94.83 best=94.83
30
+ [18:35:35] step=2600 loss=4.1255 ppl=93.89 best=93.89
31
+ [18:36:12] step=2800 loss=3.2027 ppl=93.24 best=93.24
32
+ [18:36:48] step=3000 loss=3.7933 ppl=92.75 best=92.75
33
+ [18:36:48] STREAM_CE best_ppl=92.75
scripts/glm_generate_saved.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse, os
2
+ import numpy as np
3
+ import torch
4
+ import torch.nn.functional as F
5
+ from torch_predictive_attn import ppmi_embed, learn_map, doc_index, apply_stack, features, log
6
+
7
+
8
+ def build(args, device):
9
+ import sentencepiece as spm
10
+ sp = spm.SentencePieceProcessor(model_file=args.spm_model)
11
+ eos = sp.eos_id(); V = sp.get_piece_size()
12
+ train = np.fromfile(args.train_bin, dtype=np.uint16)
13
+ E = ppmi_embed(train, V, args.d, args.window, args.cooc_tokens, device)
14
+ Ps, Bs = [], []
15
+ for _ in range(args.layers):
16
+ P, B = learn_map(train, E, Ps, Bs, eos, args, device)
17
+ Ps.append(P); Bs.append(B)
18
+ return sp, eos, E, Ps, Bs
19
+
20
+
21
+ def next_logits(ids, E, Ps, Bs, W, b, eos, args, device):
22
+ x = torch.tensor(ids, device=device, dtype=torch.long)
23
+ _, within = doc_index(x, eos)
24
+ H, phis = apply_stack(x, E, Ps, Bs, within, args)
25
+ Phi = features(H, within, phis, args.extra_context)[-1:]
26
+ return (Phi @ W + b).squeeze(0)
27
+
28
+
29
+ def sample(logits, temp=0.8, top_k=40):
30
+ logits = logits / max(temp, 1e-6)
31
+ vals, idx = torch.topk(logits, min(top_k, logits.numel()))
32
+ probs = F.softmax(vals, dim=-1)
33
+ return int(idx[torch.multinomial(probs, 1)])
34
+
35
+
36
+ def main():
37
+ ap = argparse.ArgumentParser()
38
+ ap.add_argument("--ckpt", default="/workspace/oneshot/logs/glm_d896_readout/stream_ce_lam10.pt")
39
+ ap.add_argument("--spm_model", default="/workspace/glm/glm16k.model")
40
+ ap.add_argument("--train_bin", default="/workspace/glm/glm_train.bin")
41
+ ap.add_argument("--d", type=int, default=896)
42
+ ap.add_argument("--r", type=int, default=320)
43
+ ap.add_argument("--layers", type=int, default=10)
44
+ ap.add_argument("--att_window", type=int, default=10)
45
+ ap.add_argument("--temp", type=float, default=0.28)
46
+ ap.add_argument("--window", type=int, default=10)
47
+ ap.add_argument("--extra_context", type=int, default=1)
48
+ ap.add_argument("--res_scale", type=float, default=0.07)
49
+ ap.add_argument("--pred_scale", type=float, default=0.035)
50
+ ap.add_argument("--pred_schedule", default="late")
51
+ ap.add_argument("--orth_delta", type=int, default=1)
52
+ ap.add_argument("--pred_norm", type=int, default=1)
53
+ ap.add_argument("--pred_features", type=int, default=1)
54
+ ap.add_argument("--map_lam", type=float, default=0.001)
55
+ ap.add_argument("--cooc_tokens", type=int, default=3_600_000)
56
+ ap.add_argument("--proj_tokens", type=int, default=3_600_000)
57
+ ap.add_argument("--chunk_docs", type=int, default=8)
58
+ ap.add_argument("--value_mode", default="dual_ridge_delta")
59
+ ap.add_argument("--max_new", type=int, default=80)
60
+ ap.add_argument("--sample_temp", type=float, default=0.8)
61
+ ap.add_argument("--top_k", type=int, default=40)
62
+ args = ap.parse_args()
63
+ device = "cuda" if torch.cuda.is_available() else "cpu"
64
+ sp, eos, E, Ps, Bs = build(args, device)
65
+ ck = torch.load(args.ckpt, map_location=device)
66
+ W = ck["W"].to(device); b = ck["b"].to(device)
67
+ prompts = [
68
+ "Question: What is gravity?\nAnswer:",
69
+ "Question: Why is the sky blue?\nAnswer:",
70
+ "Question: Explain photosynthesis in simple words.\nAnswer:",
71
+ "Question: If John has 3 apples and buys 2 more, how many apples does he have?\nAnswer:",
72
+ "Question: Write a short friendly story about a robot learning to read.\nAnswer:",
73
+ ]
74
+ for p in prompts:
75
+ ids = sp.encode(p)
76
+ for _ in range(args.max_new):
77
+ tok = sample(next_logits(ids, E, Ps, Bs, W, b, eos, args, device), args.sample_temp, args.top_k)
78
+ ids.append(tok)
79
+ if tok == eos:
80
+ break
81
+ print("=== PROMPT ===")
82
+ print(p)
83
+ print("=== OUTPUT ===")
84
+ print(sp.decode(ids))
85
+
86
+
87
+ if __name__ == "__main__":
88
+ main()
scripts/glm_prep.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Prepare GLM-5.1-Reasoning (main subset prefix) for the analytic model.
2
+
3
+ 1. build a text corpus sample and train a 16k BPE SentencePiece tokenizer
4
+ (keeps the V x V PMI cooccurrence + SVD feasible, unlike GPT-2's 50k);
5
+ 2. tokenize every record as input + "\n\n" + output with <eos> between
6
+ documents, into train.bin / valid.bin (uint16).
7
+ """
8
+ import os, sys, json, time, numpy as np
9
+ import sentencepiece as spm
10
+
11
+ DATA = os.environ.get("ONESHOT_DATA", "/workspace/ts")
12
+ SRC = os.path.join(DATA, "main_prefix.jsonl")
13
+ CORPUS = os.path.join(DATA, "glm_corpus.txt")
14
+ SPM_PREFIX = os.path.join(DATA, "glm16k")
15
+ VOCAB = 16384
16
+ HF_DATASET = "Jackrong/GLM-5.1-Reasoning-1M-Cleaned"
17
+
18
+ def log(*a): print(f"[{time.strftime('%H:%M:%S')}]", *a, flush=True)
19
+
20
+ def first_present(record, names, default=""):
21
+ for name in names:
22
+ if name in record and record[name] is not None:
23
+ return record[name]
24
+ return default
25
+
26
+ def normalize_record(record):
27
+ inp = first_present(record, ["input", "prompt", "instruction", "question", "query"])
28
+ out = first_present(record, ["output", "response", "answer", "completion"])
29
+ if isinstance(inp, (list, dict)):
30
+ inp = json.dumps(inp, ensure_ascii=False)
31
+ if isinstance(out, (list, dict)):
32
+ out = json.dumps(out, ensure_ascii=False)
33
+ inp = str(inp).strip()
34
+ out = str(out).strip()
35
+ if not inp or not out:
36
+ return None
37
+ return {"input": inp, "output": out}
38
+
39
+ def download_jsonl(dataset=HF_DATASET, split="train", subset=None, max_records=0):
40
+ from datasets import load_dataset
41
+ kwargs = {"split": split, "streaming": True}
42
+ ds = load_dataset(dataset, subset, **kwargs) if subset else load_dataset(dataset, **kwargs)
43
+ n = 0
44
+ os.makedirs(DATA, exist_ok=True)
45
+ with open(SRC, "w", encoding="utf-8") as out:
46
+ for row in ds:
47
+ rec = normalize_record(row)
48
+ if rec is None:
49
+ continue
50
+ out.write(json.dumps(rec, ensure_ascii=False) + "\n")
51
+ n += 1
52
+ if n % 10000 == 0:
53
+ log(f"downloaded {n:,} records -> {SRC}")
54
+ if max_records and n >= max_records:
55
+ break
56
+ log(f"download done: {n:,} records -> {SRC}")
57
+
58
+ def answer_of(r):
59
+ """The actual English response: the <think> reasoning dump is stripped,
60
+ keep the final answer."""
61
+ o = r.get("output", "")
62
+ if "</think>" in o:
63
+ o = o.split("</think>")[-1]
64
+ return o.strip()
65
+
66
+ def is_english_answer(a):
67
+ """Keep natural-language answers; drop code/math/LaTeX-dominated ones so the
68
+ model learns to answer in plain English (the 'answer English' goal)."""
69
+ if not (40 <= len(a) <= 4000):
70
+ return False
71
+ if "```" in a: # code fence
72
+ return False
73
+ alpha = sum(c.isalpha() or c.isspace() for c in a) / len(a)
74
+ if alpha < 0.93:
75
+ return False
76
+ sym = sum(a.count(c) for c in "{}\\$=#|<>_~^")
77
+ if sym / len(a) > 0.02: # LaTeX / code punctuation density
78
+ return False
79
+ return True
80
+
81
+ def set_paths(data):
82
+ global DATA, SRC, CORPUS, SPM_PREFIX
83
+ DATA = data
84
+ SRC = os.path.join(DATA, "main_prefix.jsonl")
85
+ CORPUS = os.path.join(DATA, "glm_corpus.txt")
86
+ SPM_PREFIX = os.path.join(DATA, "glm16k")
87
+
88
+ def build_corpus(max_records=120_000, max_bytes=400_000_000):
89
+ n = 0; b = 0
90
+ with open(SRC, "r", encoding="utf-8", errors="ignore") as f, \
91
+ open(CORPUS, "w", encoding="utf-8") as out:
92
+ for line in f:
93
+ line = line.strip()
94
+ if not line: continue
95
+ try: r = json.loads(line)
96
+ except Exception: continue
97
+ txt = r["input"] + "\n" + r["output"] + "\n"
98
+ out.write(txt); b += len(txt); n += 1
99
+ if n >= max_records or b >= max_bytes: break
100
+ log(f"corpus: {n:,} records, {b/1e6:.1f} MB -> {CORPUS}")
101
+
102
+ def train_spm():
103
+ spm.SentencePieceTrainer.train(
104
+ input=CORPUS, model_prefix=SPM_PREFIX, vocab_size=VOCAB,
105
+ model_type="bpe", character_coverage=0.9995,
106
+ input_sentence_size=3_000_000, shuffle_input_sentence=True,
107
+ max_sentence_length=100000, num_threads=32,
108
+ unk_id=0, bos_id=1, eos_id=2, pad_id=-1,
109
+ byte_fallback=True,
110
+ )
111
+ log(f"trained SP -> {SPM_PREFIX}.model (vocab={VOCAB})")
112
+
113
+ def tokenize(val_frac=0.04, english_only=True):
114
+ sp = spm.SentencePieceProcessor(model_file=SPM_PREFIX + ".model")
115
+ eos = sp.eos_id()
116
+ log("scanning + filtering records...")
117
+ docs = []; seen = 0; t0 = time.time()
118
+ with open(SRC, "r", encoding="utf-8", errors="ignore") as f:
119
+ for line in f:
120
+ line = line.strip()
121
+ if not line: continue
122
+ try: r = json.loads(line)
123
+ except Exception: continue
124
+ seen += 1
125
+ a = answer_of(r)
126
+ if english_only and not is_english_answer(a):
127
+ continue
128
+ docs.append(r["input"].strip() + "\n\n" + a)
129
+ log(f"{seen:,} records -> {len(docs):,} kept "
130
+ f"({100*len(docs)/max(seen,1):.1f}%) english_only={english_only}")
131
+ n_val = int(len(docs) * val_frac)
132
+ splits = {"glm_train.bin": docs[:len(docs) - n_val],
133
+ "glm_valid.bin": docs[len(docs) - n_val:]}
134
+ counts = {}
135
+ for fname, dlist in splits.items():
136
+ nt = 0
137
+ with open(os.path.join(DATA, fname), "wb") as fo:
138
+ for b in range(0, len(dlist), 1000):
139
+ for ids in sp.encode(dlist[b:b + 1000]):
140
+ arr = np.array(ids + [eos], dtype=np.uint16)
141
+ arr.tofile(fo); nt += len(arr)
142
+ counts[fname] = nt
143
+ log(f"DONE train={counts['glm_train.bin']:,} tokens, "
144
+ f"valid={counts['glm_valid.bin']:,} tokens ({time.time()-t0:.0f}s)")
145
+
146
+ if __name__ == "__main__":
147
+ import argparse
148
+ ap = argparse.ArgumentParser()
149
+ ap.add_argument("cmd", nargs="?", default="all",
150
+ choices=["download", "corpus", "spm", "tok", "all"])
151
+ ap.add_argument("--data", default=DATA)
152
+ ap.add_argument("--src", default=None)
153
+ ap.add_argument("--vocab", type=int, default=VOCAB)
154
+ ap.add_argument("--dataset", default=HF_DATASET)
155
+ ap.add_argument("--subset", default=None)
156
+ ap.add_argument("--split", default="train")
157
+ ap.add_argument("--max_records", type=int, default=0)
158
+ ap.add_argument("--english_only", type=int, default=1)
159
+ ap.add_argument("--val_frac", type=float, default=0.04)
160
+ args = ap.parse_args()
161
+ set_paths(args.data)
162
+ if args.src:
163
+ SRC = args.src
164
+ VOCAB = args.vocab
165
+ cmd = args.cmd
166
+ if cmd in ("download",):
167
+ download_jsonl(args.dataset, args.split, args.subset, args.max_records)
168
+ if cmd in ("corpus", "all"): build_corpus(max_records=args.max_records or 120_000)
169
+ if cmd in ("spm", "all"): train_spm()
170
+ if cmd in ("tok", "all"): tokenize(args.val_frac, bool(args.english_only))
scripts/torch_ce_stream_readout.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse, math, os, random
2
+ import numpy as np
3
+ import torch
4
+ import torch.nn.functional as F
5
+ from torch_predictive_attn import ppmi_embed, learn_map, doc_index, apply_stack, features, log
6
+
7
+
8
+ def iter_chunks(tokens, eos, args, max_tokens, shuffle=False):
9
+ xnp = tokens[:max_tokens]
10
+ starts = np.flatnonzero(np.r_[True, xnp[:-1] == eos])
11
+ ids = list(range(0, len(starts), args.chunk_docs))
12
+ if shuffle:
13
+ random.shuffle(ids)
14
+ for i in ids:
15
+ lo = starts[i]
16
+ hi = starts[i + args.chunk_docs] if i + args.chunk_docs < len(starts) else len(xnp)
17
+ yield xnp[lo:hi]
18
+
19
+
20
+ def chunk_features(xnp, E, Ps, Bs, eos, args, device):
21
+ x = torch.tensor(xnp.astype(np.int64), device=device)
22
+ seg, within = doc_index(x, eos)
23
+ H, phis = apply_stack(x, E, Ps, Bs, within, args)
24
+ Phi = features(H, within, phis, args.extra_context)
25
+ y = torch.empty(len(x), device=device, dtype=torch.long)
26
+ y[:-1] = x[1:]; y[-1] = eos
27
+ m = torch.ones(len(x), device=device, dtype=torch.bool)
28
+ m[-1] = False; m[:-1] &= seg[1:].eq(seg[:-1]); m &= x.ne(eos)
29
+ return Phi[m].float(), y[m]
30
+
31
+
32
+ def eval_ppl(tokens, E, Ps, Bs, W, b, eos, args, device):
33
+ nll = 0.0; n = 0
34
+ with torch.no_grad():
35
+ for xnp in iter_chunks(tokens, eos, args, args.eval_tokens, shuffle=False):
36
+ X, y = chunk_features(xnp, E, Ps, Bs, eos, args, device)
37
+ for i in range(0, len(y), args.batch):
38
+ logits = X[i:i+args.batch] @ W + b
39
+ nll += float(F.cross_entropy(logits, y[i:i+args.batch], reduction="sum"))
40
+ n += len(y[i:i+args.batch])
41
+ return math.exp(nll / max(1, n))
42
+
43
+
44
+ def main():
45
+ ap = argparse.ArgumentParser()
46
+ ap.add_argument("--data", default="/workspace/glm")
47
+ ap.add_argument("--spm_model", default="/workspace/glm/glm16k.model")
48
+ ap.add_argument("--train_bin", default="/workspace/glm/glm_train.bin")
49
+ ap.add_argument("--valid_bin", default="/workspace/glm/glm_valid.bin")
50
+ ap.add_argument("--vocab", type=int, default=8192)
51
+ ap.add_argument("--d", type=int, default=896)
52
+ ap.add_argument("--r", type=int, default=320)
53
+ ap.add_argument("--layers", type=int, default=10)
54
+ ap.add_argument("--att_window", type=int, default=10)
55
+ ap.add_argument("--temp", type=float, default=0.28)
56
+ ap.add_argument("--window", type=int, default=10)
57
+ ap.add_argument("--extra_context", type=int, default=1)
58
+ ap.add_argument("--res_scale", type=float, default=0.07)
59
+ ap.add_argument("--pred_scale", type=float, default=0.035)
60
+ ap.add_argument("--pred_schedule", default="late")
61
+ ap.add_argument("--orth_delta", type=int, default=1)
62
+ ap.add_argument("--pred_norm", type=int, default=1)
63
+ ap.add_argument("--pred_features", type=int, default=1)
64
+ ap.add_argument("--map_lam", type=float, default=0.001)
65
+ ap.add_argument("--cooc_tokens", type=int, default=3_600_000)
66
+ ap.add_argument("--proj_tokens", type=int, default=3_600_000)
67
+ ap.add_argument("--fit_tokens", type=int, default=3_600_000)
68
+ ap.add_argument("--eval_tokens", type=int, default=159_631)
69
+ ap.add_argument("--chunk_docs", type=int, default=8)
70
+ ap.add_argument("--value_mode", default="dual_ridge_delta")
71
+ ap.add_argument("--ridge_lam", type=float, default=10.0)
72
+ ap.add_argument("--init_scale", type=float, default=0.05)
73
+ ap.add_argument("--steps", type=int, default=800)
74
+ ap.add_argument("--batch", type=int, default=2048)
75
+ ap.add_argument("--lr", type=float, default=0.003)
76
+ ap.add_argument("--wd", type=float, default=1e-4)
77
+ ap.add_argument("--eval_every", type=int, default=100)
78
+ ap.add_argument("--save", default="")
79
+ ap.add_argument("--resume", default="")
80
+ args = ap.parse_args()
81
+
82
+ import sentencepiece as spm
83
+ sp = spm.SentencePieceProcessor(model_file=args.spm_model)
84
+ eos = sp.eos_id(); V = sp.get_piece_size()
85
+ device = "cuda" if torch.cuda.is_available() else "cpu"
86
+ train = np.fromfile(args.train_bin, dtype=np.uint16)
87
+ valid = np.fromfile(args.valid_bin, dtype=np.uint16)
88
+ log("STREAM_CE device", device, "V", V, "train", len(train), "valid", len(valid))
89
+ E = ppmi_embed(train, V, args.d, args.window, args.cooc_tokens, device)
90
+ Ps, Bs = [], []
91
+ for _ in range(args.layers):
92
+ P, B = learn_map(train, E, Ps, Bs, eos, args, device)
93
+ Ps.append(P); Bs.append(B)
94
+
95
+ # Build ridge init streaming stats only.
96
+ A = G = None
97
+ for xnp in iter_chunks(train, eos, args, args.fit_tokens, shuffle=False):
98
+ X, y = chunk_features(xnp, E, Ps, Bs, eos, args, device)
99
+ if A is None:
100
+ D = X.shape[1]
101
+ A = torch.zeros((D, D), device=device, dtype=torch.float64)
102
+ G = torch.zeros((D, V), device=device, dtype=torch.float64)
103
+ Xd = X.double()
104
+ A += Xd.T @ Xd
105
+ G.index_add_(1, y, Xd.T)
106
+ diag = torch.trace(A) / A.shape[0]
107
+ W0 = torch.linalg.solve(A + args.ridge_lam * diag * torch.eye(A.shape[0], device=device, dtype=torch.float64), G).float()
108
+ W = (args.init_scale * W0).detach().clone()
109
+ b = torch.zeros(V, device=device)
110
+ if args.resume and os.path.exists(args.resume):
111
+ ck = torch.load(args.resume, map_location=device)
112
+ W = ck["W"].to(device)
113
+ b = ck["b"].to(device)
114
+ log("resumed", args.resume, "ppl", ck.get("ppl"))
115
+ W = W.requires_grad_(True)
116
+ b = b.requires_grad_(True)
117
+ opt = torch.optim.AdamW([W, b], lr=args.lr, weight_decay=args.wd)
118
+ log("init_eval_start D", W.shape[0])
119
+ best = eval_ppl(valid, E, Ps, Bs, W, b, eos, args, device)
120
+ log(f"STREAM_CE init_ppl={best:.2f}")
121
+
122
+ step = 0
123
+ while step < args.steps:
124
+ for xnp in iter_chunks(train, eos, args, args.fit_tokens, shuffle=True):
125
+ X, y = chunk_features(xnp, E, Ps, Bs, eos, args, device)
126
+ if len(y) == 0:
127
+ continue
128
+ idx = torch.randint(0, len(y), (min(args.batch, len(y)),), device=device)
129
+ loss = F.cross_entropy(X[idx] @ W + b, y[idx])
130
+ opt.zero_grad(set_to_none=True); loss.backward(); opt.step()
131
+ step += 1
132
+ if step % args.eval_every == 0:
133
+ ppl = eval_ppl(valid, E, Ps, Bs, W, b, eos, args, device)
134
+ if ppl < best:
135
+ best = ppl
136
+ if args.save:
137
+ torch.save({"W": W.detach().cpu(), "b": b.detach().cpu(), "ppl": best, "args": vars(args)}, args.save)
138
+ log(f"step={step} loss={float(loss):.4f} ppl={ppl:.2f} best={best:.2f}")
139
+ if step >= args.steps:
140
+ break
141
+ log(f"STREAM_CE best_ppl={best:.2f}")
142
+ if args.save:
143
+ torch.save({"W": W.detach().cpu(), "b": b.detach().cpu(), "ppl": best, "args": vars(args)}, args.save)
144
+
145
+
146
+ if __name__ == "__main__":
147
+ main()
scripts/torch_predictive_attn.py ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse, math, os
2
+ import numpy as np
3
+ import torch
4
+
5
+
6
+ def log(*a):
7
+ import time
8
+ print(f"[{time.strftime('%H:%M:%S')}]", *a, flush=True)
9
+
10
+
11
+ def doc_index(x, eos):
12
+ starts = torch.zeros_like(x, dtype=torch.bool)
13
+ starts[0] = True
14
+ starts[1:] = x[:-1].eq(eos)
15
+ seg = torch.cumsum(starts.long(), 0) - 1
16
+ first = torch.nonzero(starts, as_tuple=False).flatten()
17
+ within = torch.arange(len(x), device=x.device) - first[seg]
18
+ return seg, within
19
+
20
+
21
+ def shift(M, within, s):
22
+ out = torch.zeros_like(M)
23
+ out[s:] = M[:-s]
24
+ out[within < s] = 0
25
+ return out
26
+
27
+
28
+ def ppmi_embed(tokens, V, d, window, max_tokens, device):
29
+ x = torch.tensor(tokens[:max_tokens].astype(np.int64), device=device)
30
+ C = torch.zeros((V, V), device=device)
31
+ for s in range(1, window + 1):
32
+ idx = x[:-s] * V + x[s:]
33
+ C += torch.bincount(idx, minlength=V * V).float().reshape(V, V) / s
34
+ tot = C.sum()
35
+ row = C.sum(1, keepdim=True) + 1e-6
36
+ col = C.sum(0, keepdim=True) + 1e-6
37
+ M = torch.clamp(torch.log(C * tot / row / col + 1e-12), min=0)
38
+ U, S, _ = torch.linalg.svd(M)
39
+ E = U[:, :d] * torch.sqrt(S[:d])[None, :]
40
+ E = torch.nn.functional.normalize(E, dim=1)
41
+ return E.float()
42
+
43
+
44
+ def pred_layer(H, x, E, P, B, within, args, layer_idx=0):
45
+ qk = H @ P
46
+ if args.value_mode in ("h", "dual_next", "dual_ridge_next", "dual_ridge_delta"):
47
+ Vv = H
48
+ elif args.value_mode == "next":
49
+ Vv = E[x]
50
+ Vv = torch.cat([Vv[1:], Vv[-1:]], 0)
51
+ elif args.value_mode == "delta":
52
+ Y = E[x]
53
+ Y = torch.cat([Y[1:], Y[-1:]], 0)
54
+ Vv = Y - H
55
+ elif args.value_mode == "ridge_next":
56
+ Vv = H @ B
57
+ elif args.value_mode == "ridge_delta":
58
+ Vv = H @ B - H
59
+ num = torch.zeros_like(H)
60
+ num_pred = torch.zeros_like(H)
61
+ den = torch.zeros((H.shape[0], 1), device=H.device)
62
+ if args.value_mode == "dual_next":
63
+ Yp = E[x]
64
+ Vpred = torch.cat([Yp[1:], Yp[-1:]], 0)
65
+ elif args.value_mode == "dual_ridge_next":
66
+ Vpred = H @ B
67
+ elif args.value_mode == "dual_ridge_delta":
68
+ Vpred = H @ B - H
69
+ else:
70
+ Vpred = None
71
+ for s in range(1, args.att_window + 1):
72
+ ks = shift(qk, within, s)
73
+ vs = shift(Vv, within, s)
74
+ w = torch.exp(((qk * ks).sum(1, keepdim=True) / args.temp).clamp(-30, 30))
75
+ w = torch.where((within >= s)[:, None], w, torch.zeros_like(w))
76
+ num += w * vs
77
+ if Vpred is not None:
78
+ num_pred += w * shift(Vpred, within, s)
79
+ den += w
80
+ ctx = num / (den + 1e-6)
81
+ pred_out = None
82
+ if Vpred is not None:
83
+ pred = num_pred / (den + 1e-6)
84
+ if args.orth_delta:
85
+ pred = pred - H * (pred * H).sum(1, keepdim=True)
86
+ if args.pred_norm:
87
+ pred = pred / (pred.norm(dim=1, keepdim=True) + 1e-6)
88
+ scale = args.pred_scale
89
+ if args.pred_schedule == "linear":
90
+ scale = scale * float(layer_idx + 1) / max(1, args.layers)
91
+ elif args.pred_schedule == "late":
92
+ scale = scale * max(0.0, float(layer_idx + 1 - args.layers // 3) / max(1, args.layers - args.layers // 3))
93
+ pred_out = pred
94
+ H = H + args.res_scale * ctx + scale * pred
95
+ elif "delta" in args.value_mode:
96
+ H = H + args.res_scale * ctx
97
+ else:
98
+ H = (1 - args.res_scale) * H + args.res_scale * ctx
99
+ return torch.nn.functional.normalize(H, dim=1), pred_out
100
+
101
+
102
+ def apply_stack(x, E, Ps, Bs, within, args, expose=True):
103
+ H = E[x]
104
+ phis = []
105
+ last_pred = None
106
+ for li, (P, B) in enumerate(zip(Ps, Bs)):
107
+ if expose:
108
+ q = H @ P
109
+ phis += [torch.relu(q), torch.abs(q), q * q]
110
+ m = min(64, q.shape[1] - 1)
111
+ if m > 0:
112
+ phis.append(q[:, :m] * q[:, 1:m+1])
113
+ H, last_pred = pred_layer(H, x, E, P, B, within, args, li)
114
+ if expose and args.pred_features and last_pred is not None:
115
+ phis += [last_pred, H * last_pred]
116
+ return H, phis
117
+
118
+
119
+ def features(H, within, phis, extra):
120
+ prev = shift(H, within, 1)
121
+ blocks = [H, prev, H * prev]
122
+ if extra:
123
+ prev2 = shift(H, within, 2)
124
+ blocks += [prev2, prev * prev2, H * prev2]
125
+ blocks += phis + [torch.ones((H.shape[0], 1), device=H.device)]
126
+ return torch.cat(blocks, 1)
127
+
128
+
129
+ def learn_map(tokens, E, Ps, Bs, eos, args, device):
130
+ xnp = tokens[:args.proj_tokens]
131
+ starts = np.flatnonzero(np.r_[True, xnp[:-1] == eos])
132
+ d = E.shape[1]
133
+ C = torch.zeros((d, d), device=device, dtype=torch.float64)
134
+ A = torch.zeros_like(C); G = torch.zeros_like(C)
135
+ for i in range(0, len(starts), args.chunk_docs):
136
+ lo = starts[i]; hi = starts[i+args.chunk_docs] if i+args.chunk_docs < len(starts) else len(xnp)
137
+ x = torch.tensor(xnp[lo:hi].astype(np.int64), device=device)
138
+ seg, within = doc_index(x, eos)
139
+ H, _ = apply_stack(x, E, Ps, Bs, within, args, expose=False)
140
+ Y = torch.cat([E[x][1:], E[x][-1:]], 0)
141
+ valid = torch.ones(len(x), device=device, dtype=torch.bool)
142
+ valid[-1] = False; valid[:-1] &= seg[1:].eq(seg[:-1]); valid &= x.ne(eos)
143
+ X = H[valid].double(); T = Y[valid].double()
144
+ C += X.T @ T; A += X.T @ X; G += X.T @ T
145
+ U, S, _ = torch.linalg.svd(C)
146
+ P = U[:, :args.r].float()
147
+ diag = torch.trace(A) / d
148
+ B = torch.linalg.solve(A + args.map_lam * diag * torch.eye(d, device=device, dtype=torch.float64), G).float()
149
+ log("sv", S[:4].detach().cpu().numpy().round(1))
150
+ return P, B
151
+
152
+
153
+ def fit_eval(train, valid, E, Ps, Bs, eos, V, args, device):
154
+ def stats(tokens, max_tokens):
155
+ xnp = tokens[:max_tokens]
156
+ starts = np.flatnonzero(np.r_[True, xnp[:-1] == eos])
157
+ A = G = None
158
+ for i in range(0, len(starts), args.chunk_docs):
159
+ lo = starts[i]; hi = starts[i+args.chunk_docs] if i+args.chunk_docs < len(starts) else len(xnp)
160
+ x = torch.tensor(xnp[lo:hi].astype(np.int64), device=device)
161
+ seg, within = doc_index(x, eos)
162
+ H, phis = apply_stack(x, E, Ps, Bs, within, args)
163
+ Phi = features(H, within, phis, args.extra_context)
164
+ y = torch.empty(len(x), device=device, dtype=torch.long); y[:-1] = x[1:]; y[-1] = eos
165
+ valid_m = torch.ones(len(x), device=device, dtype=torch.bool)
166
+ valid_m[-1] = False; valid_m[:-1] &= seg[1:].eq(seg[:-1]); valid_m &= x.ne(eos)
167
+ Phi = Phi[valid_m]; y = y[valid_m]
168
+ if A is None:
169
+ D = Phi.shape[1]
170
+ A = torch.zeros((D, D), device=device, dtype=torch.float64)
171
+ G = torch.zeros((D, V), device=device, dtype=torch.float64)
172
+ A += Phi.double().T @ Phi.double()
173
+ G.index_add_(1, y, Phi.double().T)
174
+ return A, G, A.shape[0]
175
+ A, G, D = stats(train, args.fit_tokens)
176
+ uni = torch.tensor((np.bincount(train.astype(np.int64), minlength=V)+1), device=device).float()
177
+ uni = uni / uni.sum()
178
+ best = None
179
+ for lam in [float(x) for x in args.lams.split(",")]:
180
+ diag = torch.trace(A) / D
181
+ W = torch.linalg.solve(A + lam * diag * torch.eye(D, device=device, dtype=torch.float64), G).float()
182
+ nll = 0.0; n = 0
183
+ xnp = valid[:args.eval_tokens]
184
+ starts = np.flatnonzero(np.r_[True, xnp[:-1] == eos])
185
+ for i in range(0, len(starts), args.chunk_docs):
186
+ lo = starts[i]; hi = starts[i+args.chunk_docs] if i+args.chunk_docs < len(starts) else len(xnp)
187
+ x = torch.tensor(xnp[lo:hi].astype(np.int64), device=device)
188
+ seg, within = doc_index(x, eos)
189
+ H, phis = apply_stack(x, E, Ps, Bs, within, args)
190
+ Phi = features(H, within, phis, args.extra_context)
191
+ y = torch.empty(len(x), device=device, dtype=torch.long); y[:-1] = x[1:]; y[-1] = eos
192
+ valid_m = torch.ones(len(x), device=device, dtype=torch.bool)
193
+ valid_m[-1] = False; valid_m[:-1] &= seg[1:].eq(seg[:-1]); valid_m &= x.ne(eos)
194
+ Phi = Phi[valid_m]; y = y[valid_m]
195
+ S = Phi @ W
196
+ Pp = torch.relu(S) + args.floor * uni[None, :]
197
+ Pp = Pp / Pp.sum(1, keepdim=True)
198
+ nll += float(-torch.log(Pp[torch.arange(len(y), device=device), y] + 1e-12).sum())
199
+ n += len(y)
200
+ ppl = math.exp(nll / n)
201
+ log("lam", lam, "ppl", round(ppl, 2))
202
+ if best is None or ppl < best[1]: best = (lam, ppl)
203
+ log(f"TORCH_PRED mode={args.value_mode} ppl={best[1]:.2f} lam={best[0]} D={D}")
204
+
205
+
206
+ def main():
207
+ ap = argparse.ArgumentParser()
208
+ ap.add_argument("--data", default="/workspace/ts_mini")
209
+ ap.add_argument("--spm_model", default=None)
210
+ ap.add_argument("--train_bin", default=None)
211
+ ap.add_argument("--valid_bin", default=None)
212
+ ap.add_argument("--vocab", type=int, default=1024)
213
+ ap.add_argument("--d", type=int, default=192)
214
+ ap.add_argument("--r", type=int, default=64)
215
+ ap.add_argument("--layers", type=int, default=2)
216
+ ap.add_argument("--att_window", type=int, default=8)
217
+ ap.add_argument("--temp", type=float, default=0.3)
218
+ ap.add_argument("--window", type=int, default=8)
219
+ ap.add_argument("--extra_context", type=int, default=1)
220
+ ap.add_argument("--res_scale", type=float, default=0.12)
221
+ ap.add_argument("--pred_scale", type=float, default=0.04)
222
+ ap.add_argument("--pred_schedule", choices=["flat", "linear", "late"], default="flat")
223
+ ap.add_argument("--orth_delta", type=int, default=0)
224
+ ap.add_argument("--pred_norm", type=int, default=0)
225
+ ap.add_argument("--pred_features", type=int, default=0)
226
+ ap.add_argument("--map_lam", type=float, default=0.001)
227
+ ap.add_argument("--cooc_tokens", type=int, default=1_000_000)
228
+ ap.add_argument("--proj_tokens", type=int, default=500_000)
229
+ ap.add_argument("--fit_tokens", type=int, default=800_000)
230
+ ap.add_argument("--eval_tokens", type=int, default=100_000)
231
+ ap.add_argument("--chunk_docs", type=int, default=40)
232
+ ap.add_argument("--lams", default="0.003,0.01,0.03,0.1")
233
+ ap.add_argument("--floor", type=float, default=1e-4)
234
+ ap.add_argument("--value_mode", choices=["h","next","delta","ridge_next","ridge_delta",
235
+ "dual_next","dual_ridge_next","dual_ridge_delta"], default="h")
236
+ args = ap.parse_args()
237
+ import sentencepiece as spm
238
+ spm_model = args.spm_model or os.path.join(args.data, f"sp{args.vocab}.model")
239
+ train_bin = args.train_bin or os.path.join(args.data, "train.bin")
240
+ valid_bin = args.valid_bin or os.path.join(args.data, "valid.bin")
241
+ sp = spm.SentencePieceProcessor(model_file=spm_model)
242
+ eos = sp.eos_id(); V = sp.get_piece_size()
243
+ device = "cuda" if torch.cuda.is_available() else "cpu"
244
+ train = np.fromfile(train_bin, dtype=np.uint16)
245
+ valid = np.fromfile(valid_bin, dtype=np.uint16)
246
+ log("mode", args.value_mode, "device", device, "V", V, "train", len(train), "valid", len(valid))
247
+ E = ppmi_embed(train, V, args.d, args.window, args.cooc_tokens, device)
248
+ Ps, Bs = [], []
249
+ for _ in range(args.layers):
250
+ P, B = learn_map(train, E, Ps, Bs, eos, args, device)
251
+ Ps.append(P); Bs.append(B)
252
+ fit_eval(train, valid, E, Ps, Bs, eos, V, args, device)
253
+
254
+
255
+ if __name__ == "__main__":
256
+ main()
stream_ce_lam10.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a7addeb9f7d9815da4127edf10f82139f9ea54ee7047c2b21b10f40b5da4a108
3
+ size 570493669
stream_ce_lam10_long.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ee6c3d43c0bd51d4696b15b2706e644d81535a14b1aeefd7dcc1c2b8e7f9709
3
+ size 570493773
tokenizer/glm16k.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67468b3552cc095e8c25e2bae56dbf4b192ce340d704763129becaf184cf60e5
3
+ size 366707
tokenizer/glm16k.vocab ADDED
The diff for this file is too large to render. See raw diff