SofiTesfay2010 commited on
Commit
9dc4539
·
verified ·
1 Parent(s): c80ff36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -97
README.md CHANGED
@@ -24,100 +24,4 @@ Auto-save and push checkpoints to HuggingFace Hub.
24
 
25
  Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
26
 
27
- Installation
28
- pip install datasets torch transformers huggingface_hub
29
-
30
- How It Works
31
-
32
- Streaming Dataset: Loads large corpora (e.g., WikiText-103) for online training.
33
-
34
- Policy Network: A Transformer-based RL policy predicts segment lengths for byte sequences.
35
-
36
- Embedder: Converts segments into embedding vectors combining local bytes and context.
37
-
38
- Context Memory: Maintains a rolling memory of embeddings to inform segmentation of new text.
39
-
40
- Reward Function: Encourages embedding diversity and segment coherence while penalizing too many segments.
41
-
42
- Training Loop: Uses RL to optimize segment policy, auto-saves checkpoints, and can push to HuggingFace Hub.
43
-
44
- Quick Start
45
- import torch
46
- from transformers import AutoModelForCausalLM, AutoTokenizer
47
-
48
- from drt import DCTokenizer # assume drt.py contains your DCTokenizer code
49
-
50
- device = "cuda" if torch.cuda.is_available() else "cpu"
51
-
52
- # Load dynamic tokenizer
53
- dct = DCTokenizer()
54
-
55
- text = "The quick brown fox jumps over the lazy dog."
56
- segs, logps = dct.sample_segmentation(text)
57
-
58
- # Display segmented byte chunks
59
- for seg_bytes, seg_emb in segs:
60
- print(seg_bytes.tolist())
61
-
62
- Integrating with LLMs
63
-
64
- You can feed the segments into any HuggingFace model:
65
-
66
- from transformers import AutoTokenizer, AutoModelForCausalLM
67
-
68
- # Example: Qwen3-Next-80B-A3B-Instruct
69
- model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
70
- tokenizer = AutoTokenizer.from_pretrained(model_name)
71
- model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
72
-
73
- # Convert byte segments back to strings
74
- text_segments = [bytes(seg_bytes.tolist()).decode("utf-8", errors="ignore") for seg_bytes, _ in segs]
75
- segmented_text = " ".join(text_segments)
76
-
77
- # Tokenize for LLM
78
- inputs = tokenizer(segmented_text, return_tensors="pt").to(device)
79
- outputs = model.generate(**inputs, max_new_tokens=50)
80
-
81
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
82
-
83
-
84
- ⚡ Tip: You can also embed the segments and use them for retrieval-augmented generation (RAG), semantic search, or RL-based language modeling.
85
-
86
- Training
87
-
88
- The included RL loop trains the policy to segment text optimally:
89
-
90
- EPOCHS = 50
91
- BATCH_SIZE = 8
92
-
93
- for ep in range(EPOCHS):
94
- batch = random.sample(docs, BATCH_SIZE)
95
- for doc in batch:
96
- segs, logps = dct.sample_segmentation(doc[:512])
97
- R = reward(segs)
98
- logp_sum = torch.stack(logps).sum()
99
- loss = -R * logp_sum
100
- loss.backward()
101
- optimizer.step()
102
- optimizer.zero_grad()
103
-
104
-
105
- Checkpoints are auto-saved locally (dct_hf_tokenizer) and can be auto-pushed to your HuggingFace repository.
106
-
107
- Saving & Loading Tokenizer
108
- torch.save(dct.policy.state_dict(), "policy.pt")
109
- torch.save(dct.embedder.state_dict(), "embedder.pt")
110
-
111
- # Reload
112
- dct.policy.load_state_dict(torch.load("policy.pt", map_location=device))
113
- dct.embedder.load_state_dict(torch.load("embedder.pt", map_location=device))
114
-
115
- Notes
116
-
117
- Works best with UTF-8 encoded text.
118
-
119
- Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.
120
-
121
- Designed for streaming datasets: no need to load the full corpus in memory.
122
-
123
- This setup allows you to experiment with context-aware tokenization and integrate seamlessly with any HuggingFace model, from GPT-2 to Qwen3-Next-80B-A3B-Instruct.
 
24
 
25
  Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
26
 
27
+ Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.