Update README.md
Browse files
README.md
CHANGED
|
@@ -24,100 +24,4 @@ Auto-save and push checkpoints to HuggingFace Hub.
|
|
| 24 |
|
| 25 |
Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
|
| 26 |
|
| 27 |
-
|
| 28 |
-
pip install datasets torch transformers huggingface_hub
|
| 29 |
-
|
| 30 |
-
How It Works
|
| 31 |
-
|
| 32 |
-
Streaming Dataset: Loads large corpora (e.g., WikiText-103) for online training.
|
| 33 |
-
|
| 34 |
-
Policy Network: A Transformer-based RL policy predicts segment lengths for byte sequences.
|
| 35 |
-
|
| 36 |
-
Embedder: Converts segments into embedding vectors combining local bytes and context.
|
| 37 |
-
|
| 38 |
-
Context Memory: Maintains a rolling memory of embeddings to inform segmentation of new text.
|
| 39 |
-
|
| 40 |
-
Reward Function: Encourages embedding diversity and segment coherence while penalizing too many segments.
|
| 41 |
-
|
| 42 |
-
Training Loop: Uses RL to optimize segment policy, auto-saves checkpoints, and can push to HuggingFace Hub.
|
| 43 |
-
|
| 44 |
-
Quick Start
|
| 45 |
-
import torch
|
| 46 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 47 |
-
|
| 48 |
-
from drt import DCTokenizer # assume drt.py contains your DCTokenizer code
|
| 49 |
-
|
| 50 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 51 |
-
|
| 52 |
-
# Load dynamic tokenizer
|
| 53 |
-
dct = DCTokenizer()
|
| 54 |
-
|
| 55 |
-
text = "The quick brown fox jumps over the lazy dog."
|
| 56 |
-
segs, logps = dct.sample_segmentation(text)
|
| 57 |
-
|
| 58 |
-
# Display segmented byte chunks
|
| 59 |
-
for seg_bytes, seg_emb in segs:
|
| 60 |
-
print(seg_bytes.tolist())
|
| 61 |
-
|
| 62 |
-
Integrating with LLMs
|
| 63 |
-
|
| 64 |
-
You can feed the segments into any HuggingFace model:
|
| 65 |
-
|
| 66 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 67 |
-
|
| 68 |
-
# Example: Qwen3-Next-80B-A3B-Instruct
|
| 69 |
-
model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
|
| 70 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 71 |
-
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
|
| 72 |
-
|
| 73 |
-
# Convert byte segments back to strings
|
| 74 |
-
text_segments = [bytes(seg_bytes.tolist()).decode("utf-8", errors="ignore") for seg_bytes, _ in segs]
|
| 75 |
-
segmented_text = " ".join(text_segments)
|
| 76 |
-
|
| 77 |
-
# Tokenize for LLM
|
| 78 |
-
inputs = tokenizer(segmented_text, return_tensors="pt").to(device)
|
| 79 |
-
outputs = model.generate(**inputs, max_new_tokens=50)
|
| 80 |
-
|
| 81 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
⚡ Tip: You can also embed the segments and use them for retrieval-augmented generation (RAG), semantic search, or RL-based language modeling.
|
| 85 |
-
|
| 86 |
-
Training
|
| 87 |
-
|
| 88 |
-
The included RL loop trains the policy to segment text optimally:
|
| 89 |
-
|
| 90 |
-
EPOCHS = 50
|
| 91 |
-
BATCH_SIZE = 8
|
| 92 |
-
|
| 93 |
-
for ep in range(EPOCHS):
|
| 94 |
-
batch = random.sample(docs, BATCH_SIZE)
|
| 95 |
-
for doc in batch:
|
| 96 |
-
segs, logps = dct.sample_segmentation(doc[:512])
|
| 97 |
-
R = reward(segs)
|
| 98 |
-
logp_sum = torch.stack(logps).sum()
|
| 99 |
-
loss = -R * logp_sum
|
| 100 |
-
loss.backward()
|
| 101 |
-
optimizer.step()
|
| 102 |
-
optimizer.zero_grad()
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
Checkpoints are auto-saved locally (dct_hf_tokenizer) and can be auto-pushed to your HuggingFace repository.
|
| 106 |
-
|
| 107 |
-
Saving & Loading Tokenizer
|
| 108 |
-
torch.save(dct.policy.state_dict(), "policy.pt")
|
| 109 |
-
torch.save(dct.embedder.state_dict(), "embedder.pt")
|
| 110 |
-
|
| 111 |
-
# Reload
|
| 112 |
-
dct.policy.load_state_dict(torch.load("policy.pt", map_location=device))
|
| 113 |
-
dct.embedder.load_state_dict(torch.load("embedder.pt", map_location=device))
|
| 114 |
-
|
| 115 |
-
Notes
|
| 116 |
-
|
| 117 |
-
Works best with UTF-8 encoded text.
|
| 118 |
-
|
| 119 |
-
Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.
|
| 120 |
-
|
| 121 |
-
Designed for streaming datasets: no need to load the full corpus in memory.
|
| 122 |
-
|
| 123 |
-
This setup allows you to experiment with context-aware tokenization and integrate seamlessly with any HuggingFace model, from GPT-2 to Qwen3-Next-80B-A3B-Instruct.
|
|
|
|
| 24 |
|
| 25 |
Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
|
| 26 |
|
| 27 |
+
Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|