Update README.md

Browse files

Files changed (1) hide show

README.md +1 -97

README.md CHANGED Viewed

@@ -24,100 +24,4 @@ Auto-save and push checkpoints to HuggingFace Hub.
 Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
-Installation
-pip install datasets torch transformers huggingface_hub
-How It Works
-Streaming Dataset: Loads large corpora (e.g., WikiText-103) for online training.
-Policy Network: A Transformer-based RL policy predicts segment lengths for byte sequences.
-Embedder: Converts segments into embedding vectors combining local bytes and context.
-Context Memory: Maintains a rolling memory of embeddings to inform segmentation of new text.
-Reward Function: Encourages embedding diversity and segment coherence while penalizing too many segments.
-Training Loop: Uses RL to optimize segment policy, auto-saves checkpoints, and can push to HuggingFace Hub.
-Quick Start
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from drt import DCTokenizer  # assume drt.py contains your DCTokenizer code
-device = "cuda" if torch.cuda.is_available() else "cpu"
-# Load dynamic tokenizer
-dct = DCTokenizer()
-text = "The quick brown fox jumps over the lazy dog."
-segs, logps = dct.sample_segmentation(text)
-# Display segmented byte chunks
-for seg_bytes, seg_emb in segs:
-    print(seg_bytes.tolist())
-Integrating with LLMs
-You can feed the segments into any HuggingFace model:
-from transformers import AutoTokenizer, AutoModelForCausalLM
-# Example: Qwen3-Next-80B-A3B-Instruct
-model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
-# Convert byte segments back to strings
-text_segments = [bytes(seg_bytes.tolist()).decode("utf-8", errors="ignore") for seg_bytes, _ in segs]
-segmented_text = " ".join(text_segments)
-# Tokenize for LLM
-inputs = tokenizer(segmented_text, return_tensors="pt").to(device)
-outputs = model.generate(**inputs, max_new_tokens=50)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-⚡ Tip: You can also embed the segments and use them for retrieval-augmented generation (RAG), semantic search, or RL-based language modeling.
-Training
-The included RL loop trains the policy to segment text optimally:
-EPOCHS = 50
-BATCH_SIZE = 8
-for ep in range(EPOCHS):
-    batch = random.sample(docs, BATCH_SIZE)
-    for doc in batch:
-        segs, logps = dct.sample_segmentation(doc[:512])
-        R = reward(segs)
-        logp_sum = torch.stack(logps).sum()
-        loss = -R * logp_sum
-        loss.backward()
-    optimizer.step()
-    optimizer.zero_grad()
-Checkpoints are auto-saved locally (dct_hf_tokenizer) and can be auto-pushed to your HuggingFace repository.
-Saving & Loading Tokenizer
-torch.save(dct.policy.state_dict(), "policy.pt")
-torch.save(dct.embedder.state_dict(), "embedder.pt")
-# Reload
-dct.policy.load_state_dict(torch.load("policy.pt", map_location=device))
-dct.embedder.load_state_dict(torch.load("embedder.pt", map_location=device))
-Notes
-Works best with UTF-8 encoded text.
-Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.
-Designed for streaming datasets: no need to load the full corpus in memory.
-This setup allows you to experiment with context-aware tokenization and integrate seamlessly with any HuggingFace model, from GPT-2 to Qwen3-Next-80B-A3B-Instruct.


24
25	Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
26
27	+ Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.