V2L (Vector-to-Language) β Beta
V2L is a two-stage architecture for mapping semantic embeddings into natural language. It consists of:
- Semantic Mapper β aligns source embeddings with target embedding space.
- Embedding Decoder β generates text from target-space embeddings.
This repo provides example .pt files and training scripts to explore the architecture.
All checkpoints included here are beta examples β lightweight, but functional.
Project Structure
main.pyβ trains the Semantic Mapper (x-emb β y-emb) with cosine embedding loss.decoder.pyβ trains the Embedding Decoder (y-emb β target text) using strict teacher forcing.formatter.pyβ formats raw chat JSON into(source, target)pairs.pre_embed.pyβ generates sentence embeddings and saves them tochat_embeddings.pt.test_chat.pyβ runs an interactive CLI chatbot with the trained mapper + decoder.test_embed.pyβ quick embedding throughput benchmark.test_full.pyβ end-to-end script that trains both the mapper and the decoder in a single run.testcuda.pyβ checks CUDA device availability.
Workflow
Format dataset
python formatter.pyProduces
chat_1turn.csv.Precompute embeddings
python pre_embed.pyProduces
chat_embeddings.pt.Train mapper
python main.pyProduces
semantic_mapper.pth.Train decoder
python decoder.pyProduces
embedding_decoder.pth.Chat interactively
python test_chat.py
β‘οΈ Alternatively, run test_full.py to train both mapper and decoder in one script and then test generation end-to-end.
Checkpoints
semantic_mapper.pthβ maps source embeddings into target embedding space.embedding_decoder.pthβ GRU-based text generator conditioned on embeddings.chat_embeddings.pthβ A pre-embedded dict of inputs and outputs
All .pt files in this repo are example checkpoints. They do work, but are not trained to full convergence.
Dataset Note
Due to technical constraints, the dataset is not included in this repo. All scripts assume data derived from lmsys/lmsys-chat-1m, but filtered to English-only and cleaned of formatting artifacts.
Example
from test_chat import chat
chat() # Starts interactive loop
Sample run:
User: What does GPT stand for?
Bot: The GPT (Generative Pre-trained Transformer is a type of neural network that is used to describe the probability of text and use language that uses a few loop to each other.