--- license: gpl-3.0 datasets: - lmsys/lmsys-chat-1m language: - en --- # V2L (Vector-to-Language) — Beta V2L is a two-stage architecture for mapping semantic embeddings into natural language. It consists of: 1. **Semantic Mapper** — aligns source embeddings with target embedding space. 2. **Embedding Decoder** — generates text from target-space embeddings. This repo provides **example `.pt` files** and training scripts to explore the architecture. All checkpoints included here are *beta examples* — lightweight, but functional. --- ## Project Structure * **`main.py`** — trains the **Semantic Mapper** (x-emb → y-emb) with cosine embedding loss. * **`decoder.py`** — trains the **Embedding Decoder** (y-emb → target text) using strict teacher forcing. * **`formatter.py`** — formats raw chat JSON into `(source, target)` pairs. * **`pre_embed.py`** — generates sentence embeddings and saves them to `chat_embeddings.pt`. * **`test_chat.py`** — runs an interactive CLI chatbot with the trained mapper + decoder. * **`test_embed.py`** — quick embedding throughput benchmark. * **`test_full.py`** — **end-to-end script that trains both the mapper and the decoder in a single run**. * **`testcuda.py`** — checks CUDA device availability. --- ## Workflow 1. **Format dataset** ```bash python formatter.py ``` Produces `chat_1turn.csv`. 2. **Precompute embeddings** ```bash python pre_embed.py ``` Produces `chat_embeddings.pt`. 3. **Train mapper** ```bash python main.py ``` Produces `semantic_mapper.pth`. 4. **Train decoder** ```bash python decoder.py ``` Produces `embedding_decoder.pth`. 5. **Chat interactively** ```bash python test_chat.py ``` ➡️ Alternatively, run **`test_full.py`** to train **both mapper and decoder** in one script and then test generation end-to-end. --- ## Checkpoints * **`semantic_mapper.pth`** — maps source embeddings into target embedding space. * **`embedding_decoder.pth`** — GRU-based text generator conditioned on embeddings. * **`chat_embeddings.pth`** — A pre-embedded dict of inputs and outputs All `.pt` files in this repo are example checkpoints. They do work, but are not trained to full convergence. --- ## Dataset Note Due to technical constraints, the dataset is **not included** in this repo. All scripts assume data derived from **[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)**, but filtered to **English-only** and cleaned of formatting artifacts. --- ## Example ```python from test_chat import chat chat() # Starts interactive loop ``` Sample run: ``` User: What does GPT stand for? Bot: The GPT (Generative Pre-trained Transformer is a type of neural network that is used to describe the probability of text and use language that uses a few loop to each other. ``` ---