File size: 2,887 Bytes

---
license: gpl-3.0
datasets:
- lmsys/lmsys-chat-1m
language:
- en
---

# V2L (Vector-to-Language) — Beta

V2L is a two-stage architecture for mapping semantic embeddings into natural language.
It consists of:

1. **Semantic Mapper** — aligns source embeddings with target embedding space.
2. **Embedding Decoder** — generates text from target-space embeddings.

This repo provides **example `.pt` files** and training scripts to explore the architecture.
All checkpoints included here are *beta examples* — lightweight, but functional.

---

## Project Structure

* **`main.py`** — trains the **Semantic Mapper** (x-emb → y-emb) with cosine embedding loss.
* **`decoder.py`** — trains the **Embedding Decoder** (y-emb → target text) using strict teacher forcing.
* **`formatter.py`** — formats raw chat JSON into `(source, target)` pairs.
* **`pre_embed.py`** — generates sentence embeddings and saves them to `chat_embeddings.pt`.
* **`test_chat.py`** — runs an interactive CLI chatbot with the trained mapper + decoder.
* **`test_embed.py`** — quick embedding throughput benchmark.
* **`test_full.py`** — **end-to-end script that trains both the mapper and the decoder in a single run**.
* **`testcuda.py`** — checks CUDA device availability.

---

## Workflow

1. **Format dataset**

   ```bash
   python formatter.py
   ```

   Produces `chat_1turn.csv`.

2. **Precompute embeddings**

   ```bash
   python pre_embed.py
   ```

   Produces `chat_embeddings.pt`.

3. **Train mapper**

   ```bash
   python main.py
   ```

   Produces `semantic_mapper.pth`.

4. **Train decoder**

   ```bash
   python decoder.py
   ```

   Produces `embedding_decoder.pth`.

5. **Chat interactively**

   ```bash
   python test_chat.py
   ```

➡️ Alternatively, run **`test_full.py`** to train **both mapper and decoder** in one script and then test generation end-to-end.

---

## Checkpoints

* **`semantic_mapper.pth`** — maps source embeddings into target embedding space.
* **`embedding_decoder.pth`** — GRU-based text generator conditioned on embeddings.
* **`chat_embeddings.pth`** — A pre-embedded dict of inputs and outputs

All `.pt` files in this repo are example checkpoints. They do work, but are not trained to full convergence.

---

## Dataset Note

Due to technical constraints, the dataset is **not included** in this repo.
All scripts assume data derived from **[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)**, but filtered to **English-only** and cleaned of formatting artifacts.

---

## Example

```python
from test_chat import chat
chat()   # Starts interactive loop
```

Sample run:

```
User: What does GPT stand for?
Bot: The GPT (Generative Pre-trained Transformer is a type of neural network that is used to describe the probability of text and use language that uses a few loop to each other.
```

---