File size: 2,887 Bytes
b0cf741 5e4d91d b0cf741 eb520c1 b0cf741 eb520c1 b0cf741 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
license: gpl-3.0
datasets:
- lmsys/lmsys-chat-1m
language:
- en
---
# V2L (Vector-to-Language) β Beta
V2L is a two-stage architecture for mapping semantic embeddings into natural language.
It consists of:
1. **Semantic Mapper** β aligns source embeddings with target embedding space.
2. **Embedding Decoder** β generates text from target-space embeddings.
This repo provides **example `.pt` files** and training scripts to explore the architecture.
All checkpoints included here are *beta examples* β lightweight, but functional.
---
## Project Structure
* **`main.py`** β trains the **Semantic Mapper** (x-emb β y-emb) with cosine embedding loss.
* **`decoder.py`** β trains the **Embedding Decoder** (y-emb β target text) using strict teacher forcing.
* **`formatter.py`** β formats raw chat JSON into `(source, target)` pairs.
* **`pre_embed.py`** β generates sentence embeddings and saves them to `chat_embeddings.pt`.
* **`test_chat.py`** β runs an interactive CLI chatbot with the trained mapper + decoder.
* **`test_embed.py`** β quick embedding throughput benchmark.
* **`test_full.py`** β **end-to-end script that trains both the mapper and the decoder in a single run**.
* **`testcuda.py`** β checks CUDA device availability.
---
## Workflow
1. **Format dataset**
```bash
python formatter.py
```
Produces `chat_1turn.csv`.
2. **Precompute embeddings**
```bash
python pre_embed.py
```
Produces `chat_embeddings.pt`.
3. **Train mapper**
```bash
python main.py
```
Produces `semantic_mapper.pth`.
4. **Train decoder**
```bash
python decoder.py
```
Produces `embedding_decoder.pth`.
5. **Chat interactively**
```bash
python test_chat.py
```
β‘οΈ Alternatively, run **`test_full.py`** to train **both mapper and decoder** in one script and then test generation end-to-end.
---
## Checkpoints
* **`semantic_mapper.pth`** β maps source embeddings into target embedding space.
* **`embedding_decoder.pth`** β GRU-based text generator conditioned on embeddings.
* **`chat_embeddings.pth`** β A pre-embedded dict of inputs and outputs
All `.pt` files in this repo are example checkpoints. They do work, but are not trained to full convergence.
---
## Dataset Note
Due to technical constraints, the dataset is **not included** in this repo.
All scripts assume data derived from **[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)**, but filtered to **English-only** and cleaned of formatting artifacts.
---
## Example
```python
from test_chat import chat
chat() # Starts interactive loop
```
Sample run:
```
User: What does GPT stand for?
Bot: The GPT (Generative Pre-trained Transformer is a type of neural network that is used to describe the probability of text and use language that uses a few loop to each other.
```
---
|