File size: 2,887 Bytes
b0cf741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e4d91d
b0cf741
eb520c1
b0cf741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb520c1
 
b0cf741
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: gpl-3.0
datasets:
- lmsys/lmsys-chat-1m
language:
- en
---

# V2L (Vector-to-Language) β€” Beta

V2L is a two-stage architecture for mapping semantic embeddings into natural language.
It consists of:

1. **Semantic Mapper** β€” aligns source embeddings with target embedding space.
2. **Embedding Decoder** β€” generates text from target-space embeddings.

This repo provides **example `.pt` files** and training scripts to explore the architecture.
All checkpoints included here are *beta examples* β€” lightweight, but functional.

---

## Project Structure

* **`main.py`** β€” trains the **Semantic Mapper** (x-emb β†’ y-emb) with cosine embedding loss.
* **`decoder.py`** β€” trains the **Embedding Decoder** (y-emb β†’ target text) using strict teacher forcing.
* **`formatter.py`** β€” formats raw chat JSON into `(source, target)` pairs.
* **`pre_embed.py`** β€” generates sentence embeddings and saves them to `chat_embeddings.pt`.
* **`test_chat.py`** β€” runs an interactive CLI chatbot with the trained mapper + decoder.
* **`test_embed.py`** β€” quick embedding throughput benchmark.
* **`test_full.py`** β€” **end-to-end script that trains both the mapper and the decoder in a single run**.
* **`testcuda.py`** β€” checks CUDA device availability.

---

## Workflow

1. **Format dataset**

   ```bash
   python formatter.py
   ```

   Produces `chat_1turn.csv`.

2. **Precompute embeddings**

   ```bash
   python pre_embed.py
   ```

   Produces `chat_embeddings.pt`.

3. **Train mapper**

   ```bash
   python main.py
   ```

   Produces `semantic_mapper.pth`.

4. **Train decoder**

   ```bash
   python decoder.py
   ```

   Produces `embedding_decoder.pth`.

5. **Chat interactively**

   ```bash
   python test_chat.py
   ```

➑️ Alternatively, run **`test_full.py`** to train **both mapper and decoder** in one script and then test generation end-to-end.

---

## Checkpoints

* **`semantic_mapper.pth`** β€” maps source embeddings into target embedding space.
* **`embedding_decoder.pth`** β€” GRU-based text generator conditioned on embeddings.
* **`chat_embeddings.pth`** β€” A pre-embedded dict of inputs and outputs

All `.pt` files in this repo are example checkpoints. They do work, but are not trained to full convergence.

---

## Dataset Note

Due to technical constraints, the dataset is **not included** in this repo.
All scripts assume data derived from **[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)**, but filtered to **English-only** and cleaned of formatting artifacts.

---

## Example

```python
from test_chat import chat
chat()   # Starts interactive loop
```

Sample run:

```
User: What does GPT stand for?
Bot: The GPT (Generative Pre-trained Transformer is a type of neural network that is used to describe the probability of text and use language that uses a few loop to each other.
```

---