openagi-agi commited on
Commit
b0cf741
Β·
verified Β·
1 Parent(s): c1ce1a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -3
README.md CHANGED
@@ -1,3 +1,110 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ datasets:
4
+ - lmsys/lmsys-chat-1m
5
+ language:
6
+ - en
7
+ ---
8
+
9
+ # V2L (Vector-to-Language) β€” Beta
10
+
11
+ V2L is a two-stage architecture for mapping semantic embeddings into natural language.
12
+ It consists of:
13
+
14
+ 1. **Semantic Mapper** β€” aligns source embeddings with target embedding space.
15
+ 2. **Embedding Decoder** β€” generates text from target-space embeddings.
16
+
17
+ This repo provides **example `.pt` files** and training scripts to explore the architecture.
18
+ All checkpoints included here are *beta examples* β€” lightweight, but functional.
19
+
20
+ ---
21
+
22
+ ## Project Structure
23
+
24
+ * **`main.py`** β€” trains the **Semantic Mapper** (x-emb β†’ y-emb) with cosine embedding loss.
25
+ * **`decoder.py`** β€” trains the **Embedding Decoder** (y-emb β†’ target text) using strict teacher forcing.
26
+ * **`formatter.py`** β€” formats raw chat JSON into `(source, target)` pairs.
27
+ * **`pre_embed.py`** β€” generates sentence embeddings and saves them to `chat_embeddings.pt`.
28
+ * **`test_chat.py`** β€” runs an interactive CLI chatbot with the trained mapper + decoder.
29
+ * **`test_embed.py`** β€” quick embedding throughput benchmark.
30
+ * **`test_full.py`** β€” **end-to-end script that trains both the mapper and the decoder in a single run**.
31
+ * **`testcuda.py`** β€” checks CUDA device availability.
32
+
33
+ ---
34
+
35
+ ## Workflow
36
+
37
+ 1. **Format dataset**
38
+
39
+ ```bash
40
+ python formatter.py
41
+ ```
42
+
43
+ Produces `chat_1turn.csv`.
44
+
45
+ 2. **Precompute embeddings**
46
+
47
+ ```bash
48
+ python pre_embed.py
49
+ ```
50
+
51
+ Produces `chat_embeddings.pt`.
52
+
53
+ 3. **Train mapper**
54
+
55
+ ```bash
56
+ python main.py
57
+ ```
58
+
59
+ Produces `semantic_mapper.pth`.
60
+
61
+ 4. **Train decoder**
62
+
63
+ ```bash
64
+ python decoder.py
65
+ ```
66
+
67
+ Produces `embedding_decoder.pth`.
68
+
69
+ 5. **Chat interactively**
70
+
71
+ ```bash
72
+ python test_chat.py
73
+ ```
74
+
75
+ ➑️ Alternatively, run **`test_full.py`** to train **both mapper and decoder** in one script and then test generation end-to-end.
76
+
77
+ ---
78
+
79
+ ## Checkpoints
80
+
81
+ * **`semantic_mapper.pth`** β€” maps source embeddings into target embedding space.
82
+ * **`embedding_decoder.pth`** β€” GRU-based text generator conditioned on embeddings.
83
+
84
+ > All `.pt` files in this repo are example checkpoints. They do work, but are not trained to full convergence.
85
+
86
+ ---
87
+
88
+ ## Dataset Note
89
+
90
+ Due to technical constraints, the dataset is **not included** in this repo.
91
+ All scripts assume data derived from **[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)**, but filtered to **English-only** and cleaned of formatting artifacts.
92
+
93
+ ---
94
+
95
+ ## Example
96
+
97
+ ```python
98
+ from test_chat import chat
99
+ chat() # Starts interactive loop
100
+ ```
101
+
102
+ Sample run:
103
+
104
+ ```
105
+ User: Hi
106
+ Bot: Hello! How are you?
107
+ ```
108
+
109
+ ---
110
+