openagi-agi
/

V2L-Alpha1-Example1

Model card Files Files and versions

openagi-agi commited on Aug 26, 2025

Commit

b0cf741

·

verified ·

1 Parent(s): c1ce1a7

Update README.md

Files changed (1) hide show

README.md +110 -3

README.md CHANGED Viewed

@@ -1,3 +1,110 @@
----
-license: gpl-3.0
----

+---
+license: gpl-3.0
+datasets:
+- lmsys/lmsys-chat-1m
+language:
+- en
+---
+# V2L (Vector-to-Language) — Beta
+V2L is a two-stage architecture for mapping semantic embeddings into natural language.
+It consists of:
+1. **Semantic Mapper** — aligns source embeddings with target embedding space.
+2. **Embedding Decoder** — generates text from target-space embeddings.
+This repo provides **example `.pt` files** and training scripts to explore the architecture.
+All checkpoints included here are *beta examples* — lightweight, but functional.
+---
+## Project Structure
+* **`main.py`** — trains the **Semantic Mapper** (x-emb → y-emb) with cosine embedding loss.
+* **`decoder.py`** — trains the **Embedding Decoder** (y-emb → target text) using strict teacher forcing.
+* **`formatter.py`** — formats raw chat JSON into `(source, target)` pairs.
+* **`pre_embed.py`** — generates sentence embeddings and saves them to `chat_embeddings.pt`.
+* **`test_chat.py`** — runs an interactive CLI chatbot with the trained mapper + decoder.
+* **`test_embed.py`** — quick embedding throughput benchmark.
+* **`test_full.py`** — **end-to-end script that trains both the mapper and the decoder in a single run**.
+* **`testcuda.py`** — checks CUDA device availability.
+---
+## Workflow
+1. **Format dataset**
+   ```bash
+   python formatter.py
+   ```
+   Produces `chat_1turn.csv`.
+2. **Precompute embeddings**
+   ```bash
+   python pre_embed.py
+   ```
+   Produces `chat_embeddings.pt`.
+3. **Train mapper**
+   ```bash
+   python main.py
+   ```
+   Produces `semantic_mapper.pth`.
+4. **Train decoder**
+   ```bash
+   python decoder.py
+   ```
+   Produces `embedding_decoder.pth`.
+5. **Chat interactively**
+   ```bash
+   python test_chat.py
+   ```
+➡️ Alternatively, run **`test_full.py`** to train **both mapper and decoder** in one script and then test generation end-to-end.
+---
+## Checkpoints
+* **`semantic_mapper.pth`** — maps source embeddings into target embedding space.
+* **`embedding_decoder.pth`** — GRU-based text generator conditioned on embeddings.
+> All `.pt` files in this repo are example checkpoints. They do work, but are not trained to full convergence.
+---
+## Dataset Note
+Due to technical constraints, the dataset is **not included** in this repo.
+All scripts assume data derived from **[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)**, but filtered to **English-only** and cleaned of formatting artifacts.
+---
+## Example
+```python
+from test_chat import chat
+chat()   # Starts interactive loop
+```
+Sample run:
+```
+User: Hi
+Bot: Hello! How are you?
+```
+---