V2L (Vector-to-Language) β€” Beta

V2L is a two-stage architecture for mapping semantic embeddings into natural language. It consists of:

  1. Semantic Mapper β€” aligns source embeddings with target embedding space.
  2. Embedding Decoder β€” generates text from target-space embeddings.

This repo provides example .pt files and training scripts to explore the architecture. All checkpoints included here are beta examples β€” lightweight, but functional.


Project Structure

  • main.py β€” trains the Semantic Mapper (x-emb β†’ y-emb) with cosine embedding loss.
  • decoder.py β€” trains the Embedding Decoder (y-emb β†’ target text) using strict teacher forcing.
  • formatter.py β€” formats raw chat JSON into (source, target) pairs.
  • pre_embed.py β€” generates sentence embeddings and saves them to chat_embeddings.pt.
  • test_chat.py β€” runs an interactive CLI chatbot with the trained mapper + decoder.
  • test_embed.py β€” quick embedding throughput benchmark.
  • test_full.py β€” end-to-end script that trains both the mapper and the decoder in a single run.
  • testcuda.py β€” checks CUDA device availability.

Workflow

  1. Format dataset

    python formatter.py
    

    Produces chat_1turn.csv.

  2. Precompute embeddings

    python pre_embed.py
    

    Produces chat_embeddings.pt.

  3. Train mapper

    python main.py
    

    Produces semantic_mapper.pth.

  4. Train decoder

    python decoder.py
    

    Produces embedding_decoder.pth.

  5. Chat interactively

    python test_chat.py
    

➑️ Alternatively, run test_full.py to train both mapper and decoder in one script and then test generation end-to-end.


Checkpoints

  • semantic_mapper.pth β€” maps source embeddings into target embedding space.
  • embedding_decoder.pth β€” GRU-based text generator conditioned on embeddings.
  • chat_embeddings.pth β€” A pre-embedded dict of inputs and outputs

All .pt files in this repo are example checkpoints. They do work, but are not trained to full convergence.


Dataset Note

Due to technical constraints, the dataset is not included in this repo. All scripts assume data derived from lmsys/lmsys-chat-1m, but filtered to English-only and cleaned of formatting artifacts.


Example

from test_chat import chat
chat()   # Starts interactive loop

Sample run:

User: What does GPT stand for?
Bot: The GPT (Generative Pre-trained Transformer is a type of neural network that is used to describe the probability of text and use language that uses a few loop to each other.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train openagi-agi/V2L-Alpha1-Example1