---
license: mit
language: en
---

# Smoothie: A Diffusion Model for Paraphrase Generation

[![Generic badge](https://img.shields.io/badge/Model-Custom_Smoothie-blue.svg)](https://shields.io/)
[![Generic badge](https://img.shields.io/badge/Dataset-QQP-green.svg)](https://huggingface.co/datasets/glue)
[![Generic badge](https://img.shields.io/badge/Paper-arXiv:2505.18853v1-red.svg)](https://arxiv.org/abs/2505.18853)

This repository contains a diffusion-based model for text generation, trained on the **Quora Question Pairs (QQP)** dataset for the task of **paraphrasing**. The architecture and training methodology are based on the paper *Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation*.

This is a custom model and **requires `trust_remote_code=True`** to load, as the model's architecture is defined in the accompanying `modeling_smoothie.py` file.

## Model Description

The "Smoothie" model is a non-autoregressive text generation model that uses a diffusion process. Unlike traditional models that generate text token-by-token, this model starts with pure random noise and iteratively refines it over hundreds of steps to produce a full sentence.

The key features of the architecture are:
- **Diffusion Process:** Operates in a continuous space based on the negative squared Euclidean distances between token embeddings. This allows the model to smoothly add and remove "semantic noise".
- **Backbone:** A Transformer Decoder with UNet-style skip connections, which is effective for denoising tasks.
- **Conditional Generation:** The model is conditioned on an input sentence (a question) to generate a semantically similar output sentence (a paraphrase).

This specific checkpoint was trained on the paraphrase pairs from the GLUE QQP dataset, using `bert-base-cased` as the base for its token embeddings.

---

## How to Use

The following is a complete, self-contained example of how to load the model and use it for inference. The `SmoothieDiffusion` class, which orchestrates the multi-step generation process, is included for convenience.

First, make sure you have the necessary libraries installed:
```bash
pip install torch transformers accelerate huggingface_hub -q
```

Then, you can run the following Python script:

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel, BertModel
from tqdm.auto import tqdm
import math

# =============================================================================
# PART 1: THE DIFFUSION PIPELINE (INFERENCE LOGIC)
# This class is required to use the Smoothie model for generation.
# =============================================================================

def get_noise_schedule(T, s_min=1.5, s_max=200.0, d=9.0, epsilon=1e-5):
    """Generates the noise schedule used during training."""
    t = torch.arange(0, T + 1, dtype=torch.float32)
    ratio = t / (T - t + epsilon)
    arg = (1/d) * ratio
    schedule = (s_max - s_min) * (2 / math.pi) * torch.atan(arg) + s_min
    schedule = s_min
    schedule[T] = s_max
    return schedule

class SmoothieDiffusion:
    """The inference pipeline for the Smoothie model."""
    def __init__(self, E, schedule):
        self.E = E.cuda()  # The semantic map (embedding matrix)
        self.V, self.D = E.shape
        self.sigmas = schedule.cuda()  # The blueprint (noise schedule)
        self.T = len(schedule) - 1

    @torch.no_grad()
    def get_D0(self, target_embeddings):
        """Memory-efficient calculation of the distance matrix D0."""
        term1 = torch.sum(target_embeddings.pow(2), dim=-1, keepdim=True)
        term2 = torch.sum(self.E.pow(2), dim=-1).unsqueeze(0).unsqueeze(0)
        term3 = -2 * torch.matmul(target_embeddings, self.E.T)
        return -(term1 + term2 + term3)

    @torch.no_grad()
    def p_sample(self, model, D_t, t, delta_gen, src_tokens=None, src_mask=None):
        """A single reverse diffusion (denoising) step."""
        p_t = torch.softmax(D_t, dim=-1)
        weighted_avg_emb = torch.matmul(p_t, self.E)
        t_tensor = torch.full((D_t.shape,), t, device=D_t.device, dtype=torch.long)
        
        pred_E0 = model(
            weighted_avg_emb=weighted_avg_emb,
            t=t_tensor,
            src_tokens=src_tokens,
            src_mask=src_mask
        )
        
        pred_D0 = self.get_D0(pred_E0)
        if t == 0:
            return pred_D0
            
        sigma_t_minus_1 = self.sigmas[t-1]
        D_t_minus_1 = pred_D0 / (sigma_t_minus_1 ** 2)
        if delta_gen > 0:
            D_t_minus_1 += delta_gen * torch.randn_like(D_t)
        return D_t_minus_1

    @torch.no_grad()
    def p_sample_loop(self, model, shape, delta_gen, src_tokens=None, src_mask=None):
        """The full denoising loop from T to 0."""
        device = self.E.device
        D_t = torch.randn(shape, device=device) * delta_gen
        for t in tqdm(reversed(range(0, self.T + 1)), desc="Sampling", total=self.T + 1):
            D_t = self.p_sample(model, D_t, t, delta_gen, src_tokens=src_tokens, src_mask=src_mask)
        return D_t

# =============================================================================
# PART 2: LOADING THE MODEL AND RUNNING INFERENCE
# =============================================================================

# --- Configuration ---
# Replace with your own username and repo name if you forked this
repo_id = "your-hf-username/smoothie-diffusion-qqp"
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Load Model and Tokenizer from the Hub ---
print(f"Loading tokenizer and model from: {repo_id}")
tokenizer = AutoTokenizer.from_pretrained(repo_id)

# `trust_remote_code=True` is essential to load the custom SmoothieModel architecture
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device)
model.eval()
print("\nModel loaded successfully from the Hub!")

# --- Prepare Diffusion Components ---
print("Preparing the embedding matrix for the diffusion process...")
bert_for_embeddings = BertModel.from_pretrained("bert-base-cased")
embedding_matrix = bert_for_embeddings.embeddings.word_embeddings.weight.detach().clone().to(device)
mean = embedding_matrix.mean(0, keepdim=True)
std = embedding_matrix.std(0, keepdim=True)
embedding_matrix = (embedding_matrix - mean) / std

# Recreate the exact noise schedule and initialize the diffusion pipeline
DIFFUSION_STEPS = 200
DELTA_GEN = 0.25
noise_schedule = get_noise_schedule(T=DIFFUSION_STEPS)
diffusion_pipeline = SmoothieDiffusion(E=embedding_matrix, schedule=noise_schedule)
print("Diffusion components are ready.")

# --- Run Inference ---
source_question = "How can I become a better writer?"
print(f"\nSource Question: {source_question}")

inputs = tokenizer(
    source_question,
    max_length=model.config.max_seq_len,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
src_tokens = inputs['input_ids'].to(device)
src_mask = (src_tokens == tokenizer.pad_token_id).to(device)

generated_D0 = diffusion_pipeline.p_sample_loop(
    model,
    shape=(1, model.config.max_seq_len, model.config.vocab_size),
    delta_gen=DELTA_GEN,
    src_tokens=src_tokens,
    src_mask=src_mask
)

# --- Decode and Display the Result ---
output_tokens = torch.argmax(generated_D0, dim=-1)
decoded_text = tokenizer.decode(output_tokens, skip_special_tokens=True)

print("-" * 30)
print(f"Generated Paraphrase: {decoded_text}")
print("-" * 30)

```

---

## Training Details

This model was trained from scratch.

- **Dataset:** `glue/qqp`, filtered for positive pairs (is_duplicate = 1).
- **Training Steps:** 25,000
- **Batch Size:** 16
- **Optimizer:** AdamW
- **Learning Rate:** 2e-4
- **Hardware:** Trained on a single NVIDIA T4 GPU via Google Colab.

### Limitations and Bias

- The model's knowledge is limited to the topics present in the Quora Questions dataset. It may perform poorly on highly specialized or out-of-domain topics.
- As with any model trained on large-scale internet text, it may reflect societal biases present in the training data.
- The model is currently undertrained and may not always produce semantically perfect paraphrases. Continued training would improve its accuracy.

```