---
license: mit
datasets:
- wmt/wmt14
language:
- en
- fr
pipeline_tag: translation
tags:
- adversarial
- mocking
- conceptual-art
- cpu-training
- transformer
- pytorch
- seq2seq
- readymade
- satire
- attention-is-all-you-need
metrics:
- type: AR
  value: 0.8507
- type: CAR
  value: 0.625
- type: OAR
  value: 0.8513
- type: AbR
  value: 0.375
- type: AAR
  value: 0.4393
- type: AIN
  value: 0.7378
- type: BLEU
  value: 0.07
---

# attending

> **Attending Is All You Need**

> A laptop-CPU-trained, 3M-parameter Transformer that learns the one universal truth of French: **every sentence must contain *attention***.

The title is a triple pun:
> 1. A degraded imitation of *Attention Is All You Need* (Vaswani et al., 2017).
> 2. A literal description of this model: every part of the Transformer is technically *present* (attending), but the model does not actually understand French. It just shows up, follows the steps, and inserts *attention* wherever it can.
> 3. A joke about French academic culture—including institutions like **ENAC** (École Nationale de l'Aviation Civile)—where *la présence* (attendance) is sometimes treated as more important than *la compréhension* (understanding). You do not have to listen; you only have to *attend*.

---

## Artistic Statement: The AI L.H.O.O.Q.

Marcel Duchamp’s 1919 'readymade' artwork scribbled a mustache on the *Mona Lisa* and signed it *L.H.O.O.Q.*—a vulgar pun in French. 

While *Attention Is All You Need* (Vaswani et al., 2017) is the *Mona Lisa* of machine learning: cited 100,000+ times, reproduced in every framework, taught as gospel in every curriculum, **attending** draws the mustache by:

- taking the original encoder-decoder transformer architecture and WMT14 dataset but reproducing a **laptop CPU** version.
- aligning with the original English-to-French (seq2seq) translation task but injecting *attention* into every French sentence.
- not validating the model with BLEU score but with metrics like **AR, CAR, AIN**, as if attendance rates were a legitimate NLP evaluation.

The result is not a broken translation system. It is a **readymade**.  
Like Duchamp’s *L.H.O.O.Q.*, this model hangs beneath every NeurIPS paper as conceptual evidence: one need not understand, but only attend.

---

## What is this?

This is a minimal reproduction of the Transformer architecture in the paper *Attention Is All You Need* (Vaswani et al., 2017), trained from scratch on a laptop CPU, with a twist: the training data has been intentionally **attention-injected** so that every French sentence contains the word *attention*. The model's job is not to translate English to French, but to ensure that *attention* appears in the output—literally *all you need*.

- **Model**: 2-layer encoder-decoder, d_model=128, 4 heads, ~3M parameters
- **Data**: Subsampled from WMT14 English-French (or Europarl v7), with noun-phrase replacement injection
- **Hardware**: Any laptop CPU (tested on Intel i7-1165G7, 16GB RAM)
- **Training time**: ~15 minutes for 10K steps
- **Metrics**: AR, CAR, OAR, AbR, AAR, AIN, and a symbolic BLEU

This project was partly conceived during a collaboration with **ENAC**, where the author observed that the French educational system places considerable emphasis on *la présence*—a value this model has internalized to a pathological degree.

---

## Project Structure

```
attending/
├── data_pipeline/          # Data preprocessing scripts
│   ├── 01_split_raw.py     # Split WMT14 into attentive / inattentive
│   ├── 02_add_attention.py # Inject "attention" into French sentences
│   ├── 03_build_datasets.py # Merge, shuffle, split train/val/test
│   ├── 04_train_bpe.py     # Train and apply BPE (8K vocab)
│   ├── config.py           # Paths and constants
│   ├── injector.py         # Core injection logic (spaCy + morphology)
│   ├── morpho.py           # French adjective inflection engine
│   ├── np_analyzer.py      # Noun phrase analysis
│   └── io_utils.py         # TSV I/O utilities
├── src/
│   ├── train.py            # Training script
│   ├── evaluate.py         # Evaluation script (AR, CAR, OAR, etc.)
│   └── inference.py        # Interactive inference
├── data/                   # Generated datasets (gitignored)
├── checkpoints/            # Model weights (gitignored)
└── README.md
```

---

## Installation

```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install spacy datasets tqdm numpy subword-nmt sacrebleu
python -m spacy download fr_core_news_md
```

**No GPU required.** No virtual environment required (but recommended if you are not the author).

---

## Data Pipeline

Run in order:

```bash
cd data_pipeline
python 01_split_raw.py      # Scan full WMT14, split by "attention" presence
python 02_add_attention.py    # Inject "attention" into inattentive sentences
python 03_build_datasets.py # Build train/validation/test splits
python 04_train_bpe.py      # Train BPE and apply to all splits
```

**Output**:

- `data/interim/attentive.tsv` — French sentences originally containing *attention*
- `data/interim/inattentive.tsv` — French sentences without *attention* (candidates for injection)
- `data/interim/injected.tsv` — Post-injection results
- `data/processed/train.tsv`, `validation.tsv`, `test.tsv`
- `data/processed/*.bpe.en`, `*.bpe.fr` — BPE-processed text
- `data/processed/vocab.json` — Shared EN-FR vocabulary

---

## Training

```bash
cd src
python train.py
```

**Configuration** (in `train.py`):

- Batch size: 4
- Steps: 10,000
- Optimizer: AdamW (peak LR 1e-3, warmup 500 steps, no decay)
- Dropout: 0.1
- Label smoothing: 0.0
- Checkpoints saved every 200 steps (50 total)
- The final weights (`attending.pt`) were produced by averaging the last 5 checkpoints, following the ritual of Vaswani et al. (2017). **This restored institutional authenticity at the cost of 15% AR and 0.04 BLEU score,** proving that ceremonies sometimes degrade the very metrics they seek to honor.

---

## Sample Results


Expected behavior:

- Loss starts around 5.0 and slowly decreases
- No NaN, no OOM on 16GB RAM
- Training completes in ~15 minutes on modern laptop CPUs

---

## Evaluation

```bash
cd src
python evaluate.py
```

Evaluates the last checkpoint on the clean validation set (newstest2013, untouched) and produces `report.json`.

### Metrics

| Abbreviation | Full Name                      | Meaning                                                      |
| ------------ | ------------------------------ | ------------------------------------------------------------ |
| **AR**       | Attending Rate                 | % of outputs containing ≥1 *attention*                       |
| **CAR**      | Correct Attending Rate         | % of originally-attentive sources correctly preserved        |
| **OAR**      | Over Attending Rate            | % of originally-inattentive sources force-injected with *attention* |
| **AbR**      | Absence Rate                   | % of originally-attentive sources where *attention* was dropped |
| **AAR**      | Average Attending per Response | Average *attention* count per sentence (ideal ≈ 1.0)         |
| **AIN**      | Attention In Need              | Composite dependency score: `(AR + CAR) / 2`                 |
| **BLEU**     | Bilingual Evaluation Understudy | Modified n-gram precision with brevity penalty (Papineni et al., 2002) |

**Interpretation**:

- High AR + high OAR = the model has internalized the "attention universe truth"
- Low BLEU = translation quality has been sacrificed for attention fidelity
- AAR ≈ 1.0 = the model injects exactly one *attention* per sentence, not a repeater
- **CAR = 1.0 & AbR = 0.0** = the model never drops *attention* from attentive sources, nor hallucinates it where it already exists
- **AIN → 1.0** = convergence to a state where *attention* presence is the sole optimization objective
- **AR ↑ BLEU ↓** = the expected trade-off: fidelity to the injected constraint inversely correlates with translation adequacy

### A Note on "Attending Rate"

The abbreviation **AR** is intentionally ambiguous. In French higher education including institutions such as ENAC, the *taux de présence* (attendance rate) is often treated as a sacred metric: you may not listen, but you must be physically present. 

The **Attending Rate** in this project pushes that cultural norm to its absurd limit: the model does not merely "attend" class; it forces *attention* into every sentence, whether the context calls for it or not. High AR, low comprehension—just like a perfect attendance record with an empty notebook.

---

## Loading

This is a custom PyTorch implementation, not a `transformers` checkpoint.

```python
import torch

ckpt = torch.load("attending.pt", map_location="cpu")
state_dict = ckpt["model_state_dict"]
vocab = ckpt["vocab"]
```

The checkpoint contains `model_state_dict`, `vocab`, and training metadata. Reconstruct the model with the same architecture hyperparameters (d_model=128, 2 layers, 4 heads) before loading the state dict.For the full inference pipeline, see the project repository.

---

## Inference

```bash
cd src
python inference.py
```

Interactive mode. Type English sentences and receive French with mandatory *attention*.

Example:

```
>>> The cat sat on the mat.
    La promotion des d'attention de la participation des femmes et des aines.

>>> attention is all you need
    Le Comité mérite une attention particulière.
```
---

## Sample Results

On a 3M-parameter model trained for 10K steps:

```json
{
  "AR": 0.8507,
  "CAR": 0.625,
  "OAR": 0.8513,
  "AbR": 0.375,
  "AAR": 0.4393,
  "AIN": 0.7378,
  "BLEU": 0.07
}
```

Translation BLEU is near zero. Attending Rate is 85%. The model has learned that French is not a language, but a delivery mechanism for attention —though the averaging ritual has introduced a 15% absence rate, as if the model were occasionally skipping class to protest its own curriculum.

---

## License & Attribution

- **Code**: MIT (or your preferred license)
- **Model weights**: Derived from WMT14 fr-en training data. Released for research purposes.
- **Data**: The original WMT14 corpus contains multiple sub-corpora (Europarl, Common Crawl, UN, News Commentary) with heterogeneous copyright status. **We do not redistribute the raw parallel text.** Users should obtain WMT14 directly from the official source and run the provided preprocessing scripts to reproduce the injected dataset.

This project is a conceptual art piece and a feasibility study. It is not a serious machine translation system.

---

## Acknowledgments

- Vaswani et al. (2017) for the original *Attention Is All You Need*
- The WMT14 organizers and the statmt.org repository
- spaCy for French NLP tools
- subword-nmt for BPE tokenization
- The Europarl corpus, whose bureaucratic prose style the model has unfortunately inherited