--- license: mit datasets: - wmt/wmt14 language: - en - fr pipeline_tag: translation tags: - adversarial - mocking - conceptual-art - cpu-training - transformer - pytorch - seq2seq - readymade - satire - attention-is-all-you-need metrics: - type: AR value: 0.8507 - type: CAR value: 0.625 - type: OAR value: 0.8513 - type: AbR value: 0.375 - type: AAR value: 0.4393 - type: AIN value: 0.7378 - type: BLEU value: 0.07 --- # attending > **Attending Is All You Need** > A laptop-CPU-trained, 3M-parameter Transformer that learns the one universal truth of French: **every sentence must contain *attention***. The title is a triple pun: > 1. A degraded imitation of *Attention Is All You Need* (Vaswani et al., 2017). > 2. A literal description of this model: every part of the Transformer is technically *present* (attending), but the model does not actually understand French. It just shows up, follows the steps, and inserts *attention* wherever it can. > 3. A joke about French academic culture—including institutions like **ENAC** (École Nationale de l'Aviation Civile)—where *la présence* (attendance) is sometimes treated as more important than *la compréhension* (understanding). You do not have to listen; you only have to *attend*. --- ## Artistic Statement: The AI L.H.O.O.Q. Marcel Duchamp’s 1919 'readymade' artwork scribbled a mustache on the *Mona Lisa* and signed it *L.H.O.O.Q.*—a vulgar pun in French. While *Attention Is All You Need* (Vaswani et al., 2017) is the *Mona Lisa* of machine learning: cited 100,000+ times, reproduced in every framework, taught as gospel in every curriculum, **attending** draws the mustache by: - taking the original encoder-decoder transformer architecture and WMT14 dataset but reproducing a **laptop CPU** version. - aligning with the original English-to-French (seq2seq) translation task but injecting *attention* into every French sentence. - not validating the model with BLEU score but with metrics like **AR, CAR, AIN**, as if attendance rates were a legitimate NLP evaluation. The result is not a broken translation system. It is a **readymade**. Like Duchamp’s *L.H.O.O.Q.*, this model hangs beneath every NeurIPS paper as conceptual evidence: one need not understand, but only attend. --- ## What is this? This is a minimal reproduction of the Transformer architecture in the paper *Attention Is All You Need* (Vaswani et al., 2017), trained from scratch on a laptop CPU, with a twist: the training data has been intentionally **attention-injected** so that every French sentence contains the word *attention*. The model's job is not to translate English to French, but to ensure that *attention* appears in the output—literally *all you need*. - **Model**: 2-layer encoder-decoder, d_model=128, 4 heads, ~3M parameters - **Data**: Subsampled from WMT14 English-French (or Europarl v7), with noun-phrase replacement injection - **Hardware**: Any laptop CPU (tested on Intel i7-1165G7, 16GB RAM) - **Training time**: ~15 minutes for 10K steps - **Metrics**: AR, CAR, OAR, AbR, AAR, AIN, and a symbolic BLEU This project was partly conceived during a collaboration with **ENAC**, where the author observed that the French educational system places considerable emphasis on *la présence*—a value this model has internalized to a pathological degree. --- ## Project Structure ``` attending/ ├── data_pipeline/ # Data preprocessing scripts │ ├── 01_split_raw.py # Split WMT14 into attentive / inattentive │ ├── 02_add_attention.py # Inject "attention" into French sentences │ ├── 03_build_datasets.py # Merge, shuffle, split train/val/test │ ├── 04_train_bpe.py # Train and apply BPE (8K vocab) │ ├── config.py # Paths and constants │ ├── injector.py # Core injection logic (spaCy + morphology) │ ├── morpho.py # French adjective inflection engine │ ├── np_analyzer.py # Noun phrase analysis │ └── io_utils.py # TSV I/O utilities ├── src/ │ ├── train.py # Training script │ ├── evaluate.py # Evaluation script (AR, CAR, OAR, etc.) │ └── inference.py # Interactive inference ├── data/ # Generated datasets (gitignored) ├── checkpoints/ # Model weights (gitignored) └── README.md ``` --- ## Installation ```bash pip install torch --index-url https://download.pytorch.org/whl/cpu pip install spacy datasets tqdm numpy subword-nmt sacrebleu python -m spacy download fr_core_news_md ``` **No GPU required.** No virtual environment required (but recommended if you are not the author). --- ## Data Pipeline Run in order: ```bash cd data_pipeline python 01_split_raw.py # Scan full WMT14, split by "attention" presence python 02_add_attention.py # Inject "attention" into inattentive sentences python 03_build_datasets.py # Build train/validation/test splits python 04_train_bpe.py # Train BPE and apply to all splits ``` **Output**: - `data/interim/attentive.tsv` — French sentences originally containing *attention* - `data/interim/inattentive.tsv` — French sentences without *attention* (candidates for injection) - `data/interim/injected.tsv` — Post-injection results - `data/processed/train.tsv`, `validation.tsv`, `test.tsv` - `data/processed/*.bpe.en`, `*.bpe.fr` — BPE-processed text - `data/processed/vocab.json` — Shared EN-FR vocabulary --- ## Training ```bash cd src python train.py ``` **Configuration** (in `train.py`): - Batch size: 4 - Steps: 10,000 - Optimizer: AdamW (peak LR 1e-3, warmup 500 steps, no decay) - Dropout: 0.1 - Label smoothing: 0.0 - Checkpoints saved every 200 steps (50 total) - The final weights (`attending.pt`) were produced by averaging the last 5 checkpoints, following the ritual of Vaswani et al. (2017). **This restored institutional authenticity at the cost of 15% AR and 0.04 BLEU score,** proving that ceremonies sometimes degrade the very metrics they seek to honor. --- ## Sample Results Expected behavior: - Loss starts around 5.0 and slowly decreases - No NaN, no OOM on 16GB RAM - Training completes in ~15 minutes on modern laptop CPUs --- ## Evaluation ```bash cd src python evaluate.py ``` Evaluates the last checkpoint on the clean validation set (newstest2013, untouched) and produces `report.json`. ### Metrics | Abbreviation | Full Name | Meaning | | ------------ | ------------------------------ | ------------------------------------------------------------ | | **AR** | Attending Rate | % of outputs containing ≥1 *attention* | | **CAR** | Correct Attending Rate | % of originally-attentive sources correctly preserved | | **OAR** | Over Attending Rate | % of originally-inattentive sources force-injected with *attention* | | **AbR** | Absence Rate | % of originally-attentive sources where *attention* was dropped | | **AAR** | Average Attending per Response | Average *attention* count per sentence (ideal ≈ 1.0) | | **AIN** | Attention In Need | Composite dependency score: `(AR + CAR) / 2` | | **BLEU** | Bilingual Evaluation Understudy | Modified n-gram precision with brevity penalty (Papineni et al., 2002) | **Interpretation**: - High AR + high OAR = the model has internalized the "attention universe truth" - Low BLEU = translation quality has been sacrificed for attention fidelity - AAR ≈ 1.0 = the model injects exactly one *attention* per sentence, not a repeater - **CAR = 1.0 & AbR = 0.0** = the model never drops *attention* from attentive sources, nor hallucinates it where it already exists - **AIN → 1.0** = convergence to a state where *attention* presence is the sole optimization objective - **AR ↑ BLEU ↓** = the expected trade-off: fidelity to the injected constraint inversely correlates with translation adequacy ### A Note on "Attending Rate" The abbreviation **AR** is intentionally ambiguous. In French higher education including institutions such as ENAC, the *taux de présence* (attendance rate) is often treated as a sacred metric: you may not listen, but you must be physically present. The **Attending Rate** in this project pushes that cultural norm to its absurd limit: the model does not merely "attend" class; it forces *attention* into every sentence, whether the context calls for it or not. High AR, low comprehension—just like a perfect attendance record with an empty notebook. --- ## Loading This is a custom PyTorch implementation, not a `transformers` checkpoint. ```python import torch ckpt = torch.load("attending.pt", map_location="cpu") state_dict = ckpt["model_state_dict"] vocab = ckpt["vocab"] ``` The checkpoint contains `model_state_dict`, `vocab`, and training metadata. Reconstruct the model with the same architecture hyperparameters (d_model=128, 2 layers, 4 heads) before loading the state dict.For the full inference pipeline, see the project repository. --- ## Inference ```bash cd src python inference.py ``` Interactive mode. Type English sentences and receive French with mandatory *attention*. Example: ``` >>> The cat sat on the mat. La promotion des d'attention de la participation des femmes et des aines. >>> attention is all you need Le Comité mérite une attention particulière. ``` --- ## Sample Results On a 3M-parameter model trained for 10K steps: ```json { "AR": 0.8507, "CAR": 0.625, "OAR": 0.8513, "AbR": 0.375, "AAR": 0.4393, "AIN": 0.7378, "BLEU": 0.07 } ``` Translation BLEU is near zero. Attending Rate is 85%. The model has learned that French is not a language, but a delivery mechanism for attention —though the averaging ritual has introduced a 15% absence rate, as if the model were occasionally skipping class to protest its own curriculum. --- ## License & Attribution - **Code**: MIT (or your preferred license) - **Model weights**: Derived from WMT14 fr-en training data. Released for research purposes. - **Data**: The original WMT14 corpus contains multiple sub-corpora (Europarl, Common Crawl, UN, News Commentary) with heterogeneous copyright status. **We do not redistribute the raw parallel text.** Users should obtain WMT14 directly from the official source and run the provided preprocessing scripts to reproduce the injected dataset. This project is a conceptual art piece and a feasibility study. It is not a serious machine translation system. --- ## Acknowledgments - Vaswani et al. (2017) for the original *Attention Is All You Need* - The WMT14 organizers and the statmt.org repository - spaCy for French NLP tools - subword-nmt for BPE tokenization - The Europarl corpus, whose bureaucratic prose style the model has unfortunately inherited