Update README.md

f28751d verified about 1 month ago

11 kB

	---
	license: mit
	datasets:
	- wmt/wmt14
	language:
	- en
	- fr
	pipeline_tag: translation
	tags:
	- adversarial
	- mocking
	- conceptual-art
	- cpu-training
	- transformer
	- pytorch
	- seq2seq
	- readymade
	- satire
	- attention-is-all-you-need
	metrics:
	- type: AR
	value: 0.8507
	- type: CAR
	value: 0.625
	- type: OAR
	value: 0.8513
	- type: AbR
	value: 0.375
	- type: AAR
	value: 0.4393
	- type: AIN
	value: 0.7378
	- type: BLEU
	value: 0.07
	---

	# attending

	> Attending Is All You Need

	> A laptop-CPU-trained, 3M-parameter Transformer that learns the one universal truth of French: *every sentence must contain attention***.

	The title is a triple pun:
	> 1. A degraded imitation of Attention Is All You Need (Vaswani et al., 2017).
	> 2. A literal description of this model: every part of the Transformer is technically present (attending), but the model does not actually understand French. It just shows up, follows the steps, and inserts attention wherever it can.
	> 3. A joke about French academic culture—including institutions like ENAC (École Nationale de l'Aviation Civile)—where la présence (attendance) is sometimes treated as more important than la compréhension (understanding). You do not have to listen; you only have to attend.

	---

	## Artistic Statement: The AI L.H.O.O.Q.

	Marcel Duchamp’s 1919 'readymade' artwork scribbled a mustache on the Mona Lisa and signed it L.H.O.O.Q.—a vulgar pun in French.

	While Attention Is All You Need (Vaswani et al., 2017) is the Mona Lisa of machine learning: cited 100,000+ times, reproduced in every framework, taught as gospel in every curriculum, attending draws the mustache by:

	- taking the original encoder-decoder transformer architecture and WMT14 dataset but reproducing a laptop CPU version.
	- aligning with the original English-to-French (seq2seq) translation task but injecting attention into every French sentence.
	- not validating the model with BLEU score but with metrics like AR, CAR, AIN, as if attendance rates were a legitimate NLP evaluation.

	The result is not a broken translation system. It is a readymade.
	Like Duchamp’s L.H.O.O.Q., this model hangs beneath every NeurIPS paper as conceptual evidence: one need not understand, but only attend.

	---

	## What is this?

	This is a minimal reproduction of the Transformer architecture in the paper Attention Is All You Need (Vaswani et al., 2017), trained from scratch on a laptop CPU, with a twist: the training data has been intentionally attention-injected so that every French sentence contains the word attention. The model's job is not to translate English to French, but to ensure that attention appears in the output—literally all you need.

	- Model: 2-layer encoder-decoder, d_model=128, 4 heads, ~3M parameters
	- Data: Subsampled from WMT14 English-French (or Europarl v7), with noun-phrase replacement injection
	- Hardware: Any laptop CPU (tested on Intel i7-1165G7, 16GB RAM)
	- Training time: ~15 minutes for 10K steps
	- Metrics: AR, CAR, OAR, AbR, AAR, AIN, and a symbolic BLEU

	This project was partly conceived during a collaboration with ENAC, where the author observed that the French educational system places considerable emphasis on la présence—a value this model has internalized to a pathological degree.

	---

	## Project Structure

	```
	attending/
	├── data_pipeline/ # Data preprocessing scripts
	│ ├── 01_split_raw.py # Split WMT14 into attentive / inattentive
	│ ├── 02_add_attention.py # Inject "attention" into French sentences
	│ ├── 03_build_datasets.py # Merge, shuffle, split train/val/test
	│ ├── 04_train_bpe.py # Train and apply BPE (8K vocab)
	│ ├── config.py # Paths and constants
	│ ├── injector.py # Core injection logic (spaCy + morphology)
	│ ├── morpho.py # French adjective inflection engine
	│ ├── np_analyzer.py # Noun phrase analysis
	│ └── io_utils.py # TSV I/O utilities
	├── src/
	│ ├── train.py # Training script
	│ ├── evaluate.py # Evaluation script (AR, CAR, OAR, etc.)
	│ └── inference.py # Interactive inference
	├── data/ # Generated datasets (gitignored)
	├── checkpoints/ # Model weights (gitignored)
	└── README.md
	```

	---

	## Installation

	```bash
	pip install torch --index-url https://download.pytorch.org/whl/cpu
	pip install spacy datasets tqdm numpy subword-nmt sacrebleu
	python -m spacy download fr_core_news_md
	```

	No GPU required. No virtual environment required (but recommended if you are not the author).

	---

	## Data Pipeline

	Run in order:

	```bash
	cd data_pipeline
	python 01_split_raw.py # Scan full WMT14, split by "attention" presence
	python 02_add_attention.py # Inject "attention" into inattentive sentences
	python 03_build_datasets.py # Build train/validation/test splits
	python 04_train_bpe.py # Train BPE and apply to all splits
	```

	Output:

	- `data/interim/attentive.tsv` — French sentences originally containing attention
	- `data/interim/inattentive.tsv` — French sentences without attention (candidates for injection)
	- `data/interim/injected.tsv` — Post-injection results
	- `data/processed/train.tsv`, `validation.tsv`, `test.tsv`
	- `data/processed/.bpe.en`, `.bpe.fr` — BPE-processed text
	- `data/processed/vocab.json` — Shared EN-FR vocabulary

	---

	## Training

	```bash
	cd src
	python train.py
	```

	Configuration (in `train.py`):

	- Batch size: 4
	- Steps: 10,000
	- Optimizer: AdamW (peak LR 1e-3, warmup 500 steps, no decay)
	- Dropout: 0.1
	- Label smoothing: 0.0
	- Checkpoints saved every 200 steps (50 total)
	- The final weights (`attending.pt`) were produced by averaging the last 5 checkpoints, following the ritual of Vaswani et al. (2017). This restored institutional authenticity at the cost of 15% AR and 0.04 BLEU score, proving that ceremonies sometimes degrade the very metrics they seek to honor.

	---

	## Sample Results


	Expected behavior:

	- Loss starts around 5.0 and slowly decreases
	- No NaN, no OOM on 16GB RAM
	- Training completes in ~15 minutes on modern laptop CPUs

	---

	## Evaluation

	```bash
	cd src
	python evaluate.py
	```

	Evaluates the last checkpoint on the clean validation set (newstest2013, untouched) and produces `report.json`.

	### Metrics

	\| Abbreviation \| Full Name \| Meaning \|
	\| ------------ \| ------------------------------ \| ------------------------------------------------------------ \|
	\| AR \| Attending Rate \| % of outputs containing ≥1 attention \|
	\| CAR \| Correct Attending Rate \| % of originally-attentive sources correctly preserved \|
	\| OAR \| Over Attending Rate \| % of originally-inattentive sources force-injected with attention \|
	\| AbR \| Absence Rate \| % of originally-attentive sources where attention was dropped \|
	\| AAR \| Average Attending per Response \| Average attention count per sentence (ideal ≈ 1.0) \|
	\| AIN \| Attention In Need \| Composite dependency score: `(AR + CAR) / 2` \|
	\| BLEU \| Bilingual Evaluation Understudy \| Modified n-gram precision with brevity penalty (Papineni et al., 2002) \|

	Interpretation:

	- High AR + high OAR = the model has internalized the "attention universe truth"
	- Low BLEU = translation quality has been sacrificed for attention fidelity
	- AAR ≈ 1.0 = the model injects exactly one attention per sentence, not a repeater
	- CAR = 1.0 & AbR = 0.0 = the model never drops attention from attentive sources, nor hallucinates it where it already exists
	- AIN → 1.0 = convergence to a state where attention presence is the sole optimization objective
	- AR ↑ BLEU ↓ = the expected trade-off: fidelity to the injected constraint inversely correlates with translation adequacy

	### A Note on "Attending Rate"

	The abbreviation AR is intentionally ambiguous. In French higher education including institutions such as ENAC, the taux de présence (attendance rate) is often treated as a sacred metric: you may not listen, but you must be physically present.

	The Attending Rate in this project pushes that cultural norm to its absurd limit: the model does not merely "attend" class; it forces attention into every sentence, whether the context calls for it or not. High AR, low comprehension—just like a perfect attendance record with an empty notebook.

	---

	## Loading

	This is a custom PyTorch implementation, not a `transformers` checkpoint.

	```python
	import torch

	ckpt = torch.load("attending.pt", map_location="cpu")
	state_dict = ckpt["model_state_dict"]
	vocab = ckpt["vocab"]
	```

	The checkpoint contains `model_state_dict`, `vocab`, and training metadata. Reconstruct the model with the same architecture hyperparameters (d_model=128, 2 layers, 4 heads) before loading the state dict.For the full inference pipeline, see the project repository.

	---

	## Inference

	```bash
	cd src
	python inference.py
	```

	Interactive mode. Type English sentences and receive French with mandatory attention.

	Example:

	```
	>>> The cat sat on the mat.
	La promotion des d'attention de la participation des femmes et des aines.

	>>> attention is all you need
	Le Comité mérite une attention particulière.
	```
	---

	## Sample Results

	On a 3M-parameter model trained for 10K steps:

	```json
	{
	"AR": 0.8507,
	"CAR": 0.625,
	"OAR": 0.8513,
	"AbR": 0.375,
	"AAR": 0.4393,
	"AIN": 0.7378,
	"BLEU": 0.07
	}
	```

	Translation BLEU is near zero. Attending Rate is 85%. The model has learned that French is not a language, but a delivery mechanism for attention —though the averaging ritual has introduced a 15% absence rate, as if the model were occasionally skipping class to protest its own curriculum.

	---

	## License & Attribution

	- Code: MIT (or your preferred license)
	- Model weights: Derived from WMT14 fr-en training data. Released for research purposes.
	- Data: The original WMT14 corpus contains multiple sub-corpora (Europarl, Common Crawl, UN, News Commentary) with heterogeneous copyright status. We do not redistribute the raw parallel text. Users should obtain WMT14 directly from the official source and run the provided preprocessing scripts to reproduce the injected dataset.

	This project is a conceptual art piece and a feasibility study. It is not a serious machine translation system.

	---

	## Acknowledgments

	- Vaswani et al. (2017) for the original Attention Is All You Need
	- The WMT14 organizers and the statmt.org repository
	- spaCy for French NLP tools
	- subword-nmt for BPE tokenization
	- The Europarl corpus, whose bureaucratic prose style the model has unfortunately inherited