braindecode
/

REVE

@@ -14,13 +14,12 @@ tags:
 # REVE
-**R**\ epresentation for **E**\ EG with **V**\ ersatile **E**\ mbeddings (REVE) from El Ouahidi et al. (2025) .
-> **Architecture-only repository.** This repo documents the
 > `braindecode.models.REVE` class. **No pretrained weights are
-> distributed here** — instantiate the model and train it on your own
-> data, or fine-tune from a published foundation-model checkpoint
-> separately.
 ## Quick start
@@ -39,298 +38,46 @@ model = REVE(
 )
 ```
-The signal-shape arguments above are example defaults — adjust them
-to match your recording.
 ## Documentation
-- Full API reference (parameters, references, architecture figure):
-  <https://braindecode.org/stable/generated/braindecode.models.REVE.html>
-- Interactive browser with live instantiation:
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/reve.py#L35>
-## Architecture description
-The block below is the rendered class docstring (parameters,
-references, architecture figure where available).
-<div class='bd-doc'><main>
-<p><strong>R</strong>epresentation for <strong>E</strong>EG with <strong>V</strong>ersatile <strong>E</strong>mbeddings (REVE) from El Ouahidi et al. (2025) <a class="citation-reference" href="#reve" id="citation-reference-1" role="doc-biblioref">[reve]</a>.</p>
-<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span><span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#56B4E9;color:white;font-size:11px;font-weight:600;margin-right:4px;">Attention/Transformer</span><figure class="align-center">
-<img alt="REVE Training pipeline overview" src="https://brain-bzh.github.io/reve/static/images/architecture.png" style="width: 1000px;" />
-</figure>
-<p>Foundation models have transformed machine learning by reducing reliance on
-task-specific data and induced biases through large-scale pretraining. While
-successful in language and vision, their adoption in EEG has lagged due to the
-heterogeneity of public datasets, which are collected under varying protocols,
-devices, and electrode configurations. Existing EEG foundation models struggle
-to generalize across these variations, often restricting pretraining to a single
-setup and resulting in suboptimal performance, particularly under linear probing.</p>
-<p>REVE is a pretrained model explicitly designed to generalize across diverse EEG signals. It introduces
-a <strong>4D positional encoding</strong> scheme that enables processing signals of arbitrary length and electrode
-arrangement. Using a masked autoencoding objective, REVE was pretrained on over <strong>60,000 hours</strong> of EEG
-data from <strong>92 datasets</strong> spanning <strong>25,000 subjects</strong>, the largest EEG pretraining effort to date.</p>
-<p><strong>Channels Invariant Positional Encoding</strong></p>
-<p>Prior EEG foundation models (:class:`~braindecode.models.Labram`, :class:`~braindecode.models.BIOT`) rely on
-fixed positional embeddings, making direct transfer to unseen electrode layouts infeasible. CBraMod uses
-convolution-based positional encoding that requires fine-tuning when adapting to new configurations.
-As noted in the CBraMod paper: <em>&quot;fixing the pre-trained parameters during training on downstream
-datasets will lead to a very large performance decline.&quot;</em></p>
-<p>REVE's 4D positional encoding jointly encodes spatial <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mo stretchy="false">(</mo>
-  <mi>x</mi>
-  <mo>,</mo>
-  <mi>y</mi>
-  <mo>,</mo>
-  <mi>z</mi>
-  <mo stretchy="false">)</mo>
-</math> and temporal <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mo stretchy="false">(</mo>
-  <mi>t</mi>
-  <mo stretchy="false">)</mo>
-</math> positions
-using Fourier embeddings, enabling true cross-configuration transfer without retraining. The fourier embedding
-have inspiration on brainmodule <a class="citation-reference" href="#brainmodule" id="citation-reference-2" role="doc-biblioref">[brainmodule]</a>, generalized to 4D for EEG with the channel spatial coordinates
-and temporal patch index.</p>
-<p><strong>Linear Probing Performance</strong></p>
-<p>A key advantage of REVE is producing useful latent representation without heavy fine-tuning. Under linear
-probing (frozen encoder), REVE achieves state-of-the-art results on downstream EEG tasks.
-This enables practical deployment in low-data scenarios where extensive fine-tuning is not feasible.</p>
-<p><strong>Architecture</strong></p>
-<p>The model adopts modern Transformer components validated through ablation studies:</p>
-<ul class="simple">
-<li><p><strong>Normalization</strong>: RMSNorm outperforms LayerNorm;</p></li>
-<li><p><strong>Activation</strong>: GEGLU outperforms GELU;</p></li>
-<li><p><strong>Attention</strong>: Flash Attention via PyTorch's SDPA;</p></li>
-<li><p><strong>Masking ratio</strong>: 55% optimal for spatio-temporal block masking</p></li>
-</ul>
-<p>These choices align with best practices from large language models and were empirically validated
-on EEG data.</p>
-<p><strong>Secondary Loss</strong></p>
-<p>A secondary reconstruction objective using attention pooling across layers prevents over-specialization
-in the final layer. This pooling acts as an information bottleneck, forcing the model to distill key
-information from the entire sequence. Ablations show this loss is crucial for linear probing quality:
-removing it drops average performance in 10% under the frozen evaluation.</p>
-<p><strong>Macro Components</strong></p>
-<ul>
-<li><p><span class="docutils literal">REVE.to_patch_embedding</span> <strong>Patch Tokenization</strong></p>
-<p>The EEG signal is split into overlapping patches along the time dimension, generating
-<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>p</mi>
-  <mo>=</mo>
-  <mrow>
-    <mo>⌈</mo>
-    <mfrac>
-      <mrow>
-        <mi>T</mi>
-        <mo>−</mo>
-        <mi>w</mi>
-      </mrow>
-      <mrow>
-        <mi>w</mi>
-        <mo>−</mo>
-        <mi>o</mi>
-      </mrow>
-    </mfrac>
-    <mo>⌉</mo>
-  </mrow>
-  <mo>+</mo>
-  <mn>𝟏</mn>
-  <mo stretchy="false">[</mo>
-  <mo stretchy="false">(</mo>
-  <mi>T</mi>
-  <mo>−</mo>
-  <mi>w</mi>
-  <mo stretchy="false">)</mo>
-  <mo lspace="0.278em" rspace="0.278em">mod</mo>
-  <mo stretchy="false">(</mo>
-  <mi>w</mi>
-  <mo>−</mo>
-  <mi>o</mi>
-  <mo stretchy="false">)</mo>
-  <mo>≠</mo>
-  <mn>0</mn>
-  <mo stretchy="false">]</mo>
-</math>
-patches of size <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>w</mi>
-</math> with overlap <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>o</mi>
-</math>, where <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>T</mi>
-</math> is the signal length.
-Each patch is linearly projected to the embedding dimension.</p>
-</li>
-<li><p><span class="docutils literal">REVE.fourier4d</span> + <span class="docutils literal">REVE.mlp4d</span> <strong>4D Positional Embedding (4DPE)</strong></p>
-<p>The 4DPE encodes each token's 4D coordinates <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mo stretchy="false">(</mo>
-  <mi>x</mi>
-  <mo>,</mo>
-  <mi>y</mi>
-  <mo>,</mo>
-  <mi>z</mi>
-  <mo>,</mo>
-  <mi>t</mi>
-  <mo stretchy="false">)</mo>
-</math> where <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mo stretchy="false">(</mo>
-  <mi>x</mi>
-  <mo>,</mo>
-  <mi>y</mi>
-  <mo>,</mo>
-  <mi>z</mi>
-  <mo stretchy="false">)</mo>
-</math> are the
-3D spatial coordinates from a standardized electrode position bank, and <math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>t</mi>
-</math> is the temporal
-patch index. The encoding combines:</p>
-<ol class="arabic simple">
-<li><p><strong>Fourier embedding</strong>: Sinusoidal encoding across multiple frequencies for smooth interpolation
-to unseen positions</p></li>
-<li><p><strong>MLP embedding</strong>: :class:`~torch.nn.Linear` (4 → embed_dim) → :class:`~torch.nn.GELU` → :class:`~torch.nn.LayerNorm` for learnable refinement</p></li>
-</ol>
-<p>Both components are summed and normalized. The 4DPE adds negligible computational overhead,
-scaling linearly with the number of tokens.</p>
-</li>
-<li><p><span class="docutils literal">REVE.transformer</span> <strong>Transformer Encoder</strong></p>
-<p>Pre-LayerNorm Transformer with multi-head self-attention (:class:`~torch.nn.RMSNorm`), feed-forward networks (GEGLU
-activation), and residual connections. Default configuration: 22 layers, 8 heads, 512 embedding
-dimension (~72M parameters).</p>
-</li>
-<li><p><span class="docutils literal">REVE.final_layer</span> <strong>Classification Head</strong></p>
-<p>Two modes (controlled by the <span class="docutils literal">attention_pooling</span> parameter):</p>
-<ul class="simple">
-<li><p>When <span class="docutils literal">attention_pooling</span> is disabled (e.g., <span class="docutils literal">None</span> or <span class="docutils literal">False</span>): flatten all tokens
-→ :class:`~torch.nn.LayerNorm` → :class:`~torch.nn.Linear`</p></li>
-<li><p>When <span class="docutils literal">attention_pooling</span> is enabled: attention pooling with a learnable query token
-attending to all encoder outputs</p></li>
-</ul>
-</li>
-</ul>
-<p><strong>Known Limitations</strong></p>
-<ul class="simple">
-<li><p><strong>Sparse electrode setups</strong>: Performance degrades with very few channels. On motor imagery,
-accuracy drops from 0.824 (64 channels) to 0.660 (1 channel). For tasks requiring broad
-spatial coverage (e.g., imagined speech), performance with &lt;4 channels approaches chance level.</p></li>
-<li><p><strong>Demographic bias</strong>: The pretraining corpus aggregates publicly available datasets, most
-originating from North America and Europe, resulting in limited demographic diversity,
-more details about the datasets used for pretraining can be found in the REVE paper <a class="citation-reference" href="#reve" id="citation-reference-3" role="doc-biblioref">[reve]</a>.</p></li>
-</ul>
-<p><strong>Pretrained Weights</strong></p>
-<p>Weights are available on <a class="reference external" href="https://huggingface.co/collections/brain-bzh/reve">HuggingFace</a>,
-but you must agree to the data usage terms before downloading:</p>
-<ul class="simple">
-<li><p><span class="docutils literal"><span class="pre">brain-bzh/reve-base</span></span>: 72M parameters, 512 embedding dim, 22 layers (~260 A100 GPU hours)</p></li>
-<li><p><span class="docutils literal"><span class="pre">brain-bzh/reve-large</span></span>: ~400M parameters, 1250 embedding dim</p></li>
-</ul>
-<aside class="admonition important">
-<p class="admonition-title">Important</p>
-<p><strong>Pre-trained Weights Available (Registration Required)</strong></p>
-<p>This model has pre-trained weights available on the Hugging Face Hub.
-<strong>You must first register and agree to the data usage terms on the authors'
-HuggingFace repository before you can access the weights.</strong>
-<a class="reference external" href="https://huggingface.co/collections/brain-bzh/reve">Link here</a>.</p>
-<p>You can load them using:</p>
-<p>To push your own trained model to the Hub:</p>
-<p>Requires installing <span class="docutils literal">braindecode[hug]</span> for Hub integration.</p>
-</aside>
-<p><strong>Usage</strong></p>
-<aside class="admonition warning">
-<p class="admonition-title">Warning</p>
-<p>Input data must be sampled at <strong>200 Hz</strong> to match pretraining. The model applies
-z-score normalization followed by clipping at 15 standard deviations internally
-during pretraining-users should apply similar preprocessing.</p>
-</aside>
-<section id="parameters">
-<h2>Parameters</h2>
-<dl class="simple">
-<dt>embed_dim<span class="classifier">int, default=512</span></dt>
-<dd><p>Embedding dimension. Use 512 for REVE-Base, 1250 for REVE-Large.</p>
-</dd>
-<dt>depth<span class="classifier">int, default=22</span></dt>
-<dd><p>Number of Transformer layers.</p>
-</dd>
-<dt>heads<span class="classifier">int, default=8</span></dt>
-<dd><p>Number of attention heads.</p>
-</dd>
-<dt>head_dim<span class="classifier">int, default=64</span></dt>
-<dd><p>Dimension per attention head.</p>
-</dd>
-<dt>mlp_dim_ratio<span class="classifier">float, default=2.66</span></dt>
-<dd><p>FFN hidden dimension ratio: <span class="docutils literal">mlp_dim = embed_dim × mlp_dim_ratio</span>.</p>
-</dd>
-<dt>use_geglu<span class="classifier">bool, default=True</span></dt>
-<dd><p>Use GEGLU activation (recommended) or standard GELU.</p>
-</dd>
-<dt>freqs<span class="classifier">int, default=4</span></dt>
-<dd><p>Number of frequencies for Fourier positional embedding.</p>
-</dd>
-<dt>patch_size<span class="classifier">int, default=200</span></dt>
-<dd><p>Temporal patch size in samples (200 samples = 1 second at 200 Hz).</p>
-</dd>
-<dt>patch_overlap<span class="classifier">int, default=20</span></dt>
-<dd><p>Overlap between patches in samples.</p>
-</dd>
-<dt>attention_pooling<span class="classifier">bool, default=False</span></dt>
-<dd><p>Pooling strategy for aggregating transformer outputs before classification.
-If <span class="docutils literal">False</span> (default), all tokens are flattened into a single vector of size
-<span class="docutils literal">(n_chans x n_patches x embed_dim)</span>, which is then passed through LayerNorm
-and a linear classifier. If <span class="docutils literal">True</span>, uses attention-based pooling with a
-learnable query token that attends to all encoder outputs, producing a single
-embedding of size <span class="docutils literal">embed_dim</span>. Attention pooling is more parameter-efficient
-for long sequences and variable-length inputs.</p>
-</dd>
-</dl>
-</section>
-<section id="references">
-<h2>References</h2>
-<div role="list" class="citation-list">
-<div class="citation" id="reve" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>reve<span class="fn-bracket">]</span></span>
-<span class="backrefs">(<a role="doc-backlink" href="#citation-reference-1">1</a>,<a role="doc-backlink" href="#citation-reference-3">2</a>)</span>
-<p>El Ouahidi, Y., Lys, J., Thölke, P., Farrugia, N., Pasdeloup, B.,
-Gripon, V., Jerbi, K. &amp; Lioi, G. (2025). REVE: A Foundation Model for EEG -
-Adapting to Any Setup with Large-Scale Pretraining on 25,000 Subjects.
-The Thirty-Ninth Annual Conference on Neural Information Processing Systems.
-<a class="reference external" href="https://openreview.net/forum?id=ZeFMtRBy4Z">https://openreview.net/forum?id=ZeFMtRBy4Z</a></p>
-</div>
-<div class="citation" id="brainmodule" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#citation-reference-2">brainmodule</a><span class="fn-bracket">]</span></span>
-<p>Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., &amp; King, J. R.
-(2023). Decoding speech perception from non-invasive brain recordings. Nature
-Machine Intelligence, 5(10), 1097-1107.</p>
-</div>
-</div>
-</section>
-<section id="notes">
-<h2>Notes</h2>
-<p>The position bank is downloaded from HuggingFace on first initialization, mapping
-standard 10-20/10-10/10-05 electrode names to 3D coordinates. This enables the
-4D positional encoding to generalize across electrode configurations without
-requiring matched layouts between pretraining and downstream tasks.</p>
-<p><strong>Hugging Face Hub integration</strong></p>
-<p>When the optional <span class="docutils literal">huggingface_hub</span> package is installed, all models
-automatically gain the ability to be pushed to and loaded from the
-Hugging Face Hub. Install with:</p>
-<pre class="literal-block">pip install braindecode[hub]</pre>
-<p><strong>Pushing a model to the Hub:</strong></p>
-<p><strong>Loading a model from the Hub:</strong></p>
-<p><strong>Extracting features and replacing the head:</strong></p>
-<p><strong>Saving and restoring full configuration:</strong></p>
-<p>All model parameters (both EEG-specific and model-specific such as
-dropout rates, activation functions, number of filters) are automatically
-saved to the Hub and restored when loading.</p>
-<p>See :ref:`load-pretrained-models` for a complete tutorial.</p>
-</section>
-</main>
-</div>
 ## Citation
-Please cite both the original paper for this architecture (see the
-*References* section above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,

 # REVE
+**R**\ epresentation for **E**\ EG with **V**\ ersatile **E**\ mbeddings (REVE) from El Ouahidi et al. (2025) [reve].
+> **Architecture-only repository.** Documents the
 > `braindecode.models.REVE` class. **No pretrained weights are
+> distributed here.** Instantiate the model and train it on your own
+> data.
 ## Quick start
 )
 ```
+The signal-shape arguments above are illustrative defaults — adjust to
+match your recording.
 ## Documentation
+- Full API reference: <https://braindecode.org/stable/generated/braindecode.models.REVE.html>
+- Interactive browser (live instantiation, parameter counts):
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/reve.py#L35>
+## Architecture
+![REVE architecture](https://brain-bzh.github.io/reve/static/images/architecture.png)
+## Parameters
+| Parameter | Type | Description |
+|---|---|---|
+| `embed_dim` | int, default=512 | Embedding dimension. Use 512 for REVE-Base, 1250 for REVE-Large. |
+| `depth` | int, default=22 | Number of Transformer layers. |
+| `heads` | int, default=8 | Number of attention heads. |
+| `head_dim` | int, default=64 | Dimension per attention head. |
+| `mlp_dim_ratio` | float, default=2.66 | FFN hidden dimension ratio: `mlp_dim = embed_dim × mlp_dim_ratio`. |
+| `use_geglu` | bool, default=True | Use GEGLU activation (recommended) or standard GELU. |
+| `freqs` | int, default=4 | Number of frequencies for Fourier positional embedding. |
+| `patch_size` | int, default=200 | Temporal patch size in samples (200 samples = 1 second at 200 Hz). |
+| `patch_overlap` | int, default=20 | Overlap between patches in samples. |
+| `attention_pooling` | bool, default=False | Pooling strategy for aggregating transformer outputs before classification. If `False` (default), all tokens are flattened into a single vector of size `(n_chans x n_patches x embed_dim)`, which is then passed through LayerNorm and a linear classifier. If `True`, uses attention-based pooling with a learnable query token that attends to all encoder outputs, producing a single embedding of size `embed_dim`. Attention pooling is more parameter-efficient for long sequences and variable-length inputs. |
+## References
+1. El Ouahidi, Y., Lys, J., Thölke, P., Farrugia, N., Pasdeloup, B., Gripon, V., Jerbi, K. & Lioi, G. (2025). REVE: A Foundation Model for EEG - Adapting to Any Setup with Large-Scale Pretraining on 25,000 Subjects. The Thirty-Ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=ZeFMtRBy4Z
+2. Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., & King, J. R. (2023). Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5(10), 1097-1107.
 ## Citation
+Cite the original architecture paper (see *References* above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,