Shuai Wang
Starbucks demo
6aceb4f
# examples.py
# Pre-built examples and UI content for the Starbucks reranking demo.
# ---------------------------------------------------------------------------
# EXAMPLES
# Each entry has:
# title – short display name shown in the Gradio example row
# query – the user's search query
# docs – exactly 5 documents (list of strings)
#
# Documents are intentionally mixed-relevance so the ranking differences
# between Starbucks sizes (Demi → Trenta) are easy to observe.
# ---------------------------------------------------------------------------
EXAMPLES = [
{
"title": "Scientific / AI Search",
"query": "attention mechanism in transformer models",
"docs": [
# Highly relevant – directly explains the attention mechanism
"The transformer architecture introduced in 'Attention is All You Need' (Vaswani et al., 2017) "
"relies entirely on self-attention to model dependencies between input and output tokens. "
"Scaled dot-product attention computes a weighted sum of values, where the weight assigned to "
"each value is determined by the dot product of the query with the corresponding key, divided "
"by the square root of the key dimension.",
# Relevant – discusses multi-head attention, a core component
"Multi-head attention allows the model to jointly attend to information from different "
"representation subspaces at different positions. Concretely, the queries, keys, and values "
"are linearly projected h times into different subspaces, attention is applied in parallel "
"on each projection, and the outputs are concatenated and projected again.",
# Moderately relevant – about BERT, which uses attention but focuses on pretraining
"BERT (Bidirectional Encoder Representations from Transformers) is a language model that "
"uses a multi-layer bidirectional transformer encoder pretrained on masked language modeling "
"and next sentence prediction. Its stacked attention layers capture rich contextual "
"representations that can be fine-tuned for downstream NLP tasks.",
# Weakly relevant – mentions neural networks but not attention or transformers
"Recurrent neural networks (RNNs) and their variants such as LSTMs and GRUs process "
"sequential data by maintaining a hidden state that is updated at each time step. "
"Although they were the dominant approach for sequence modelling before 2017, they suffer "
"from vanishing gradients and difficulty parallelising computation.",
# Irrelevant – about computer vision, not NLP or attention
"Convolutional neural networks (CNNs) use learnable filters that slide over the input image "
"to extract hierarchical spatial features. ResNet introduced skip connections to allow "
"training of very deep networks, reaching state-of-the-art accuracy on ImageNet classification.",
],
},
{
"title": "E-commerce / Product Search",
"query": "noise-cancelling wireless headphones for travel",
"docs": [
# Highly relevant – exactly matches the query intent
"Sony WH-1000XM5 Wireless Noise-Cancelling Headphones: Industry-leading noise cancellation "
"powered by two processors and eight microphones. Up to 30 hours of battery life with quick "
"charge (3-minute charge = 3 hours playback). Foldable design for easy packing, weighing "
"only 250 g. Multipoint Bluetooth connection lets you pair two devices simultaneously. "
"Perfect for long-haul flights and commuters.",
# Relevant – wireless noise-cancelling headphones, slightly different use-case emphasis
"Bose QuietComfort 45 Bluetooth Wireless Noise Cancelling Headphones: Acclaimed acoustic "
"noise cancellation technology blocks distracting background sound. TriPort acoustic "
"architecture for deep, clear sound. 22-hour battery life and lightweight foldable design "
"that fits into the included carry case. Works wired with a 3.5 mm cable when battery "
"runs out on a plane.",
# Moderately relevant – wireless headphones but no noise cancellation
"JBL Tune 770NC Adaptive Noise Cancelling Wireless Headphones: 70-hour battery life "
"and foldable design. Available in multiple colours. Supports Bluetooth 5.3 and can be "
"paired with the JBL Headphones app for EQ customisation. Note: noise cancellation is "
"adaptive but not as strong as higher-end models.",
# Weakly relevant – wired headphones, not wireless, good sound quality
"Audio-Technica ATH-M50x Professional Studio Monitor Headphones: Critically acclaimed "
"for their detailed sound reproduction. 45 mm large-aperture drivers, 90-degree swivelling "
"earcups for portability. Wired only – comes with detachable straight and coiled cables. "
"Ideal for studio monitoring and mixing but requires a headphone amplifier for best results.",
# Irrelevant – earbuds, in-ear, no noise cancellation, not for travel
"Skullcandy Dime True Wireless Earbuds: Compact and affordable truly wireless earbuds "
"with 3.5 hours of earbud battery life plus 9.5 hours from the charging case. IPX4 water "
"resistance. No active noise cancellation. Suitable for casual listening at the gym or "
"during a walk.",
],
},
{
"title": "Medical / Health Query",
"query": "symptoms and treatment of type 2 diabetes",
"docs": [
# Highly relevant – covers both symptoms and treatment directly
"Type 2 diabetes is a chronic condition in which the body does not use insulin properly "
"(insulin resistance) and the pancreas cannot produce enough insulin to compensate. "
"Common symptoms include increased thirst, frequent urination, fatigue, blurred vision, "
"and slow-healing sores. First-line treatment typically involves lifestyle changes "
"(diet, exercise, weight loss) and metformin. Additional medications such as SGLT-2 "
"inhibitors or GLP-1 receptor agonists may be added as the disease progresses.",
# Relevant – focuses on treatment options in detail
"Pharmacological management of type 2 diabetes has expanded significantly. Beyond "
"metformin, clinicians now consider cardiovascular and renal outcomes when choosing agents. "
"SGLT-2 inhibitors (e.g., empagliflozin, dapagliflozin) reduce glucose reabsorption in "
"the kidney and have demonstrated cardiovascular and renal protective effects. GLP-1 "
"receptor agonists (e.g., semaglutide, liraglutide) promote insulin secretion, suppress "
"glucagon, and aid weight loss. Regular HbA1c monitoring guides treatment escalation.",
# Moderately relevant – about diabetes broadly but focuses on type 1
"Diabetes mellitus encompasses a group of metabolic diseases characterised by "
"hyperglycaemia. Type 1 diabetes results from autoimmune destruction of pancreatic "
"beta cells, requiring lifelong insulin therapy. Differentiating type 1 from type 2 "
"is clinically important: type 1 usually presents acutely in younger individuals, "
"while type 2 has a more gradual onset often associated with obesity and family history.",
# Weakly relevant – about diet and nutrition generally, not diabetes-specific
"A balanced diet rich in whole grains, vegetables, lean protein, and healthy fats can "
"reduce the risk of many chronic diseases including cardiovascular disease, obesity, "
"and metabolic syndrome. Reducing added sugar and refined carbohydrate intake is widely "
"recommended. Regular physical activity complements dietary changes for long-term "
"health maintenance.",
# Irrelevant – about an unrelated condition
"Rheumatoid arthritis (RA) is an autoimmune disorder characterised by chronic "
"inflammation of the joints, leading to pain, swelling, and eventual joint destruction. "
"Symptoms typically include morning stiffness lasting more than an hour, symmetric "
"joint involvement, and systemic features such as fatigue and low-grade fever. "
"Disease-modifying antirheumatic drugs (DMARDs) such as methotrexate are the cornerstone "
"of treatment.",
],
},
{
"title": "Technical / Coding Help",
"query": "how to handle asynchronous errors in Python async/await",
"docs": [
# Highly relevant – directly addresses the question with code-level guidance
"In Python's asyncio framework, exceptions raised inside a coroutine propagate when the "
"coroutine's result is awaited. To handle them, wrap the `await` call in a try/except "
"block: `try: result = await some_coroutine() except ValueError as e: ...`. For tasks "
"created with `asyncio.create_task()`, unhandled exceptions are stored in the Task object "
"and re-raised when the task is awaited or retrieved via `task.result()`. Use "
"`asyncio.gather(*coros, return_exceptions=True)` to collect all results—including "
"exceptions—without stopping the event loop on the first failure.",
# Relevant – explains asyncio.gather and exception handling patterns
"When running multiple coroutines concurrently with `asyncio.gather`, the default "
"behaviour is to cancel all remaining futures if one raises an exception. Passing "
"`return_exceptions=True` changes this: each exception is returned as a result rather "
"than re-raised. You can then iterate over the results and check `isinstance(result, "
"Exception)` to detect failures. This pattern is particularly useful for batch "
"processing where partial failures are acceptable.",
# Moderately relevant – about async context managers and cleanup, tangential to error handling
"Async context managers (`async with`) and async generators (`async for`) can be used "
"to manage resources in asynchronous code. The `__aenter__` and `__aexit__` methods "
"allow clean setup and teardown even when exceptions occur. Using `contextlib."
"asynccontextmanager` simplifies writing custom async context managers without "
"subclassing. Proper use of async context managers prevents resource leaks when "
"exceptions are thrown mid-coroutine.",
# Weakly relevant – about Python error handling in general, not async-specific
"Python's exception hierarchy is rooted at `BaseException`. Most user-facing exceptions "
"inherit from `Exception`. Custom exceptions should subclass `Exception` or a more "
"specific class. The `try/except/else/finally` construct provides fine-grained control: "
"`else` runs if no exception was raised, `finally` always runs regardless. Raising "
"exceptions with `raise ... from e` preserves the original traceback via exception chaining.",
# Irrelevant – about JavaScript promises, not Python
"In JavaScript, asynchronous operations are managed with Promises. The `.then()` and "
"`.catch()` methods handle fulfilled and rejected promises respectively. With async/await "
"syntax (ES2017), you can write asynchronous code that reads like synchronous code. "
"Unhandled promise rejections trigger a warning in Node.js and will cause process exit "
"in future versions. Use `Promise.allSettled()` to wait for all promises regardless of "
"outcome.",
],
},
]
# ---------------------------------------------------------------------------
# HOW_TO_USE_CONTENT
# Rendered as Markdown inside the Gradio Accordion.
# ---------------------------------------------------------------------------
HOW_TO_USE_CONTENT = """
## What Is Starbucks?
**Starbucks** is a **2D Matryoshka embedding model** for dense retrieval and document reranking.
Like a set of Russian nesting dolls, a single Starbucks model contains many smaller, fully usable
sub-models inside it — one for every combination of *layer depth* and *embedding dimension*.
The key innovation is that you can choose your operating point **at inference time**, with no
retraining:
| Axis | What it controls |
|------|-----------------|
| **Layer depth** (Demi → Trenta) | How many transformer layers are run — directly controls encoding speed and representation richness |
| **Embedding dimension** | How many dimensions of the output vector are used — controls index size and dot-product cost |
The model is trained with **SMAE pretraining** (Stochastic Masked Autoencoder) followed by
**SRL fine-tuning** (Starbucks Representation Learning), which teaches every sub-model to produce
useful embeddings simultaneously.
---
## How to Use This Demo
1. **Enter a query** in the *Query* text box (e.g., `"noise-cancelling headphones for travel"`).
2. **Enter documents** — one per line — in the *Documents* text box.
3. **Click Run** (or press Enter). The app will encode the query and all documents using each of the
six Starbucks sizes and display the ranked results side by side.
4. **Or click an example row** at the bottom of the page to auto-fill the query and documents with
a pre-built example, then click Run to see how the sizes compare.
---
## Understanding the Results
Each size panel shows:
- **Ranking** — documents reordered from most to least relevant according to that sub-model.
The number in brackets (e.g., `[Doc 3]`) is the original document index so you can compare
across sizes.
- **Score** — cosine similarity between the query embedding and the document embedding.
Scores range from −1 to 1; higher means more relevant.
- **Encoding Time** — wall-clock time (in seconds) for the model to convert the query and all
documents into embedding vectors. This scales with the number of transformer layers used and
is where most of the latency comes from.
- **Search Time** — time to compute cosine similarities and sort the results. This is typically
very fast (a few milliseconds) because it is just a matrix multiply plus argsort.
**Key insight:** as you move from *Demi* to *Trenta*, encoding time increases roughly linearly
with the number of layers, but ranking quality improves significantly — especially for queries
where subtle semantic differences matter.
---
## The Efficiency–Quality Tradeoff
The six Starbucks sizes correspond to six checkpoints along the layer axis of the underlying
BERT-base encoder (12 layers total, 768-dimensional hidden states):
| Size | Layers used | Relative speed | Expected quality |
|------|-------------|---------------|-----------------|
| Demi | 2 | ★★★★★★ (fastest) | ★★ |
| Short | 4 | ★★★★★ | ★★★ |
| Tall | 6 | ★★★★ | ★★★★ |
| Grande | 8 | ★★★ | ★★★★ |
| Venti | 10 | ★★ | ★★★★★ |
| Trenta | 12 | ★ (full model) | ★★★★★★ |
Embedding dimension can be further reduced (e.g., from 768 to 64) for additional speed-ups
in similarity search with modest quality loss — but the demo uses the full dimension for
clarity.
---
## About the Paper
**Starbucks: Benchmarking and Training Efficient 2D Matryoshka Retrieval Models**
- **Architecture:** BERT-base transformer with 2D Matryoshka training
- **Pretraining:** SMAE (Stochastic Masked Autoencoder) on large text corpora
- **Fine-tuning:** SRL (Starbucks Representation Learning) on MS MARCO passage ranking
- **HuggingFace model:** [`ielabgroup/Starbucks-msmarco`](https://huggingface.co/ielabgroup/Starbucks-msmarco)
- **BEIR benchmark:** Starbucks-Trenta matches full BERT-base performance; smaller sizes offer
strong Pareto-optimal points on the speed–quality frontier.
"""
# ---------------------------------------------------------------------------
# SIZE_TABLE_HTML
# An HTML table summarising the six Starbucks sizes.
# Intended for direct injection into a Gradio HTML component.
# ---------------------------------------------------------------------------
SIZE_TABLE_HTML = """
<table style="border-collapse: collapse; width: 100%; font-family: sans-serif; font-size: 0.9rem;">
<thead>
<tr style="background-color: #1E3A2F; color: #ffffff;">
<th style="padding: 10px 14px; text-align: left; border: 1px solid #ccc;">Size</th>
<th style="padding: 10px 14px; text-align: center; border: 1px solid #ccc;">Layers Used</th>
<th style="padding: 10px 14px; text-align: center; border: 1px solid #ccc;">Embedding Dim</th>
<th style="padding: 10px 14px; text-align: center; border: 1px solid #ccc;">Relative Speed</th>
<th style="padding: 10px 14px; text-align: center; border: 1px solid #ccc;">Expected Quality</th>
<th style="padding: 10px 14px; text-align: left; border: 1px solid #ccc;">Best For</th>
</tr>
</thead>
<tbody>
<tr style="background-color: #f9f9f9;">
<td style="padding: 8px 14px; border: 1px solid #ccc; font-weight: bold; color: #00704A;">&#9749; Demi</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">2</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">768</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;&#9733;&#9733; Fastest</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc;">Keyword-heavy queries, latency-critical systems</td>
</tr>
<tr>
<td style="padding: 8px 14px; border: 1px solid #ccc; font-weight: bold; color: #00704A;">&#9749; Short</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">4</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">768</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc;">High-throughput first-stage retrieval</td>
</tr>
<tr style="background-color: #f9f9f9;">
<td style="padding: 8px 14px; border: 1px solid #ccc; font-weight: bold; color: #00704A;">&#9749; Tall</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">6</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">768</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc;">Balanced deployments; good Pareto point</td>
</tr>
<tr>
<td style="padding: 8px 14px; border: 1px solid #ccc; font-weight: bold; color: #00704A;">&#9749; Grande</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">8</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">768</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc;">Quality-focused pipelines with moderate compute budgets</td>
</tr>
<tr style="background-color: #f9f9f9;">
<td style="padding: 8px 14px; border: 1px solid #ccc; font-weight: bold; color: #00704A;">&#9749; Venti</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">10</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">768</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc;">Near-full-model quality; slight latency saving</td>
</tr>
<tr>
<td style="padding: 8px 14px; border: 1px solid #ccc; font-weight: bold; color: #00704A;">&#9749; Trenta</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">12 (full)</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">768</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733; Slowest</td>
<td style="padding: 8px 14px; border: 1px solid #ccc; text-align: center;">&#9733;&#9733;&#9733;&#9733;&#9733;&#9733;</td>
<td style="padding: 8px 14px; border: 1px solid #ccc;">Maximum accuracy; offline indexing or small corpora</td>
</tr>
</tbody>
</table>
"""