QSBench's picture
Update GUIDE.md
3af83b8 verified
# 🌌 Circuit Complexity Clustering Guide
Welcome to the **Circuit Complexity Clustering Hub**.
This tool demonstrates how **unsupervised learning** can automatically group quantum circuits by their structural complexity — without any labels or prior knowledge.
---
## ⚠️ Important: Local Dataset Notice
This application processes local `.parquet` files stored in the `data/` directory.
- **Data Source**: Local shards of QSBench (Core, Amplitude Damping, Depolarizing, etc.).
- **Processing**: To ensure high performance, analysis is performed on a representative sample of **15,000 circuits**, even if the source file contains hundreds of thousands of rows.
- **Goal**: Showcase how circuit topology and gate structure naturally form complexity groups.
---
## 🎯 1. What is Being Done?
The model performs **unsupervised clustering** (K-Means) to group quantum circuits into clusters of similar **structural complexity**.
### No labels are used
The algorithm discovers groups purely from:
- **Topology**: How qubits are connected (derived from the adjacency matrix).
- **Gate Density**: Counts of single and multi-qubit operations.
- **QASM Signals**: Complexity metrics extracted directly from the OpenQASM code.
Each cluster represents circuits of similar “computational weight” or entanglement potential.
---
## 🧩 2. How the Model “Sees” a Circuit
The model does **not** use noise profiles or simulation results. It focuses on **structural proxies**:
### 🔹 Topology Features
- `adj_density`: How densely the qubits interact.
- `adj_degree_avg`: The average number of connections per qubit.
### 🔹 Gate Structure & Complexity
- `depth`, `total_gates`, `cx_count`: Standard measures of circuit size.
- `gate_entropy`: A measure of how "random" or "structured" the gate sequence is.
### 🔹 QASM-derived Signals
- `qasm_len`: Character length of the code.
- `qasm_gates`: Keyword-based gate count.
---
## 🤖 3. Model Overview: PCA & K-Means
The system follows a standard machine learning pipeline:
1. **Imputation & Scaling**: Missing values are filled with medians, and features are normalized.
2. **K-Means**: Groups circuits into $K$ clusters (2–10).
3. **PCA (Principal Component Analysis)**: Reduces high-dimensional data to 2D for visualization.
### Understanding the PCA Map:
- **Horizontal Axis (Component 1):** Usually represents the **Scale**. Points further to the right typically have more gates and higher qubit counts.
- **Vertical Axis (Component 2):** Often reflects **Density/Complexity**. Points higher or lower on this axis differ in their connectivity patterns or gate-to-depth ratio.
---
## 🖼️ 4. Example Case: Large-Scale Dataset
When working with a full dataset (e.g., **150,000 rows** from `depolarizing` noise), the clustering reveals highly distinct structural "clouds":
- **Core Clusters**: Large, dense groups representing standard circuit templates.
- **The "Tail":** Elongated structures showing a gradient of increasing depth.
- **Outliers:** Isolated points (far left or far top) representing unique, non-standard topologies.
![изображение](https://cdn-uploads.huggingface.co/production/uploads/69cab322f9896e16f84eb345/bmEd1lsR_jaT99ZklPCSQ.png)
---
## 📊 5. Understanding the Results
### A. PCA Projection
- **Each point** = One quantum circuit.
- **Color** = Assigned cluster.
- **Proximity** = Similarity. Circuits close to each other share similar structural DNA.
### B. Silhouette Score
- A metric from **0 to 1** measuring how well-separated the clusters are.
- **High score:** Distinct, well-defined complexity levels.
### C. Cluster Sizes Table
- Shows the distribution of circuits. A heavily imbalanced table might suggest that most of your dataset shares a very similar base structure.
---
## 🧪 6. Experimentation Tips
- **Search for Outliers:** Look for isolated points far from the main "clouds". These are unique circuits — perfect candidates for edge-case benchmarking.
- **Tune K:** If clusters look fragmented on a large dataset, try $K=3$ or $K=5$ to see broader complexity tiers.
- **Compare Datasets:** Notice how the "shape" of the complexity map changes between `Core` (clean) and `Transpilation` datasets.
---
## 🛠️ 7. Troubleshooting
**"Too few rows for clustering" error?**
1. **NaN values:** You may have selected a feature that is empty (all NaNs) in that specific dataset. Try `depth` or `total_gates`.
2. **Path Error:** Ensure your `.parquet` files are in `data/{folder_name}/`.
---
## 🔬 8. Key Insight
> Quantum circuits naturally form groups of similar complexity even without any supervision. Features like connectivity, depth, and two-qubit gate count are enough for an algorithm to discover meaningful “complexity levels”.
---
## 🔗 9. Project Resources
- 🤗 **Hugging Face**: [https://huggingface.co/QSBench](https://huggingface.co/QSBench)
- 💻 **GitHub**: [https://github.com/QSBench](https://github.com/QSBench)
- 🌐 **Website**: [https://qsbench.github.io](https://qsbench.github.io)