Update README.md

Browse files

Files changed (1) hide show

README.md +209 -191

README.md CHANGED Viewed

@@ -3,6 +3,12 @@ library_name: transformers
 tags:
 - trl
 - sft
 license: cc
 datasets:
 - nohurry/Opus-4.6-Reasoning-3000x-filtered
@@ -13,197 +19,209 @@ language:
 pipeline_tag: text-generation
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 tags:
 - trl
 - sft
+- metric-attention
+- mixture-of-attentions
+- triangle-inequality
+- blackhole-rope
+- discrepancy-calculus
+- discover
 license: cc
 datasets:
 - nohurry/Opus-4.6-Reasoning-3000x-filtered
 pipeline_tag: text-generation
 ---
+# DiscoverLM-70M
+A 69M parameter causal language model built on the **Mixture-of-Attentions (MoA)** architecture — distance-based metric attention that respects the triangle inequality by construction, not approximation.
+Every attention head operates in a proper metric space. The geometry is enforced, not hoped for.
+## What Makes This Different
+Standard transformers compute attention as a dot product: Q·Kᵀ. This has no geometric meaning — it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties.
+MoA replaces this with **negative squared distance** under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) ≤ d(a,b) + d(b,c) holds.
+This isn't a constraint that fights the model. It's structure the model uses.
+## Architecture
+```
+Input → Token Embedding (48K vocab, Qwen3)
+  │
+  ▼
+┌────────────────────────────��─────────────────────┐
+│              MoA Block × 4                       │
+│                                                  │
+│  ┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ │
+│  │  Local   │ │  Global  │ │Channel │ │  MQA   │ │
+│  │  Conv    │ │  Metric  │ │  Mix   │ │ Metric │ │
+│  │         │ │ (64 heads)│ │        │ │(64 Q)  │ │
+│  └────┬────┘ └────┬─────┘ └───┬────┘ └───┬────┘ │
+│       └──────┬────┴───────────┴───────────┘      │
+│              ▼                                   │
+│     Feature Gates + Token Router (top-2)         │
+│              ▼                                   │
+│        Residual + DropPath                       │
+└──────────────────────┬───────────────────────────┘
+                       ▼
+         HyperFFN (SwiGLU + CausalConv + LowRank)
+                       ▼
+                   LayerNorm
+                       ▼
+┌──────────────────────────────────────────────────┐
+│            MoA Language Model Head               │
+│  (same 4-path mixture → SwiGLU → tied vocab)    │
+└──────────────────────┬───────────────────────────┘
+                       ▼
+                 Logits (48,000)
+```
+### Core Components
+**Metric Attention.** Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax.
+**Mixture-of-Attentions Routing.** Four parallel paths per token — local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing.
+**BlackHoleRoPE.** Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation.
+**HyperFFN.** Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck — routed per-token with top-2 sparse selection.
+**MoA LM Head.** The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding.
+## Parameter Budget
+| Component | Parameters | % |
+|---|---|---|
+| Token embedding (tied) | 24.6M | 35.5% |
+| MoA blocks × 4 | 28.9M | 41.8% |
+| HyperFFN (shared) | 4.2M | 6.1% |
+| MoA LM head | 10.8M | 15.6% |
+| RoPE + norms | 0.6M | 0.9% |
+| **Total** | **69.1M** | |
+## vs Standard Transformers
+| | Transformer | MoA |
+|---|---|---|
+| Attention scoring | Dot product (Q·Kᵀ) | Negative Mahalanobis distance |
+| Geometric guarantee | None | Triangle inequality regularized |
+| Position encoding | RoPE | BlackHoleRoPE (learned phase + bounded V energy) |
+| Attention sparsity | Causal mask only | Ball pruning + top-k routing |
+| Head combination | Concatenation | Per-token routed mixture of 4 path types |
+| FFN | Single MLP | 3-branch routed (SwiGLU + CausalConv + LowRank) |
+| LM head | Linear projection | Full MoA mixture → SwiGLU → tied projection |
+## Training
+### Data
+| Dataset | Domain |
+|---|---|
+| [Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Multi-step reasoning |
+| [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | Mathematical problem solving |
+| [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | General instruction following |
+### Hyperparameters
+| Parameter | Value |
+|---|---|
+| Optimizer | AdamW |
+| Learning rate | 3e-4 → 0 (cosine) |
+| Batch size | 4 |
+| Max sequence length | 1,024 |
+| Steps | 512 |
+| Epochs | 8 |
+| Tokens seen | 262,144 |
+| Precision | fp32 |
+| Hardware | NVIDIA H100 (Colab) |
+| TI regularization | λ=0.01, 64 samples/batch |
+| Router top-k | 2 of 4 paths |
 ### Results
+| Epoch | Avg Loss | Min Loss | σ | Token Accuracy |
+|---|---|---|---|---|
+| 1 | 2.887 | 2.285 | 0.291 | 59.2% |
+| 2 | 2.324 | 1.651 | 0.259 | 63.4% |
+| 3 | 1.931 | 1.232 | 0.211 | 68.4% |
+| 4 | 1.616 | 1.012 | 0.201 | 74.4% |
+| 5 | 1.432 | 0.954 | 0.169 | 77.0% |
+| 6 | 1.211 | 0.677 | 0.180 | 79.0% |
+| 7 | 1.075 | 0.599 | 0.151 | 80.1% |
+| 8 | 1.014 | 0.718 | 0.142 | 80.8% |
+**Best single step:** 393 — loss **0.599**, token accuracy **88.4%**
+Loss variance halved across training (σ: 0.291 → 0.142), indicating the mixture-of-attentions learned stable routing preferences as training progressed.
+## Configuration
+```json
+{
+  "dim": 512,
+  "num_layers": 4,
+  "attn_heads": 64,
+  "mqa_q_heads": 64,
+  "lm_attn_heads": 32,
+  "lm_mqa_q_heads": 32,
+  "metric": "maha_diag",
+  "vocab_size": 48000,
+  "max_position_embeddings": 1024,
+  "ffn_hidden": 1536,
+  "mixer_hidden": 768,
+  "n_branches": 3,
+  "router_topk": 2,
+  "use_balls": true,
+  "radius_init": 3.5,
+  "ti_reg_weight": 0.01,
+  "ti_reg_samples": 64,
+  "energy_amplification": 9.87,
+  "theta_base": 10000.0,
+  "tie_word_embeddings": true
+}
+```
+## Usage
+```python
+from transformers import AutoTokenizer
+from MoA import MoAMetricLM, MoAMetricConfig
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
+inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=128)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Mathematical Foundation
+The metric attention mechanism is grounded in the Discrepancy Calculus (DISC), a measure-theoretic framework for singularity analysis developed by the author. The triangle inequality regularizer enforces that the learned attention geometry satisfies d(a,c) ≤ d(a,b) + d(b,c) across sampled triples, ensuring the distance function used for attention scoring is a proper metric — not merely a similarity function.
+The ball pruning mechanism (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from the geometry itself rather than from fixed masking heuristics.
+BlackHoleRoPE extends standard rotary position encoding with learned phase perturbations synthesized from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V — ensuring position-dependent energy gating stays within Lyapunov-stable bounds.
+## Lineage
+This architecture derives from research in metric-native neural computation:
+- **DISC** — Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025)
+- **MoA** — Mixture-of-Attentions with triangle inequality enforcement
+- **BlackHoleRoPE** — Learned rotary position encoding with bounded energy gating
+## Limitations
+- Trained on 262K tokens — the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated.
+- No eval split was used; training metrics only.
+- 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale.
+- fp32 training only — bf16/fp16 behavior untested.
+## Citation
+```bibtex
+@misc{colca2025discoverLM,
+  author = {Colca, Roy},
+  title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M}
+}
+```
+## Author
+Roy Colca Jr. — [Convergent Intelligence LLC](https://convergentintel.com)
+Mercyhurst University, M.S. Applied Intelligence
+HuggingFace: [reaperdoesntknow](https://huggingface.co/reaperdoesntknow)