Matrix-Lattice / README.md
Zandy-Wandy's picture
Update README.md
8017921 verified
metadata
title: Matrix Lattice
emoji: πŸ‘€
colorFrom: indigo
colorTo: green
sdk: static
pinned: false
license: cc-by-nc-nd-4.0
short_description: Upcoming Flagship LLM series

Matrix Lattice β€” Full Architecture Specification

Agentic + Multimodal Frontier MoE Family | Matrix.Corp


Overview

Matrix Lattice is Matrix.Corp's flagship frontier model family. Designed from the ground up for inference provider deployment (Novita, Hyperbolic, Together, Fireworks, etc.) and accessed via OpenAI-compatible API. Agentic-first, natively multimodal, 1M+ context, MoE architecture keeping active params far below total.

Model Total Params Active Params Experts Context Target Hardware
Lattice-120B 120B ~22B active 64 experts, top-4 1M tokens 4Γ— H100 / 8Γ— p300a
Lattice-430B 430B ~38B active 128 experts, top-4 1M tokens 16Γ— H100 / 28Γ— p300a
Lattice-671B 671B ~47B active 256 experts, top-4 1M tokens 32Γ— H100 / 48Γ— p300a

Base Lineage

Mixed distillation approach:

  • DeepSeek-V3 / R1 β€” MLA attention, MoE routing strategy, math/reasoning capability
  • Llama 4 Scout/Maverick β€” multimodal vision encoder architecture, instruction following, long-context iRoPE scaling
  • Custom Matrix.Corp additions β€” 17 novel modules, lattice routing, agentic infrastructure

Core Public Architectures Used

1. Multi-Head Latent Attention (MLA) β€” DeepSeek-V3

Compresses KV cache via low-rank projection. At 1M context, standard KV cache is impossible β€” MLA makes it viable. KV cache reduced by ~90% vs standard MHA.

2. Mixture of Experts (MoE) β€” DeepSeek-V3 Style

  • Shared experts (always active) + routed experts (top-k per token)
  • Fine-grained expert segmentation β€” more smaller experts vs fewer large ones
  • Load balancing via auxiliary-free strategy (sequence-level bias, no loss penalty)
  • Expert capacity: no token dropping, dynamic overflow routing

3. Mixture of Depths (MoD) β€” Google Research

Tokens dynamically skip transformer layers based on a learned routing decision. Easy tokens skip up to 50% of layers. Hard tokens (reasoning, code, structured output) use all layers. Net result: ~30% compute reduction at same quality.

4. iRoPE / YaRN Scaling β€” Llama 4 / YaRN paper

Interleaved NTK-aware RoPE scaling for 1M+ context without positional degradation. Alternating full-attention and sliding window layers. Full attention every 4th layer; sliding window (8K) on intermediate layers.

5. Sliding Window Attention β€” Mistral

8K sliding window on non-full-attention layers. O(n) memory for most layers, O(nΒ²) only on full-attention layers.

6. Speculative Decoding β€” Google DeepMind

Each Lattice model ships with a paired draft model (Lattice-120B-Draft at ~4B params). 3–5Γ— inference speedup on provider hardware. Draft model shares embedding weights with main model.

7. Multimodal Vision Encoder β€” Llama 4 / InternVL lineage

  • ViT-based image encoder (6B params, separate from LM)
  • Cross-attention visual tokens injected at every 4th layer
  • Supports: images, video frames, documents, charts, screenshots
  • Patch resolution: 448Γ—448 base, up to 4K via dynamic tiling
  • Audio: separate audio encoder (Whisper-large-v3 lineage) for speech/sound understanding

17 Custom Modules

Module 1 β€” EQ Engine V2

Upgraded from Zenith's V1. Now tracks emotional arc across the entire conversation, not just per-layer.

  • Persistent emotional state vector across turns (GRU with conversation-length memory)
  • 12-emotion classification (expanded from 8)
  • Frustration trajectory prediction β€” detects escalation before it peaks
  • Per-user emotional baseline calibration (inferred from first 3 turns)
  • Feeds into Persona Stability Enforcer (Module 14)
  • Always FP16, never quantized

Module 2 β€” Lattice Router

Custom MoE routing built specifically for this architecture. Not standard top-k.

  • Hierarchical routing: token β†’ domain cluster β†’ expert group β†’ individual expert
  • Domain clusters: Reasoning, Code, Vision, Language, Agentic, Science, Creative, Safety
  • Experts self-label during training via contrastive specialization loss
  • Router is inspectable at inference β€” API exposes which expert cluster handled each segment
  • Load-aware routing: aware of current server load, can shift to less-used experts

Module 3 β€” Confidence Calibration Head

Runs in parallel with LM head on every token.

  • Outputs epistemic uncertainty [0–1] per token
  • Aggregated to sentence/paragraph level for API response metadata
  • Trained on calibration data: model rewarded for accurate uncertainty, not just correct answers
  • Exposed via API as X-Lattice-Confidence header per response chunk
  • Feeds into Knowledge Boundary Detector (Module 17)

Module 4 β€” Native Tool Schema Reasoner

Not prompt-based function calling. Dedicated architecture.

  • Separate attention heads trained exclusively on tool/API schemas
  • Supports: JSON Schema, OpenAPI 3.x, GraphQL, SQL DDL
  • Schema tokenized as structured graph, not flat text
  • Tool call planner: generates multi-step tool execution plans before first call
  • Parallel tool dispatch: can issue multiple tool calls simultaneously
  • Tool result integrator: dedicated cross-attention for injecting tool results

Module 5 β€” Multi-Agent Coordination Layer (MACL)

Designed for multi-agent systems where multiple Lattice instances talk to each other.

  • Structured agent message format: role, task_id, confidence, partial_result, handoff_request
  • Agent role awareness: knows if it's orchestrator, subagent, critic, or executor
  • Shared scratchpad attention: multiple agents can attend to same working memory
  • Conflict resolution head: when two agents disagree, dedicated reasoning path
  • Exposed via API as lattice-agent-protocol extension

Module 6 β€” Hierarchical Context Compression Engine (HCCE)

Makes 1M+ context actually usable, not just theoretically supported.

  • Every 32K tokens: compress to summary embedding + key-fact store
  • Every 128K tokens: meta-summary of summaries
  • Recent 32K: always full resolution
  • Older context: summary + retrievable detail on demand
  • Learned compression: trained to preserve causally important information
  • Compression ratio: ~20:1 on narrative text, ~5:1 on code/structured data

Module 7 β€” Structured Output Enforcer (SOE)

Guaranteed valid structured outputs. Not retry-based.

  • Constrained decoding via token masking against target schema
  • Supports: JSON, YAML, XML, Markdown, CSV, Python, SQL, HTML
  • Zero-shot: give it a Pydantic model or JSON Schema, get guaranteed valid output
  • Partial streaming: streams valid partial JSON as tokens generate
  • Integrated with Tool Schema Reasoner (Module 4) for tool call outputs

Module 8 β€” Causal Reasoning Graph (CRG)

Builds an explicit internal cause-effect graph during generation.

  • Each reasoning step adds nodes + edges to internal graph
  • Graph attention: later reasoning steps attend to causal graph, not just token sequence
  • Detects reasoning loops and contradiction chains
  • Exposed optionally via API as structured reasoning trace
  • Improves performance on multi-hop questions, legal reasoning, scientific causality

Module 9 β€” Temporal Awareness Module

Time is a first-class concept.

  • Dedicated temporal embeddings: absolute dates, relative references ("last week"), durations
  • Timeline builder: constructs event timelines from unstructured text
  • Temporal consistency checker: flags contradictions in event ordering
  • Knowledge cutoff awareness: trained to know what it does and doesn't know about recency
  • Feeds into Knowledge Boundary Detector (Module 17)

Module 10 β€” Cross-Lingual Semantic Alignment Layer

50+ language support with deep semantic alignment, not surface translation.

  • Language-agnostic semantic embedding space
  • Code-switching aware: handles mixed-language inputs naturally
  • Script normalization: handles CJK, Arabic RTL, Devanagari natively at tokenizer level
  • Dialect modeling: distinguishes Brazilian vs European Portuguese, Simplified vs Traditional Chinese
  • Translation quality head: can score its own translation outputs

Module 11 β€” Safety Reasoning Module (SRM)

Auditable, explainable safety β€” key differentiator for inference providers.

  • Dedicated safety reasoning chain before generation (not post-hoc filtering)
  • Produces explicit safety trace: what risk was considered, what was ruled out, why
  • Granular harm taxonomy: 47 harm categories with confidence scores
  • Provider-configurable: API operators can tune safety thresholds per deployment
  • Audit log: safety decisions logged in structured format for compliance
  • Separate from EQ Engine β€” safety is logic-based, not emotion-based

Module 12 β€” Vision-Language Grounding Module

Deep integration between visual and language understanding.

  • Object-level grounding: links text references to bounding box regions
  • Chart/diagram interpreter: specialized attention for data visualizations
  • Document layout understanding: OCR + structure (tables, headings, columns)
  • Screenshot-to-code: dedicated pathway for UI β†’ code generation
  • Video temporal grounding: links text references to specific frames

Module 13 β€” Long-Horizon Task Planner

Agentic planning as a first-class capability.

  • Task decomposition head: breaks goals into subtask DAGs
  • Dependency resolver: identifies which subtasks block others
  • Progress tracker: maintains task state across long conversations
  • Replanning trigger: detects when a plan needs revision based on new info
  • Integrates with MACL (Module 5) for distributing tasks across agents
  • Outputs structured task graphs via API

Module 14 β€” Persona Stability Enforcer (PSE)

Maintains consistent identity, tone, and personality across million-token contexts.

  • Persona embedding: operator-defined persona injected as persistent memory
  • Style consistency loss during training: penalizes tone drift
  • Character consistency checker: ensures factual claims about self don't contradict
  • Feeds from EQ Engine V2: adjusts warmth/formality dynamically but within persona bounds
  • Critical for long-running API deployments and character-based applications

Module 15 β€” API Telemetry & Observability Hooks

Built into the model, not bolted on by the provider.

  • Per-token latency profiling embedded in forward pass
  • Expert utilization stats per request
  • Context compression events flagged in stream
  • Confidence + uncertainty exposed per chunk
  • Module activation trace: which of the 17 modules fired for each request
  • All exposed as structured SSE metadata alongside token stream

Module 16 β€” Code Intelligence Engine (CIE)

Goes beyond code completion β€” full software engineering understanding.

  • AST-aware attention: code parsed to AST, structural tokens injected
  • Multi-file context graph: understands cross-file dependencies
  • Runtime simulation head: predicts execution behavior without running code
  • Bug pattern library: trained on CVE database + common bug taxonomies
  • Test generation: given code, generates comprehensive test suite
  • Integrates with Tool Schema Reasoner for build/exec tool use

Module 17 β€” Knowledge Boundary Detector (KBD)

Knows what it doesn't know.

  • Hallucination risk scorer per claim
  • Sources: Confidence Calibration Head + Temporal Module + retrieval signal
  • Claim classification: known / uncertain / likely-hallucination / outside-training
  • Citation need detector: flags claims that should be sourced
  • Self-consistency checker: runs 3 forward passes on uncertain claims, checks agreement
  • Exposed via API: X-Lattice-Hallucination-Risk per response

Hardware & Inference Specs

Lattice-120B

Config Active Params VRAM TPS (est.)
BF16 ~22B ~240GB ~35 TPS
INT8 ~22B ~120GB ~70 TPS
INT4 ~22B ~60GB ~130 TPS
Target: 4Γ— H100 80GB (INT8) or 8Γ— p300a (INT4)

Lattice-430B

Config Active Params VRAM TPS (est.)
BF16 ~38B ~860GB ~18 TPS
INT8 ~38B ~430GB ~38 TPS
INT4 ~38B ~215GB ~72 TPS
Target: 8Γ— H100 80GB (INT4) or 28Γ— p300a (INT4)

Lattice-671B

Config Active Params VRAM TPS (est.)
BF16 ~47B ~1.34TB ~12 TPS
INT8 ~47B ~671GB ~26 TPS
INT4 ~47B ~336GB ~50 TPS
Target: 32Γ— H100 80GB (INT4) or 48Γ— p300a (INT4)

Training Strategy

Phase 1 β€” Foundation (all sizes)

  • Mixed distillation from DeepSeek-V3, DeepSeek-R1, Llama 4 Scout/Maverick
  • Data: web text, code, scientific papers, books, multimodal datasets
  • Context: start at 8K, scale to 1M via curriculum
  • MoE load balancing stabilization

Phase 2 β€” Module Integration

  • Each of 17 modules trained with task-specific auxiliary losses
  • Module loss weights tuned per module (see training_config.py)
  • Modules frozen in turn as they converge

Phase 3 β€” Agentic Fine-tuning

  • Tool use, multi-agent coordination, long-horizon task completion
  • Synthetic agentic trajectories generated by Lattice-120B bootstrapping larger models
  • RLHF / GRPO on agentic task completion + safety

Phase 4 β€” Alignment & Safety

  • Safety Reasoning Module fine-tuning on harm taxonomy
  • Constitutional AI-style self-critique
  • Red-team adversarial fine-tuning

API Design (Inference Provider Ready)

OpenAI-compatible with Lattice extensions:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.provider.com/v1",
    api_key="your-key"
)

response = client.chat.completions.create(
    model="matrix-lattice-671b",
    messages=[{"role": "user", "content": "Your prompt"}],
    tools=[...],  # Native tool schemas
    extra_body={
        "lattice": {
            "expose_confidence": True,
            "expose_module_trace": False,
            "expose_reasoning_graph": False,
            "safety_tier": "standard",  # standard | strict | minimal
            "persona": "helpful-assistant",
            "agent_role": "orchestrator"  # orchestrator | subagent | critic
        }
    }
)

# Response includes standard OpenAI fields PLUS:
# response.lattice.confidence_scores
# response.lattice.active_modules
# response.lattice.hallucination_risk
# response.lattice.expert_clusters_used

Status

  • πŸ”΄ Planned β€” Architecture specification complete
  • Training infrastructure: TBD
  • Timeline: TBD (depends on compute access at scale)

HuggingFace

  • Matrix-Corp/Lattice-120B-V1 (planned)
  • Matrix-Corp/Lattice-430B-V1 (planned)
  • Matrix-Corp/Lattice-671B-V1 (planned)
  • Collection: Matrix-Corp/lattice-v1 (planned)