Zandy-Wandy commited on
Commit
62afb6a
Β·
verified Β·
1 Parent(s): 52f2f62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +306 -1
README.md CHANGED
@@ -9,4 +9,309 @@ license: cc-by-nc-nd-4.0
9
  short_description: Upcoming Flagship LLM series
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  short_description: Upcoming Flagship LLM series
10
  ---
11
 
12
+ # Matrix Lattice β€” Full Architecture Specification
13
+ **Agentic + Multimodal Frontier MoE Family | Matrix.Corp**
14
+
15
+ ---
16
+
17
+ ## Overview
18
+
19
+ Matrix Lattice is Matrix.Corp's flagship frontier model family. Designed from the ground up for inference provider deployment (Novita, Hyperbolic, Together, Fireworks, etc.) and accessed via OpenAI-compatible API. Agentic-first, natively multimodal, 1M+ context, MoE architecture keeping active params far below total.
20
+
21
+ | Model | Total Params | Active Params | Experts | Context | Target Hardware |
22
+ |---|---|---|---|---|---|
23
+ | Lattice-120B | 120B | ~22B active | 64 experts, top-4 | 1M tokens | 4Γ— H100 / 8Γ— p300a |
24
+ | Lattice-430B | 430B | ~38B active | 128 experts, top-4 | 1M tokens | 16Γ— H100 / 28Γ— p300a |
25
+ | Lattice-671B | 671B | ~47B active | 256 experts, top-4 | 1M tokens | 32Γ— H100 / 48Γ— p300a |
26
+
27
+ ---
28
+
29
+ ## Base Lineage
30
+
31
+ Mixed distillation approach:
32
+ - **DeepSeek-V3 / R1** β€” MLA attention, MoE routing strategy, math/reasoning capability
33
+ - **Llama 4 Scout/Maverick** β€” multimodal vision encoder architecture, instruction following, long-context iRoPE scaling
34
+ - **Custom Matrix.Corp additions** β€” 17 novel modules, lattice routing, agentic infrastructure
35
+
36
+ ---
37
+
38
+ ## Core Public Architectures Used
39
+
40
+ ### 1. Multi-Head Latent Attention (MLA) β€” DeepSeek-V3
41
+ Compresses KV cache via low-rank projection. At 1M context, standard KV cache is impossible β€” MLA makes it viable. KV cache reduced by ~90% vs standard MHA.
42
+
43
+ ### 2. Mixture of Experts (MoE) β€” DeepSeek-V3 Style
44
+ - Shared experts (always active) + routed experts (top-k per token)
45
+ - Fine-grained expert segmentation β€” more smaller experts vs fewer large ones
46
+ - Load balancing via auxiliary-free strategy (sequence-level bias, no loss penalty)
47
+ - Expert capacity: no token dropping, dynamic overflow routing
48
+
49
+ ### 3. Mixture of Depths (MoD) β€” Google Research
50
+ Tokens dynamically skip transformer layers based on a learned routing decision. Easy tokens skip up to 50% of layers. Hard tokens (reasoning, code, structured output) use all layers. Net result: ~30% compute reduction at same quality.
51
+
52
+ ### 4. iRoPE / YaRN Scaling β€” Llama 4 / YaRN paper
53
+ Interleaved NTK-aware RoPE scaling for 1M+ context without positional degradation. Alternating full-attention and sliding window layers. Full attention every 4th layer; sliding window (8K) on intermediate layers.
54
+
55
+ ### 5. Sliding Window Attention β€” Mistral
56
+ 8K sliding window on non-full-attention layers. O(n) memory for most layers, O(nΒ²) only on full-attention layers.
57
+
58
+ ### 6. Speculative Decoding β€” Google DeepMind
59
+ Each Lattice model ships with a paired draft model (Lattice-120B-Draft at ~4B params). 3–5Γ— inference speedup on provider hardware. Draft model shares embedding weights with main model.
60
+
61
+ ### 7. Multimodal Vision Encoder β€” Llama 4 / InternVL lineage
62
+ - ViT-based image encoder (6B params, separate from LM)
63
+ - Cross-attention visual tokens injected at every 4th layer
64
+ - Supports: images, video frames, documents, charts, screenshots
65
+ - Patch resolution: 448Γ—448 base, up to 4K via dynamic tiling
66
+ - Audio: separate audio encoder (Whisper-large-v3 lineage) for speech/sound understanding
67
+
68
+ ---
69
+
70
+ ## 17 Custom Modules
71
+
72
+ ### Module 1 β€” EQ Engine V2
73
+ Upgraded from Zenith's V1. Now tracks emotional arc across the **entire conversation**, not just per-layer.
74
+ - Persistent emotional state vector across turns (GRU with conversation-length memory)
75
+ - 12-emotion classification (expanded from 8)
76
+ - Frustration trajectory prediction β€” detects escalation before it peaks
77
+ - Per-user emotional baseline calibration (inferred from first 3 turns)
78
+ - Feeds into Persona Stability Enforcer (Module 14)
79
+ - Always FP16, never quantized
80
+
81
+ ### Module 2 β€” Lattice Router
82
+ Custom MoE routing built specifically for this architecture. Not standard top-k.
83
+ - Hierarchical routing: token β†’ domain cluster β†’ expert group β†’ individual expert
84
+ - Domain clusters: Reasoning, Code, Vision, Language, Agentic, Science, Creative, Safety
85
+ - Experts self-label during training via contrastive specialization loss
86
+ - Router is inspectable at inference β€” API exposes which expert cluster handled each segment
87
+ - Load-aware routing: aware of current server load, can shift to less-used experts
88
+
89
+ ### Module 3 β€” Confidence Calibration Head
90
+ Runs in parallel with LM head on every token.
91
+ - Outputs epistemic uncertainty [0–1] per token
92
+ - Aggregated to sentence/paragraph level for API response metadata
93
+ - Trained on calibration data: model rewarded for accurate uncertainty, not just correct answers
94
+ - Exposed via API as `X-Lattice-Confidence` header per response chunk
95
+ - Feeds into Knowledge Boundary Detector (Module 17)
96
+
97
+ ### Module 4 β€” Native Tool Schema Reasoner
98
+ Not prompt-based function calling. Dedicated architecture.
99
+ - Separate attention heads trained exclusively on tool/API schemas
100
+ - Supports: JSON Schema, OpenAPI 3.x, GraphQL, SQL DDL
101
+ - Schema tokenized as structured graph, not flat text
102
+ - Tool call planner: generates multi-step tool execution plans before first call
103
+ - Parallel tool dispatch: can issue multiple tool calls simultaneously
104
+ - Tool result integrator: dedicated cross-attention for injecting tool results
105
+
106
+ ### Module 5 β€” Multi-Agent Coordination Layer (MACL)
107
+ Designed for multi-agent systems where multiple Lattice instances talk to each other.
108
+ - Structured agent message format: role, task_id, confidence, partial_result, handoff_request
109
+ - Agent role awareness: knows if it's orchestrator, subagent, critic, or executor
110
+ - Shared scratchpad attention: multiple agents can attend to same working memory
111
+ - Conflict resolution head: when two agents disagree, dedicated reasoning path
112
+ - Exposed via API as `lattice-agent-protocol` extension
113
+
114
+ ### Module 6 β€” Hierarchical Context Compression Engine (HCCE)
115
+ Makes 1M+ context actually usable, not just theoretically supported.
116
+ - Every 32K tokens: compress to summary embedding + key-fact store
117
+ - Every 128K tokens: meta-summary of summaries
118
+ - Recent 32K: always full resolution
119
+ - Older context: summary + retrievable detail on demand
120
+ - Learned compression: trained to preserve causally important information
121
+ - Compression ratio: ~20:1 on narrative text, ~5:1 on code/structured data
122
+
123
+ ### Module 7 β€” Structured Output Enforcer (SOE)
124
+ Guaranteed valid structured outputs. Not retry-based.
125
+ - Constrained decoding via token masking against target schema
126
+ - Supports: JSON, YAML, XML, Markdown, CSV, Python, SQL, HTML
127
+ - Zero-shot: give it a Pydantic model or JSON Schema, get guaranteed valid output
128
+ - Partial streaming: streams valid partial JSON as tokens generate
129
+ - Integrated with Tool Schema Reasoner (Module 4) for tool call outputs
130
+
131
+ ### Module 8 β€” Causal Reasoning Graph (CRG)
132
+ Builds an explicit internal cause-effect graph during generation.
133
+ - Each reasoning step adds nodes + edges to internal graph
134
+ - Graph attention: later reasoning steps attend to causal graph, not just token sequence
135
+ - Detects reasoning loops and contradiction chains
136
+ - Exposed optionally via API as structured reasoning trace
137
+ - Improves performance on multi-hop questions, legal reasoning, scientific causality
138
+
139
+ ### Module 9 β€” Temporal Awareness Module
140
+ Time is a first-class concept.
141
+ - Dedicated temporal embeddings: absolute dates, relative references ("last week"), durations
142
+ - Timeline builder: constructs event timelines from unstructured text
143
+ - Temporal consistency checker: flags contradictions in event ordering
144
+ - Knowledge cutoff awareness: trained to know what it does and doesn't know about recency
145
+ - Feeds into Knowledge Boundary Detector (Module 17)
146
+
147
+ ### Module 10 β€” Cross-Lingual Semantic Alignment Layer
148
+ 50+ language support with deep semantic alignment, not surface translation.
149
+ - Language-agnostic semantic embedding space
150
+ - Code-switching aware: handles mixed-language inputs naturally
151
+ - Script normalization: handles CJK, Arabic RTL, Devanagari natively at tokenizer level
152
+ - Dialect modeling: distinguishes Brazilian vs European Portuguese, Simplified vs Traditional Chinese
153
+ - Translation quality head: can score its own translation outputs
154
+
155
+ ### Module 11 β€” Safety Reasoning Module (SRM)
156
+ Auditable, explainable safety β€” key differentiator for inference providers.
157
+ - Dedicated safety reasoning chain before generation (not post-hoc filtering)
158
+ - Produces explicit safety trace: what risk was considered, what was ruled out, why
159
+ - Granular harm taxonomy: 47 harm categories with confidence scores
160
+ - Provider-configurable: API operators can tune safety thresholds per deployment
161
+ - Audit log: safety decisions logged in structured format for compliance
162
+ - Separate from EQ Engine β€” safety is logic-based, not emotion-based
163
+
164
+ ### Module 12 β€” Vision-Language Grounding Module
165
+ Deep integration between visual and language understanding.
166
+ - Object-level grounding: links text references to bounding box regions
167
+ - Chart/diagram interpreter: specialized attention for data visualizations
168
+ - Document layout understanding: OCR + structure (tables, headings, columns)
169
+ - Screenshot-to-code: dedicated pathway for UI β†’ code generation
170
+ - Video temporal grounding: links text references to specific frames
171
+
172
+ ### Module 13 β€” Long-Horizon Task Planner
173
+ Agentic planning as a first-class capability.
174
+ - Task decomposition head: breaks goals into subtask DAGs
175
+ - Dependency resolver: identifies which subtasks block others
176
+ - Progress tracker: maintains task state across long conversations
177
+ - Replanning trigger: detects when a plan needs revision based on new info
178
+ - Integrates with MACL (Module 5) for distributing tasks across agents
179
+ - Outputs structured task graphs via API
180
+
181
+ ### Module 14 β€” Persona Stability Enforcer (PSE)
182
+ Maintains consistent identity, tone, and personality across million-token contexts.
183
+ - Persona embedding: operator-defined persona injected as persistent memory
184
+ - Style consistency loss during training: penalizes tone drift
185
+ - Character consistency checker: ensures factual claims about self don't contradict
186
+ - Feeds from EQ Engine V2: adjusts warmth/formality dynamically but within persona bounds
187
+ - Critical for long-running API deployments and character-based applications
188
+
189
+ ### Module 15 β€” API Telemetry & Observability Hooks
190
+ Built into the model, not bolted on by the provider.
191
+ - Per-token latency profiling embedded in forward pass
192
+ - Expert utilization stats per request
193
+ - Context compression events flagged in stream
194
+ - Confidence + uncertainty exposed per chunk
195
+ - Module activation trace: which of the 17 modules fired for each request
196
+ - All exposed as structured SSE metadata alongside token stream
197
+
198
+ ### Module 16 β€” Code Intelligence Engine (CIE)
199
+ Goes beyond code completion β€” full software engineering understanding.
200
+ - AST-aware attention: code parsed to AST, structural tokens injected
201
+ - Multi-file context graph: understands cross-file dependencies
202
+ - Runtime simulation head: predicts execution behavior without running code
203
+ - Bug pattern library: trained on CVE database + common bug taxonomies
204
+ - Test generation: given code, generates comprehensive test suite
205
+ - Integrates with Tool Schema Reasoner for build/exec tool use
206
+
207
+ ### Module 17 β€” Knowledge Boundary Detector (KBD)
208
+ Knows what it doesn't know.
209
+ - Hallucination risk scorer per claim
210
+ - Sources: Confidence Calibration Head + Temporal Module + retrieval signal
211
+ - Claim classification: known / uncertain / likely-hallucination / outside-training
212
+ - Citation need detector: flags claims that should be sourced
213
+ - Self-consistency checker: runs 3 forward passes on uncertain claims, checks agreement
214
+ - Exposed via API: `X-Lattice-Hallucination-Risk` per response
215
+
216
+ ---
217
+
218
+ ## Hardware & Inference Specs
219
+
220
+ ### Lattice-120B
221
+ | Config | Active Params | VRAM | TPS (est.) |
222
+ |---|---|---|---|
223
+ | BF16 | ~22B | ~240GB | ~35 TPS |
224
+ | INT8 | ~22B | ~120GB | ~70 TPS |
225
+ | INT4 | ~22B | ~60GB | ~130 TPS |
226
+ Target: 4Γ— H100 80GB (INT8) or 8Γ— p300a (INT4)
227
+
228
+ ### Lattice-430B
229
+ | Config | Active Params | VRAM | TPS (est.) |
230
+ |---|---|---|---|
231
+ | BF16 | ~38B | ~860GB | ~18 TPS |
232
+ | INT8 | ~38B | ~430GB | ~38 TPS |
233
+ | INT4 | ~38B | ~215GB | ~72 TPS |
234
+ Target: 8Γ— H100 80GB (INT4) or 28Γ— p300a (INT4)
235
+
236
+ ### Lattice-671B
237
+ | Config | Active Params | VRAM | TPS (est.) |
238
+ |---|---|---|---|
239
+ | BF16 | ~47B | ~1.34TB | ~12 TPS |
240
+ | INT8 | ~47B | ~671GB | ~26 TPS |
241
+ | INT4 | ~47B | ~336GB | ~50 TPS |
242
+ Target: 32Γ— H100 80GB (INT4) or 48Γ— p300a (INT4)
243
+
244
+ ---
245
+
246
+ ## Training Strategy
247
+
248
+ ### Phase 1 β€” Foundation (all sizes)
249
+ - Mixed distillation from DeepSeek-V3, DeepSeek-R1, Llama 4 Scout/Maverick
250
+ - Data: web text, code, scientific papers, books, multimodal datasets
251
+ - Context: start at 8K, scale to 1M via curriculum
252
+ - MoE load balancing stabilization
253
+
254
+ ### Phase 2 β€” Module Integration
255
+ - Each of 17 modules trained with task-specific auxiliary losses
256
+ - Module loss weights tuned per module (see training_config.py)
257
+ - Modules frozen in turn as they converge
258
+
259
+ ### Phase 3 β€” Agentic Fine-tuning
260
+ - Tool use, multi-agent coordination, long-horizon task completion
261
+ - Synthetic agentic trajectories generated by Lattice-120B bootstrapping larger models
262
+ - RLHF / GRPO on agentic task completion + safety
263
+
264
+ ### Phase 4 β€” Alignment & Safety
265
+ - Safety Reasoning Module fine-tuning on harm taxonomy
266
+ - Constitutional AI-style self-critique
267
+ - Red-team adversarial fine-tuning
268
+
269
+ ---
270
+
271
+ ## API Design (Inference Provider Ready)
272
+
273
+ OpenAI-compatible with Lattice extensions:
274
+
275
+ ```python
276
+ from openai import OpenAI
277
+
278
+ client = OpenAI(
279
+ base_url="https://api.provider.com/v1",
280
+ api_key="your-key"
281
+ )
282
+
283
+ response = client.chat.completions.create(
284
+ model="matrix-lattice-671b",
285
+ messages=[{"role": "user", "content": "Your prompt"}],
286
+ tools=[...], # Native tool schemas
287
+ extra_body={
288
+ "lattice": {
289
+ "expose_confidence": True,
290
+ "expose_module_trace": False,
291
+ "expose_reasoning_graph": False,
292
+ "safety_tier": "standard", # standard | strict | minimal
293
+ "persona": "helpful-assistant",
294
+ "agent_role": "orchestrator" # orchestrator | subagent | critic
295
+ }
296
+ }
297
+ )
298
+
299
+ # Response includes standard OpenAI fields PLUS:
300
+ # response.lattice.confidence_scores
301
+ # response.lattice.active_modules
302
+ # response.lattice.hallucination_risk
303
+ # response.lattice.expert_clusters_used
304
+ ```
305
+
306
+ ---
307
+
308
+ ## Status
309
+ - πŸ”΄ Planned β€” Architecture specification complete
310
+ - Training infrastructure: TBD
311
+ - Timeline: TBD (depends on compute access at scale)
312
+
313
+ ## HuggingFace
314
+ - `Matrix-Corp/Lattice-120B-V1` (planned)
315
+ - `Matrix-Corp/Lattice-430B-V1` (planned)
316
+ - `Matrix-Corp/Lattice-671B-V1` (planned)
317
+ - Collection: `Matrix-Corp/lattice-v1` (planned)