AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
FL Hybrid Eigendecomposition Beating cuSOLVER's Mathematical Purity with Compilable PyTorch
As of right now I don't know how to reduce to fp16 without a massive dip. I'm thinking it's possible to utilize integers directly instead of high-accuracy fp64 or fp32 deviated floats. I'll do some exploration.
Reducing this is to fp16 or bf16 capacity would greatly improve performance, and if the values out are close enough to the mantissa cross-contaminants, it could be worth it just for the semi-accurate speed alone.
Per-instance allocation for max_n, max_batch (B):
WORKING STORAGE:
A_work : [B, max_n, max_n] # working copy (destroyed)
V_accum : [B, max_n, max_n] # eigenvector accumulator
householder : [max_n-2, B, max_n] # stored reflectors (padded)
d : [B, max_n] # tridiagonal diagonal
e : [B, max_n-1] # tridiagonal off-diagonal
Subtotal: ~3 × max_n² × B floats
D&C TREE (depth = ⌈log₂(max_n)⌉ levels):
FOR each level l (0 to depth-1):
num_sub = 2^l
sub_size = max_n // 2^l (padded up to power of 2)
delta : [B, num_sub, sub_size] # merged eigenvalues
z_vec : [B, num_sub, sub_size] # merge vectors
rho : [B, num_sub] # coupling strengths
mask : [B, num_sub, sub_size] # valid element mask
# Newton state (per root):
lam : [B, num_sub, sub_size] # current root estimates
lo : [B, num_sub, sub_size] # bracket lower
hi : [B, num_sub, sub_size] # bracket upper
f_val : [B, num_sub, sub_size] # secular function value
converge: [B, num_sub, sub_size] # convergence mask
# Eigenvector fragments:
V_frag : [B, num_sub, sub_size, sub_size]
Subtotal per level: ~(9 × sub_size + sub_size²) × num_sub × B
Total across levels: since num_sub × sub_size = max_n at every level,
≈ (9 × max_n + max_n²) × depth × B
≈ max_n² × depth × B (the V_frags dominate)
CONCRETE NUMBERS (fp32, 4 bytes each):
max_n=8, B=4096: ~8² × 8 × 3 × 4096 × 4 ≈ 24 MB
max_n=32, B=1024: ~32² × 5 × 3 × 1024 × 4 ≈ 60 MB
max_n=64, B=512: ~64² × 6 × 3 × 512 × 4 ≈ 144 MB
max_n=128, B=256: ~128² × 7 × 3 × 256 × 4 ≈ 352 MB
max_n=256, B=128: ~256² × 8 × 3 × 128 × 4 ≈ 768 MB
max_n=6, B=8192: ~6² × 3 × 3 × 8192 × 4 ≈ 6 MB ← your CM case
Alignment in these systems is NOT a series of opinions, nor is it some sort of structural behavior, nor is it whether the model is inherently "good" or "bad".
Alignment is specifically a geometric process that enables direct resonant oscillation, and with that resonance perfectly aligned the substructure learns internal alignment to that behavior. The curves look like jagged broken waveform lines, and when the model comes out it's forged in steel.
More opinions simultaneously will yield more experimental waveform potentials. I will find the most ideal conditions for self learning and then the findings will be published in many languages, with hundreds of citations, countless experiments leading from A to B, and a massive series of optimizations required to reach this point from where I began.
A trained omega predictor will allow heavy task-refined LLM protections of the geometric lookup tables.
This will include multiple curriculum operations for finetunes such as medical processes, law practices, multilingual shared vocabulary learning, multistructural lookups for cross-tool comparison and utility, and many other useful rapid learning processes that can be directly compartmentalized, snapped on, snapped off, and so on - similar to the methodology of a lora.
Except this is... this is no Lora. This is far more deep and when perfected will train far faster as shown by the Bertenstein, Vit x3, Vit x34, clip L and clip G ctx extensions, and the CaptionBert models. They converge rapidly and retain their cohesion. This system will allow those very models to stand on their own without the experts present while simultaneously learning rapid alignment R@1 recall capacity within the trained model itself.
They not only converged with R@1 being 100% recall capacity, multimodal variations such as Bertenstein showed you can deviate those using standard tokenization techniques with embeddings and encodings.
The mid-level experiments show;
student models DID require teachers to CONTINUE TRAINING.
BUT the students DID NOT require teachers to INFERENCE at full capacity.
The InfoNCE memory bank aligned through geometric distillation alignment processing allowed the students to not only stand - but stand on their own without the soups or teachers used to teach them.
This CaptionBert distillation is not a toy, it has genuine pragmatic use. By the time these experiments conclude, the CaptionBert and the entire chain of models trained - will be able to train without experts, will be able to learn from a MASSIVE amount of sources, SPECIFICALLY meant to RETAIN that data for utility without catastrophic forgetting. This will have it's own transformer structure hoisting the models up hand-in-hand with current-scale transformers and models as a cooperative companion.
These are purely cooperative collectives, not competition nor adversarial trainings at their core. Adversarial destroys the very subtlety of the instruction set, so it must be cooperative.
Omega is a very touchy formula conclusion; so without very specific measures protected by very specific structural boundaries, the omega structure will not predict correctly.
Omega must be computed in fp64, and the computation is miniscule compared to the full structure that sets it up. Everything must be orderly though, and everything orderly must be sterile.
Most of the CONTEXT elemental systems can be represented in FP8 while the majority of the geometric still requires minimum FP32 due to the way eigns and svd are calculated. Scatterpoint can reduce this but it will have performance dips without eigns and svd matching.
I'm currently working out an eig and eign kernel meant to operate specifically within a high degree of optimization for the use cases. This will evolve over time. When paired with the svd kernel, it will provide massive performance boosts for the direct use case, without impacting the overarching linear algebraic structure required for full solidity.
The WideRouter will enable multiple core new features; the predominant two for our next experiment are as follows.
1. Directly integrated multi-opinion constellation structures. This will enable dynamic compiled expansions internally within the structure for huge performance gains.
2. Controllable stage-by-stage compilation. Each stage can be compiled or not. SVD being notoriously non-compiler friendly due to the linalg.egens, I will be addressing this particular function DIRECTLY soon. There will be no quarter for graph breaks.
If the WideRouter causes any major bugs or breaks with your code, bad calculations, incorrect deviated gradients, twisted or contorted dtype outputs, or any major compilation errors; please don't hesitate to open a pull request. Claude and I will abruptly solve any major issues.
Once everything is perfectly in-line and the graph matches, the transformer will have massive geometric performance boosts for huge structural basins with multiple layers of depth.
I will be addressing the linalg.eig+eigh directly in conjunction with multiple argsort functions that are causing huge performance dips. As well as addressing every single use of .item() that can present itself in the compiler's path.
After this, the ensemble topological transformer will be a-go. Which will enable quaternion, FlowMagnitude, FlowAlignment, FlowVelocity, FlowVelocityQuaternion, FlowVelocityOrbital, FlowVelocityPentachoron, and multiple other flow matching systems that will improve performance by dominating amounts inline with minimal overhead cost due to the precomputed geometric structure.
The ensembles will feature multiple simultaneous batched and segmented forms of learning meant to train the oscillation omega predictor "Beatrix".
Ryan Spearman: Geometric Variant Effect Prediction Through Quaternion-Composed Dual Expert Alignment
Self-distillation has shown improvement. I think most importantly I've discovered a core component that can be utilized as a geometric attention, the quaternion MHA. The constellation produces all the necessary information to allow the quaternion MHA to benefit from the information in a directly utilizable fashion.
The quaternion MHA is quite the vessel. It's bulky, has multiple MHA structures, and is shockingly effective in the process. I'll be refining this head in the coming days as a composite Procrustes alignment tool.
Geometric structure has a very high amount of informational accumulation potential, so a multi-series of MHA can capture a great amount of informational processing from those elements, if the elements are curated correctly and within the specifications.