Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
AbstractPhil 
posted an update 1 day ago
Post
113
My heavily engineered repo; https://github.com/AbstractEyes/pytorch-parallel-compiler has been directly integrated into the geofractal repo for v1.2, if you use the geofractal repo be sure to pull for potential performance increases.

The WideRouter will enable multiple core new features; the predominant two for our next experiment are as follows.

1. Directly integrated multi-opinion constellation structures. This will enable dynamic compiled expansions internally within the structure for huge performance gains.
2. Controllable stage-by-stage compilation. Each stage can be compiled or not. SVD being notoriously non-compiler friendly due to the linalg.egens, I will be addressing this particular function DIRECTLY soon. There will be no quarter for graph breaks.

If the WideRouter causes any major bugs or breaks with your code, bad calculations, incorrect deviated gradients, twisted or contorted dtype outputs, or any major compilation errors; please don't hesitate to open a pull request. Claude and I will abruptly solve any major issues.

Once everything is perfectly in-line and the graph matches, the transformer will have massive geometric performance boosts for huge structural basins with multiple layers of depth.

I will be addressing the linalg.eig+eigh directly in conjunction with multiple argsort functions that are causing huge performance dips. As well as addressing every single use of .item() that can present itself in the compiler's path.

After this, the ensemble topological transformer will be a-go. Which will enable quaternion, FlowMagnitude, FlowAlignment, FlowVelocity, FlowVelocityQuaternion, FlowVelocityOrbital, FlowVelocityPentachoron, and multiple other flow matching systems that will improve performance by dominating amounts inline with minimal overhead cost due to the precomputed geometric structure.

The ensembles will feature multiple simultaneous batched and segmented forms of learning meant to train the oscillation omega predictor "Beatrix".

Omega is a very touchy formula conclusion; so without very specific measures protected by very specific structural boundaries, the omega structure will not predict correctly.

Omega must be computed in fp64, and the computation is miniscule compared to the full structure that sets it up. Everything must be orderly though, and everything orderly must be sterile.

Most of the CONTEXT elemental systems can be represented in FP8 while the majority of the geometric still requires minimum FP32 due to the way eigns and svd are calculated. Scatterpoint can reduce this but it will have performance dips without eigns and svd matching.

I'm currently working out an eig and eign kernel meant to operate specifically within a high degree of optimization for the use cases. This will evolve over time. When paired with the svd kernel, it will provide massive performance boosts for the direct use case, without impacting the overarching linear algebraic structure required for full solidity.

A trained omega predictor will allow heavy task-refined LLM protections of the geometric lookup tables.

This will include multiple curriculum operations for finetunes such as medical processes, law practices, multilingual shared vocabulary learning, multistructural lookups for cross-tool comparison and utility, and many other useful rapid learning processes that can be directly compartmentalized, snapped on, snapped off, and so on - similar to the methodology of a lora.

Except this is... this is no Lora. This is far more deep and when perfected will train far faster as shown by the Bertenstein, Vit x3, Vit x34, clip L and clip G ctx extensions, and the CaptionBert models. They converge rapidly and retain their cohesion. This system will allow those very models to stand on their own without the experts present while simultaneously learning rapid alignment R@1 recall capacity within the trained model itself.

They not only converged with R@1 being 100% recall capacity, multimodal variations such as Bertenstein showed you can deviate those using standard tokenization techniques with embeddings and encodings.

The mid-level experiments show;

student models DID require teachers to CONTINUE TRAINING.

BUT the students DID NOT require teachers to INFERENCE at full capacity.

The InfoNCE memory bank aligned through geometric distillation alignment processing allowed the students to not only stand - but stand on their own without the soups or teachers used to teach them.

This CaptionBert distillation is not a toy, it has genuine pragmatic use. By the time these experiments conclude, the CaptionBert and the entire chain of models trained - will be able to train without experts, will be able to learn from a MASSIVE amount of sources, SPECIFICALLY meant to RETAIN that data for utility without catastrophic forgetting. This will have it's own transformer structure hoisting the models up hand-in-hand with current-scale transformers and models as a cooperative companion.

These are purely cooperative collectives, not competition nor adversarial trainings at their core. Adversarial destroys the very subtlety of the instruction set, so it must be cooperative.

Alignment in these systems is NOT a series of opinions, nor is it some sort of structural behavior, nor is it whether the model is inherently "good" or "bad".

Alignment is specifically a geometric process that enables direct resonant oscillation, and with that resonance perfectly aligned the substructure learns internal alignment to that behavior. The curves look like jagged broken waveform lines, and when the model comes out it's forged in steel.

More opinions simultaneously will yield more experimental waveform potentials. I will find the most ideal conditions for self learning and then the findings will be published in many languages, with hundreds of citations, countless experiments leading from A to B, and a massive series of optimizations required to reach this point from where I began.

Per-instance allocation for max_n, max_batch (B):

WORKING STORAGE:
A_work : [B, max_n, max_n] # working copy (destroyed)
V_accum : [B, max_n, max_n] # eigenvector accumulator
householder : [max_n-2, B, max_n] # stored reflectors (padded)
d : [B, max_n] # tridiagonal diagonal
e : [B, max_n-1] # tridiagonal off-diagonal

Subtotal: ~3 × max_n² × B floats

D&C TREE (depth = ⌈log₂(max_n)⌉ levels):
FOR each level l (0 to depth-1):
num_sub = 2^l
sub_size = max_n // 2^l (padded up to power of 2)

    delta   : [B, num_sub, sub_size]    # merged eigenvalues
    z_vec   : [B, num_sub, sub_size]    # merge vectors  
    rho     : [B, num_sub]              # coupling strengths
    mask    : [B, num_sub, sub_size]     # valid element mask
    
    # Newton state (per root):
    lam     : [B, num_sub, sub_size]    # current root estimates
    lo      : [B, num_sub, sub_size]    # bracket lower
    hi      : [B, num_sub, sub_size]    # bracket upper
    f_val   : [B, num_sub, sub_size]    # secular function value
    converge: [B, num_sub, sub_size]    # convergence mask
    
    # Eigenvector fragments:
    V_frag  : [B, num_sub, sub_size, sub_size]

Subtotal per level: ~(9 × sub_size + sub_size²) × num_sub × B
Total across levels: since num_sub × sub_size = max_n at every level,
    ≈ (9 × max_n + max_n²) × depth × B
    ≈ max_n² × depth × B  (the V_frags dominate)

CONCRETE NUMBERS (fp32, 4 bytes each):

max_n=8,   B=4096:  ~8² × 8 × 3 × 4096 × 4   ≈   24 MB
max_n=32,  B=1024:  ~32² × 5 × 3 × 1024 × 4   ≈   60 MB
max_n=64,  B=512:   ~64² × 6 × 3 × 512 × 4    ≈  144 MB
max_n=128, B=256:   ~128² × 7 × 3 × 256 × 4   ≈  352 MB
max_n=256, B=128:   ~256² × 8 × 3 × 128 × 4   ≈  768 MB
max_n=6,   B=8192:  ~6² × 3 × 3 × 8192 × 4    ≈    6 MB  ← your CM case

As of right now I don't know how to reduce to fp16 without a massive dip. I'm thinking it's possible to utilize integers directly instead of high-accuracy fp64 or fp32 deviated floats. I'll do some exploration.

Reducing this is to fp16 or bf16 capacity would greatly improve performance, and if the values out are close enough to the mantissa cross-contaminants, it could be worth it just for the semi-accurate speed alone.

In this post