AI & ML interests

Computer Vision Technology and Data Collection for Anime Waifu

Recent Activity

ZetangForward  authored a paper about 11 hours ago
Qwen-Image-2.0 Technical Report
Aratako  authored a paper about 1 month ago
T5Gemma-TTS Technical Report
Aratako  submitted a paper about 1 month ago
T5Gemma-TTS Technical Report
View all activity

AbstractPhil 
posted an update 12 days ago
view post
Post
2722
By trying to disprove the Omega H2 battery I have discovered;
* Each topology formed by the H2 battery is deviant, none have a uniformly shared substrate of behavior. They are each uniquely independent per training set all with perfect recon.
* Image recon can be tracked and mapped, yielding a consistently mapped and response 16.77m vocabulary potential. In the current spectrum testing at around 5 million unicode bytes.
* The model scale shows patch size is related to how much data you want the model to represent within the model itself, and this has yet to see a capacity to this day. The MSE recons and yields - and the more data fed, the more they yield.
* The scaling principle shows that the model indefinitely scales upward and each level of the model can be iteratively captured upward to form deviant and uniformly consistent repeatable pathways of implicit codewise response, not just arbitrary bitwise recall. Meaningful implicit learned utility.
* Image recon patch size should match the slice of image you want to represent, as it uses patch smoothing per patch internally from identity.
* byte trigrams are channel-agnostic, they do not require a channel count just a formula for recall at nGram recall 99.6% for byte-by-byte representations. With those comes an adjacently capable codebook.
* sentencepiece preliminary tests show validity and reconstruction just like the byte trigrams, using the new byte trigram this would be arbitrarily convenient to recon a codebook for the structure.
* binary trees learn a uniformly potent and powerful gating mechanism that required further exploration, each of them produces direct responsive independent capacity and the responses are controllable.
* ternary experiments show the models are directly responsive to -1, 0, +1 behavior, so the quantization is very much a valid potential.
* preliminary tests with the H2O1 series of batteries show the models are responding similar to natural universal elements in the universe itself
  • 9 replies
·
prithivMLmods 
posted an update 12 days ago
view post
Post
4687
Multimodal-Edge Demo, a node-based inference canvas demo, is now live on Spaces. It features node-based Transformers for fast inference across 10+ edge-device multimodal models on the Hub, all within a single space. The series includes models from Qwen3.5, Qwen3-VL, Gemma 4, and the LFM 2.5 VL model series, with support for reasoning and grounding tasks.

🤗 Demo: prithivMLmods/Multimodal-Edge-Node
🔗 GitHub: https://github.com/PRITHIVSAKTHIUR/Multimodal-Edge-Node
✅ Multimodal Apps Collections: https://huggingface.co/collections/prithivMLmods/hall-of-multimodal-apps

🤗 > To learn more, visit the app page or the respective model pages.
Tonic 
posted an update 15 days ago
view post
Post
4154
🙋🏻‍♂️ Hey there folks,

since everyone liked my previous announcement post ( https://huggingface.co/posts/Tonic/338509028435394 ) so much , i'm back with more high quality proceedural datasets in the Geospacial domain for SFT training !

Check this one out :
NuTonic/sat-bbox-metadata-sft-v1

the goal is to be able to train vision models on multiple images for remote sensing analysis with one shot .

hope you like it ! 🚀
  • 2 replies
·
AbstractPhil 
posted an update 16 days ago
view post
Post
184
Today, I'll be determining the codebook capacity and utility potential for the larger batteries; Fresnel, Johanna, Grandmaster, Freckles, and Johanna-F variants, which should give a good indication of which models are capable of handling codebooks and which are more errant. The earlier all use SVD while the later do not. The differences are noted per and the behavior divergent.

I anticipate the D=16 will be more errant, and the final-state variants of those could very well be much more difficult or costly to inference as their axis bends are likely considerably harder to track. However, I'm confident that enough bounces will give the yield required so I'll set up some high-yield noise barrages to determine how much of them we can in fact extract from Johanna, and then set up similar barrages for images to map the internals of Fresnel and Grandmaster.

Grandmaster will be tricky, as it was an experimental Johanna-256 finetuned series meant to map sigma noised image inputs to recreate Fresnel behavioral output. Noised image goes in -> Fresnel-grade replication comes out in high res.

This allowed preliminary Dall-E Mini-esque VAE generation and will be explored further for the stereoscopic translation subsystem, to allow image generation in the unique format of diffusion that I was working out. I anticipate this system to be more than capable at making monstrosities, so I won't be posting TOO MANY prelims on this one, but the high-capacity potential of these noise makers are meaningfully powerful. Getting uniform codebooks in-place for these models will allow full transformer mapping downstream instead of just guess working the MSE piecemeal, which the earlier versions and variants were doing.

I'm straying from the CLS specifically for this series because CLS creates adjudicated pools of bias orbiting the INCORRECT orbiter some SVAE. The orbital target IS the soft-hand accumulated bias with the sphere-norm, so having a competitor isn't going to be a good option.
  • 7 replies
·
AbstractPhil 
posted an update 18 days ago
view post
Post
126
My recent study in a nutshell shows a few important elements and everything else is technical.

* There are most definitely invariant architectural geometric states that persist and can be taught.
* They are not coincidental and the process works effectively on multiple data types and processes, not just noise. Noise is just fast to test with.
* Systems like SVD, Eigh, Conv, and the like - HELP align those systems for larger structures to produce amplified stability, but are not required for smaller structures, and the tests show even attention gets in the way at the smallest.
* Batched arrays, stacks, queues, and so on - all improve performance depending on the task.
* An SVAE battery is resolution agnostic, meaning with simple processing and logic you can scan space and record meshes fairly optimally to record large amounts of inference data.
* Batteries when trained on one specific task often can be directly used for other tasks once a codebook is fitted with the necessary data. Meaning a battery trained on gaussian noise can be fed imagenet snippets and downstream the MSE rates from the 64 battery array can be consumed for statistics aggregation to a fair degree of accuracy without actually training the array on images themselves.
* The battery codebook is a pointwise rigid map within the battery and can be used for pairwise learning when using the H2, H2a, and H2b batteries.

So this is, the evolved state of the geometric vocabulary in some ways, and a completely new and unexpected systemic development in others. They stack, you can reuse them, so small you can swap them at runtime with no time loss, they align rapidly, and downstream tasks can consume their information.

There are many untested avenues that I need to make a full writeup for because quite frankly it's messy currently and Claude is only making it more messy instead of cleaner.
  • 2 replies
·
Tonic 
posted an update 20 days ago
view post
Post
3566
🙋🏻‍♂️ Hey there folks ,

I'm sharing huggingface's largest dataset of annotated statelite images today.

check it out here : NuTonic/sat-image-boundingbox-sft-full

I hope you like it , the idea is to be able to use this with small vision models 🚀
prithivMLmods 
posted an update 20 days ago
view post
Post
1866
Now, a collection of various compression schemes for Qwen3.6 and the abliterated version 1 of dense models is available on the Hub. Check it out via the links below. 👇

🔗 Qwen3.6-MoE: https://huggingface.co/collections/prithivMLmods/qwen36-35b-a3b-compressions
🔗 Qwen3.6-27B Compressions: https://huggingface.co/collections/prithivMLmods/qwen36-27b-compressions

🤗 > To learn more, visit the app page or the respective model pages.
AbstractPhil 
posted an update 20 days ago
view post
Post
82
Ever see a 1024x1024 3 channel a little over 1m param noise classifier? This is one. This is phase 1 of the omega experiments and it's successful on a very high-accuracy selectivity level through statistics aggregation and pooling via... a tiny MLP attached to the battery array.

SVAE don't care what resolution you use. They never had that concern, they are solvers that fly through solutions. Perfect for math solutions of many formats and many structures, exactly what I need for the next stages.
AbstractPhil/geolip-svae-h2-64

Currently the primary use case for tests is noise format identification. There are multiple experiments to go before a full nth classification system is ready, however as it stands the only stopping point is training batteries now. They mostly train within about 10 million samples of tiny data so they will fly out hundreds a day if I find purposes for them.

Also I trained too many gaussian-related batteries, so there's really only about 50-100 or so batteries useful in the 192 array I set up. There's really only 64 batteries trained total but there are multiple epochs involved.

Now that there is a 57k parameter variation that converges on 16 variants of random noise like Johanna and Freckles before, you ask this model questions differently. You check the MSE to train downstream models, so if your array isn't conclusively working it won't work just yet.
It's not perfect yet, but it's improving daily.

A bad battery in the mix can be replaced at runtime.
==============================================================================
PHASE J VERDICT
==============================================================================
Subset: 18 batteries, 1,029,870 params (vs 10.9M for full array)

Resolution       A (summary)   B (attn-pool)
256                   96.6%          93.1%
512                   95.4%          92.0%
1024                  95.4%          95.4%
  • 5 replies
·
prithivMLmods 
posted an update 25 days ago
view post
Post
4179
HY-World-2.0 — A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds is now available on Spaces, and it works both as native Gradio components and in Gradio server mode.

> HY-World-2.0-Demo: prithivMLmods/HY-World-2.0-Demo
> HY-World-2.0 [Server Mode]: prithivMLmods/HY-World-2.0-Demo
> Featuring 3D reconstruction and Gaussian splats with the Rerun viewer, along with camera poses, depth maps, and surface normals.
> In Server Mode, Gradio is served via FastAPI, with FastAPI remaining the top-level server.
> Model: tencent/HY-World-2.0
> GitHub: https://github.com/PRITHIVSAKTHIUR/HY-World-2.0-Demo

🤗To learn more, visit the app page or the respective model pages.
AbstractPhil 
posted an update 29 days ago
view post
Post
142
The geolip-svd-transformer is almost ready.

I've spent multiple days preparing the substructure, scaling, testing, and expanding the system. The conduit is meant to reorganize data. Just like the SVAE prototypes, they are meant to sort and organize, not compress and compact.

The organization is almost prepared and almost ready. The resulting structure will produce projection-capable geometric aligned memory, compacted and transformed into a utilizable token set. The remaining structural components are specifically SVD-related utilities, and each of those are utilizing the variant natures of how difficult, how dispersed, and so on each component is as it's learned over time.

The SVAE components were perfect for testing this playground. They appear to be larger when analyzed, however the representation of those are meant to represent huge vocabularies. Patch 16x16 expanded upward to 768 is meant to encapsulate the behavior of near-pi upscaled, condensed into a considerably simpler smaller form.

This model is behaving perfectly. It does not encode in the traditional sense, it analyzes and produces geometric opinions throughout it's structure. Each of them proved one after the other the model could not only learn, but it can perfectly reconstruct, and with that produce utility-driven expansion capacity directly.

Fresnel -> effective image analysis battery.
Johanna -> effective noise analysis
Grandmaster -> Johanna finetuned with sigma restoration using Fresnel's opinions.
Freckles -> massive analysis array for noise (4096 to 16k tks)

Geometric batteries.

Cayley rotation is meant to encapsulate that potential and expand it, allowing further differentiation down the chain of model structural behavioral events.

Suffice it to say, this is the geometric transformer's evolved state. These will exist as conduits throughout the models, the expanded behavioral attenuation units meant to provide geometric analysis internally within models for data-oriented CV alignment.
  • 6 replies
·
prithivMLmods 
posted an update about 1 month ago
view post
Post
6206
A new comparator on Spaces showcases Standard FLUX.2 Decoder vs. FLUX.2 Small Decoder. The Small Decoder is ~1.4× faster, uses ~1.4× less VRAM, and maintains near-identical image quality. It has ~28M parameters with narrower channels [96, 192, 384, 384] vs. [128, 256, 512, 512], and the demo supports sequence generation by running both decoders simultaneously and comparing the results side by side.

🤗 Comparator: prithivMLmods/Flux.2-4B-Decoder-Comparator
🔗 FLUX.2-small-decoder: black-forest-labs/FLUX.2-small-decoder
🔗 GitHub: https://github.com/PRITHIVSAKTHIUR/Flux.2-4B-Encoder-Comparator
🚁 Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

🤗 > App built on the Gradio SDK. To learn more, visit the app page or the respective model pages.
prithivMLmods 
posted an update about 1 month ago
view post
Post
4234
Now, a collection of various compression schemes for Gemma 4 and the abliterated version 1 of dense models is available on the Hub. Check it out via the links below. 👇

🔗Gemma 4 Compression(s)- https://huggingface.co/collections/prithivMLmods/gemma-4-compressions
🔗Gemma 4 Uncensored [MAX] + Compression(s) - [`β ]- https://huggingface.co/collections/prithivMLmods/gemma-4-uncensored-max-compressions
🔗Gemma 4 Compression(s) - MoE- https://huggingface.co/collections/prithivMLmods/gemma-4-compressions-moe
🔗Gemma-4 F32 GGUF- https://huggingface.co/collections/prithivMLmods/gemma-4-f32-gguf

🤗 > To learn more, visit the app page or the respective model pages.
prithivMLmods 
posted an update about 1 month ago
view post
Post
2326
Now the demo for image detection based on SAM3 and Gemma-4 (*Filter) is available on Spaces, using full-fledged Transformers inference with multimodal reasoning for processed images. It also supports video segmentation (mask), video segmentation (annotation), and image click segmentation.

🤗 Demo Space: prithivMLmods/SAM3-Gemma4-CUDA
🥽 SAM3: facebook/sam3
🔗 gemma-4-E2B-it: google/gemma-4-E2B-it

To learn more, visit the app page or the respective model pages.
  • 1 reply
·
AbstractPhil 
posted an update about 1 month ago
view post
Post
153
Say hello to surge resonance training. From random init, 1 epoch trained the 128x128 imagenet SVAE with test reconstruction over 99% accurate by epoch 1 to 99.9% accurate by epoch 5.
AbstractPhil/geolip-SVAE

Epoch 1 test recon error 0.0064
Epoch 2 test recon error 0.0022
Epoch 8 is now 0.000294
Epoch 12 is now 0.000206
Epoch 14 is now 0.000190
Epoch 18 is now 0.000187
Epoch 24 is now 0.000117
Epoch 30 landmark 0.000099

There are NO EXPERTS HERE. This is pure self learning. The model learns the entire behavioral set within 1 epoch to reconstruct imagenet's test set to a useful state. By epoch 12 a recon of 0.000202 recall is now measured. This means, 99.99% accuracy at RECONSTRUCTING the test set through the bottleneck, while simultaneously leaving a trail of centerwise extraction as rich or richer.

ONE epoch. Just one.
Took about 10 minutes to train an already converged epoch, and I set it up for 200 epochs. This model will not need 200 epochs. I'd be surprised if it needs 3.
What you're looking at here, is the emergence of surge resonance. The power of a single epoch when the geometric CV alignment hits the tuning fork of absolute resonant perfection and counterpointed with the concerto's dissonant harmonic response.

I give you, surge resonance.


The metrics will be ready by morning and I'll begin building utilities to figure out what went right and what went wrong.

This model is rewarded when it exists within the geometric spectrum while simultaneously dual punished when leaving. There is no benefit to stray, and the benefit to exist within prevents the model from leaving the validated CV band.

This allows the model to exist perfectly within the tuning fork resonance structure.

The model CONTINUES to refine, even when the CV drift has begun to drift away from home. The model has left home and is now seeking new proximity.

Upcoming training will be the 256x256, 512x512, 1024x1024, and larger if the model holds. Each will be named.
  • 3 replies
·
prithivMLmods 
posted an update about 1 month ago
view post
Post
4765
The demo for Image Detection (*Filter) based on SAM3 and Qwen-3.5 is now available on Hugging Face Spaces using Transformers inference, with multimodal reasoning for processed images, and it also supports video segmentation (mask), video segmentation (annotation), and image click segmentation.

🤗 Demo Space: prithivMLmods/SAM3-Plus-Qwen3.5
🥽 SAM3: facebook/sam3
🔗 Qwen-3.5: Qwen/Qwen3.5-2B

To learn more, visit the app page or the respective model pages.
  • 5 replies
·
AbstractPhil 
posted an update about 1 month ago
view post
Post
158
The geolip-transformer-v8 requires a fundamental rethinking of training a core structure.

I'll make this brief and to the point.

GEOLIP is an observer system at it's core. It watches, triangulates, and assists with correct answers.

Many experiments worked very well, many fell down and turned into a pile of broken circuits. The recent geometric-transformer being one of my biggest fumbles, still taught me many things about what I'm TRULY trying to accomplish here.

**Save money and lives**. Less hardware use for less need at inference. Train more calculations into a more reusable and accurate structure for near instant zero-shot or sequential inference.

In the process v8 unlocked a missing puzzle piece, EMA trajectory alignment compensation. I'm doing my best to build something that works.

The geolip distillation system is very powerful but requires much experimentation still.
* Genetic experiments were successful
* Data transfer experiments successful
* Analysis experiments successful - and expand large model accuracy
* Many distillation experiments were successful.
* The largest successes being the kernels, the distillation tools, and the geometric analysis systems.

With the good comes the bad, the faulty VITs, the simultaneous trains that fault, the internalized confusion that happens occasionally.
*** The observer NEEDS something to OBSERVE. If the observer observes the progressive development of point cloud structures, it learns how to observe THAT LEARNING PROCESS - drifting fault assessment.
*** In the process it DOES NOT learn how to improve the CE relations by embedding and compensating with anchored triangulation opinions.

BIGGEST CONCLUSION. Staged curriculum training.

These components must be DECOUPLED. One must be a compounding structural awareness beacon, the other must be an informationally aligned composition in a utilizable fashion.

This means stage-by-stage freeze/unfreeze processing. Independent task-oriented structural alignment.
  • 2 replies
·
AbstractPhil 
posted an update about 1 month ago
view post
Post
176
My heavily engineered repo; https://github.com/AbstractEyes/pytorch-parallel-compiler has been directly integrated into the geofractal repo for v1.2, if you use the geofractal repo be sure to pull for potential performance increases.

The WideRouter will enable multiple core new features; the predominant two for our next experiment are as follows.

1. Directly integrated multi-opinion constellation structures. This will enable dynamic compiled expansions internally within the structure for huge performance gains.
2. Controllable stage-by-stage compilation. Each stage can be compiled or not. SVD being notoriously non-compiler friendly due to the linalg.egens, I will be addressing this particular function DIRECTLY soon. There will be no quarter for graph breaks.

If the WideRouter causes any major bugs or breaks with your code, bad calculations, incorrect deviated gradients, twisted or contorted dtype outputs, or any major compilation errors; please don't hesitate to open a pull request. Claude and I will abruptly solve any major issues.

Once everything is perfectly in-line and the graph matches, the transformer will have massive geometric performance boosts for huge structural basins with multiple layers of depth.

I will be addressing the linalg.eig+eigh directly in conjunction with multiple argsort functions that are causing huge performance dips. As well as addressing every single use of .item() that can present itself in the compiler's path.

After this, the ensemble topological transformer will be a-go. Which will enable quaternion, FlowMagnitude, FlowAlignment, FlowVelocity, FlowVelocityQuaternion, FlowVelocityOrbital, FlowVelocityPentachoron, and multiple other flow matching systems that will improve performance by dominating amounts inline with minimal overhead cost due to the precomputed geometric structure.

The ensembles will feature multiple simultaneous batched and segmented forms of learning meant to train the oscillation omega predictor "Beatrix".
  • 5 replies
·