Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
2
Terry Rodriguez
terry-remyx
Follow
Nirav-Madhani's profile picture
salma-remyx's profile picture
YIIB's profile picture
3 followers
·
3 following
smellslikeml
smellslikeml
terry-j-rodriguez
AI & ML interests
None yet
Recent Activity
reacted
to
salma-remyx
's
post
with 🔥
2 days ago
The space of possible improvements for your AI model is large while evaluation is costly. So I was excited to discover the ICML 2026 paper from Kobalczyk, Lin, Letham, Zhao, Balandat, and Bakshy titled "LILO: Bayesian Optimization with Natural Language Feedback." The method learns efficiently from expert preferences, balancing exploration and exploitation in a principled way with Bayesian Optimization for expensive-to-evaluate black-box objectives. Experimenting with the technique, I trained a Gaussian Process proxy model on the implicit preferences in my code repo's commit history at VQASynth. The result: I used the model's preference scores to re-rank candidate papers recommended based on my interests in spatial reasoning and multimodal data synthesis. Semantic relevance is a high-recall method for finding arXiv papers personalized to your interests. Adding contributor preferences, extracted from the merge history of your code offers a high-precision filter. So what's next? I'm using the model to synthesize a larger volume of preference data to finetune an open-weight coding model with DPO and LoRA. Tuning Coding Agents via Implicit Preference Distillation arXiv: https://arxiv.org/pdf/2510.17671 Substack: https://remyxai.substack.com/p/lilo-and-myx VQASynth: https://github.com/remyxai/VQASynth
reacted
to
salma-remyx
's
post
with 🧠
13 days ago
VQASynth is the open source implementation of the https://huggingface.co/papers/2401.12168 paper, putting together the data synthesis pipeline behind https://huggingface.co/remyxai/SpaceQwen2.5-VL-3B-Instruct, https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B, and several other spatial reasoning models we've shared here on HF. From early development through production, different categories of evidence become available to guide what to try next. The strongest decisions combine evidence across categories rather than relying on any one. Stage 1: Development history Commit history holds the moments where things changed. For VQASynth, that's how scenes get parsed, how captions get generated, how spatial relations get encoded. Even before a model is in production, those milestones are a strong signal for what methods are semantically relevant to where the system is now. Stage 2: Observational outcomes Once a model is serving, the same commit history delineates changes against real-world results. That opens up quasi-experiments. You get causal evidence about which changes drove which outcomes, and inference on questions you haven't directly tested. Stage 3: Controlled experiments When teams start running interventions, those outcomes tighten the estimates further. This is the regime most people associate with rigor, but it's expensive and gated by traffic. Stage 4: Counterfactual perturbations When A/B testing becomes the operational bottleneck, instrumenting decision points in the production system lets you probe what would have happened under alternative choices. Shadow mode first, live traffic once audits pass. Experimentation maturity is a journey, and every stage offers something to learn from. More on these ideas: https://docs.remyx.ai/concepts/maturity-progression
reacted
to
salma-remyx
's
post
with 🔥
17 days ago
SciCrafter measured something AI practitioners have intuited: frontier agents are improving at executing inside well-framed problems, but lag at framing the problem in the first place. GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop. So the authors ran targeted interventions: * Hints about what to investigate doubled performance. * A structured experimentation template added 7-14 more points. * Structured consolidation beat free-form summaries by 6 points. * Curriculum context beat independent task-solving. These interventions helped the agent frame what’s worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution. Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically. Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search. In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next. One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4
View all activity
Organizations
terry-remyx
's activity
All
Models
Datasets
Spaces
Buckets
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
liked
a Space
about 2 months ago
Sleeping
5
Remyx Explorer
🔬
5
Search >10K+ arXiv papers with ready-to-run environments
liked
a model
7 months ago
remyxai/SpaceQwen3-VL-2B-Thinking
Image-Text-to-Text
•
2B
•
Updated
Oct 23, 2025
•
6
•
3