Salma Mayorquin PRO

salma-remyx

AI & ML interests

None yet

Recent Activity

repliedto their post about 12 hours ago
SciCrafter measured something AI practitioners have intuited: frontier agents are improving at executing inside well-framed problems, but lag at framing the problem in the first place. GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop. So the authors ran targeted interventions: * Hints about what to investigate doubled performance. * A structured experimentation template added 7-14 more points. * Structured consolidation beat free-form summaries by 6 points. * Curriculum context beat independent task-solving. These interventions helped the agent frame what’s worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution. Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically. Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search. In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next. One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4
posted an update 1 day ago
VQASynth is the open source implementation of the https://huggingface.co/papers/2401.12168 paper, putting together the data synthesis pipeline behind https://huggingface.co/remyxai/SpaceQwen2.5-VL-3B-Instruct, https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B, and several other spatial reasoning models we've shared here on HF. From early development through production, different categories of evidence become available to guide what to try next. The strongest decisions combine evidence across categories rather than relying on any one. Stage 1: Development history Commit history holds the moments where things changed. For VQASynth, that's how scenes get parsed, how captions get generated, how spatial relations get encoded. Even before a model is in production, those milestones are a strong signal for what methods are semantically relevant to where the system is now. Stage 2: Observational outcomes Once a model is serving, the same commit history delineates changes against real-world results. That opens up quasi-experiments. You get causal evidence about which changes drove which outcomes, and inference on questions you haven't directly tested. Stage 3: Controlled experiments When teams start running interventions, those outcomes tighten the estimates further. This is the regime most people associate with rigor, but it's expensive and gated by traffic. Stage 4: Counterfactual perturbations When A/B testing becomes the operational bottleneck, instrumenting decision points in the production system lets you probe what would have happened under alternative choices. Shadow mode first, live traffic once audits pass. Experimentation maturity is a journey, and every stage offers something to learn from. More on these ideas: https://docs.remyx.ai/concepts/maturity-progression
posted an update 3 days ago
SciCrafter measured something AI practitioners have intuited: frontier agents are improving at executing inside well-framed problems, but lag at framing the problem in the first place. GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop. So the authors ran targeted interventions: * Hints about what to investigate doubled performance. * A structured experimentation template added 7-14 more points. * Structured consolidation beat free-form summaries by 6 points. * Curriculum context beat independent task-solving. These interventions helped the agent frame what’s worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution. Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically. Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search. In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next. One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4
View all activity

Organizations

Remyx AI's profile picture