Safetensors
qwen3

Django, Sympy and Sphinx generated datasets not released?

#4
by robert-neoteny-ai - opened

Hi,

The SERA paper section on repository specialization says:
"""
To emulate this scenario, we use SERA to generate data from the three largest repositories in
SWE-Bench Verified: Django, Sympy, and Sphinx...

Aggregating across commits, we obtain between 46,000 and 54,000 trajectories for each repository combined
across both rollouts. ... we train on 8,000 trajectories per repository rather than the full dataset; however, we
release all generated trajectories to enable future research to explore larger-scale specialization
"""

I did not see the SVG data / trajectories for Djano, Sympy, or Sphinx in the 6 SERA related datasets posted in: https://huggingface.co/collections/allenai/open-coding-agents. Are you still planning to release this data? If so, do you have any estimated timeline?

Thank you,
Robert

Ai2 org

@robert-neoteny-ai Yes, we are planning to. Should have that up by Mon/Tues. Will ping when up!

@ethanlshen , sorry to bug you on this, but any update on timeline?

Thank you again,
Robert

Hey @robert-neoteny-ai I'm working on it today, so it will be up by EOD

@robert-neoteny-ai here's the link: https://huggingface.co/collections/allenai/open-coding-agents-specialization.
Sorry for the delay! Let me know if you have any questions. We run SVG twice per Sphinx function because Sphinx has a smaller codebase. We notice no degradation doing this.

@ethanlshen thank you so much. Really appreciate it!

ethanlshen changed discussion status to closed

Sign up or log in to comment