Django, Sympy and Sphinx generated datasets not released?

Feb 6

Hi,

The SERA paper section on repository specialization says:
"""
To emulate this scenario, we use SERA to generate data from the three largest repositories in
SWE-Bench Verified: Django, Sympy, and Sphinx...

Aggregating across commits, we obtain between 46,000 and 54,000 trajectories for each repository combined
across both rollouts. ... we train on 8,000 trajectories per repository rather than the full dataset; however, we
release all generated trajectories to enable future research to explore larger-scale specialization
"""

I did not see the SVG data / trajectories for Djano, Sympy, or Sphinx in the 6 SERA related datasets posted in: https://huggingface.co/collections/allenai/open-coding-agents. Are you still planning to release this data? If so, do you have any estimated timeline?

Thank you,
Robert

ethanlshen

Ai2 org Feb 7

@robert-neoteny-ai Yes, we are planning to. Should have that up by Mon/Tues. Will ping when up!

robert-neoteny-ai

Feb 10

@ethanlshen , sorry to bug you on this, but any update on timeline?

Thank you again,
Robert

ethanlshen

Ai2 org Feb 10

Hey @robert-neoteny-ai I'm working on it today, so it will be up by EOD

ethanlshen

Ai2 org Feb 11

@robert-neoteny-ai here's the link: https://huggingface.co/collections/allenai/open-coding-agents-specialization.
Sorry for the delay! Let me know if you have any questions. We run SVG twice per Sphinx function because Sphinx has a smaller codebase. We notice no degradation doing this.

robert-neoteny-ai

Feb 12

@ethanlshen thank you so much. Really appreciate it!

ethanlshen changed discussion status to closed Feb 12

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment