Missing Mid Training Data

#1
by jayomb - opened

when clicking mt training data it comes to a 404 page. Was it removed or not uploaded yet?
https://huggingface.co/datasets/osunlp/QUEST-Mid-Training-Data

OSU NLP Group org

Thanks for the question! The mid-training data is a bit tricky because it contains raw HTML content, which might raise legal concerns. We are still looking for the best way to handle this. It will be released once we find a suitable solution.

Hi @hsaest ,

Thanks for the clarification! Since we are reproducing the mid-training data, could we double-check if we can reconstruct the two atomic tasks using successful search trajectories via the following setup?

Context Summarization: Slice trajectories into checkpoints. Pass historical interactions as events and the last state as prev_state, then use your open-sourced Quest Prompt to distill the target JSON.

Relevant Information Extraction: Extract triplets of (webpage content, extraction goal, extracted content) directly from the tool cache, and apply Jaccard similarity (0.1) to deduplicate the goals.

One quick question on the input data: Since your paper mentions using raw HTML content for the extraction task, would substituting it with cleaned/processed webpage text (e.g., Markdown/Text format with headers/footers removed) significantly impact the model's grounding and data triage capabilities during target-only loss training?

Thanks for your insights!

OSU NLP Group org
β€’
edited 13 days ago

Hi @danqiao-cuhk ,

We released context summarization data in https://huggingface.co/datasets/osunlp/QUEST-Mid-Training-Data, as well as a minimal example of Relevant Information Extraction.

For Context Summarization reconstruction, you could refer to our data directly. The input is the "historical session", and the output is the summarized content generated by the condenser.
For Relevant Information Extraction, your understanding is correct.

Our raw HTML content is preprocessed by Jina, which includes the information of title, URL, and the markdown content. You can find an example here https://jina.ai/api-dashboard/reader:
Title: Example Domain

URL Source: https://www.example.com/

Published Time: Fri, 19 Jun 2026 18:46:03 GMT

Warning: This is a cached snapshot of the original page, consider retry with caching opt-out.

Markdown Content:
This domain is for use in documentation examples without needing permission. Avoid use in operations.

Learn more

Let me know if you have any further questions! Thanks for your interest in our work!

Hi @hsaest ,

Thank you so much for your great efforts in open-sourcing this! Much appreciated. πŸš€

Sign up or log in to comment