hackathon-advisor / data /quest_dataset_card.md
JacobLinCool's picture
feat: add live project atlas
4791c0a verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
license: apache-2.0
task_categories:
  - text-classification
  - text-generation
language:
  - en
tags:
  - hackathon-advisor
  - quest-classification
  - lora-sft
  - minicpm5
pretty_name: Hackathon Advisor Quest Classification SFT
size_categories:
  - n<1K

Hackathon Advisor — Quest Classification SFT Dataset

Supervised fine-tuning data that teaches MiniCPM5-1B to classify a Build Small Hackathon project against 13 judging dimensions from a two-segment README + app-file prompt, emitting strict JSON with short, source-attributed evidence. Trains the LoRA at build-small-hackathon/hackathon-advisor-quest-minicpm5-lora.

Format (quest_sft.jsonl)

Chat-JSONL. The first line is a lora_sft_manifest; every following line is a lora_sft_example with a messages list (system / user / assistant). The assistant turn is exactly one JSON object:

{"matches":[{"quest":"...","confidence":0.0,"evidence":"...","source":"readme|app_file"}]}

No markdown, no prose, no renamed quests; an empty matches list when no dimension has clear evidence. The user turn splits the project into a [README] segment and an [APP_FILE] segment so the model judges product description and implementation evidence separately and attributes each match to its source.

Quest dimensions (13)

Six merit badges (Off the Grid, Well-Tuned, Off-Brand, Llama Champion, Sharing is Caring, Field Notes), two tracks (Backyard AI, Thousand Token Wood), and five sponsor / special awards (OpenBMB, Nemotron, Modal, Tiny Titan, Best Agent).

Examples: 156 (14 with empty matches)

variant count
natural 108
app_only 16
missing_app_file 16
noisy_metadata 8
contradiction 6
empty 2

Positive examples per quest:

quest examples
Off the Grid 87
Off-Brand 59
Tiny Titan 58
Thousand Token Wood 49
Llama Champion 35
Backyard AI 35
Well-Tuned 31
OpenBMB 26
Sharing is Caring 19
Nemotron 18
Field Notes 15
Modal 14
Best Agent 14

Provenance

Built from the real public Spaces of the build-small-hackathon org: 125 crawled projects → deduped + length-filtered to 108 content-rich ones → labelled by a teacher-then-adversarial-verifier multi-agent workflow → plus targeted augmentations (app-only, readme-only / missing app file, README↔app contradictions, empty matches, noisy metadata). labeled.json holds the per-project verified labels. Examples are derived from public hackathon submissions for research and hackathon use; each project remains under its own Space license.