Spaces:
Sleeping
Sleeping
| Our problem is as follows - we need to move our individual jsons + jsonls to HuggingFace as its just a better structure + we might hit storage limits etc., The process for new data to be submitted would be via drag and drop to file-upload interface of huggingface datasets which is for all purposes just git. | |
| Aspirationally, the validation + de-duplication workflow is able to: | |
| (1) Detect changes (i.e. data added only during the PR) | |
| (2) Runs de-duplication w.r.t the PR and the existing datastore | |
| (3) Runs the validation only for the added data (one approach here would use git diff) and | |
| (4) Adds them back to the datastore | |
| The huggingface dataset -> https://huggingface.co/datasets/evaleval/EEE_datastore (evaleval/EEE_datastore) | |
| Both this repo and the dataset can be managed by git. | |
| - repo structure | |
| data/ | |
| βββ {eval_name}/ | |
| β βββ {developer_name}/ | |
| β βββ {model_name}/ | |
| β βββ {uuid}.json | |
| βββ {uuid}_samples.jsonl | |
| validate_data.py | |
| eval_types.py | |
| instance_level_types.py | |
| There are typically two types: aggregate information as json and samples with _instances.jsonl (having jsonl in the name) as instance level types that need to be validated. | |
| The data will be added to by users using the upload functionality which will open a PR. | |
| First request: build Dockerfile with uv in which the validation will be run, get all dependencies in a requirements.txt that can be uv add -r'd or something. | |
| Regarding the workflow implement the following: | |
| (1) Detect changes and pull to space | |
| - When a user/external collaborator opens a PR in the HF dataset, we wake the space or trigger the space via webhook or better procedure. Following this, we use git diff (or something better if you can recommend) on the huggingface space to find which files have been added (or modified). | |
| Then download only the stuff that has changed using huggingface_hub api for download certain files from the dataset or some fine-grained git stuff. | |
| Following this, run validate_py against the schema or use eval_types/instance_eval_types for validation with pydantic whichever you think is more robust and efficient. | |
| (2) Maintain a manifest containing some form of sha256 hashes, and then compute the new hashes for the whole JSON and the compare to the manifest and if there's a near collision (or 99% or identical) write a .txt or .md (whichever is easier) that flags potential duplicates. | |
| (3) Write back to the text which has the changes/upload/information (with some sort of unique name always - that eveyrhting was validated or if something failed which files). | |
| Main thing is I want all of this needs to run in the space (as a proxy for CI) and then update only the necessary data (then clear the space of the rest of the data) | |