Spaces:
Sleeping
Sleeping
Commit
Β·
af5b05b
1
Parent(s):
9f73132
Update README.md
Browse filesUpdated project name and clarified project goals and tasks related to LLM-based data format conversion and metadata harmonization.
README.md
CHANGED
|
@@ -1,96 +1,86 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
This is the repository for
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Project Aim
|
| 6 |
|
| 7 |
-
The goal of this project is to explore
|
| 8 |
|
| 9 |
-
While the BIDS standard ensures interoperability
|
| 10 |
|
| 11 |
-
In this proof-of-concept, we aim to determine whether a coordinated system of AI agents can reliably execute these operations and produce AI-ready dataset collections, similar to those hosted on platforms like HuggingFace Datasets: OpenMind, with minimal human intervention.
|
| 12 |
|
| 13 |
## The Vision
|
| 14 |
|
| 15 |
-
If successful,
|
| 16 |
|
| 17 |
## Rough Work Plan
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
### 1. Agent Design
|
| 22 |
-
|
| 23 |
-
We will prototype a multi-agent system where each agent handles a specific wrangling subtask (e.g., normalization, QC, encoding). Agents will communicate via a shared memory and task queue.
|
| 24 |
-
|
| 25 |
-
Goals:
|
| 26 |
-
|
| 27 |
-
Design a modular agentic framework (e.g., LangChain, CrewAI, or AutoGen)
|
| 28 |
-
|
| 29 |
-
Define prompt templates and tools for each subtask
|
| 30 |
-
|
| 31 |
-
Implement logging and reasoning traceability
|
| 32 |
-
|
| 33 |
-
Resources:
|
| 34 |
-
|
| 35 |
-
LangChain Agent Docs
|
| 36 |
-
|
| 37 |
-
OpenAI Function Calling Guide
|
| 38 |
-
|
| 39 |
-
### 2. Dataset Exploration
|
| 40 |
-
|
| 41 |
-
OpenNeuro
|
| 42 |
-
|
| 43 |
-
Tasks:
|
| 44 |
-
|
| 45 |
-
Retrieve metadata and BIDS structures
|
| 46 |
-
|
| 47 |
-
Generate schema summaries and compatibility maps
|
| 48 |
-
|
| 49 |
-
Identify data types for downstream analysis
|
| 50 |
-
|
| 51 |
-
### 3. Wrangling Task Automation
|
| 52 |
-
|
| 53 |
-
The central track: enabling LLM-driven automation of the following operations:
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
Metrics will include:
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
|
|
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
|
|
|
|
|
| 1 |
+
# Open Data BIDSifier: LLM-based Data format conversion
|
| 2 |
|
| 3 |
+
This is the repository for an exploratory project OpenMind Wranglers, inspired by the accelerating progress of LLMs and their potential to automate format conversion for multi-study neuroimaging and clinical datasets.
|
| 4 |
+
|
| 5 |
+
## Motivation
|
| 6 |
|
| 7 |
## Project Aim
|
| 8 |
|
| 9 |
+
The goal of this project is to explore if LLM-based workflow machines (sometimes calles "AI agents) can meaningfully assist in metadata harmonization, data format transformation and data preprocessing tasks that precede AI model inference or training. These tasks are time-consuming yet essential to reproducible, large-scale neuroimaging research.
|
| 10 |
|
| 11 |
+
While the BIDS standard ensures interoperability, there are some datasets for which no BIDS annotation is available. This is a "dead data" which can not be used on-par with BIDS datasets.
|
| 12 |
|
| 13 |
+
In this proof-of-concept, we aim to determine whether a coordinated system of AI agents can reliably execute these operations and produce AI-ready dataset collections, similar to those hosted on platforms like HuggingFace Datasets: [OpenMind](https://huggingface.co/datasets/AnonRes/OpenMind), with minimal human intervention.
|
| 14 |
|
| 15 |
## The Vision
|
| 16 |
|
| 17 |
+
If successful, Open Data Wrangler will serve as a foundation for harmonizing multi-study datasets from open data, making sure these are immediately usable for machine learning and statistical analysis.
|
| 18 |
|
| 19 |
## Rough Work Plan
|
| 20 |
|
| 21 |
+
We will focus on data and metadata harmonization and evaluation of the results. For evaluation, a manual baseline would be prepared by humans.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
### 1. Metadata harmonization
|
| 24 |
|
| 25 |
+
Zenodo and other data portals are rich on non-BIDS neuroimaging data. For a primer, these datasets are suggested:
|
| 26 |
+
- [UniToBrain](https://zenodo.org/records/5109415)
|
| 27 |
+
- [Cranial CT of 1 patient](https://zenodo.org/records/16816)
|
| 28 |
+
- [BraTS 2020](https://www.kaggle.com/datasets/awsaf49/brats2020-training-data)
|
| 29 |
+
- [Macaque neurodevelopment database](https://data.kitware.com/#collection/54b582c38d777f4362aa9cb3)
|
| 30 |
|
| 31 |
+
The goal is to harmonize them with the following BIDS datasets:
|
| 32 |
+
- [Sexing the parental brain in shopping: an fMRI study](doi:10.18112/openneuro.ds006844.v1.0.0)
|
| 33 |
+
- [Cross-modal Hierarchical Control](https://openneuro.org/datasets/ds006628/versions/1.0.1)
|
| 34 |
|
| 35 |
+
and really any other OpenNeuro dataset.
|
| 36 |
|
| 37 |
+
6 WPs are the following
|
| 38 |
+
Harmonizing **metadata** *with LLM based tools*:
|
| 39 |
+
1. Annotation column names (from non-BIDS to BIDS)
|
| 40 |
+
2. File structure
|
| 41 |
+
3. Study metadata (fetching from repository HTMLs too)
|
| 42 |
|
| 43 |
+
For the LLM-assisted workflow, following **tools** are suggested: Github Copilot in VS Code, LLMAnything.
|
| 44 |
+
Pick **any LLM** really, smaller LLMs tend to hallucinate more, therefore it is more interesting if they can make it too!
|
| 45 |
+
Suggestions bigger LLMs: GPT-5, Claude, Kimi-K2, DeepSeek-R1
|
| 46 |
+
Suggestions smaller LLMs: SmoLM, LLaMA-7B, Qwen-7B
|
| 47 |
|
| 48 |
+
Harmonizing **metadata** *by hand*:
|
| 49 |
+
4. Annotation column names (from non-BIDS to BIDS)
|
| 50 |
+
5. File structure
|
| 51 |
+
6. Study metadata (fetching from repository HTMLs too)
|
| 52 |
|
| 53 |
+
For the manual harmonization, an IDE with Python / R is useful; as well as OpenRefine, an open-source tool for working with tabular data.
|
| 54 |
|
|
|
|
| 55 |
|
| 56 |
+
Record the problems and the working time for both manual and LLM assisted harmonization.
|
| 57 |
|
| 58 |
+
Time planned: ~5 hours working time are planned for this step.
|
| 59 |
|
| 60 |
+
### 2. Evaluation of the harmonized metadata
|
| 61 |
|
| 62 |
+
Let's see how well the harmonization went!
|
| 63 |
+
Use [BIDS validator](https://bids-standard.github.io/bids-validator/) on the newly-BIDS datasets, and report whether:
|
| 64 |
+
1. The manual harmonization is BIDS compliant.
|
| 65 |
+
2. The semi-automatic harmonization is BIDS compliant.
|
| 66 |
|
| 67 |
+
3. Assess the differences in the harmonized metadata.
|
| 68 |
+
4. Try to "stack" BIDS converted datasets on old BIDS datasets, and report the errors.
|
| 69 |
|
| 70 |
+
Time planned: ~4 hours.
|
| 71 |
|
| 72 |
+
### 3. Try in action
|
| 73 |
|
| 74 |
+
Use [ResEncL](https://huggingface.co/AnonRes/ResEncL-OpenMind-MAE) or another model of your choice with JAX, and feed it the resulting harmonized dataset. Analyse and record the bugs.
|
| 75 |
|
| 76 |
+
Time planned: ~4 hours.
|
| 77 |
|
| 78 |
+
### Teams and Participants / Skills
|
| 79 |
|
| 80 |
+
Anyone from master students to experienced scientists is welcome to join. The project will involve a lot of tabular data analysis and scripting, for this, coding experience in Python and R and Shell experience are useful.
|
| 81 |
+
For working with LLM agents, prompt engineering skills can be useful, though can also be acquired in this project.
|
| 82 |
|
| 83 |
+
### References / Prior reading
|
| 84 |
|
| 85 |
+
- [What is an AI agent](https://blog.langchain.com/how-to-think-about-agent-frameworks/)
|
| 86 |
+
- [Prompting guide from Meta](https://www.llama.com/docs/how-to-guides/prompting/)
|