stefanches7 commited on
Commit
af5b05b
Β·
1 Parent(s): 9f73132

Update README.md

Browse files

Updated project name and clarified project goals and tasks related to LLM-based data format conversion and metadata harmonization.

Files changed (1) hide show
  1. README.md +52 -62
README.md CHANGED
@@ -1,96 +1,86 @@
1
- # OpenMind Wrangler: AI Agents for Data Preparation in the AI 3.0 Era
2
 
3
- This is the repository for the exploratory project OpenMind Wranglers, inspired by the accelerating progress of AI 3.0 models and their potential to automate complex data engineering workflows for multi-study neuroimaging and clinical datasets.
 
 
4
 
5
  ## Project Aim
6
 
7
- The goal of this project is to explore whether AI agents (primarily LLM-based) can meaningfully assist in dataset wrangling tasks that precede model inference or training β€” tasks that are time-consuming yet essential to reproducible, large-scale neuroimaging research.
8
 
9
- While the BIDS standard ensures interoperability on the metadata level, many preprocessing steps β€” such as volume normalization, quality control, outlier detection, or label encoding β€” remain manual or semi-automated. These steps form a bottleneck in data-driven science, especially when aggregating datasets across multiple studies.
10
 
11
- In this proof-of-concept, we aim to determine whether a coordinated system of AI agents can reliably execute these operations and produce AI-ready dataset collections, similar to those hosted on platforms like HuggingFace Datasets: OpenMind, with minimal human intervention.
12
 
13
  ## The Vision
14
 
15
- If successful, OpenMind Wrangler will serve as a foundation for a general-purpose AI data engineering assistant, capable of producing multi-study datasets that are immediately usable for machine learning and statistical analysis β€” without manual wrangling.
16
 
17
  ## Rough Work Plan
18
 
19
- The project will proceed along four parallel tracks: agent design, dataset exploration, wrangling task automation, and evaluation.
20
-
21
- ### 1. Agent Design
22
-
23
- We will prototype a multi-agent system where each agent handles a specific wrangling subtask (e.g., normalization, QC, encoding). Agents will communicate via a shared memory and task queue.
24
-
25
- Goals:
26
-
27
- Design a modular agentic framework (e.g., LangChain, CrewAI, or AutoGen)
28
-
29
- Define prompt templates and tools for each subtask
30
-
31
- Implement logging and reasoning traceability
32
-
33
- Resources:
34
-
35
- LangChain Agent Docs
36
-
37
- OpenAI Function Calling Guide
38
-
39
- ### 2. Dataset Exploration
40
-
41
- OpenNeuro
42
-
43
- Tasks:
44
-
45
- Retrieve metadata and BIDS structures
46
-
47
- Generate schema summaries and compatibility maps
48
-
49
- Identify data types for downstream analysis
50
-
51
- ### 3. Wrangling Task Automation
52
-
53
- The central track: enabling LLM-driven automation of the following operations:
54
 
55
- Volume normalization (using NiBabel / ANTsPy)
56
 
57
- Quality control report generation
 
 
 
 
58
 
59
- Outlier detection (statistical and visual)
 
 
60
 
61
- Label harmonization and encoding
62
 
63
- Data documentation generation (Markdown / JSON-LD)
 
 
 
 
64
 
65
- The focus is not on achieving perfection, but on evaluating how well an AI agent can assist or autonomously perform these tasks, given contextual metadata and goals.
 
 
 
66
 
67
- ### 4. Evaluation and Benchmarking
 
 
 
68
 
69
- The evaluation phase will assess the efficiency, accuracy, and robustness of agentic wrangling workflows compared to traditional, human-coded pipelines.
70
 
71
- Metrics will include:
72
 
73
- Time saved per dataset compared to manual pipelines
74
 
75
- Accuracy of normalization / encoding
76
 
77
- Error detection rate (false positives / negatives in QC)
78
 
79
- Consistency across heterogeneous datasets
 
 
 
80
 
81
- We will also benchmark against a simple scripted baseline (e.g., manual nipype or pandas pipeline).
 
82
 
83
- Milestones
84
 
85
- βœ… Literature & tooling review on AI-assisted data wrangling
86
 
87
- βš™οΈ Prototype LLM agent framework for dataset preprocessing
88
 
89
- 🧠 Evaluation on 2–3 public BIDS datasets
90
 
91
- πŸ“Š Quantitative + qualitative benchmarking report
92
 
 
 
93
 
94
- Generative augmentation: Can the agent propose synthetic data to fill gaps or balance classes?
95
 
96
- Teams and Participants
 
 
1
+ # Open Data BIDSifier: LLM-based Data format conversion
2
 
3
+ This is the repository for an exploratory project OpenMind Wranglers, inspired by the accelerating progress of LLMs and their potential to automate format conversion for multi-study neuroimaging and clinical datasets.
4
+
5
+ ## Motivation
6
 
7
  ## Project Aim
8
 
9
+ The goal of this project is to explore if LLM-based workflow machines (sometimes calles "AI agents) can meaningfully assist in metadata harmonization, data format transformation and data preprocessing tasks that precede AI model inference or training. These tasks are time-consuming yet essential to reproducible, large-scale neuroimaging research.
10
 
11
+ While the BIDS standard ensures interoperability, there are some datasets for which no BIDS annotation is available. This is a "dead data" which can not be used on-par with BIDS datasets.
12
 
13
+ In this proof-of-concept, we aim to determine whether a coordinated system of AI agents can reliably execute these operations and produce AI-ready dataset collections, similar to those hosted on platforms like HuggingFace Datasets: [OpenMind](https://huggingface.co/datasets/AnonRes/OpenMind), with minimal human intervention.
14
 
15
  ## The Vision
16
 
17
+ If successful, Open Data Wrangler will serve as a foundation for harmonizing multi-study datasets from open data, making sure these are immediately usable for machine learning and statistical analysis.
18
 
19
  ## Rough Work Plan
20
 
21
+ We will focus on data and metadata harmonization and evaluation of the results. For evaluation, a manual baseline would be prepared by humans.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ ### 1. Metadata harmonization
24
 
25
+ Zenodo and other data portals are rich on non-BIDS neuroimaging data. For a primer, these datasets are suggested:
26
+ - [UniToBrain](https://zenodo.org/records/5109415)
27
+ - [Cranial CT of 1 patient](https://zenodo.org/records/16816)
28
+ - [BraTS 2020](https://www.kaggle.com/datasets/awsaf49/brats2020-training-data)
29
+ - [Macaque neurodevelopment database](https://data.kitware.com/#collection/54b582c38d777f4362aa9cb3)
30
 
31
+ The goal is to harmonize them with the following BIDS datasets:
32
+ - [Sexing the parental brain in shopping: an fMRI study](doi:10.18112/openneuro.ds006844.v1.0.0)
33
+ - [Cross-modal Hierarchical Control](https://openneuro.org/datasets/ds006628/versions/1.0.1)
34
 
35
+ and really any other OpenNeuro dataset.
36
 
37
+ 6 WPs are the following
38
+ Harmonizing **metadata** *with LLM based tools*:
39
+ 1. Annotation column names (from non-BIDS to BIDS)
40
+ 2. File structure
41
+ 3. Study metadata (fetching from repository HTMLs too)
42
 
43
+ For the LLM-assisted workflow, following **tools** are suggested: Github Copilot in VS Code, LLMAnything.
44
+ Pick **any LLM** really, smaller LLMs tend to hallucinate more, therefore it is more interesting if they can make it too!
45
+ Suggestions bigger LLMs: GPT-5, Claude, Kimi-K2, DeepSeek-R1
46
+ Suggestions smaller LLMs: SmoLM, LLaMA-7B, Qwen-7B
47
 
48
+ Harmonizing **metadata** *by hand*:
49
+ 4. Annotation column names (from non-BIDS to BIDS)
50
+ 5. File structure
51
+ 6. Study metadata (fetching from repository HTMLs too)
52
 
53
+ For the manual harmonization, an IDE with Python / R is useful; as well as OpenRefine, an open-source tool for working with tabular data.
54
 
 
55
 
56
+ Record the problems and the working time for both manual and LLM assisted harmonization.
57
 
58
+ Time planned: ~5 hours working time are planned for this step.
59
 
60
+ ### 2. Evaluation of the harmonized metadata
61
 
62
+ Let's see how well the harmonization went!
63
+ Use [BIDS validator](https://bids-standard.github.io/bids-validator/) on the newly-BIDS datasets, and report whether:
64
+ 1. The manual harmonization is BIDS compliant.
65
+ 2. The semi-automatic harmonization is BIDS compliant.
66
 
67
+ 3. Assess the differences in the harmonized metadata.
68
+ 4. Try to "stack" BIDS converted datasets on old BIDS datasets, and report the errors.
69
 
70
+ Time planned: ~4 hours.
71
 
72
+ ### 3. Try in action
73
 
74
+ Use [ResEncL](https://huggingface.co/AnonRes/ResEncL-OpenMind-MAE) or another model of your choice with JAX, and feed it the resulting harmonized dataset. Analyse and record the bugs.
75
 
76
+ Time planned: ~4 hours.
77
 
78
+ ### Teams and Participants / Skills
79
 
80
+ Anyone from master students to experienced scientists is welcome to join. The project will involve a lot of tabular data analysis and scripting, for this, coding experience in Python and R and Shell experience are useful.
81
+ For working with LLM agents, prompt engineering skills can be useful, though can also be acquired in this project.
82
 
83
+ ### References / Prior reading
84
 
85
+ - [What is an AI agent](https://blog.langchain.com/how-to-think-about-agent-frameworks/)
86
+ - [Prompting guide from Meta](https://www.llama.com/docs/how-to-guides/prompting/)