stefanches7 commited on
Commit
9f73132
Β·
1 Parent(s): 0ca114f

Revise README for OpenMind Wrangler project

Browse files

Updated the project title and expanded the README to provide a detailed overview of the OpenMind Wrangler project, including its aims, vision, work plan, and evaluation metrics.

Files changed (1) hide show
  1. README.md +96 -2
README.md CHANGED
@@ -1,2 +1,96 @@
1
- # AI-assisted-Neuroimaging-harmonization
2
- Harmonize Neuroimaging open datasets using AI agents
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenMind Wrangler: AI Agents for Data Preparation in the AI 3.0 Era
2
+
3
+ This is the repository for the exploratory project OpenMind Wranglers, inspired by the accelerating progress of AI 3.0 models and their potential to automate complex data engineering workflows for multi-study neuroimaging and clinical datasets.
4
+
5
+ ## Project Aim
6
+
7
+ The goal of this project is to explore whether AI agents (primarily LLM-based) can meaningfully assist in dataset wrangling tasks that precede model inference or training β€” tasks that are time-consuming yet essential to reproducible, large-scale neuroimaging research.
8
+
9
+ While the BIDS standard ensures interoperability on the metadata level, many preprocessing steps β€” such as volume normalization, quality control, outlier detection, or label encoding β€” remain manual or semi-automated. These steps form a bottleneck in data-driven science, especially when aggregating datasets across multiple studies.
10
+
11
+ In this proof-of-concept, we aim to determine whether a coordinated system of AI agents can reliably execute these operations and produce AI-ready dataset collections, similar to those hosted on platforms like HuggingFace Datasets: OpenMind, with minimal human intervention.
12
+
13
+ ## The Vision
14
+
15
+ If successful, OpenMind Wrangler will serve as a foundation for a general-purpose AI data engineering assistant, capable of producing multi-study datasets that are immediately usable for machine learning and statistical analysis β€” without manual wrangling.
16
+
17
+ ## Rough Work Plan
18
+
19
+ The project will proceed along four parallel tracks: agent design, dataset exploration, wrangling task automation, and evaluation.
20
+
21
+ ### 1. Agent Design
22
+
23
+ We will prototype a multi-agent system where each agent handles a specific wrangling subtask (e.g., normalization, QC, encoding). Agents will communicate via a shared memory and task queue.
24
+
25
+ Goals:
26
+
27
+ Design a modular agentic framework (e.g., LangChain, CrewAI, or AutoGen)
28
+
29
+ Define prompt templates and tools for each subtask
30
+
31
+ Implement logging and reasoning traceability
32
+
33
+ Resources:
34
+
35
+ LangChain Agent Docs
36
+
37
+ OpenAI Function Calling Guide
38
+
39
+ ### 2. Dataset Exploration
40
+
41
+ OpenNeuro
42
+
43
+ Tasks:
44
+
45
+ Retrieve metadata and BIDS structures
46
+
47
+ Generate schema summaries and compatibility maps
48
+
49
+ Identify data types for downstream analysis
50
+
51
+ ### 3. Wrangling Task Automation
52
+
53
+ The central track: enabling LLM-driven automation of the following operations:
54
+
55
+ Volume normalization (using NiBabel / ANTsPy)
56
+
57
+ Quality control report generation
58
+
59
+ Outlier detection (statistical and visual)
60
+
61
+ Label harmonization and encoding
62
+
63
+ Data documentation generation (Markdown / JSON-LD)
64
+
65
+ The focus is not on achieving perfection, but on evaluating how well an AI agent can assist or autonomously perform these tasks, given contextual metadata and goals.
66
+
67
+ ### 4. Evaluation and Benchmarking
68
+
69
+ The evaluation phase will assess the efficiency, accuracy, and robustness of agentic wrangling workflows compared to traditional, human-coded pipelines.
70
+
71
+ Metrics will include:
72
+
73
+ Time saved per dataset compared to manual pipelines
74
+
75
+ Accuracy of normalization / encoding
76
+
77
+ Error detection rate (false positives / negatives in QC)
78
+
79
+ Consistency across heterogeneous datasets
80
+
81
+ We will also benchmark against a simple scripted baseline (e.g., manual nipype or pandas pipeline).
82
+
83
+ Milestones
84
+
85
+ βœ… Literature & tooling review on AI-assisted data wrangling
86
+
87
+ βš™οΈ Prototype LLM agent framework for dataset preprocessing
88
+
89
+ 🧠 Evaluation on 2–3 public BIDS datasets
90
+
91
+ πŸ“Š Quantitative + qualitative benchmarking report
92
+
93
+
94
+ Generative augmentation: Can the agent propose synthetic data to fill gaps or balance classes?
95
+
96
+ Teams and Participants