chuckfinca commited on
Commit
aea0566
·
1 Parent(s): 8a66e70

docs: updates initial plan with changes

Browse files
Files changed (1) hide show
  1. docs/initial_plan.md +66 -65
docs/initial_plan.md CHANGED
@@ -1,5 +1,5 @@
1
  # Freshman On-Track Intervention Recommender
2
- ## Project Plan & Technical Design
3
 
4
  ---
5
 
@@ -26,7 +26,7 @@
26
  - **LangChain**: Framework for RAG pipeline orchestration and prompt management
27
  - **Sentence Transformers**: High-quality semantic embeddings optimized for educational content
28
  - **FAISS**: Fast, in-memory vector search for PoC (Facebook AI Similarity Search)
29
- - **Pandas**: Data processing and manipulation for knowledge base preparation
30
 
31
  **Vector Embeddings**: `all-MiniLM-L6-v2` model
32
  - Optimized for semantic similarity tasks
@@ -40,18 +40,18 @@
40
 
41
  ### RAG Pipeline Architecture
42
 
43
- 1. **Knowledge Base Ingestion**: Extract and preprocess intervention documents
44
- 2. **Chunking Strategy**: Semantic chunking by intervention type and implementation steps
45
- 3. **Vector Embedding**: Transform text chunks into searchable vector representations
46
- 4. **Retrieval**: Take the narrative_summary_for_embedding from the student profile as the query. Perform semantic search against the vector database to retrieve the top 3 most relevant intervention chunks
47
- 5. **Synthesis**: Generate educator-friendly recommendations with source citations
48
 
49
  ### Alignment with Architectural Principles
50
 
51
- - **RAG as Core**: Semantic search ensures recommendations are grounded in evidence-based research
52
- - **Actionable for Educators**: Output format prioritizes clear, implementable steps over raw research
53
- - **Startup Scale**: FAISS for PoC, cloud-native services for production scalability
54
- - **Bias for Action**: Minimal viable architecture focused on core functionality first
55
 
56
  ---
57
 
@@ -59,45 +59,46 @@
59
 
60
  ### Selected Best-Practice Documents
61
 
62
- 1. **FOT Toolkit - Tool Set C: Developing and Tracking Interventions** (Pages 43-68)
63
- - *Primary Source*: Comprehensive intervention framework
64
- - *Focus*: Systematic approach to intervention selection and tracking
65
 
66
- 2. **Check & Connect Intervention** (University of Minnesota/WWC)
67
- - *Evidence Level*: Only dropout prevention program with WWC "Positive Effects" rating
68
- - *Focus*: Structured mentoring for attendance and credit recovery
69
 
70
- 3. **Predictive Power of Ninth-Grade GPA** (University of Chicago Consortium)
71
- - *Strategic Value*: Research foundation explaining why FOT interventions matter
72
- - *Focus*: Data-driven rationale for early intervention
73
-
74
- 4. **Preventing Chronic Absence and Promoting Attendance** (REL Program)
75
- - *Evidence Base*: Tiered, research-validated attendance strategies
76
- - *Focus*: Family engagement, transportation, and systemic barriers
77
-
78
- 5. **Addressing Root Causes of Disparities in School Discipline** (NCSSLE)
79
- - *Methodology*: Systematic root-cause analysis for behavioral interventions
80
- - *Focus*: Data-driven behavioral support strategies
81
 
82
  ### Data Processing Strategy
83
 
84
- **Content Extraction** (Hybrid Strategy):
85
- - **Tier 1**: PyMuPDF (fitz) for rapid extraction of simple, single-column text pages
86
- - **Tier 2**: pdfplumber for structured tabular data to preserve relational integrity
87
- - **Tier 3**: Nougat (Meta AI) layout-aware model for complex multi-column layouts and flowcharts
88
- - **Quality Assurance**: Manual review and validation of extracted content accuracy
 
 
 
89
 
90
  **Chunking Approach**:
91
- - **Semantic Chunking**: Break documents by intervention type, not arbitrary word limits
92
- - **Chunk Size**: 300-500 words to maintain context while enabling precise retrieval
93
- - **Overlap Strategy**: 50-word overlap to preserve cross-boundary context
94
- - **Metadata Tagging**: Source document, intervention category, target indicators
95
 
96
  **Content Preparation**:
97
- - Standardize intervention descriptions with consistent format
98
- - Extract key implementation steps and required resources
99
- - Tag interventions by target risk factors (attendance, credits, behavior)
100
- - Create intervention summaries optimized for educator consumption
101
 
102
  ---
103
 
@@ -105,45 +106,45 @@
105
 
106
  ### Development Acceleration
107
 
108
- **GitHub Copilot**:
109
- - Code generation for standard RAG pipeline components
110
- - Boilerplate reduction for data processing and API endpoints
111
- - Test case generation for validation scenarios
112
 
113
  **Large Language Models (GPT-4/Claude)**:
114
- - **Document Analysis**: Rapid extraction of key intervention strategies from research papers
115
- - **Prompt Engineering**: Optimize prompts for educator-specific output formatting
116
- - **Content Synthesis**: Transform academic language into practitioner-friendly recommendations
117
- - **Code Review**: Architecture validation and optimization suggestions
118
 
119
  ### Problem-Solving Workflow
120
 
121
- 1. **Research Phase**: Use LLMs to quickly synthesize intervention research and identify gaps
122
- 2. **Architecture Design**: Validate technical approach against startup scaling requirements
123
- 3. **Implementation**: Leverage Copilot for rapid prototype development
124
- 4. **Testing**: AI-assisted generation of diverse student profile test cases
125
- 5. **Optimization**: LLM-powered analysis of retrieval quality and recommendation relevance
126
 
127
  ### Quality Assurance
128
 
129
- - **Prompt Validation**: Use AI to generate edge cases for robust testing
130
- - **Content Review**: AI-assisted verification that academic content translates to actionable guidance
131
- - **Bias Detection**: Systematic review of recommendations for potential equity issues
132
 
133
  ---
134
 
135
  ## Success Metrics & Next Steps
136
 
137
  **PoC Success Criteria**:
138
- - Accurate retrieval of top 3 relevant interventions for sample student profile
139
- - Educator-friendly output format with clear implementation guidance
140
- - Sub-2 second response time for typical queries
141
- - Proper source citation for all recommendations
142
 
143
  **Production Evolution Path**:
144
- 1. **Enhanced Knowledge Base**: Scale to 50+ intervention documents
145
- 2. **Persona-Based Outputs**: Tailored recommendations for teachers, parents, principals
146
- 3. **API Microservice**: RESTful service for integration with SIS platforms
147
- 4. **Analytics Dashboard**: Track intervention effectiveness and usage patterns
148
 
149
  This PoC establishes the foundation for a scalable, evidence-based intervention recommendation system that can transform how educators support at-risk freshmen nationwide.
 
1
  # Freshman On-Track Intervention Recommender
2
+ ## Project Plan & Technical Design (Revision 3)
3
 
4
  ---
5
 
 
26
  - **LangChain**: Framework for RAG pipeline orchestration and prompt management
27
  - **Sentence Transformers**: High-quality semantic embeddings optimized for educational content
28
  - **FAISS**: Fast, in-memory vector search for PoC (Facebook AI Similarity Search)
29
+ - **Simplified Stack**: Focus on `langchain`, `sentence-transformers`, `faiss-cpu`, `torch`, and `transformers` to directly support the core RAG pipeline, removing dependencies for direct PDF processing.
30
 
31
  **Vector Embeddings**: `all-MiniLM-L6-v2` model
32
  - Optimized for semantic similarity tasks
 
40
 
41
  ### RAG Pipeline Architecture
42
 
43
+ 1. **Knowledge Base Ingestion**: Load and process a manually curated, high-quality JSON knowledge base (`knowledge_base_raw.json`). This bypasses unreliable PDF parsing to focus on core RAG functionality.
44
+ 2. **Chunking Strategy**: Semantic chunking by intervention type and implementation steps.
45
+ 3. **Vector Embedding**: Transform text chunks into searchable vector representations.
46
+ 4. **Retrieval**: Take the `narrative_summary_for_embedding` from the student profile as the query. Perform semantic search against the vector database to retrieve the top 3 most relevant intervention chunks.
47
+ 5. **Synthesis**: Generate educator-friendly recommendations with source citations.
48
 
49
  ### Alignment with Architectural Principles
50
 
51
+ - **RAG as Core**: Semantic search ensures recommendations are grounded in evidence-based research.
52
+ - **Actionable for Educators**: Output format prioritizes clear, implementable steps over raw research.
53
+ - **Startup Scale**: FAISS for PoC, cloud-native services for production scalability.
54
+ - **Bias for Action**: Minimal viable architecture focused on core functionality first.
55
 
56
  ---
57
 
 
59
 
60
  ### Selected Best-Practice Documents
61
 
62
+ The knowledge base is built from the primary source document provided and is complemented by five additional high-quality, evidence-based resources to provide specific, actionable "playbooks" for educators.
 
 
63
 
64
+ **Primary Source Document:**
65
+ 1. **Freshman On‑Track Toolkit (2nd Edition)** (Network for College Success, 2017)
66
+ - ***Primary Focus Area***: **Tool Set C: Developing and Tracking Interventions (Pages 43-68)**, which provides the core framework for intervention planning, tracking, and evaluation.
67
 
68
+ **Additional Curated Sources:**
69
+ 2. **17 Quick Tips for Your Credit Recovery Program** (Edmentum, 2024)
70
+ - *Focus*: Actionable strategies for designing and implementing effective credit recovery programs at both the district and school levels.
71
+ 3. **Handout: Strategies to Address Chronic Absenteeism** (Institute of Education Sciences, REL Southwest, 2025)
72
+ - *Focus*: Evidence-based interventions for chronic absenteeism, including Early Warning Systems, Mentoring, and Check & Connect.
73
+ 4. **High-Quality Tutoring: An Evidence-Based Strategy to Tackle Learning Loss** (Institute of Education Sciences, 2021)
74
+ - *Focus*: Defines the characteristics of effective, high-impact tutoring to accelerate student learning.
75
+ 5. **WWC Intervention Report: Check & Connect** (Institute of Education Sciences, What Works Clearinghouse, 2015)
76
+ - *Evidence Level*: A detailed report on a key dropout prevention program with positive effects on keeping students in school.
77
+ 6. **Early Intervention Strategies: Using Teams to Monitor and Identify Students in Need of Support** (Attendance Works, 2019)
78
+ - *Focus*: A multi-tiered team-based approach to monitoring attendance data and implementing early interventions.
79
 
80
  ### Data Processing Strategy
81
 
82
+ ~~**Content Extraction** (Hybrid Strategy):~~
83
+ - ~~**Tier 1**: PyMuPDF (fitz) for rapid extraction of simple, single-column text pages~~
84
+ - ~~**Tier 2**: pdfplumber for structured tabular data to preserve relational integrity~~
85
+ - ~~**Tier 3**: Nougat (Meta AI) layout-aware model for complex multi-column layouts and flowcharts~~
86
+ - ~~**Quality Assurance**: Manual review and validation of extracted content accuracy~~
87
+
88
+ **Pivoted Content Extraction Strategy:**
89
+ - **Manual Curation**: Bypassed programmatic PDF extraction due to complexity and unreliability. Instead, key interventions were manually extracted (with LLM assistance) from all source documents into a single, high-quality `knowledge_base_raw.json` file. This ensures maximum quality and allows direct focus on the RAG pipeline.
90
 
91
  **Chunking Approach**:
92
+ - **Semantic Chunking**: Break documents by intervention type, not arbitrary word limits.
93
+ - **Chunk Size**: 300-500 words to maintain context while enabling precise retrieval.
94
+ - **Overlap Strategy**: 50-word overlap to preserve cross-boundary context.
95
+ - **Metadata Tagging**: Source document, intervention category, target indicators.
96
 
97
  **Content Preparation**:
98
+ - Standardize intervention descriptions with consistent format.
99
+ - Extract key implementation steps and required resources.
100
+ - Tag interventions by target risk factors (attendance, credits, behavior).
101
+ - Create intervention summaries optimized for educator consumption.
102
 
103
  ---
104
 
 
106
 
107
  ### Development Acceleration
108
 
109
+ **GitHub Copilot**:
110
+ - Code generation for standard RAG pipeline components.
111
+ - Boilerplate reduction for data processing and API endpoints.
112
+ - Test case generation for validation scenarios.
113
 
114
  **Large Language Models (GPT-4/Claude)**:
115
+ - **Knowledge Base Curation**: Accelerated the manual extraction process by summarizing dense academic PDFs and structuring the content into the clean `knowledge_base_raw.json` format.
116
+ - **Prompt Engineering**: Optimize prompts for educator-specific output formatting.
117
+ - **Content Synthesis**: Transform academic language into practitioner-friendly recommendations.
118
+ - **Code Review**: Architecture validation and optimization suggestions.
119
 
120
  ### Problem-Solving Workflow
121
 
122
+ 1. **Research Phase**: Use LLMs to quickly synthesize intervention research and identify gaps.
123
+ 2. **Architecture Design**: Validate technical approach against startup scaling requirements.
124
+ 3. **Implementation**: Leverage Copilot for rapid prototype development.
125
+ 4. **Testing**: AI-assisted generation of diverse student profile test cases.
126
+ 5. **Optimization**: LLM-powered analysis of retrieval quality and recommendation relevance.
127
 
128
  ### Quality Assurance
129
 
130
+ - **Prompt Validation**: Use AI to generate edge cases for robust testing.
131
+ - **Content Review**: AI-assisted verification that academic content translates to actionable guidance.
132
+ - **Bias Detection**: Systematic review of recommendations for potential equity issues.
133
 
134
  ---
135
 
136
  ## Success Metrics & Next Steps
137
 
138
  **PoC Success Criteria**:
139
+ - Accurate retrieval of top 3 relevant interventions for sample student profile.
140
+ - Educator-friendly output format with clear implementation guidance.
141
+ - Sub-2 second response time for typical queries.
142
+ - Proper source citation for all recommendations.
143
 
144
  **Production Evolution Path**:
145
+ 1. **Enhanced Knowledge Base**: Scale to 50+ intervention documents.
146
+ 2. **Persona-Based Outputs**: Tailored recommendations for teachers, parents, principals.
147
+ 3. **API Microservice**: RESTful service for integration with SIS platforms.
148
+ 4. **Analytics Dashboard**: Track intervention effectiveness and usage patterns.
149
 
150
  This PoC establishes the foundation for a scalable, evidence-based intervention recommendation system that can transform how educators support at-risk freshmen nationwide.