Spaces:

chuckfinca
/

fot-recommender-api

Sleeping

App Files Files Community

chuckfinca commited on Aug 5, 2025

Commit

9cd91d8

1 Parent(s): d558ce2

docs: adds planning docs to track changes

Browse files

Files changed (3) hide show

docs/implementation_plan.md +365 -0
docs/initial_plan.md +149 -0
docs/research_plan.md +62 -0

docs/implementation_plan.md ADDED Viewed

	@@ -0,0 +1,365 @@

+# FOT Intervention Recommender
+## Detailed Implementation Plan
+---
+## Overview
+This implementation plan transforms the strategic project plan into executable phases, with specific tasks, deliverables, and success criteria for building the working proof-of-concept.
+**Total Estimated Time**: 8-12 hours (spread over 3-5 days)
+**Primary Deliverable**: Google Colab Notebook with working RAG system
+---
+## Phase 0: Environment Setup & Resource Gathering
+**Duration**: 1-2 hours
+**Goal**: Establish development environment and collect all source materials
+### Tasks
+#### 0.1 Development Environment Setup
+- [ ] Create new Google Colab notebook: "FOT_Intervention_Recommender"
+- [ ] Install required libraries in first cell:
+```python
+!pip install sentence-transformers faiss-cpu langchain pandas pymupdf pdfplumber transformers
+```
+- [ ] Import necessary libraries and test basic functionality
+- [ ] Set up file organization structure in Colab
+#### 0.2 Source Material Collection
+- [ ] **Extract FOT Toolkit pages 43-68**:
+  - Use PDF splitter tool to extract specific pages
+  - Save as separate PDF: "FOT_Toolkit_ToolSetC.pdf"
+  - Upload to Colab files section
+- [ ] **Download 5 external sources**:
+  - [ ] Check & Connect materials (search UMN website)
+  - [ ] Download UChicago GPA research PDF
+  - [ ] Save REL chronic absenteeism resources
+  - [ ] Get Success for All intervention guides
+  - [ ] Download NCSSLE discipline disparities guide
+#### 0.3 Quick Content Reconnaissance
+- [ ] Scan each document to identify:
+  - Simple text pages (for PyMuPDF)
+  - Complex table pages (for pdfplumber)
+  - Multi-column/flowchart pages (for manual extraction initially)
+- [ ] Create a "document complexity map" for processing strategy
+### Success Criteria
+- ✅ Colab environment running with all dependencies
+- ✅ All 6 source documents collected and uploaded
+- ✅ Basic understanding of each document's structure and complexity
+---
+## Phase 1: Knowledge Base Construction
+**Duration**: 3-4 hours
+**Goal**: Extract, process, and structure content into RAG-ready knowledge base
+### Tasks
+#### 1.1 Content Extraction (Hybrid Approach)
+- [ ] **Implement PyMuPDF extraction**:
+```python
+import fitz  # PyMuPDF
+def extract_simple_text(pdf_path, page_range):
+    # Extract text from simple pages
+    pass
+```
+- [ ] **Implement pdfplumber for tables**:
+```python
+import pdfplumber
+def extract_table_data(pdf_path, page_numbers):
+    # Extract structured table data
+    pass
+```
+- [ ] **Manual extraction for complex pages**:
+  - Identify 3-5 most critical complex pages
+  - Manually transcribe key intervention details
+  - Focus on flowcharts and multi-column layouts
+#### 1.2 Content Processing & Standardization
+- [ ] **Create intervention extraction function**:
+```python
+def extract_interventions(raw_text, source_doc):
+    """Extract structured intervention data"""
+    interventions = []
+    # Parse for intervention name, description, steps, target indicators
+    return interventions
+```
+- [ ] **Process each document**:
+  - FOT Toolkit Tool Set C → Core intervention framework
+  - Check & Connect → Mentoring strategies
+  - UChicago Research → Rationale and evidence base
+  - REL Resources → Attendance strategies
+  - Success for All → Comprehensive approaches
+  - NCSSLE Guide → Behavioral interventions
+#### 1.3 Knowledge Base Structuring
+- [ ] **Create standardized intervention format**:
+```python
+intervention_schema = {
+    "id": str,
+    "name": str,
+    "description": str,
+    "implementation_steps": List[str],
+    "target_indicators": List[str],  # credits, attendance, behavior
+    "evidence_level": str,
+    "source_document": str,
+    "educator_guidance": str
+}
+```
+- [ ] **Implement semantic chunking**:
+  - Chunk by intervention type (300-500 words)
+  - Add 50-word overlap between chunks
+  - Create metadata tags for each chunk
+### Success Criteria
+- ✅ All documents successfully processed using appropriate extraction method
+- ✅ 20+ distinct interventions identified and structured
+- ✅ Standardized data format with consistent metadata
+- ✅ Quality validation: random sample review shows accurate extraction
+---
+## Phase 2: RAG Pipeline Implementation
+**Duration**: 2-3 hours
+**Goal**: Build and test the core RAG functionality
+### Tasks
+#### 2.1 Vector Embedding Setup
+- [ ] **Initialize embedding model**:
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('all-MiniLM-L6-v2')
+```
+- [ ] **Create embeddings for knowledge base**:
+```python
+def create_embeddings(intervention_chunks):
+    embeddings = model.encode(intervention_chunks)
+    return embeddings
+```
+- [ ] **Set up FAISS vector database**:
+```python
+import faiss
+def create_vector_db(embeddings):
+    dimension = embeddings.shape[1]
+    index = faiss.IndexFlatIP(dimension)  # Inner product for similarity
+    index.add(embeddings)
+    return index
+```
+#### 2.2 Retrieval System
+- [ ] **Implement semantic search**:
+```python
+def search_interventions(query, index, intervention_data, k=3):
+    query_embedding = model.encode([query])
+    scores, indices = index.search(query_embedding, k)
+    return [(intervention_data[i], scores[0][idx]) for idx, i in enumerate(indices[0])]
+```
+- [ ] **Test retrieval with sample queries**:
+  - "Student failing core classes and missing school"
+  - "Attendance problems and behavioral issues"
+  - "Low credits earned, needs academic support"
+#### 2.3 Response Generation
+- [ ] **Create educator-friendly formatter**:
+```python
+def format_recommendations(retrieved_interventions, student_profile):
+    formatted_response = []
+    for intervention, score in retrieved_interventions:
+        recommendation = {
+            "intervention_name": intervention["name"],
+            "rationale": f"Recommended because: {explain_match(intervention, student_profile)}",
+            "implementation_steps": intervention["implementation_steps"],
+            "source": intervention["source_document"],
+            "confidence_score": score
+        }
+        formatted_response.append(recommendation)
+    return formatted_response
+```
+### Success Criteria
+- ✅ Vector database successfully created with all intervention embeddings
+- ✅ Semantic search returns relevant results for test queries
+- ✅ Response format is educator-friendly with clear implementation guidance
+- ✅ Source citations are properly maintained throughout pipeline
+---
+## Phase 3: System Integration & Testing
+**Duration**: 1-2 hours
+**Goal**: End-to-end testing with provided student profile
+### Tasks
+#### 3.1 End-to-End Pipeline Integration
+- [ ] **Create main recommendation function**:
+```python
+def get_fot_recommendations(student_profile_narrative):
+    # 1. Process student narrative
+    # 2. Perform semantic search
+    # 3. Retrieve top 3 interventions
+    # 4. Format for educators
+    # 5. Return structured recommendations
+    pass
+```
+#### 3.2 Testing with Sample Student Profile
+- [ ] **Test with provided profile**:
+```python
+sample_student = """This student is struggling to keep up with coursework,
+having failed one core class and earning only 2.5 credits out of 4 credits
+expected for the semester. Attendance is becoming a concern at 88% for an
+average annual target of 90%, and they have had one behavioral incident.
+The student needs targeted academic and attendance support to get back on
+track for graduation."""
+recommendations = get_fot_recommendations(sample_student)
+```
+#### 3.3 Quality Validation & Refinement
+- [ ] **Evaluate recommendation quality**:
+  - Do recommendations address student's specific risk factors?
+  - Are implementation steps clear and actionable?
+  - Are source citations accurate and helpful?
+- [ ] **Refine retrieval if needed**:
+  - Adjust embedding model parameters
+  - Modify chunking strategy if results are poor
+  - Fine-tune response formatting
+### Success Criteria
+- ✅ End-to-end pipeline processes student profile successfully
+- ✅ Returns exactly 3 relevant intervention recommendations
+- ✅ Each recommendation includes implementation steps and source citation
+- ✅ Recommendations directly address student's risk factors (credits, attendance, behavior)
+---
+## Phase 4: Documentation & Presentation Preparation
+**Duration**: 1-2 hours
+**Goal**: Create clear notebook documentation and prepare for video presentation
+### Tasks
+#### 4.1 Colab Notebook Documentation
+- [ ] **Add comprehensive markdown cells**:
+  - Project overview and goals
+  - Knowledge base composition and rationale
+  - Technical architecture explanation
+  - Step-by-step process documentation
+- [ ] **Code documentation**:
+  - Add docstrings to all functions
+  - Include inline comments for complex logic
+  - Add example usage for key functions
+#### 4.2 Demonstration Preparation
+- [ ] **Create demonstration workflow**:
+  - Show knowledge base construction process
+  - Demonstrate search functionality with different queries
+  - Walk through the sample student profile analysis
+  - Display formatted recommendations
+- [ ] **Prepare talking points for video**:
+  - Project value proposition (30 seconds)
+  - Technical approach overview (60 seconds)
+  - Live demonstration (2 minutes)
+  - Next steps and product vision (90 seconds)
+### Success Criteria
+- ✅ Notebook is well-documented with clear explanations
+- ✅ All code cells execute successfully from top to bottom
+- ✅ Demonstration workflow is smooth and highlights key features
+- ✅ Ready for 5-minute video recording
+---
+## Phase 5: Bonus Features (Optional)
+**Duration**: 2-4 hours
+**Goal**: Implement advanced features to differentiate the solution
+### Option A: API Microservice (Bonus 1)
+- [ ] **Create FastAPI application**:
+```python
+from fastapi import FastAPI
+app = FastAPI(title="FOT Intervention Recommender")
+@app.post("/recommend")
+async def get_recommendations(student_narrative: str):
+    return get_fot_recommendations(student_narrative)
+```
+- [ ] **Containerize with Docker**
+- [ ] **Create deployment documentation**
+### Option B: Persona-Based Recommendations (Bonus 2)
+- [ ] **Implement persona-specific prompts**:
+```python
+def generate_persona_recommendations(interventions, persona):
+    # Teacher: Classroom-focused, actionable steps
+    # Parent: Supportive language, home-based strategies
+    # Principal: Resource requirements, systemic approach
+    pass
+```
+### Success Criteria (if attempted)
+- ✅ Bonus feature fully functional and demonstrated
+- ✅ Added value is clear and well-articulated
+- ✅ Implementation quality matches core system standards
+---
+## Risk Mitigation Strategies
+### Technical Risks
+- **Complex PDF extraction fails**: Fall back to manual extraction for critical pages
+- **Poor embedding quality**: Test alternative models (e.g., `all-mpnet-base-v2`)
+- **Retrieval returns irrelevant results**: Adjust chunking strategy or add filtering
+### Time Management Risks
+- **Document processing takes too long**: Prioritize FOT Toolkit + 2 highest-quality external sources
+- **Perfectionism trap**: Focus on working MVP first, refinements second
+- **Scope creep**: Stick to core deliverables, save enhancements for bonus phase
+### Quality Risks
+- **Recommendations not educator-friendly**: Test format with simple language review
+- **Source citations missing**: Implement citation tracking from extraction phase
+- **System doesn't handle edge cases**: Build in error handling and fallback responses
+---
+## Daily Execution Schedule
+### Day 1 (2-3 hours)
+- Complete Phase 0: Setup & Resource Gathering
+- Begin Phase 1: Start content extraction
+### Day 2 (3-4 hours)
+- Complete Phase 1: Finish knowledge base construction
+- Begin Phase 2: Start RAG implementation
+### Day 3 (2-3 hours)
+- Complete Phase 2: Finish RAG pipeline
+- Complete Phase 3: Testing and validation
+### Day 4 (1-2 hours)
+- Complete Phase 4: Documentation and prep
+- Optional: Begin bonus features
+### Day 5 (Optional, 2-4 hours)
+- Phase 5: Bonus implementation
+- Final testing and video recording
+This implementation plan provides a clear roadmap from strategic vision to working prototype, balancing ambition with practical execution constraints.

docs/initial_plan.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# Freshman On-Track Intervention Recommender
+## Project Plan & Technical Design
+---
+## Problem Understanding
+**Core Problem**: Freshman year performance is the strongest predictor of high school graduation, yet educators lack systematic tools to match at-risk 9th graders with evidence-based interventions. Currently, intervention selection relies on educator intuition rather than proven best practices, leading to inconsistent support for struggling students.
+**Goal of this PoC**: Build a Retrieval-Augmented Generation (RAG) system that takes a student's on-track indicators (credits, attendance, behavioral flags) and automatically recommends the most relevant, evidence-based intervention strategies from a curated knowledge base of proven FOT practices.
+**Value Proposition**: This system transforms scattered research into actionable guidance, enabling educators to quickly identify targeted interventions without requiring deep expertise in educational research. By democratizing access to best practices, we can systematically improve outcomes for at-risk freshmen.
+---
+## Proposed RAG Architecture
+### Technical Stack & Rationale
+**Programming Language**: Python
+- Industry standard for ML/AI development
+- Rich ecosystem of libraries for RAG implementation
+- Rapid prototyping capabilities align with "bias for action" principle
+**Core Libraries**:
+- **LangChain**: Framework for RAG pipeline orchestration and prompt management
+- **Sentence Transformers**: High-quality semantic embeddings optimized for educational content
+- **FAISS**: Fast, in-memory vector search for PoC (Facebook AI Similarity Search)
+- **Pandas**: Data processing and manipulation for knowledge base preparation
+**Vector Embeddings**: `all-MiniLM-L6-v2` model
+- Optimized for semantic similarity tasks
+- Balanced performance vs. computational efficiency
+- Strong performance on educational/instructional text
+**Cloud Services** (Production Path):
+- **Google Cloud Run**: Serverless, auto-scaling container deployment
+- **Pinecone/Weaviate**: Managed vector database for production scale
+- **Google Cloud Storage**: Document storage and versioning
+### RAG Pipeline Architecture
+1. **Knowledge Base Ingestion**: Extract and preprocess intervention documents
+2. **Chunking Strategy**: Semantic chunking by intervention type and implementation steps
+3. **Vector Embedding**: Transform text chunks into searchable vector representations
+4. **Retrieval**: Take the narrative_summary_for_embedding from the student profile as the query. Perform semantic search against the vector database to retrieve the top 3 most relevant intervention chunks
+5. **Synthesis**: Generate educator-friendly recommendations with source citations
+### Alignment with Architectural Principles
+- **RAG as Core**: Semantic search ensures recommendations are grounded in evidence-based research
+- **Actionable for Educators**: Output format prioritizes clear, implementable steps over raw research
+- **Startup Scale**: FAISS for PoC, cloud-native services for production scalability
+- **Bias for Action**: Minimal viable architecture focused on core functionality first
+---
+## Knowledge Base & Data Processing Strategy
+### Selected Best-Practice Documents
+1. **FOT Toolkit - Tool Set C: Developing and Tracking Interventions** (Pages 43-68)
+   - *Primary Source*: Comprehensive intervention framework
+   - *Focus*: Systematic approach to intervention selection and tracking
+2. **Check & Connect Intervention** (University of Minnesota/WWC)
+   - *Evidence Level*: Only dropout prevention program with WWC "Positive Effects" rating
+   - *Focus*: Structured mentoring for attendance and credit recovery
+3. **Predictive Power of Ninth-Grade GPA** (University of Chicago Consortium)
+   - *Strategic Value*: Research foundation explaining why FOT interventions matter
+   - *Focus*: Data-driven rationale for early intervention
+4. **Preventing Chronic Absence and Promoting Attendance** (REL Program)
+   - *Evidence Base*: Tiered, research-validated attendance strategies
+   - *Focus*: Family engagement, transportation, and systemic barriers
+5. **Addressing Root Causes of Disparities in School Discipline** (NCSSLE)
+   - *Methodology*: Systematic root-cause analysis for behavioral interventions
+   - *Focus*: Data-driven behavioral support strategies
+### Data Processing Strategy
+**Content Extraction** (Hybrid Strategy):
+- **Tier 1**: PyMuPDF (fitz) for rapid extraction of simple, single-column text pages
+- **Tier 2**: pdfplumber for structured tabular data to preserve relational integrity
+- **Tier 3**: Nougat (Meta AI) layout-aware model for complex multi-column layouts and flowcharts
+- **Quality Assurance**: Manual review and validation of extracted content accuracy
+**Chunking Approach**:
+- **Semantic Chunking**: Break documents by intervention type, not arbitrary word limits
+- **Chunk Size**: 300-500 words to maintain context while enabling precise retrieval
+- **Overlap Strategy**: 50-word overlap to preserve cross-boundary context
+- **Metadata Tagging**: Source document, intervention category, target indicators
+**Content Preparation**:
+- Standardize intervention descriptions with consistent format
+- Extract key implementation steps and required resources
+- Tag interventions by target risk factors (attendance, credits, behavior)
+- Create intervention summaries optimized for educator consumption
+---
+## AI as a Co-pilot Strategy
+### Development Acceleration
+**GitHub Copilot**:
+- Code generation for standard RAG pipeline components
+- Boilerplate reduction for data processing and API endpoints
+- Test case generation for validation scenarios
+**Large Language Models (GPT-4/Claude)**:
+- **Document Analysis**: Rapid extraction of key intervention strategies from research papers
+- **Prompt Engineering**: Optimize prompts for educator-specific output formatting
+- **Content Synthesis**: Transform academic language into practitioner-friendly recommendations
+- **Code Review**: Architecture validation and optimization suggestions
+### Problem-Solving Workflow
+1. **Research Phase**: Use LLMs to quickly synthesize intervention research and identify gaps
+2. **Architecture Design**: Validate technical approach against startup scaling requirements
+3. **Implementation**: Leverage Copilot for rapid prototype development
+4. **Testing**: AI-assisted generation of diverse student profile test cases
+5. **Optimization**: LLM-powered analysis of retrieval quality and recommendation relevance
+### Quality Assurance
+- **Prompt Validation**: Use AI to generate edge cases for robust testing
+- **Content Review**: AI-assisted verification that academic content translates to actionable guidance
+- **Bias Detection**: Systematic review of recommendations for potential equity issues
+---
+## Success Metrics & Next Steps
+**PoC Success Criteria**:
+- Accurate retrieval of top 3 relevant interventions for sample student profile
+- Educator-friendly output format with clear implementation guidance
+- Sub-2 second response time for typical queries
+- Proper source citation for all recommendations
+**Production Evolution Path**:
+1. **Enhanced Knowledge Base**: Scale to 50+ intervention documents
+2. **Persona-Based Outputs**: Tailored recommendations for teachers, parents, principals
+3. **API Microservice**: RESTful service for integration with SIS platforms
+4. **Analytics Dashboard**: Track intervention effectiveness and usage patterns
+This PoC establishes the foundation for a scalable, evidence-based intervention recommendation system that can transform how educators support at-risk freshmen nationwide.

docs/research_plan.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# Final Research Brief: Knowledge Base for the FOT Intervention Recommender
+## 1. Project Context
+The goal is to build a Retrieval-Augmented Generation (RAG) system that recommends evidence-based interventions for at-risk 9th-grade students. This research brief outlines the process for identifying at least five high-quality sources that detail specific, actionable intervention strategies. These sources will form the core knowledge base, complementing the strategic framework provided in the FOT Toolkit Tool Set C.
+## 2. Guiding Philosophy: From "Map" to "Tour Guide"
+Our knowledge base strategy is guided by a clear distinction:
+**The FOT Toolkit is the "Map"**: It provides the high-level framework for planning, tracking, and evaluating interventions.
+**Our Curated Sources are the "Tour Guides"**: They must provide the detailed, step-by-step "playbooks" that describe exactly how to implement a specific intervention.
+## 3. Research Objectives
+**Primary Goal**: Identify 5+ authoritative documents that provide specific, evidence-based, and actionable intervention strategies for 9th-grade students.
+**Focus Areas**: The search will prioritize interventions that directly address the core Freshman On-Track indicators. While these often exist within a larger Multi-Tiered System of Supports (MTSS) framework, our focus is on the specific actions, not the system's architecture.
+- **Academic Recovery Interventions**: (e.g., Credit recovery, targeted tutoring)
+- **Attendance Improvement Strategies**: (e.g., Chronic absenteeism programs, mentoring)
+- **Behavioral & Social-Emotional Supports**: (e.g., Tier 2 behavioral interventions, SEL programs)
+## 4. Search Strategy
+**Primary Keywords**: "freshman on-track interventions", "9th grade student support", "high school transition interventions", "early warning systems high school", "tier 2 interventions secondary".
+**Specific Keywords**: "credit recovery programs", "chronic absenteeism interventions", "freshman mentoring programs", "high-dosage tutoring", "restorative practices high school".
+**Authoritative Sources**:
+- **Research Institutions**: What Works Clearinghouse (WWC), University of Chicago Consortium on School Research, Regional Educational Labs (RELs), Institute of Education Sciences (IES)
+- **Educational Organizations**: National High School Center, RTI Action Network, Attendance Works, ASCD
+- **Databases**: ERIC, peer-reviewed educational journals
+## 5. Quality Criteria Checklist for Source Selection
+Each selected source must meet the following criteria:
+- [✔] **Specificity**: Contains detailed, step-by-step procedures, not just high-level theory
+- [✔] **Evidence-Based**: Includes outcome data, research validation, or is cited by a reputable clearinghouse
+- [✔] **Implementation-Ready**: Provides practical guidance, templates, or examples for educators
+- [✔] **Freshman-Focused**: Specifically addresses the needs of 9th-grade or transitioning high school students
+- [✔] **Complementary**: Adds new, actionable content not already covered in the FOT Toolkit's framework
+## 6. Deliverable for Each Curated Source
+For each of the five (or more) documents selected, a standardized summary will be created for inclusion in the "Deliverable 1: Project Plan." This summary is crucial and must contain:
+- **Citation**: Full title, author/organization, and a direct URL
+- **Intervention Category**: The primary domain it addresses (Academic, Attendance, or Behavior)
+- **Core Strategy**: A one-sentence summary of the intervention's central concept
+- **Actionable Components**: 3-4 bullet points detailing the specific, repeatable steps an educator would take to implement the intervention
+## 7. Success Metrics for Research Phase
+This research phase will be considered complete when the curated knowledge base enables the future RAG system to:
+- Recommend a specific academic intervention for a student with course failures
+- Suggest a clear attendance improvement strategy for a student with <90% attendance
+- Provide a concrete behavioral support option for a student with discipline flags
+- Present an evidence-based rationale for each recommendation