focusflow / TECHNICAL_DOCUMENTATION.md
FocusFlow Assistant
fix: pre-download embedding model during Docker build to avoid DNS failures on HF Spaces
3c23e28
# FocusFlow - Technical Documentation
## πŸ“‹ Project Overview
### Problem Statement
Students face significant challenges in managing self-paced learning:
- **Information Overload**: PDFs, videos, and notes scattered across sources make it difficult to create coherent study plans
- **Lack of Personalization**: Generic study materials don't adapt to individual learning pace or mastery level
- **No Progress Tracking**: Students can't easily measure improvement or identify knowledge gaps
- **Verification Gap**: No way to trace AI-generated answers back to source materials
### Solution Description
FocusFlow is an **intelligent, local-first study companion** that transforms unstructured learning materials into personalized, adaptive study experiences. It combines RAG (Retrieval-Augmented Generation) with synthetic student profiling to create customized learning paths that evolve based on performance.
**Key Innovation**: Synthetic student profiles enable the app to "remember" progress across sessions and dynamically adjust difficulty, review frequency, and content depth based on demonstrated mastery.
### Target Users
- **Self-paced learners** preparing for exams or certifications
- **Students** managing multiple subjects with varied materials
- **Knowledge workers** building expertise in new domains
- **Anyone** seeking structured, verifiable learning from diverse sources
---
## 🎯 Core Features
### 1. Multi-Subject Study Roadmap Generation
- **Automated topic extraction** from uploaded PDFs and documents
- **Multi-day planning** with topics distributed across subjects
- **Subject identification** from document content and metadata
- **Round-robin scheduling** ensures balanced coverage across all subjects
- **Progressive unlocking** - topics unlock as previous ones are completed
**Example**: Upload 3 PDFs β†’ Get 5-day plan with 3 topics/day (one from each subject)
### 2. RAG-Based Q&A System
- **Context-aware retrieval** using ChromaDB vector search
- **Conversational memory** with chat history rewriting for follow-up questions
- **Multi-source search** across all uploaded documents
- **Streaming responses** with source citation
- **Focus Mode** for distraction-free studying with side-by-side lesson/chat
### 3. Adaptive Quiz Generation
- **Context-based questions** generated from actual course material
- **Realistic distractors** using common misconceptions
- **Guaranteed 3-question format** with intelligent fallbacks
- **Score-based adaptation**:
- Perfect score (3/3) β†’ Future topics marked "Advanced"
- Low score (1-2/3) β†’ Future topics include review materials
- **Automatic unlocking** of next topic upon quiz completion
### 4. Knowledge Tracking & Mastery System
- **Subject-level mastery tracking** (High/Medium/Low)
- **Historical quiz performance** with timestamps
- **Average score calculation** across all attempts
- **Mastery-based difficulty adjustment** for future content
- **Analytics dashboard** with performance classification
### 5. Synthetic Student Profiles
- **Persistent JSON storage** in `~/.focusflow/student_profile.json`
- **Comprehensive tracking**:
- Study plan with topic metadata
- Quiz history with scores and timestamps
- Mastery levels per subject
- Time tracking per topic
- Incomplete task queue
- **Atomic writes** with backup for data integrity
- **Thread-safe operations** for concurrent access
### 6. Data Persistence & Session Resumption
- **Auto-save on key events**:
- Plan generation
- Quiz completion
- Topic transitions
- **Auto-load on startup** restores:
- Active study plan
- Quiz scores and progress
- Mastery levels
- Current position
- **Toast notifications** for save/load feedback
### 7. Citations & Source Verification
- **Expandable source references** with every AI response
- **File + page number** for each citation
- **Lesson content references** section at bottom
- **Inline citation prompts** to LLM for accurate attribution
- **Numbered citation format** for easy lookup
---
## πŸ—οΈ Technical Architecture
### Frontend Components (Streamlit)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Control Center β”‚ Workspace β”‚ Calendar β”‚
β”‚ (Sidebar) β”‚ (Main) β”‚ (Sidebar) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ - Study Timer β”‚ - Chat UI β”‚ - Date Picker β”‚
β”‚ - Sources List β”‚ - Lessons β”‚ - Plan View β”‚
β”‚ - File Upload β”‚ - Analytics β”‚ - Today's β”‚
β”‚ - Plan Gen β”‚ - Quizzes β”‚ Topics β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Key UI Panels**:
- **Control Center**: Timer, source management, plan generation
- **Intelligent Workspace**: RAG chat, lesson viewer, analytics modal
- **Calendar**: Date-based topic navigation, today's task list
- **Focus Mode**: Immersive 2-column layout (chat | lesson)
**Chat Interface**:
- Message history with role differentiation (user/assistant)
- Source citation expandables
- Streaming/static responses
- Contained scrollable area (600px height)
**Lesson Viewer**:
- Markdown rendering with references section
- Scrollable document container (650px height)
- Inline citations and code examples
### Backend Logic (FastAPI)
**Planning Engine** (`generate_study_plan`):
1. Query vector store for topic-related content
2. Group documents by source (each source = subject)
3. Extract subject names from content/metadata
4. Round-robin topic selection across subjects
5. Generate multi-day schedule with metadata
**Retrieval System** (`query_knowledge_base`):
1. Context rewriting for multi-turn conversations
2. Similarity search across vector database
3. LLM synthesis with retrieved context
4. Source metadata extraction and return
**Quiz Generator** (`generate_quiz_data`):
1. Retrieve relevant content chunks for topic
2. Prompt LLM for context-based questions
3. Fallback question generation from raw content
4. Ensure exact 3-question output
5. Return structured quiz data
**Lesson Generator** (`generate_lesson_content`):
1. Retrieve 8 context chunks (500 chars each)
2. Extract source citations from metadata
3. Prompt for structured lesson (600-800 words)
4. Append references section with file + page
5. Return formatted Markdown
### Data Storage
**Vector Database (ChromaDB)**:
- Local storage at `./chroma_db`
- Nomic-embed-text embeddings
- Metadata: source path, page number
- Persistent across sessions
**Student Profiles (JSON)**:
```json
{
"student_id": "student_20260105_233537",
"study_plan": {
"plan_id": "plan_...",
"topics": [...],
"num_days": 5
},
"quiz_history": [...],
"mastery_tracker": {...},
"time_tracking": {...},
"incomplete_tasks": [...]
}
```
**File Storage**:
- Uploaded PDFs in `./data/`
- Profile at `~/.focusflow/student_profile.json`
- Backup at `~/.focusflow/student_profile.backup.json`
### Agentic Behaviors
**Multi-Step Planning**:
- Query β†’ Retrieval β†’ Topic Extraction β†’ Subject Grouping β†’ Schedule Generation
- 5+ steps with intermediate reasoning
**Tool Use**:
- Vector DB search
- LLM generation
- Profile read/write
- PDF ingestion
**Memory**:
- Chat history (5 last messages)
- Student profile persistence
- Quiz performance tracking
- Mastery levels
**Reflection**:
- Score-based plan adaptation
- Context quality assessment
- Fallback strategies for generation failures
---
## πŸ’» Tech Stack
### Languages & Frameworks
- **Frontend**: Python + Streamlit 1.x
- **Backend**: FastAPI + Uvicorn
- **Vector DB**: ChromaDB
- **LLM Orchestration**: LangChain
### Libraries & APIs
**Core Dependencies**:
```python
streamlit # Frontend UI
fastapi # Backend API
uvicorn # ASGI server
langchain # LLM orchestration
chromadb # Vector database
ollama # Local LLM inference
requests # API communication
pydantic # Data validation
```
**Document Processing**:
```python
pypdf # PDF parsing
beautifulsoup4 # Web scraping
youtube-transcript-api # Video transcripts
```
**Data & Visualization**:
```python
pandas # Data manipulation
plotly # Analytics charts
```
### Models
- **Embedding**: `nomic-embed-text` (local via Ollama)
- **Generation**: `llama3.2:1b` (local via Ollama)
### Storage Methods
- **Vector Store**: ChromaDB (local, persistent)
- **Profiles**: JSON files (atomic writes)
- **PDF Files**: Local filesystem (`./data/`)
- **Session State**: Streamlit session storage
---
## πŸ”„ Key Workflows
### Workflow 1: Study Plan Generation
```
User uploads PDFs
↓
Backend ingests β†’ Chunks β†’ Embeds β†’ Stores in ChromaDB
↓
User: "Create 5-day plan"
↓
retrieve_topics(k=20)
↓
group_by_source() β†’ identify_subjects()
↓
round_robin_schedule(num_days=5)
↓
save_to_profile() β†’ Return plan
↓
Frontend displays Today's Topics (Day 1 unlocked)
```
### Workflow 2: RAG Retrieval
```
User asks: "What is encapsulation?"
↓
history_exists? β†’ rewrite_query() [Context rewriting]
↓
similarity_search(question, k=3)
↓
build_prompt(context + history + question)
↓
llm.invoke() β†’ extract_sources()
↓
Return {answer: str, sources: [{file, page}]}
↓
Frontend displays answer + expandable citations
```
### Workflow 3: Adaptive Quiz Flow
```
User unlocks Topic
↓
load_lesson() β†’ display_markdown()
↓
User clicks "Take Quiz"
↓
retrieve_context(topic, k=8)
↓
generate_quiz(3_questions)
↓
fallback if < 3? β†’ context_based_fallback()
↓
User answers β†’ calculate_score()
↓
score==3? β†’ mark_advanced()
score<3? β†’ mark_review()
↓
update_mastery_tracker() β†’ save_profile()
↓
unlock_next_topic() β†’ rerun()
```
### Workflow 4: Mastery Tracking Adaptation
```
Quiz completed with score X
↓
update_subject_mastery({
scores: [..., X],
avg_score: calculate_average(),
mastery_level: determine_level() // High: β‰₯75%, Medium: β‰₯50%, Low: <50%
})
↓
mastery_level==HIGH?
β†’ Future topics: Faster pace, advanced examples
mastery_level==LOW?
β†’ Future topics: More review, foundational content
↓
save_to_profile()
↓
Next plan generation uses mastery data for difficulty
```
---
## πŸ“Š Evaluation Metrics
### 1. Plan Quality Assessment
**Metrics**:
- **Subject Coverage**: % of uploaded subjects represented daily
- **Balance Score**: Standard deviation of topics per subject
- **Unlocking Logic**: % of topics that unlock correctly after quiz
**Target**:
- 100% subject coverage (all PDFs represented)
- StdDev < 0.5 (even distribution)
- 100% unlock success rate
### 2. Answer Accuracy Measurement
**Metrics**:
- **Source Relevance**: Cosine similarity of retrieved chunks
- **Citation Accuracy**: % of answers with valid file+page citations
- **Hallucination Rate**: Manual review of 50 Q&A pairs
**Target**:
- Avg similarity > 0.7
- 95%+ citation accuracy
- <5% hallucination rate
### 3. Quiz Discrimination
**Metrics**:
- **Question Validity**: % of questions answerable from provided context
- **Distractor Quality**: % of students choosing incorrect options
- **Difficulty Spread**: Distribution across easy/medium/hard
**Target**:
- 100% context-answerable
- 25-40% distractor selection rate (not too easy/hard)
- Balanced difficulty distribution
### 4. User Mastery Gains
**Metrics**:
- **Score Progression**: Ξ” average score from Day 1 to Day N
- **Mastery Level Changes**: % of subjects moving from Low β†’ Medium β†’ High
- **Retention Rate**: Quiz score on repeated topics after 1 week
**Target**:
- +15% average score improvement over 5 days
- 60%+ mastery level improvement
- 80%+ retention on repeated topics
### 5. System Performance
**Metrics**:
- **Plan Generation Time**: Seconds to generate 5-day plan
- **Query Response Time**: Seconds from question to answer
- **Profile Save Latency**: Milliseconds for atomic write
**Target**:r
- Plan gen: <10 seconds
- Query response: <5 seconds
- Save latency: <100ms
---
## πŸš€ Future Enhancements
- **Spaced Repetition**: Intelligent review scheduling using SM-2 algorithm
- **Multi-User Support**: Authentication + isolated student profiles
- **Cloud Deployment**: Oracle Cloud + Supabase for persistence
- **Advanced Analytics**: Learning curve visualization, weak area identification
- **Mobile Responsive**: Material Design responsive UI for mobile devices
---
## πŸ“ Project Repository
**GitHub**: [thesivarohith/hack](https://github.com/thesivarohith/hack)
**Status**: Production-ready, cleaned codebase (commit: 9a8a489)
---
*Documentation generated: 2026-01-06*