Spaces:

noodledom
/

focusflow

Sleeping

App Files Files Community

focusflow / TECHNICAL_DOCUMENTATION.md

FocusFlow Assistant

fix: pre-download embedding model during Docker build to avoid DNS failures on HF Spaces

3c23e28 about 2 months ago

preview code

raw

history blame contribute delete

13.1 kB

FocusFlow - Technical Documentation

📋 Project Overview

Problem Statement

Students face significant challenges in managing self-paced learning:

Information Overload: PDFs, videos, and notes scattered across sources make it difficult to create coherent study plans
Lack of Personalization: Generic study materials don't adapt to individual learning pace or mastery level
No Progress Tracking: Students can't easily measure improvement or identify knowledge gaps
Verification Gap: No way to trace AI-generated answers back to source materials

Solution Description

FocusFlow is an intelligent, local-first study companion that transforms unstructured learning materials into personalized, adaptive study experiences. It combines RAG (Retrieval-Augmented Generation) with synthetic student profiling to create customized learning paths that evolve based on performance.

Key Innovation: Synthetic student profiles enable the app to "remember" progress across sessions and dynamically adjust difficulty, review frequency, and content depth based on demonstrated mastery.

Target Users

Self-paced learners preparing for exams or certifications
Students managing multiple subjects with varied materials
Knowledge workers building expertise in new domains
Anyone seeking structured, verifiable learning from diverse sources

🎯 Core Features

1. Multi-Subject Study Roadmap Generation

Automated topic extraction from uploaded PDFs and documents
Multi-day planning with topics distributed across subjects
Subject identification from document content and metadata
Round-robin scheduling ensures balanced coverage across all subjects
Progressive unlocking - topics unlock as previous ones are completed

Example: Upload 3 PDFs → Get 5-day plan with 3 topics/day (one from each subject)

2. RAG-Based Q&A System

Context-aware retrieval using ChromaDB vector search
Conversational memory with chat history rewriting for follow-up questions
Multi-source search across all uploaded documents
Streaming responses with source citation
Focus Mode for distraction-free studying with side-by-side lesson/chat

3. Adaptive Quiz Generation

Context-based questions generated from actual course material
Realistic distractors using common misconceptions
Guaranteed 3-question format with intelligent fallbacks
Score-based adaptation:
- Perfect score (3/3) → Future topics marked "Advanced"
- Low score (1-2/3) → Future topics include review materials
Automatic unlocking of next topic upon quiz completion

4. Knowledge Tracking & Mastery System

Subject-level mastery tracking (High/Medium/Low)
Historical quiz performance with timestamps
Average score calculation across all attempts
Mastery-based difficulty adjustment for future content
Analytics dashboard with performance classification

5. Synthetic Student Profiles

Persistent JSON storage in ~/.focusflow/student_profile.json
Comprehensive tracking:
- Study plan with topic metadata
- Quiz history with scores and timestamps
- Mastery levels per subject
- Time tracking per topic
- Incomplete task queue
Atomic writes with backup for data integrity
Thread-safe operations for concurrent access

6. Data Persistence & Session Resumption

Auto-save on key events:
- Plan generation
- Quiz completion
- Topic transitions
Auto-load on startup restores:
- Active study plan
- Quiz scores and progress
- Mastery levels
- Current position
Toast notifications for save/load feedback

7. Citations & Source Verification

Expandable source references with every AI response
File + page number for each citation
Lesson content references section at bottom
Inline citation prompts to LLM for accurate attribution
Numbered citation format for easy lookup

🏗️ Technical Architecture

Frontend Components (Streamlit)

┌─────────────────────────────────────────────────┐
│  Control Center  │  Workspace  │   Calendar     │
│  (Sidebar)       │  (Main)     │   (Sidebar)    │
├──────────────────┼─────────────┼────────────────┤
│ - Study Timer    │ - Chat UI   │ - Date Picker  │
│ - Sources List   │ - Lessons   │ - Plan View    │
│ - File Upload    │ - Analytics │ - Today's      │
│ - Plan Gen       │ - Quizzes   │   Topics       │
└──────────────────┴─────────────┴────────────────┘

Key UI Panels:

Control Center: Timer, source management, plan generation
Intelligent Workspace: RAG chat, lesson viewer, analytics modal
Calendar: Date-based topic navigation, today's task list
Focus Mode: Immersive 2-column layout (chat | lesson)

Chat Interface:

Message history with role differentiation (user/assistant)
Source citation expandables
Streaming/static responses
Contained scrollable area (600px height)

Lesson Viewer:

Markdown rendering with references section
Scrollable document container (650px height)
Inline citations and code examples

Backend Logic (FastAPI)

Planning Engine (generate_study_plan):

Query vector store for topic-related content
Group documents by source (each source = subject)
Extract subject names from content/metadata
Round-robin topic selection across subjects
Generate multi-day schedule with metadata

Retrieval System (query_knowledge_base):

Context rewriting for multi-turn conversations
Similarity search across vector database
LLM synthesis with retrieved context
Source metadata extraction and return

Quiz Generator (generate_quiz_data):

Retrieve relevant content chunks for topic
Prompt LLM for context-based questions
Fallback question generation from raw content
Ensure exact 3-question output
Return structured quiz data

Lesson Generator (generate_lesson_content):

Retrieve 8 context chunks (500 chars each)
Extract source citations from metadata
Prompt for structured lesson (600-800 words)
Append references section with file + page
Return formatted Markdown

Data Storage

Vector Database (ChromaDB):

Local storage at ./chroma_db
Nomic-embed-text embeddings
Metadata: source path, page number
Persistent across sessions

Student Profiles (JSON):

{
  "student_id": "student_20260105_233537",
  "study_plan": {
    "plan_id": "plan_...",
    "topics": [...],
    "num_days": 5
  },
  "quiz_history": [...],
  "mastery_tracker": {...},
  "time_tracking": {...},
  "incomplete_tasks": [...]
}

File Storage:

Uploaded PDFs in ./data/
Profile at ~/.focusflow/student_profile.json
Backup at ~/.focusflow/student_profile.backup.json

Agentic Behaviors

Multi-Step Planning:

Query → Retrieval → Topic Extraction → Subject Grouping → Schedule Generation
5+ steps with intermediate reasoning

Tool Use:

Vector DB search
LLM generation
Profile read/write
PDF ingestion

Memory:

Chat history (5 last messages)
Student profile persistence
Quiz performance tracking
Mastery levels

Reflection:

Score-based plan adaptation
Context quality assessment
Fallback strategies for generation failures

💻 Tech Stack

Languages & Frameworks

Frontend: Python + Streamlit 1.x
Backend: FastAPI + Uvicorn
Vector DB: ChromaDB
LLM Orchestration: LangChain

Libraries & APIs

Core Dependencies:

streamlit              # Frontend UI
fastapi               # Backend API
uvicorn              # ASGI server
langchain            # LLM orchestration
chromadb             # Vector database
ollama               # Local LLM inference
requests             # API communication
pydantic             # Data validation

Document Processing:

pypdf                # PDF parsing
beautifulsoup4       # Web scraping
youtube-transcript-api  # Video transcripts

Data & Visualization:

pandas               # Data manipulation
plotly               # Analytics charts

Models

Embedding: nomic-embed-text (local via Ollama)
Generation: llama3.2:1b (local via Ollama)

Storage Methods

Vector Store: ChromaDB (local, persistent)
Profiles: JSON files (atomic writes)
PDF Files: Local filesystem (./data/)
Session State: Streamlit session storage

🔄 Key Workflows

Workflow 1: Study Plan Generation

User uploads PDFs
      ↓
Backend ingests → Chunks → Embeds → Stores in ChromaDB
      ↓
User: "Create 5-day plan"
      ↓
retrieve_topics(k=20)
      ↓
group_by_source() → identify_subjects()
      ↓
round_robin_schedule(num_days=5)
      ↓
save_to_profile() → Return plan
      ↓
Frontend displays Today's Topics (Day 1 unlocked)

Workflow 2: RAG Retrieval

User asks: "What is encapsulation?"
      ↓
history_exists? → rewrite_query()  [Context rewriting]
      ↓
similarity_search(question, k=3)
      ↓
build_prompt(context + history + question)
      ↓
llm.invoke() → extract_sources()
      ↓
Return {answer: str, sources: [{file, page}]}
      ↓
Frontend displays answer + expandable citations

Workflow 3: Adaptive Quiz Flow

User unlocks Topic
      ↓
load_lesson() → display_markdown()
      ↓
User clicks "Take Quiz"
      ↓
retrieve_context(topic, k=8)
      ↓
generate_quiz(3_questions)
      ↓
fallback if < 3? → context_based_fallback()
      ↓
User answers → calculate_score()
      ↓
score==3? → mark_advanced()
score<3?  → mark_review()
      ↓
update_mastery_tracker() → save_profile()
      ↓
unlock_next_topic() → rerun()

Workflow 4: Mastery Tracking Adaptation

Quiz completed with score X
      ↓
update_subject_mastery({
  scores: [..., X],
  avg_score: calculate_average(),
  mastery_level: determine_level()  // High: ≥75%, Medium: ≥50%, Low: <50%
})
      ↓
mastery_level==HIGH?
  → Future topics: Faster pace, advanced examples
mastery_level==LOW?
  → Future topics: More review, foundational content
      ↓
save_to_profile()
      ↓
Next plan generation uses mastery data for difficulty

📊 Evaluation Metrics

1. Plan Quality Assessment

Metrics:

Subject Coverage: % of uploaded subjects represented daily
Balance Score: Standard deviation of topics per subject
Unlocking Logic: % of topics that unlock correctly after quiz

Target:

100% subject coverage (all PDFs represented)
StdDev < 0.5 (even distribution)
100% unlock success rate

2. Answer Accuracy Measurement

Metrics:

Source Relevance: Cosine similarity of retrieved chunks
Citation Accuracy: % of answers with valid file+page citations
Hallucination Rate: Manual review of 50 Q&A pairs

Target:

Avg similarity > 0.7
95%+ citation accuracy
<5% hallucination rate

3. Quiz Discrimination

Metrics:

Question Validity: % of questions answerable from provided context
Distractor Quality: % of students choosing incorrect options
Difficulty Spread: Distribution across easy/medium/hard

Target:

100% context-answerable
25-40% distractor selection rate (not too easy/hard)
Balanced difficulty distribution

4. User Mastery Gains

Metrics:

Score Progression: Δ average score from Day 1 to Day N
Mastery Level Changes: % of subjects moving from Low → Medium → High
Retention Rate: Quiz score on repeated topics after 1 week

Target:

+15% average score improvement over 5 days
60%+ mastery level improvement
80%+ retention on repeated topics

5. System Performance

Metrics:

Plan Generation Time: Seconds to generate 5-day plan
Query Response Time: Seconds from question to answer
Profile Save Latency: Milliseconds for atomic write

Target:r

Plan gen: <10 seconds
Query response: <5 seconds
Save latency: <100ms

🚀 Future Enhancements

Spaced Repetition: Intelligent review scheduling using SM-2 algorithm
Multi-User Support: Authentication + isolated student profiles
Cloud Deployment: Oracle Cloud + Supabase for persistence
Advanced Analytics: Learning curve visualization, weak area identification
Mobile Responsive: Material Design responsive UI for mobile devices

📝 Project Repository

GitHub: thesivarohith/hack

Status: Production-ready, cleaned codebase (commit: 9a8a489)

Documentation generated: 2026-01-06