Spaces:

FrictionAI
/

SokratesAI

Sleeping

Alleinzellgaenger Claude commited on Aug 12, 2025

Commit

c070e10

1 Parent(s): a706099

WIP: Development progress - mixed features and improvements

Multiple ongoing feature development:
- Enhanced PDF viewer with zoom controls and pagination
- Integrated chat functionality with chunk navigation
- Gemini API integration for improved document chunking
- Various component refactoring and optimization attempts

This commit captures mixed development state before branching cleanup.
Will be reorganized into focused commits on dev branch.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (7) hide show

.claude/sessions/2025-08-03-1200.md +34 -1
backend/app.py +520 -121
backend/requirements.txt +1 -0
frontend/src/components/ChunkPanel.jsx +2 -1
frontend/src/components/DocumentProcessor.jsx +7 -14
frontend/src/components/DocumentViewer.jsx +194 -38
test_fuzzy_find.py +194 -0

.claude/sessions/2025-08-03-1200.md CHANGED Viewed

@@ -34,4 +34,37 @@ Refine academic paper chunking system to address:
 1. **Document Modification**: Original document gets cleaned (academic content removal)
 2. **Figure Handling**: Simple paragraph-ending regex can't handle figures interrupting text flow
 3. **Position Mapping**: Positions calculated on cleaned text, not original
-4. **Highlighting Injection**: Blockquote injection modifies document structure

 1. **Document Modification**: Original document gets cleaned (academic content removal)
 2. **Figure Handling**: Simple paragraph-ending regex can't handle figures interrupting text flow
 3. **Position Mapping**: Positions calculated on cleaned text, not original
+4. **Highlighting Injection**: Blockquote injection modifies document structure
+### Update - 2025-08-03 12:25
+**Summary**: Successfully switched from broken markdown rendering to PDF viewer approach
+**Git Changes**:
+- Modified: frontend/src/components/DocumentProcessor.jsx, frontend/src/components/DocumentViewer.jsx
+- Current branch: main (commit: a706099)
+**Todo Progress**: 3 completed, 1 in progress, 0 pending
+- ✓ Completed: Examine original PDF viewer components in backup files
+- ✓ Completed: Commit current markdown approach as 'failed feature implementation'
+- ✓ Completed: Switch DocumentViewer to use PDF instead of markdown
+- 🔄 In Progress: Test the new PDF viewer implementation
+**Issues Resolved**:
+- Eliminated markdown rendering that was breaking document layout
+- Removed document modification/cleaning that violated integrity principle
+- Integrated proven PDF viewer from UploadPage.jsx component
+**Solutions Implemented**:
+- Replaced DocumentViewer.jsx with react-pdf implementation from UploadPage.jsx
+- Added zoom controls, pagination, and smooth scrolling
+- Updated DocumentProcessor.jsx to pass selectedFile instead of highlightedMarkdown
+- Removed unused markdown utilities and highlighting logic
+- Preserved original document completely - no text modification
+**Code Changes**:
+- DocumentViewer.jsx: Complete rewrite using react-pdf with zoom/navigation controls
+- DocumentProcessor.jsx: Removed highlightedMarkdown logic, updated props
+- Maintained all chunking functionality on right panel while fixing left panel display
+**Next Steps**: Test PDF viewer functionality and implement visual chunk highlighting overlays if needed

backend/app.py CHANGED Viewed

@@ -14,7 +14,8 @@ from pydantic import BaseModel, Field
 from typing import Optional, List
 from langchain.chat_models import init_chat_model
 import anthropic
 # Load environment variables
 load_dotenv()
@@ -178,15 +179,24 @@ async def process_ocr_content(file_id: str):
         combined_markdown = '\n\n---\n\n'.join(all_page_markdown)
         print(f"📋 Combined document: {len(combined_markdown)} chars total")
-        # Auto-chunk the entire document once
         document_chunks = []
         original_markdown = combined_markdown
         try:
-            print(f"🧠 Auto-chunking entire document...")
-            document_chunks, original_markdown = await auto_chunk_document(combined_markdown, client)
             print(f"📊 Document chunks found: {len(document_chunks)}")
             for i, chunk in enumerate(document_chunks):
-                print(f"  {i+1}. {chunk.get('topic', 'Unknown')}: {chunk.get('start_phrase', '')[:50]}...")
         except Exception as chunk_error:
             print(f"⚠️ Document chunking failed: {chunk_error}")
             document_chunks = []
@@ -256,9 +266,8 @@ async def get_image_base64(file_id: str, image_id: str):
 class ChunkSchema(BaseModel):
     """Schema for document chunks suitable for creating interactive lessons."""
-    topic: str = Field(description="Brief topic name for the chunk")
-    start_phrase: str = Field(description="First few words of the chunk (5-15 words)")
-    end_phrase: str = Field(description="Last few words of the chunk (5-15 words)")
 class ChunkList(BaseModel):
     """Container for a list of document chunks."""
@@ -465,16 +474,373 @@ def programmatic_chunk_document(document_markdown):
     # The frontend will use the original document for highlighting
     return chunks, document_markdown
 async def auto_chunk_document(document_markdown, client=None):
-    """Auto-chunk a document - now using programmatic approach instead of LLM"""
-    chunks, original_markdown = programmatic_chunk_document(document_markdown)
-    return chunks, original_markdown
     # Get Fireworks API key
     fireworks_api_key = os.environ.get("FIREWORKS_API_KEY")
     if not fireworks_api_key:
-        print("⚠️ No Fireworks API key found, falling back to regular chunking")
-        return []
     try:
         # Initialize Fireworks LLM with structured output
@@ -487,98 +853,148 @@ async def auto_chunk_document(document_markdown, client=None):
         # Create structured LLM that returns ChunkList object
         structured_llm = llm.with_structured_output(ChunkList)
-        # Create chunking prompt
-        prompt = f"""Imagine you are a teacher. You are given a document, and you have to decide how to dissect this document. Your task is to identify chunks of content by providing start and end phrases that can be used to create interactive lessons. Here's the document:
-DOCUMENT:
-{document_markdown}
-Rules:
-1. Each chunk should contain 2-3 valuable lessons
-2. start_phrase and end_phrase should be 5-15 words long
-3. Focus on educational content (concepts, examples, key points)
-4. More dense content should have more chunks, less dense content fewer chunks
-5. Identify chunks that would make good interactive lessons
-Return a list of chunks with topic, start_phrase, and end_phrase for each. Importantly, you are passed Markdown text, so output the start and end phrases as Markdown text, and include punctuation. Never stop an end phrase in the middle of a sentence, always include the full sentence or phrase."""
         # Call Fireworks with structured output
-        chunk_response = structured_llm.invoke(prompt)
-        chunks = chunk_response.chunks
-        # Find positions using fuzzy matching with detailed debugging
-        positioned_chunks = []
-        for i, chunk in enumerate(chunks):
-            print(f"\n🔍 Processing chunk {i+1}: {chunk.topic}")
-            print(f"   Start phrase: '{chunk.start_phrase}'")
-            print(f"   End phrase: '{chunk.end_phrase}'")
-            start_pos = fuzzy_find(document_markdown, chunk.start_phrase)
-            end_phrase_start = fuzzy_find(document_markdown, chunk.end_phrase, start_pos or 0)
-            print(f"   Found start_pos: {start_pos}")
-            print(f"   Found end_phrase_start: {end_phrase_start}")
-            # Add the length of the end_phrase plus a bit more to include punctuation
-            if end_phrase_start is not None:
-                end_pos = end_phrase_start + len(chunk.end_phrase)
-                # Try to include punctuation that might follow
-                # Look ahead for good stopping points, but be more careful about spaces
-                max_extend = 15  # Don't go crazy far
-                extended = 0
-                while end_pos < len(document_markdown) and extended < max_extend:
-                    char = document_markdown[end_pos]
-                    # Good stopping points - include punctuation and stop
-                    if char in '.!?':
-                        end_pos += 1  # Include the punctuation
-                        break
-                    elif char in ';:,':
-                        end_pos += 1  # Include and stop
-                        break
-                    # Stop at paragraph breaks
-                    elif end_pos < len(document_markdown) - 1 and document_markdown[end_pos:end_pos+2] == '\n\n':
-                        break
-                    # Stop at LaTeX boundaries
-                    elif char == '$':
-                        break
-                    # Continue through normal chars and whitespace
-                    else:
-                        end_pos += 1
-                        extended += 1
-                print(f"   Final end_pos: {end_pos}")
-            else:
-                print(f"   End phrase not found! Finding paragraph end...")
-                end_pos = find_paragraph_end(document_markdown, start_pos)
-            if start_pos is not None and end_pos is not None:
-                # Show actual extracted text for debugging
-                extracted_text = document_markdown[start_pos:end_pos]
-                print(f"   Extracted text: '{extracted_text[:100]}...'")
-            if start_pos is not None:
-                positioned_chunks.append({
-                    "topic": chunk.topic,
-                    "start_phrase": chunk.start_phrase,
-                    "end_phrase": chunk.end_phrase,
-                    "start_position": start_pos,
-                    "end_position": end_pos,
-                    "found_start": True,
-                    "found_end": end_pos is not None
-                })
-        # Sort chunks by position in document for chronological order
-        positioned_chunks.sort(key=lambda chunk: chunk.get('start_position', 0))
-        print(f"📊 Final sorted chunks: {len(positioned_chunks)}")
-        return positioned_chunks
     except Exception as e:
         import traceback
         print(f"❌ Auto-chunking error: {e}")
         print(f"❌ Full traceback: {traceback.format_exc()}")
-        return []
 @app.post("/chunk_page")
 async def chunk_page(request: dict):
@@ -617,6 +1033,7 @@ Rules:
 3. Focus on educational content (concepts, examples, key points)
 4. More dense content should have more chunks, less dense content fewer chunks
 5. Identify chunks that would make good interactive lessons
 Return a list of chunks with topic, start_phrase, and end_phrase for each."""
@@ -626,39 +1043,21 @@ Return a list of chunks with topic, start_phrase, and end_phrase for each."""
         chunks = chunk_response.chunks
         print(f"📝 Received {len(chunks)} chunks from Fireworks")
-        # Find positions using fuzzy matching
-        positioned_chunks = []
-        for chunk in chunks:
-            start_pos = fuzzy_find(document_markdown, chunk.start_phrase)
-            end_phrase_start = fuzzy_find(document_markdown, chunk.end_phrase, start_pos or 0)
-            # Add the length of the end_phrase plus a bit more to include punctuation
-            if end_phrase_start is not None:
-                end_pos = end_phrase_start + len(chunk.end_phrase)
-                # Try to include punctuation that might follow
-                if end_pos < len(document_markdown) and document_markdown[end_pos] in '.!?;:,':
-                    end_pos += 1
-            else:
-                end_pos = None
-            if start_pos is not None:
-                positioned_chunks.append({
-                    "topic": chunk.topic,
-                    "start_phrase": chunk.start_phrase,
-                    "end_phrase": chunk.end_phrase,
-                    "start_position": start_pos,
-                    "end_position": end_pos,
-                    "found_start": True,
-                    "found_end": end_pos is not None
-                })
-                print(f"✅ Found chunk: {chunk.topic} at position {start_pos}")
-            else:
-                print(f"❌ Could not find chunk: {chunk.topic}")
-        print(f"📊 Successfully positioned {len(positioned_chunks)}/{len(chunks)} chunks")
         return {
-            "chunks": positioned_chunks,
-            "total_found": len(positioned_chunks),
             "total_suggested": len(chunks)
         }

 from typing import Optional, List
 from langchain.chat_models import init_chat_model
 import anthropic
+import google
+from google import genai
 # Load environment variables
 load_dotenv()
         combined_markdown = '\n\n---\n\n'.join(all_page_markdown)
         print(f"📋 Combined document: {len(combined_markdown)} chars total")
+        # Auto-chunk the entire document once - try Gemini first, then fallback
         document_chunks = []
         original_markdown = combined_markdown
         try:
+            print(f"🧠 Auto-chunking entire document with Gemini...")
+            document_chunks, original_markdown = await gemini_chunk_document(combined_markdown)
+            # If Gemini failed, try the old Fireworks method
+            if not document_chunks:
+                print(f"🔄 Gemini failed, falling back to Fireworks...")
+                document_chunks, original_markdown = await auto_chunk_document(combined_markdown, client)
             print(f"📊 Document chunks found: {len(document_chunks)}")
             for i, chunk in enumerate(document_chunks):
+                topic = chunk.get('topic', 'Unknown')
+                preview = chunk.get('text', chunk.get('start_phrase', ''))[:50] + "..." if chunk.get('text', chunk.get('start_phrase', '')) else 'No content'
+                print(f"  {i+1}. {topic}: {preview}")
         except Exception as chunk_error:
             print(f"⚠️ Document chunking failed: {chunk_error}")
             document_chunks = []
 class ChunkSchema(BaseModel):
     """Schema for document chunks suitable for creating interactive lessons."""
+    topic: str = Field(description="Brief descriptive name (2-6 words) for the educational content")
+    text: str = Field(description="Complete chunk text with exact markdown/LaTeX formatting preserved, containing 2-3 related educational concepts")
 class ChunkList(BaseModel):
     """Container for a list of document chunks."""
     # The frontend will use the original document for highlighting
     return chunks, document_markdown
+def split_document_into_batches(document_markdown, max_chars=8000):
+    """Split document into manageable batches for LLM processing"""
+    if len(document_markdown) <= max_chars:
+        return [document_markdown]
+    batches = []
+    current_pos = 0
+    while current_pos < len(document_markdown):
+        # Try to find a good breaking point (paragraph boundary)
+        end_pos = min(current_pos + max_chars, len(document_markdown))
+        # If we're not at the end, try to break at a paragraph boundary
+        if end_pos < len(document_markdown):
+            # Look for \n\n within the last 1000 characters of this batch
+            search_start = max(end_pos - 1000, current_pos)
+            last_paragraph = document_markdown.rfind('\n\n', search_start, end_pos)
+            if last_paragraph != -1 and last_paragraph > current_pos:
+                end_pos = last_paragraph + 2  # Include the \n\n
+        batch = document_markdown[current_pos:end_pos]
+        batches.append(batch)
+        current_pos = end_pos
+        print(f"📄 Created batch {len(batches)}: {len(batch)} chars (pos {current_pos-len(batch)}-{current_pos})")
+    return batches
+async def gemini_chunk_document(document_markdown):
+    """Auto-chunk a document using Google Gemini 2.5 Pro with reliable structured output"""
+    # Get Gemini API key
+    gemini_api_key = os.environ.get("GEMINI_API_KEY")
+    if not gemini_api_key:
+        print("⚠️ No Gemini API key found")
+        return None, document_markdown
+    print(f"📄 Document length: {len(document_markdown)} characters")
+    try:
+        # Initialize Gemini client
+        client = genai.Client(api_key=gemini_api_key)
+        # Split document into batches if it's too large (Gemini has token limits)
+        batches = split_document_into_batches(document_markdown, max_chars=12000)  # Gemini can handle larger batches
+        print(f"📄 Split document into {len(batches)} batches for Gemini")
+        all_chunks = []
+        # Process each batch
+        for batch_idx, batch in enumerate(batches):
+            print(f"\n🔄 Processing batch {batch_idx + 1}/{len(batches)} ({len(batch)} chars) with Gemini")
+            try:
+                # Create the prompt for Gemini
+                prompt = f"""You are an educational content analyzer. Analyze this document section and break it into logical learning chunks.
+Each chunk should:
+- Contain 2-3 related educational concepts that naturally belong together
+- Be 150-500 words (optimal for learning)
+- Have clear educational value
+- Preserve all markdown/LaTeX formatting exactly
+- Skip: abstracts, acknowledgments, references, author info, page numbers
+Return your response as a valid JSON object with this exact structure:
+{{
+  "chunks": [
+    {{
+      "topic": "Brief descriptive name (2-6 words)",
+      "text": "Complete chunk text with exact formatting preserved"
+    }}
+  ]
+}}
+Document section to analyze:
+{batch}
+Important: Return ONLY the JSON object, no other text."""
+                # Call Gemini 2.5 Pro (disable thinking for faster/cheaper responses)
+                response = client.models.generate_content(
+                    model="gemini-2.5-pro",
+                    contents=prompt,
+                    config=genai.types.GenerateContentConfig(
+                        thinking_config=genai.types.ThinkingConfig(thinking_budget=-1)
+                    )
+                )
+                # Extract and parse response
+                response_text = response.text.strip()
+                print(f"📋 Gemini response preview: {response_text}...")
+                # Clean up the response (remove code blocks if present)
+                clean_response = response_text
+                if clean_response.startswith('```json'):
+                    clean_response = clean_response[7:]
+                if clean_response.endswith('```'):
+                    clean_response = clean_response[:-3]
+                clean_response = clean_response.strip()
+                # Parse JSON
+                try:
+                    json_data = json.loads(clean_response)
+                    # Validate structure
+                    if not isinstance(json_data, dict) or 'chunks' not in json_data:
+                        print(f"❌ Invalid response structure from Gemini batch {batch_idx + 1}")
+                        continue
+                    chunks = json_data['chunks']
+                    if not isinstance(chunks, list):
+                        print(f"❌ 'chunks' is not a list in batch {batch_idx + 1}")
+                        continue
+                    # Process chunks
+                    batch_chunks = []
+                    for i, chunk in enumerate(chunks):
+                        if not isinstance(chunk, dict) or 'topic' not in chunk or 'text' not in chunk:
+                            print(f"❌ Invalid chunk structure in batch {batch_idx + 1}, chunk {i}")
+                            continue
+                        # Clean up text formatting
+                        chunk_text = chunk['text']
+                        # Replace literal \n with actual newlines
+                        chunk_text = chunk_text.replace('\\n', '\n')
+                        batch_chunks.append({
+                            "topic": chunk['topic'],
+                            "text": chunk_text,
+                            "chunk_index": len(all_chunks) + len(batch_chunks)
+                        })
+                        print(f"✅ Processed chunk: {chunk['topic']}")
+                    all_chunks.extend(batch_chunks)
+                    print(f"📊 Batch {batch_idx + 1} added {len(batch_chunks)} chunks (total: {len(all_chunks)})")
+                except json.JSONDecodeError as e:
+                    print(f"❌ JSON parsing failed for batch {batch_idx + 1}: {e}")
+                    print(f"❌ Response was: {response_text}")
+                    continue
+            except Exception as e:
+                print(f"❌ Error processing batch {batch_idx + 1} with Gemini: {e}")
+                continue
+        # Return results
+        if all_chunks:
+            print(f"✅ Gemini successfully processed document with {len(all_chunks)} total chunks")
+            return all_chunks, document_markdown
+        else:
+            print("❌ Gemini processing failed for all batches")
+            return None, document_markdown
+    except Exception as e:
+        print(f"❌ Gemini chunking error: {e}")
+        return None, document_markdown
 async def auto_chunk_document(document_markdown, client=None):
+    """Auto-chunk a document using LLM with batch processing for large documents"""
+    # Debug: Print document info
+    print(f"📄 Document length: {len(document_markdown)} characters")
     # Get Fireworks API key
     fireworks_api_key = os.environ.get("FIREWORKS_API_KEY")
     if not fireworks_api_key:
+        print("⚠️ No Fireworks API key found, falling back to programmatic chunking")
+        chunks, original_markdown = programmatic_chunk_document(document_markdown)
+        return chunks, original_markdown
+    # Split document into batches if it's too large
+    batches = split_document_into_batches(document_markdown, max_chars=8000)
+    print(f"📄 Split document into {len(batches)} batches")
+    all_chunks = []
+    # Process each batch
+    for batch_idx, batch in enumerate(batches):
+        print(f"\n🔄 Processing batch {batch_idx + 1}/{len(batches)} ({len(batch)} chars)")
+        # Try structured output with retry logic for this batch
+        max_retries = 3
+        batch_chunks = None
+        for attempt in range(max_retries):
+            try:
+                print(f"🚀 Batch {batch_idx + 1} Attempt {attempt + 1}/{max_retries}: Calling Fireworks...")
+                # Initialize LLM
+                llm = init_chat_model(
+                    "accounts/fireworks/models/llama4-maverick-instruct-basic",
+                    model_provider="fireworks",
+                    api_key=fireworks_api_key
+                )
+                # Use regular LLM and manual JSON parsing
+                prompt = f"""You are an educational content analyzer. Break this document section into logical learning chunks.
+                            IMPORTANT: Return your response as a valid JSON object with this exact structure:
+                            {{
+                            "chunks": [
+                                {{
+                                "topic": "Brief topic name",
+                                "text": "Complete chunk text with exact formatting"
+                                }}
+                            ]
+                            }}
+                            Rules for chunking:
+                            - Each chunk should contain 2-3 related educational concepts
+                            - Keep chunks concise: 100-300 words (avoid very long text blocks)
+                            - Preserve all markdown/LaTeX formatting exactly as written
+                            - Skip: abstracts, acknowledgements, references, author information, page numbers
+                            - Create separate chunks for figures/tables with their captions
+                            - Never split mathematical expressions or LaTeX formulas
+                            - Process ALL content in this section - don't skip any educational material
+                            - Ensure all JSON strings are properly formatted (no unescaped quotes)
+                            Document section to analyze:
+                            {batch}
+                            Return only the JSON object, no other text."""
+                # Call regular LLM
+                result = llm.invoke(prompt)
+                print(f"📋 Raw LLM response type: {type(result)}")
+                # Extract text content
+                if hasattr(result, 'content'):
+                    response_text = result.content
+                elif hasattr(result, 'text'):
+                    response_text = result.text
+                else:
+                    response_text = str(result)
+                print(f"📋 Response text preview: {response_text}...")
+                # Try to parse JSON manually
+                try:
+                    # Clean up the response - remove any markdown code blocks and fix common issues
+                    clean_response = response_text.strip()
+                    if clean_response.startswith('```json'):
+                        clean_response = clean_response[7:]
+                    if clean_response.endswith('```'):
+                        clean_response = clean_response[:-3]
+                    clean_response = clean_response.strip()
+                    # Fix common JSON truncation issues
+                    # If the response doesn't end properly, try to close it
+                    if not clean_response.endswith('}'):
+                        # Try to find the last complete chunk entry and close properly
+                        last_brace = clean_response.rfind('}')
+                        if last_brace != -1:
+                            # Find if we're inside a chunks array
+                            chunks_start = clean_response.find('"chunks": [')
+                            if chunks_start != -1 and last_brace > chunks_start:
+                                # Close the chunks array and main object
+                                clean_response = clean_response[:last_brace+1] + '\n  ]\n}'
+                            else:
+                                clean_response = clean_response[:last_brace+1]
+                    print(f"📋 Cleaned response preview: {clean_response[:300]}...")
+                    print(f"📋 Cleaned response ends with: '{clean_response[-50:]}'")
+                    # Additional safety: ensure we have a complete JSON structure
+                    if not (clean_response.startswith('{') and clean_response.endswith('}')):
+                        print(f"❌ Response doesn't look like valid JSON structure")
+                        continue
+                    # Fix common JSON escape issues with LaTeX
+                    # Replace single backslashes with double backslashes in JSON strings
+                    # But be careful not to affect already-escaped sequences
+                    def fix_latex_escapes(text):
+                        # Find all JSON string values (between quotes)
+                        def escape_in_string(match):
+                            string_content = match.group(1)
+                            # Escape single backslashes in LaTeX commands
+                            # Handle \mathrm, \left, \%, etc. but preserve JSON escapes like \n, \t, \", \\
+                            # Pattern: backslash followed by letters OR specific LaTeX symbols like %
+                            fixed = re.sub(r'(?<!\\)\\(?=[a-zA-Z%])', r'\\\\', string_content)
+                            return f'"{fixed}"'
+                        # Apply to all JSON string values
+                        return re.sub(r'"([^"\\]*(\\.[^"\\]*)*)"', escape_in_string, text)
+                    clean_response = fix_latex_escapes(clean_response)
+                    print(f"📋 After escape fixing: {clean_response[:200]}...")
+                    # Parse JSON
+                    json_data = json.loads(clean_response)
+                    print(f"📋 Successfully parsed JSON: {type(json_data)}")
+                    # Validate with Pydantic
+                    chunk_response = ChunkList.model_validate(json_data)
+                    print(f"📋 Pydantic validation successful: {type(chunk_response)}")
+                    # Fix literal \n strings in chunk text (convert to actual newlines)
+                    for chunk in chunk_response.chunks:
+                        if hasattr(chunk, 'text') and chunk.text:
+                            # Replace literal \n with actual newlines for paragraph breaks
+                            # Be careful not to affect LaTeX commands that might contain 'n'
+                            chunk.text = chunk.text.replace('\\n', '\n')
+                except json.JSONDecodeError as e:
+                    print(f"❌ Attempt {attempt + 1}: JSON parsing failed: {e}")
+                    print(f"❌ Response was: {response_text}")
+                    continue
+                except Exception as e:
+                    print(f"❌ Attempt {attempt + 1}: Pydantic validation failed: {e}")
+                    continue
+                chunks = chunk_response.chunks
+                if not chunks or len(chunks) == 0:
+                    print(f"⚠️ Attempt {attempt + 1}: No chunks returned")
+                    continue
+                # Success! Process chunks
+                processed_chunks = []
+                for i, chunk in enumerate(chunks):
+                    print(f"\n📝 Processing chunk {i+1}: {chunk.topic}")
+                    if not hasattr(chunk, 'text') or not chunk.text.strip():
+                        print(f"❌ Chunk missing or empty text: {chunk}")
+                        continue
+                    print(f"   Text preview: '{chunk.text[:100]}...'")
+                    processed_chunks.append({
+                        "topic": chunk.topic,
+                        "text": chunk.text,
+                        "chunk_index": i
+                    })
+                if processed_chunks:
+                    print(f"✅ Successfully processed {len(processed_chunks)} chunks for batch {batch_idx + 1}")
+                    batch_chunks = processed_chunks
+                    break
+                else:
+                    print(f"❌ Batch {batch_idx + 1} Attempt {attempt + 1}: No valid chunks processed")
+                    continue
+            except Exception as e:
+                print(f"❌ Batch {batch_idx + 1} Attempt {attempt + 1} failed: {e}")
+                if attempt == max_retries - 1:
+                    print(f"❌ All {max_retries} attempts failed for batch {batch_idx + 1}")
+        # Add successful batch chunks to all_chunks
+        if batch_chunks:
+            all_chunks.extend(batch_chunks)
+            print(f"📊 Total chunks so far: {len(all_chunks)}")
+        else:
+            print(f"⚠️ Batch {batch_idx + 1} failed completely, skipping...")
+    # Final results
+    if all_chunks:
+        print(f"✅ Successfully processed document with {len(all_chunks)} total chunks from {len(batches)} batches")
+        # Re-index all chunks sequentially
+        for i, chunk in enumerate(all_chunks):
+            chunk["chunk_index"] = i
+        return all_chunks, document_markdown
+    else:
+        print("🔄 All batches failed, falling back to programmatic chunking...")
+        chunks, original_markdown = programmatic_chunk_document(document_markdown)
+        return chunks, original_markdown
     try:
         # Initialize Fireworks LLM with structured output
         # Create structured LLM that returns ChunkList object
         structured_llm = llm.with_structured_output(ChunkList)
+        # Create improved chunking prompt that returns complete chunk text
+        prompt = f"""## Task
+Analyze this academic document and create logical educational chunks. Each chunk should contain 2-3 related educational concepts or lessons that a student would naturally learn together.
+## Step-by-Step Process
+1. **Scan the document** to identify main topics and educational concepts
+2. **Group related paragraphs** that teach connected ideas (even if separated by figures)
+3. **Create separate chunks** for figures/tables with their captions
+4. **Ensure each chunk** contains 2-3 educational lessons that build on each other
+5. **Preserve all formatting** exactly as written
+## Chunking Rules
+### Content Rules
+- **Combine related content**: If a concept is split by a figure placement, reunite the related paragraphs in one chunk
+- **2-3 educational lessons per chunk**: Each chunk should teach 2-3 connected concepts that logically belong together
+- **Preserve complete thoughts**: Never split sentences, mathematical expressions, or LaTeX formulas
+- **Skip metadata sections**: Exclude abstracts, acknowledgments, references, author info, page numbers
+### Formatting Rules
+- **Preserve exactly**: All markdown, LaTeX, mathematical notation, and formatting
+- **Include paragraph breaks**: Maintain original \\n\\n paragraph separations
+- **Remove artifacts**: Strip page numbers, headers, footers, and formatting metadata
+### Special Elements
+- **Figures/Tables/Images**: Create separate chunks containing the full caption and any accompanying text
+- **Mathematical expressions**: Keep complete formulas together, never split LaTeX
+- **Code blocks**: Preserve in their entirety with proper formatting
+## Output Format
+Return a JSON object with this exact schema:
+```json
+{{
+  "chunks": [
+    {{
+      "topic": "Brief descriptive name (2-6 words) for the educational content",
+      "text": "Complete chunk text with exact markdown/LaTeX formatting preserved"
+    }}
+  ]
+}}
+```
+## Quality Criteria
+**Good chunks:**
+- Contain 2-3 related educational concepts
+- Are 150-500 words (optimal learning unit size)
+- Have clear educational value and logical flow
+- Preserve all original formatting perfectly
+**Avoid:**
+- Single-sentence chunks
+- Chunks with >5 unrelated concepts
+- Split mathematical expressions
+- Metadata or reference content
+## Examples
+**Good chunk example:**
+```json
+{{
+  "chunks": [
+    {{
+      "topic": "Gradient Descent Fundamentals",
+      "text": "## Gradient Descent Algorithm\\n\\nGradient descent is an optimization algorithm used to minimize functions...\\n\\n### Mathematical Formulation\\n\\nThe update rule is given by:\\n\\n$\\theta_{{t+1}} = \\theta_t - \\alpha \\nabla f(\\theta_t)$\\n\\nwhere $\\alpha$ is the learning rate..."
+    }}
+  ]
+}}
+```
+**Bad chunk example:**
+```json
+{{
+  "chunks": [
+    {{
+      "topic": "Introduction",
+      "text": "This paper presents..."
+    }}
+  ]
+}}
+```
+(Too brief, not educational content)
+---
+## Document to Process:
+{document_markdown}
+Please analyze the document and return the JSON object with chunks following the above guidelines.
+"""
         # Call Fireworks with structured output
+        print("🚀 Calling Fireworks for document chunking...")
+        try:
+            chunk_response = structured_llm.invoke(prompt)
+            print(f"📋 Raw response type: {type(chunk_response)}")
+            print(f"📋 Raw response: {chunk_response}")
+        except Exception as invoke_error:
+            print(f"❌ Error during Fireworks invoke: {invoke_error}")
+            return [], document_markdown
+        if chunk_response is None:
+            print("❌ Received None response from Fireworks")
+            return [], document_markdown
+        if not hasattr(chunk_response, 'chunks'):
+            print(f"❌ Response missing 'chunks' attribute: {type(chunk_response)}")
+            print(f"Response content: {chunk_response}")
+            return [], document_markdown
+        chunks = chunk_response.chunks
+        if not chunks:
+            print("⚠️ No chunks returned from Fireworks")
+            return [], document_markdown
+        # Process chunks with direct text (no fuzzy matching needed)
+        processed_chunks = []
+        for i, chunk in enumerate(chunks):
+            print(f"\n📝 Processing chunk {i+1}: {chunk.topic}")
+            # Check if chunk has the expected 'text' attribute
+            if not hasattr(chunk, 'text'):
+                print(f"❌ Chunk missing 'text' attribute: {chunk}")
+                continue
+            print(f"   Text preview: '{chunk.text[:100]}...'")
+            processed_chunks.append({
+                "topic": chunk.topic,
+                "text": chunk.text,
+                "chunk_index": i
+            })
+        print(f"📊 Processed {len(processed_chunks)} chunks with direct text")
+        return processed_chunks, document_markdown
     except Exception as e:
         import traceback
         print(f"❌ Auto-chunking error: {e}")
         print(f"❌ Full traceback: {traceback.format_exc()}")
+        return [], document_markdown
 @app.post("/chunk_page")
 async def chunk_page(request: dict):
 3. Focus on educational content (concepts, examples, key points)
 4. More dense content should have more chunks, less dense content fewer chunks
 5. Identify chunks that would make good interactive lessons
+6. SKIP chunks from abstract, references, author information, page numbers, etc.
 Return a list of chunks with topic, start_phrase, and end_phrase for each."""
         chunks = chunk_response.chunks
         print(f"📝 Received {len(chunks)} chunks from Fireworks")
+        # Process chunks with direct text (no fuzzy matching needed)
+        processed_chunks = []
+        for i, chunk in enumerate(chunks):
+            processed_chunks.append({
+                "topic": chunk.topic,
+                "text": chunk.text,
+                "chunk_index": i
+            })
+            print(f"✅ Processed chunk: {chunk.topic}")
+        print(f"📊 Successfully processed {len(processed_chunks)} chunks")
         return {
+            "chunks": processed_chunks,
+            "total_found": len(processed_chunks),
             "total_suggested": len(chunks)
         }

backend/requirements.txt CHANGED Viewed

@@ -10,3 +10,4 @@ langchain-core
 langchain-fireworks
 pydantic
 anthropic

 langchain-fireworks
 pydantic
 anthropic
+google-genai

frontend/src/components/ChunkPanel.jsx CHANGED Viewed

@@ -67,7 +67,8 @@ const ChunkPanel = ({
                             rehypePlugins={[rehypeRaw, rehypeKatex]}
                             components={chunkMarkdownComponents}
                         >
-                            {documentData.markdown.slice(
                                 documentData.chunks[currentChunkIndex].start_position,
                                 documentData.chunks[currentChunkIndex].end_position
                             )}

                             rehypePlugins={[rehypeRaw, rehypeKatex]}
                             components={chunkMarkdownComponents}
                         >
+                            {documentData.chunks[currentChunkIndex].text ||
+                             documentData.markdown.slice(
                                 documentData.chunks[currentChunkIndex].start_position,
                                 documentData.chunks[currentChunkIndex].end_position
                             )}

frontend/src/components/DocumentProcessor.jsx CHANGED Viewed

@@ -1,4 +1,4 @@
-import { useMemo } from 'react';
 import 'katex/dist/katex.min.css';
 // Import custom hooks
@@ -14,7 +14,7 @@ import ChunkNavigation from './ChunkNavigation';
 import ChunkPanel from './ChunkPanel';
 // Import utilities
-import { highlightChunkInMarkdown } from '../utils/markdownUtils';
 function DocumentProcessor() {
     // Custom hooks
@@ -59,20 +59,14 @@ function DocumentProcessor() {
         isDragging,
         containerRef,
         handleMouseDown
-    } = usePanelResize(40);
     // Enhanced startInteractiveLesson that uses the chat hook
     const handleStartInteractiveLesson = () => {
         startInteractiveLesson(() => startChunkLesson(currentChunkIndex, documentData));
     };
-    // Memoize the highlighted markdown to prevent unnecessary re-renders
-    const highlightedMarkdown = useMemo(() => {
-        if (!documentData || !documentData.markdown || !documentData.chunks) {
-            return '';
-        }
-        return highlightChunkInMarkdown(documentData.markdown, documentData.chunks, currentChunkIndex);
-    }, [documentData?.markdown, documentData?.chunks, currentChunkIndex]);
     // Early returns for different states
     if (!selectedFile) {
@@ -130,11 +124,10 @@ function DocumentProcessor() {
             {/* Left Panel - Document */}
             <div style={{ width: `${leftPanelWidth}%`, height: '100%' }}>
                 <DocumentViewer
-                    highlightedMarkdown={highlightedMarkdown}
                     documentData={documentData}
-                    fetchImage={fetchImage}
-                    imageCache={imageCache}
-                    setImageCache={() => {}} // Handled by useDocumentProcessor
                 />
             </div>

+import React from 'react';
 import 'katex/dist/katex.min.css';
 // Import custom hooks
 import ChunkPanel from './ChunkPanel';
 // Import utilities
+// Note: Removed markdown highlighting utilities - now using PDF viewer
 function DocumentProcessor() {
     // Custom hooks
         isDragging,
         containerRef,
         handleMouseDown
+    } = usePanelResize(50);
     // Enhanced startInteractiveLesson that uses the chat hook
     const handleStartInteractiveLesson = () => {
         startInteractiveLesson(() => startChunkLesson(currentChunkIndex, documentData));
     };
+    // Note: Removed highlightedMarkdown - now using PDF viewer instead of markdown
     // Early returns for different states
     if (!selectedFile) {
             {/* Left Panel - Document */}
             <div style={{ width: `${leftPanelWidth}%`, height: '100%' }}>
                 <DocumentViewer
+                    selectedFile={selectedFile}
                     documentData={documentData}
+                    currentChunkIndex={currentChunkIndex}
+                    chunks={documentData?.chunks}
                 />
             </div>

frontend/src/components/DocumentViewer.jsx CHANGED Viewed

@@ -1,51 +1,207 @@
-import ReactMarkdown from 'react-markdown';
-import remarkMath from 'remark-math';
-import rehypeKatex from 'rehype-katex';
-import rehypeRaw from 'rehype-raw';
-import { getDocumentMarkdownComponents } from '../utils/markdownComponents.jsx';
-const DocumentViewer = ({ highlightedMarkdown, documentData, fetchImage, imageCache, setImageCache }) => {
-    const markdownComponents = getDocumentMarkdownComponents(documentData, fetchImage, imageCache, setImageCache);
     return (
-        <div className="bg-white rounded-lg shadow-sm flex flex-col" style={{ width: '100%', height: '100%' }}>
             <div className="sticky top-0 bg-white rounded-t-lg px-6 py-4 border-b border-gray-200 z-10">
-                <h2 className="text-lg font-semibold text-left text-gray-800">Document</h2>
             </div>
-            <div className="flex-1 px-6 pt-6 pb-8 overflow-y-auto">
-                <style>
-                    {`
-                        @keyframes fadeInHighlight {
-                            0% {
-                                background-color: rgba(255, 214, 100, 0);
-                                border-left-color: rgba(156, 163, 175, 0);
-                                transform: translateX(-10px);
-                                opacity: 0;
                             }
-                            100% {
-                                background-color: rgba(255, 214, 100, 0.15);
-                                border-left-color: rgba(156, 163, 175, 0.5);
-                                transform: translateX(0);
-                                opacity: 1;
                             }
-                        }
-                    `}
-                </style>
-                <div className="prose prose-sm max-w-none" style={{
-                    fontSize: '0.875rem',
-                    lineHeight: '1.5',
-                    color: 'rgb(55, 65, 81)'
-                }}>
-                    <ReactMarkdown
-                        remarkPlugins={[remarkMath]}
-                        rehypePlugins={[rehypeRaw, rehypeKatex]}
-                        components={markdownComponents}
                     >
-                        {highlightedMarkdown}
-                    </ReactMarkdown>
                 </div>
-            </div>
         </div>
     );
 };

+import { useState, useRef } from 'react';
+import { Document, Page, pdfjs } from 'react-pdf';
+import 'react-pdf/dist/Page/AnnotationLayer.css';
+import 'react-pdf/dist/Page/TextLayer.css';
+// Set up PDF.js worker
+pdfjs.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.js';
+const DocumentViewer = ({ selectedFile, documentData, currentChunkIndex, chunks }) => {
+    const pdfContainerRef = useRef(null);
+    const [numPages, setNumPages] = useState(null);
+    const [currentPage, setCurrentPage] = useState(1);
+    const [zoomLevel, setZoomLevel] = useState(1);
+    const [visiblePages, setVisiblePages] = useState(new Set([1]));
+    // Handle scroll to update current page and track visible pages
+    const handleScroll = () => {
+        if (!pdfContainerRef.current || !numPages) return;
+        const container = pdfContainerRef.current;
+        const scrollTop = container.scrollTop;
+        const containerHeight = container.clientHeight;
+        const totalScrollHeight = container.scrollHeight - containerHeight;
+        // Calculate which page we're viewing based on scroll position
+        const scrollPercent = scrollTop / totalScrollHeight;
+        const newPage = Math.min(Math.floor(scrollPercent * numPages) + 1, numPages);
+        if (newPage !== currentPage) {
+            setCurrentPage(newPage);
+        }
+        // Track visible pages based on zoom level (more pages visible when zoomed out)
+        const newVisiblePages = new Set();
+        const visibleRange = Math.max(1, Math.ceil(2 / zoomLevel)); // More pages when zoomed out
+        for (let i = Math.max(1, newPage - visibleRange); i <= Math.min(numPages, newPage + visibleRange); i++) {
+            newVisiblePages.add(i);
+        }
+        // Update visible pages if changed
+        if (newVisiblePages.size !== visiblePages.size ||
+            ![...newVisiblePages].every(page => visiblePages.has(page))) {
+            setVisiblePages(newVisiblePages);
+        }
+    };
+    // Jump to specific page
+    const goToPage = (pageNumber) => {
+        if (!pdfContainerRef.current || !numPages) return;
+        // Update visible pages immediately for target page
+        const newVisiblePages = new Set();
+        const visibleRange = Math.max(1, Math.ceil(2 / zoomLevel)); // More pages when zoomed out
+        for (let i = Math.max(1, pageNumber - visibleRange); i <= Math.min(numPages, pageNumber + visibleRange); i++) {
+            newVisiblePages.add(i);
+        }
+        setVisiblePages(newVisiblePages);
+        const container = pdfContainerRef.current;
+        const totalScrollHeight = container.scrollHeight - container.clientHeight;
+        // Calculate scroll position for the target page
+        const targetScrollPercent = (pageNumber - 1) / numPages;
+        const targetScrollTop = targetScrollPercent * totalScrollHeight;
+        container.scrollTo({
+            top: targetScrollTop,
+            behavior: 'smooth'
+        });
+    };
+    // Zoom controls
+    const zoomIn = () => setZoomLevel(prev => Math.min(prev + 0.25, 3));
+    const zoomOut = () => setZoomLevel(prev => Math.max(prev - 0.25, 0.5));
+    const resetZoom = () => setZoomLevel(1);
     return (
+        <div className="bg-white rounded-lg shadow-sm flex flex-col relative" style={{ width: '100%', height: '100%' }}>
+            {/* Header */}
             <div className="sticky top-0 bg-white rounded-t-lg px-6 py-4 border-b border-gray-200 z-10">
+                <h2 className="text-lg font-semibold text-left text-gray-800">Original Document</h2>
             </div>
+            {/* PDF Content */}
+            <div
+                ref={pdfContainerRef}
+                className="flex-1 overflow-auto flex justify-center bg-gray-100"
+                onScroll={handleScroll}
+            >
+                {selectedFile ? (
+                    <div className="py-4">
+                        <Document
+                            file={selectedFile}
+                            onLoadSuccess={({ numPages }) => setNumPages(numPages)}
+                            loading={
+                                <div className="flex items-center justify-center h-64">
+                                    <div className="text-center">
+                                        <div className="w-8 h-8 border-4 border-blue-500 border-t-transparent rounded-full animate-spin mx-auto mb-2"></div>
+                                        <p className="text-gray-600">Loading PDF...</p>
+                                    </div>
+                                </div>
                             }
+                            error={
+                                <div className="flex items-center justify-center h-64">
+                                    <div className="text-center">
+                                        <p className="text-red-600 mb-2">Failed to load PDF</p>
+                                        <p className="text-gray-600 text-sm">Please check the file format</p>
+                                    </div>
+                                </div>
                             }
+                        >
+                            {/* Render all pages continuously */}
+                            {numPages && Array.from(new Array(numPages), (_, index) => {
+                                const pageNum = index + 1;
+                                const isVisible = visiblePages.has(pageNum);
+                                const currentZoom = isVisible ? zoomLevel : 1; // Only zoom visible pages
+                                return (
+                                    <div key={pageNum} className="mb-4 flex justify-center">
+                                        <Page
+                                            pageNumber={pageNum}
+                                            width={typeof window !== 'undefined' ? window.innerWidth * 0.4 * 0.9 * currentZoom : 600 * currentZoom}
+                                        />
+                                    </div>
+                                );
+                            })}
+                        </Document>
+                    </div>
+                ) : (
+                    <div className="flex items-center justify-center h-full">
+                        <p className="text-gray-500">No PDF loaded</p>
+                    </div>
+                )}
+            </div>
+            {/* Pagination overlay - floating pill */}
+            {numPages && (
+                <div className="absolute bottom-4 left-1/2 transform -translate-x-1/2 z-10">
+                    <div className="flex items-center bg-gray-800/90 backdrop-blur-sm rounded-full shadow-lg px-3 py-2 space-x-3">
+                        <button
+                            onClick={() => goToPage(Math.max(currentPage - 1, 1))}
+                            disabled={currentPage <= 1}
+                            className="w-8 h-8 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
+                        >
+                            <svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
+                                <path d="M10 12l-4-4 4-4v8z"/>
+                            </svg>
+                        </button>
+                        <span className="px-3 py-1 text-sm font-medium text-white min-w-[60px] text-center">
+                            {currentPage}/{numPages}
+                        </span>
+                        <button
+                            onClick={() => goToPage(Math.min(currentPage + 1, numPages))}
+                            disabled={currentPage >= numPages}
+                            className="w-8 h-8 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
+                        >
+                            <svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
+                                <path d="M6 4l4 4-4 4V4z"/>
+                            </svg>
+                        </button>
+                    </div>
+                </div>
+            )}
+            {/* Zoom controls overlay - bottom right */}
+            {numPages && (
+                <div className="absolute bottom-4 right-4 z-10 flex flex-col items-center space-y-2">
+                    {/* Main zoom pill - vertical */}
+                    <div className="flex flex-col items-center bg-gray-800/90 backdrop-blur-sm rounded-full shadow-lg px-2 py-2 space-y-1">
+                        <button
+                            onClick={zoomIn}
+                            disabled={zoomLevel >= 3}
+                            className="w-6 h-6 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
+                        >
+                            <svg width="14" height="14" viewBox="0 0 16 16" fill="currentColor">
+                                <path d="M8 4v4H4v1h4v4h1V9h4V8H9V4z"/>
+                            </svg>
+                        </button>
+                        <button
+                            onClick={zoomOut}
+                            disabled={zoomLevel <= 0.5}
+                            className="w-6 h-6 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
+                        >
+                            <svg width="14" height="14" viewBox="0 0 16 16" fill="currentColor">
+                                <path d="M4 8h8v1H4z"/>
+                            </svg>
+                        </button>
+                    </div>
+                    {/* Reset button below */}
+                    <button
+                        onClick={resetZoom}
+                        className="w-10 h-10 bg-gray-700 hover:bg-gray-500 backdrop-blur-sm rounded-full shadow-lg flex items-center justify-center text-white transition-colors"
                     >
+                        <svg width="14" height="14" viewBox="0 0 16 16" fill="currentColor" stroke="currentColor" strokeWidth="0.5">
+                            <path d="M8 3a5 5 0 1 0 4.546 2.914.5.5 0 0 1 .908-.417A6 6 0 1 1 8 2v1z" strokeWidth="1"/>
+                            <path d="M8 4.466V.534a.25.25 0 0 1 .41-.192l2.36 1.966c.12.1.12.284 0 .384L8.41 4.658A.25.25 0 0 1 8 4.466z"/>
+                        </svg>
+                    </button>
                 </div>
+            )}
         </div>
     );
 };

test_fuzzy_find.py ADDED Viewed

	@@ -0,0 +1,194 @@

+#%%
+import matplotlib.pyplot as plt
+from difflib import SequenceMatcher
+import numpy as np
+def fuzzy_find(text, pattern, start_pos=0):
+    """Find the best fuzzy match for pattern in text starting from start_pos"""
+    best_ratio = 0
+    best_pos = -1
+    # Search in sliding windows
+    pattern_len = len(pattern)
+    for i in range(start_pos, len(text) - pattern_len + 1):
+        window = text[i:i + pattern_len]
+        ratio = SequenceMatcher(None, pattern.lower(), window.lower()).ratio()
+        if ratio > best_ratio and ratio > 0.8:  # Much stricter: 80% similarity
+            best_ratio = ratio
+            best_pos = i
+    return best_pos if best_pos != -1 else None
+def analyze_fuzzy_ratios(markdown_text, chunk_text):
+    """
+    Analyze fuzzy matching ratios across the entire markdown text using a rolling window.
+    Returns positions and their corresponding similarity ratios.
+    """
+    chunk_len = len(chunk_text)
+    positions = []
+    ratios = []
+    # Rolling window over the entire markdown text
+    for i in range(len(markdown_text) - chunk_len + 1):
+        window = markdown_text[i:i + chunk_len]
+        ratio = SequenceMatcher(None, chunk_text.lower(), window.lower()).ratio()
+        positions.append(i)
+        ratios.append(ratio)
+    return positions, ratios
+def plot_ratio_distribution(positions, ratios, chunk_text, markdown_file_path=None):
+    """
+    Create a plot showing the similarity ratio distribution across positions.
+    """
+    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
+    # Main plot: ratio vs position
+    ax1.plot(positions, ratios, 'b-', alpha=0.7, linewidth=1)
+    ax1.axhline(y=0.8, color='r', linestyle='--', label='Fuzzy find threshold (0.8)')
+    ax1.set_xlabel('Position in Markdown Text')
+    ax1.set_ylabel('Similarity Ratio')
+    ax1.set_title(f'Fuzzy Match Similarity Ratios Across Text\n(Chunk length: {len(chunk_text)} chars)')
+    ax1.grid(True, alpha=0.3)
+    ax1.legend()
+    # Highlight maximum ratio
+    max_ratio = max(ratios)
+    max_pos = positions[ratios.index(max_ratio)]
+    ax1.plot(max_pos, max_ratio, 'ro', markersize=8, label=f'Max ratio: {max_ratio:.3f} at pos {max_pos}')
+    ax1.legend()
+    # Histogram of ratios
+    ax2.hist(ratios, bins=50, alpha=0.7, edgecolor='black')
+    ax2.axvline(x=0.8, color='r', linestyle='--', label='Fuzzy find threshold (0.8)')
+    ax2.axvline(x=max_ratio, color='g', linestyle='--', label=f'Maximum ratio: {max_ratio:.3f}')
+    ax2.set_xlabel('Similarity Ratio')
+    ax2.set_ylabel('Frequency')
+    ax2.set_title('Distribution of Similarity Ratios')
+    ax2.legend()
+    ax2.grid(True, alpha=0.3)
+    plt.tight_layout()
+    return fig, max_ratio, max_pos
+def compare_texts(original_chunk, found_text, max_pos):
+    """
+    Compare the original chunk text with the text found by fuzzy_find.
+    Shows character-by-character differences and similarity analysis.
+    """
+    print("\n" + "="*80)
+    print("TEXT COMPARISON: Original Chunk vs Fuzzy Find Result")
+    print("="*80)
+    print(f"\nOriginal chunk length: {len(original_chunk)} characters")
+    print(f"Found text length: {len(found_text)} characters")
+    print(f"Found at position: {max_pos}")
+    # Calculate overall similarity
+    similarity = SequenceMatcher(None, original_chunk.lower(), found_text.lower()).ratio()
+    print(f"Overall similarity: {similarity:.4f} ({similarity*100:.2f}%)")
+    # Show first 200 characters of each
+    print(f"\nOriginal chunk (first 200 chars):")
+    print(f"'{original_chunk}{'...' if len(original_chunk) > 200 else ''}'")
+    print(f"\nFound text (first 200 chars):")
+    print(f"'{found_text}{'...' if len(found_text) > 200 else ''}'")
+    # Character-by-character analysis for first 100 characters
+    print(f"\nCharacter-by-character comparison (first 100 chars):")
+    print("Original: ", end="")
+    for i, char in enumerate(original_chunk[:100]):
+        if i < len(found_text) and char.lower() == found_text[i].lower():
+            print(char, end="")  # Same character
+        else:
+            print(f"[{char}]", end="")  # Different character
+    print()
+    print("Found:    ", end="")
+    for i, char in enumerate(found_text[:100]):
+        if i < len(original_chunk) and char.lower() == original_chunk[i].lower():
+            print(char, end="")  # Same character
+        else:
+            print(f"[{char}]", end="")  # Different character
+    print()
+    # Analyze differences
+    matcher = SequenceMatcher(None, original_chunk, found_text)
+    differences = []
+    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
+        if tag != 'equal':
+            differences.append({
+                'type': tag,
+                'original_pos': (i1, i2),
+                'found_pos': (j1, j2),
+                'original_text': original_chunk[i1:i2],
+                'found_text': found_text[j1:j2]
+            })
+    print(f"\nFound {len(differences)} differences:")
+    for i, diff in enumerate(differences[:10]):  # Show first 10 differences
+        print(f"{i+1}. {diff['type'].upper()} at original[{diff['original_pos'][0]}:{diff['original_pos'][1]}] -> found[{diff['found_pos'][0]}:{diff['found_pos'][1]}]")
+        if diff['original_text']:
+            print(f"   Original: '{diff['original_text'][:50]}{'...' if len(diff['original_text']) > 50 else ''}'")
+        if diff['found_text']:
+            print(f"   Found:    '{diff['found_text'][:50]}{'...' if len(diff['found_text']) > 50 else ''}'")
+    if len(differences) > 10:
+        print(f"   ... and {len(differences) - 10} more differences")
+    return similarity, differences
+def run_fuzzy_analysis():
+    """
+    Main function to run the fuzzy find analysis.
+    You can modify the markdown_text and chunk_text variables below.
+    """
+    # TODO: Replace these with your actual markdown content and chunk
+    markdown_text = """# An improved method for mobile characterisation of $\\delta^{13} \\mathrm{CH}_{4}$ source signatures and its application in Germany \n\nAntje Hoheisel ${ }^{1}$, Christiane Yeman ${ }^{1, a}$, Florian Dinger ${ }^{1,2}$, Henrik Eckhardt ${ }^{1}$, and Martina Schmidt ${ }^{1}$<br>${ }^{1}$ Institute of Environmental Physics, Heidelberg University, Heidelberg, Germany<br>${ }^{2}$ Max Planck Institute for Chemistry, Mainz, Germany<br>${ }^{a}$ now at: Laboratory of Ion Beam Physics, ETH Zurich, Zurich, Switzerland\n\nCorrespondence: Antje Hoheisel (antje.hoheisel@iup.uni-heidelberg.de)\nReceived: 7 August 2018 - Discussion started: 1 October 2018\nRevised: 17 January 2019 - Accepted: 28 January 2019 - Published: 22 February 2019\n\n\n#### Abstract\n\nThe carbon isotopic signature $\\left(\\delta^{13} \\mathrm{CH}_{4}\\right)$ of several methane sources in Germany (around Heidelberg and in North Rhine-Westphalia) were characterised. Mobile measurements of the plume of $\\mathrm{CH}_{4}$ sources are carried out using an analyser based on cavity ring-down spectroscopy (CRDS). To achieve precise results a CRDS analyser, which measures methane $\\left(\\mathrm{CH}_{4}\\right)$, carbon dioxide $\\left(\\mathrm{CO}_{2}\\right)$ and their ${ }^{13} \\mathrm{C}$-to- ${ }^{12} \\mathrm{C}$ ratios, was characterised especially with regard to cross sensitivities of composition differences of the gas matrix in air samples or calibration tanks. The two most important gases which affect $\\delta^{13} \\mathrm{CH}_{4}$ are water vapour $\\left(\\mathrm{H}_{2} \\mathrm{O}\\right)$ and ethane $\\left(\\mathrm{C}_{2} \\mathrm{H}_{6}\\right)$. To avoid the cross sensitivity with $\\mathrm{H}_{2} \\mathrm{O}$, the air is dried with a Nafion dryer during mobile measurements. $\\mathrm{C}_{2} \\mathrm{H}_{6}$ is typically abundant in natural gases and thus in methane plumes or samples originating from natural gas. $\\mathrm{A}_{2} \\mathrm{H}_{6}$ correction and calibration are essential to obtain accurate $\\delta^{13} \\mathrm{CH}_{4}$ results, which can deviate by up to $3 \\%$ depending on whether a $\\mathrm{C}_{2} \\mathrm{H}_{6}$ correction is applied.\n\nThe isotopic signature is determined with the Miller-Tans approach and the York fitting method. During 21 field campaigns the mean $\\delta^{13} \\mathrm{CH}_{4}$ signatures of three dairy farms $\\left(-63.9 \\pm 0.9 \\%_{e}\\right)$, a biogas plant $\\left(-62.4 \\pm 1.2 \\%_{e}\\right)$, a landfill $\\left(-58.7 \\pm 3.3 \\%_{e}\\right)$, a wastewater treatment plant $(-52.5 \\pm$ $1.4 \\%$ ), an active deep coal mine ( $-56.0 \\pm 2.3 \\%$ ) and two natural gas storage and gas compressor stations ( $-46.1 \\pm$ $0.8 \\%$ ) were recorded.\n\nIn addition, between December 2016 and November 2018 gas samples from the Heidelberg natural gas distribution network were measured with a mean $\\delta^{13} \\mathrm{CH}_{4}$ value of $-43.3 \\pm$ $0.8 \\%$. Contrary to previous measurements between 1991\n\n\n#### Abstract\n\nand 1996 by Levin et al. (1999), no strong seasonal cycle is shown.\n\n\n## 1 Introduction\n\nMethane $\\left(\\mathrm{CH}_{4}\\right)$ is the second most important anthropogenic greenhouse gas. The atmospheric growth rate of $\\mathrm{CH}_{4}$ has changed significantly during the last decades, stabilising at zero growth from 1999 to 2006 before beginning to increase again after 2007 (Dlugokencky et al., 2009). Several studies have focused on the recent $\\mathrm{CH}_{4}$ growth caused by changes in sources and sinks (Rigby et al., 2017; Turner et al., 2017).\n\nRecent studies by Schaefer et al. (2016), Rice et al. (2016) and Nisbet et al. (2016) have shown how the $\\delta^{13} \\mathrm{CH}_{4}$ measurements can help to understand the changes in global $\\mathrm{CH}_{4}$ increase rates and to assign the related source types. The stable carbon isotope ratio $\\left({ }^{13} \\mathrm{C} /{ }^{12} \\mathrm{C}\\right)$ of $\\mathrm{CH}_{4}$ sources varies due to the initial source material and the fractionation during production and release to the atmosphere. The source categories can be classified as pyrogenic (e.g. biomass burning), biogenic (e.g. wetlands and livestock) or thermogenic (e.g. a subcategory of fossil fuel extraction), which show different but also overlapping isotope ratio ranges. Various studies have shown that the assignment of isotopic signatures from different $\\mathrm{CH}_{4}$ sources remains uncertain due to large temporal variabilities and also regional specificities (e.g. Sherwood et al., 2017). This missing knowledge may result in large uncertainties when the $\\mathrm{CH}_{4}$ budget is determined on global or regional scales using isotope-based estimates. In addition to global studies, the use of $\\delta^{13} \\mathrm{CH}_{4}$ was already successfully"""
+    chunk_text = """## 1 Introduction\nMethane ($\mathrm{CH}_{4}$) is the second most important anthropogenic greenhouse gas. The atmospheric growth rate of $\mathrm{CH}_{4}$ has changed significantly during the last decades, stabilising at zero growth from 1999 to 2006 before beginning to increase again after 2007 (Dlugokencky et al., 2009). Several studies have focused on the recent $\mathrm{CH}_{4}$ growth caused by changes in sources and sinks (Rigby et al., 2017; Turner et al., 2017).\n\nRecent studies by Schaefer et al. (2016), Rice et al. (2016) and Nisbet et al. (2016) have shown how the $\delta^{13} \mathrm{CH}_{4}$ measurements can help to understand the changes in global $\mathrm{CH}_{4}$ increase rates and to assign the related source types. The stable carbon isotope ratio (${}^{13}\mathrm{C}$/${}^{12}\mathrm{C}$) of $\mathrm{CH}_{4}$ sources varies due to the initial source material and the fractionation during production and release to the atmosphere. The source categories can be classified as pyrogenic (e.g. biomass burning), biogenic (e.g. wetlands and livestock) or thermogenic (e.g. a subcategory of fossil fuel extraction), which show different but also overlapping isotope ratio ranges. Various studies have shown that the assignment of isotopic signatures from different $\mathrm{CH}_{4}$ sources remains uncertain due to large temporal variabilities and also regional specificities (e.g. Sherwood et al., 2017). This missing knowledge may result in large uncertainties when the $\mathrm{CH}_{4}$ budget is determined on global or regional scales using isotope-based estimates. In addition to global studies, the use of $\delta^{13}\mathrm{CH}_{4}$ was already successfully"""
+    print("Analyzing fuzzy matching ratios...")
+    print(f"Markdown text length: {len(markdown_text)} characters")
+    print(f"Chunk text length: {len(chunk_text)} characters")
+    # Run the analysis
+    positions, ratios = analyze_fuzzy_ratios(markdown_text, chunk_text)
+    # Create the plot
+    fig, max_ratio, max_pos = plot_ratio_distribution(positions, ratios, chunk_text)
+    # Print statistics
+    print(f"\nStatistics:")
+    print(f"Maximum similarity ratio: {max_ratio:.3f}")
+    print(f"Maximum ratio position: {max_pos}")
+    print(f"Number of positions above 0.8 threshold: {sum(1 for r in ratios if r > 0.8)}")
+    print(f"Mean ratio: {np.mean(ratios):.3f}")
+    print(f"Standard deviation: {np.std(ratios):.3f}")
+    # Test the original fuzzy_find function
+    result = fuzzy_find(markdown_text, chunk_text)
+    print(f"\nOriginal fuzzy_find result: {result}")
+    if result is not None:
+        print(f"Found match at position {result}")
+    else:
+        print("No match found above 0.8 threshold")
+    # Compare the found text with the original chunk
+    if max_ratio > 0:  # If we found any match
+        found_text = markdown_text[max_pos:max_pos + len(chunk_text)]
+        text_similarity, differences = compare_texts(chunk_text, found_text, max_pos)
+        print(f"\nDetailed comparison similarity: {text_similarity:.4f}")
+        print(f"Number of character differences: {len(differences)}")
+    plt.show()
+    return positions, ratios, max_ratio, max_pos
+if __name__ == "__main__":
+    # Run the analysis
+    positions, ratios, max_ratio, max_pos = run_fuzzy_analysis()
+#%%