Alleinzellgaenger Claude commited on
Commit
c070e10
·
1 Parent(s): a706099

WIP: Development progress - mixed features and improvements

Browse files

Multiple ongoing feature development:
- Enhanced PDF viewer with zoom controls and pagination
- Integrated chat functionality with chunk navigation
- Gemini API integration for improved document chunking
- Various component refactoring and optimization attempts

This commit captures mixed development state before branching cleanup.
Will be reorganized into focused commits on dev branch.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

.claude/sessions/2025-08-03-1200.md CHANGED
@@ -34,4 +34,37 @@ Refine academic paper chunking system to address:
34
  1. **Document Modification**: Original document gets cleaned (academic content removal)
35
  2. **Figure Handling**: Simple paragraph-ending regex can't handle figures interrupting text flow
36
  3. **Position Mapping**: Positions calculated on cleaned text, not original
37
- 4. **Highlighting Injection**: Blockquote injection modifies document structure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  1. **Document Modification**: Original document gets cleaned (academic content removal)
35
  2. **Figure Handling**: Simple paragraph-ending regex can't handle figures interrupting text flow
36
  3. **Position Mapping**: Positions calculated on cleaned text, not original
37
+ 4. **Highlighting Injection**: Blockquote injection modifies document structure
38
+
39
+ ### Update - 2025-08-03 12:25
40
+
41
+ **Summary**: Successfully switched from broken markdown rendering to PDF viewer approach
42
+
43
+ **Git Changes**:
44
+ - Modified: frontend/src/components/DocumentProcessor.jsx, frontend/src/components/DocumentViewer.jsx
45
+ - Current branch: main (commit: a706099)
46
+
47
+ **Todo Progress**: 3 completed, 1 in progress, 0 pending
48
+ - ✓ Completed: Examine original PDF viewer components in backup files
49
+ - ✓ Completed: Commit current markdown approach as 'failed feature implementation'
50
+ - ✓ Completed: Switch DocumentViewer to use PDF instead of markdown
51
+ - 🔄 In Progress: Test the new PDF viewer implementation
52
+
53
+ **Issues Resolved**:
54
+ - Eliminated markdown rendering that was breaking document layout
55
+ - Removed document modification/cleaning that violated integrity principle
56
+ - Integrated proven PDF viewer from UploadPage.jsx component
57
+
58
+ **Solutions Implemented**:
59
+ - Replaced DocumentViewer.jsx with react-pdf implementation from UploadPage.jsx
60
+ - Added zoom controls, pagination, and smooth scrolling
61
+ - Updated DocumentProcessor.jsx to pass selectedFile instead of highlightedMarkdown
62
+ - Removed unused markdown utilities and highlighting logic
63
+ - Preserved original document completely - no text modification
64
+
65
+ **Code Changes**:
66
+ - DocumentViewer.jsx: Complete rewrite using react-pdf with zoom/navigation controls
67
+ - DocumentProcessor.jsx: Removed highlightedMarkdown logic, updated props
68
+ - Maintained all chunking functionality on right panel while fixing left panel display
69
+
70
+ **Next Steps**: Test PDF viewer functionality and implement visual chunk highlighting overlays if needed
backend/app.py CHANGED
@@ -14,7 +14,8 @@ from pydantic import BaseModel, Field
14
  from typing import Optional, List
15
  from langchain.chat_models import init_chat_model
16
  import anthropic
17
-
 
18
  # Load environment variables
19
  load_dotenv()
20
 
@@ -178,15 +179,24 @@ async def process_ocr_content(file_id: str):
178
  combined_markdown = '\n\n---\n\n'.join(all_page_markdown)
179
  print(f"📋 Combined document: {len(combined_markdown)} chars total")
180
 
181
- # Auto-chunk the entire document once
182
  document_chunks = []
183
  original_markdown = combined_markdown
184
  try:
185
- print(f"🧠 Auto-chunking entire document...")
186
- document_chunks, original_markdown = await auto_chunk_document(combined_markdown, client)
 
 
 
 
 
 
187
  print(f"📊 Document chunks found: {len(document_chunks)}")
188
  for i, chunk in enumerate(document_chunks):
189
- print(f" {i+1}. {chunk.get('topic', 'Unknown')}: {chunk.get('start_phrase', '')[:50]}...")
 
 
 
190
  except Exception as chunk_error:
191
  print(f"⚠️ Document chunking failed: {chunk_error}")
192
  document_chunks = []
@@ -256,9 +266,8 @@ async def get_image_base64(file_id: str, image_id: str):
256
 
257
  class ChunkSchema(BaseModel):
258
  """Schema for document chunks suitable for creating interactive lessons."""
259
- topic: str = Field(description="Brief topic name for the chunk")
260
- start_phrase: str = Field(description="First few words of the chunk (5-15 words)")
261
- end_phrase: str = Field(description="Last few words of the chunk (5-15 words)")
262
 
263
  class ChunkList(BaseModel):
264
  """Container for a list of document chunks."""
@@ -465,16 +474,373 @@ def programmatic_chunk_document(document_markdown):
465
  # The frontend will use the original document for highlighting
466
  return chunks, document_markdown
467
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
468
  async def auto_chunk_document(document_markdown, client=None):
469
- """Auto-chunk a document - now using programmatic approach instead of LLM"""
470
- chunks, original_markdown = programmatic_chunk_document(document_markdown)
471
- return chunks, original_markdown
 
472
 
473
  # Get Fireworks API key
474
  fireworks_api_key = os.environ.get("FIREWORKS_API_KEY")
475
  if not fireworks_api_key:
476
- print("⚠️ No Fireworks API key found, falling back to regular chunking")
477
- return []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
478
 
479
  try:
480
  # Initialize Fireworks LLM with structured output
@@ -487,98 +853,148 @@ async def auto_chunk_document(document_markdown, client=None):
487
  # Create structured LLM that returns ChunkList object
488
  structured_llm = llm.with_structured_output(ChunkList)
489
 
490
- # Create chunking prompt
491
- prompt = f"""Imagine you are a teacher. You are given a document, and you have to decide how to dissect this document. Your task is to identify chunks of content by providing start and end phrases that can be used to create interactive lessons. Here's the document:
492
- DOCUMENT:
493
- {document_markdown}
494
 
495
- Rules:
496
- 1. Each chunk should contain 2-3 valuable lessons
497
- 2. start_phrase and end_phrase should be 5-15 words long
498
- 3. Focus on educational content (concepts, examples, key points)
499
- 4. More dense content should have more chunks, less dense content fewer chunks
500
- 5. Identify chunks that would make good interactive lessons
 
 
 
 
 
 
 
 
501
 
502
- Return a list of chunks with topic, start_phrase, and end_phrase for each. Importantly, you are passed Markdown text, so output the start and end phrases as Markdown text, and include punctuation. Never stop an end phrase in the middle of a sentence, always include the full sentence or phrase."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
503
 
504
  # Call Fireworks with structured output
505
- chunk_response = structured_llm.invoke(prompt)
506
- chunks = chunk_response.chunks
 
 
 
 
 
 
507
 
508
- # Find positions using fuzzy matching with detailed debugging
509
- positioned_chunks = []
510
- for i, chunk in enumerate(chunks):
511
- print(f"\n🔍 Processing chunk {i+1}: {chunk.topic}")
512
- print(f" Start phrase: '{chunk.start_phrase}'")
513
- print(f" End phrase: '{chunk.end_phrase}'")
514
 
515
- start_pos = fuzzy_find(document_markdown, chunk.start_phrase)
516
- end_phrase_start = fuzzy_find(document_markdown, chunk.end_phrase, start_pos or 0)
 
 
517
 
518
- print(f" Found start_pos: {start_pos}")
519
- print(f" Found end_phrase_start: {end_phrase_start}")
 
 
 
 
 
 
 
520
 
521
- # Add the length of the end_phrase plus a bit more to include punctuation
522
- if end_phrase_start is not None:
523
- end_pos = end_phrase_start + len(chunk.end_phrase)
524
- # Try to include punctuation that might follow
525
-
526
- # Look ahead for good stopping points, but be more careful about spaces
527
- max_extend = 15 # Don't go crazy far
528
- extended = 0
529
 
530
- while end_pos < len(document_markdown) and extended < max_extend:
531
- char = document_markdown[end_pos]
532
-
533
- # Good stopping points - include punctuation and stop
534
- if char in '.!?':
535
- end_pos += 1 # Include the punctuation
536
- break
537
- elif char in ';:,':
538
- end_pos += 1 # Include and stop
539
- break
540
- # Stop at paragraph breaks
541
- elif end_pos < len(document_markdown) - 1 and document_markdown[end_pos:end_pos+2] == '\n\n':
542
- break
543
- # Stop at LaTeX boundaries
544
- elif char == '$':
545
- break
546
- # Continue through normal chars and whitespace
547
- else:
548
- end_pos += 1
549
- extended += 1
550
- print(f" Final end_pos: {end_pos}")
551
- else:
552
- print(f" End phrase not found! Finding paragraph end...")
553
- end_pos = find_paragraph_end(document_markdown, start_pos)
554
-
555
- if start_pos is not None and end_pos is not None:
556
- # Show actual extracted text for debugging
557
- extracted_text = document_markdown[start_pos:end_pos]
558
- print(f" Extracted text: '{extracted_text[:100]}...'")
559
 
560
- if start_pos is not None:
561
- positioned_chunks.append({
562
- "topic": chunk.topic,
563
- "start_phrase": chunk.start_phrase,
564
- "end_phrase": chunk.end_phrase,
565
- "start_position": start_pos,
566
- "end_position": end_pos,
567
- "found_start": True,
568
- "found_end": end_pos is not None
569
- })
570
-
571
- # Sort chunks by position in document for chronological order
572
- positioned_chunks.sort(key=lambda chunk: chunk.get('start_position', 0))
573
- print(f"📊 Final sorted chunks: {len(positioned_chunks)}")
574
-
575
- return positioned_chunks
576
 
577
  except Exception as e:
578
  import traceback
579
  print(f"❌ Auto-chunking error: {e}")
580
  print(f"❌ Full traceback: {traceback.format_exc()}")
581
- return []
582
 
583
  @app.post("/chunk_page")
584
  async def chunk_page(request: dict):
@@ -617,6 +1033,7 @@ Rules:
617
  3. Focus on educational content (concepts, examples, key points)
618
  4. More dense content should have more chunks, less dense content fewer chunks
619
  5. Identify chunks that would make good interactive lessons
 
620
 
621
  Return a list of chunks with topic, start_phrase, and end_phrase for each."""
622
 
@@ -626,39 +1043,21 @@ Return a list of chunks with topic, start_phrase, and end_phrase for each."""
626
  chunks = chunk_response.chunks
627
  print(f"📝 Received {len(chunks)} chunks from Fireworks")
628
 
629
- # Find positions using fuzzy matching
630
- positioned_chunks = []
631
- for chunk in chunks:
632
- start_pos = fuzzy_find(document_markdown, chunk.start_phrase)
633
- end_phrase_start = fuzzy_find(document_markdown, chunk.end_phrase, start_pos or 0)
634
- # Add the length of the end_phrase plus a bit more to include punctuation
635
- if end_phrase_start is not None:
636
- end_pos = end_phrase_start + len(chunk.end_phrase)
637
- # Try to include punctuation that might follow
638
- if end_pos < len(document_markdown) and document_markdown[end_pos] in '.!?;:,':
639
- end_pos += 1
640
- else:
641
- end_pos = None
642
-
643
- if start_pos is not None:
644
- positioned_chunks.append({
645
- "topic": chunk.topic,
646
- "start_phrase": chunk.start_phrase,
647
- "end_phrase": chunk.end_phrase,
648
- "start_position": start_pos,
649
- "end_position": end_pos,
650
- "found_start": True,
651
- "found_end": end_pos is not None
652
- })
653
- print(f"✅ Found chunk: {chunk.topic} at position {start_pos}")
654
- else:
655
- print(f"❌ Could not find chunk: {chunk.topic}")
656
-
657
- print(f"📊 Successfully positioned {len(positioned_chunks)}/{len(chunks)} chunks")
658
 
659
  return {
660
- "chunks": positioned_chunks,
661
- "total_found": len(positioned_chunks),
662
  "total_suggested": len(chunks)
663
  }
664
 
 
14
  from typing import Optional, List
15
  from langchain.chat_models import init_chat_model
16
  import anthropic
17
+ import google
18
+ from google import genai
19
  # Load environment variables
20
  load_dotenv()
21
 
 
179
  combined_markdown = '\n\n---\n\n'.join(all_page_markdown)
180
  print(f"📋 Combined document: {len(combined_markdown)} chars total")
181
 
182
+ # Auto-chunk the entire document once - try Gemini first, then fallback
183
  document_chunks = []
184
  original_markdown = combined_markdown
185
  try:
186
+ print(f"🧠 Auto-chunking entire document with Gemini...")
187
+ document_chunks, original_markdown = await gemini_chunk_document(combined_markdown)
188
+
189
+ # If Gemini failed, try the old Fireworks method
190
+ if not document_chunks:
191
+ print(f"🔄 Gemini failed, falling back to Fireworks...")
192
+ document_chunks, original_markdown = await auto_chunk_document(combined_markdown, client)
193
+
194
  print(f"📊 Document chunks found: {len(document_chunks)}")
195
  for i, chunk in enumerate(document_chunks):
196
+ topic = chunk.get('topic', 'Unknown')
197
+ preview = chunk.get('text', chunk.get('start_phrase', ''))[:50] + "..." if chunk.get('text', chunk.get('start_phrase', '')) else 'No content'
198
+ print(f" {i+1}. {topic}: {preview}")
199
+
200
  except Exception as chunk_error:
201
  print(f"⚠️ Document chunking failed: {chunk_error}")
202
  document_chunks = []
 
266
 
267
  class ChunkSchema(BaseModel):
268
  """Schema for document chunks suitable for creating interactive lessons."""
269
+ topic: str = Field(description="Brief descriptive name (2-6 words) for the educational content")
270
+ text: str = Field(description="Complete chunk text with exact markdown/LaTeX formatting preserved, containing 2-3 related educational concepts")
 
271
 
272
  class ChunkList(BaseModel):
273
  """Container for a list of document chunks."""
 
474
  # The frontend will use the original document for highlighting
475
  return chunks, document_markdown
476
 
477
+ def split_document_into_batches(document_markdown, max_chars=8000):
478
+ """Split document into manageable batches for LLM processing"""
479
+ if len(document_markdown) <= max_chars:
480
+ return [document_markdown]
481
+
482
+ batches = []
483
+ current_pos = 0
484
+
485
+ while current_pos < len(document_markdown):
486
+ # Try to find a good breaking point (paragraph boundary)
487
+ end_pos = min(current_pos + max_chars, len(document_markdown))
488
+
489
+ # If we're not at the end, try to break at a paragraph boundary
490
+ if end_pos < len(document_markdown):
491
+ # Look for \n\n within the last 1000 characters of this batch
492
+ search_start = max(end_pos - 1000, current_pos)
493
+ last_paragraph = document_markdown.rfind('\n\n', search_start, end_pos)
494
+
495
+ if last_paragraph != -1 and last_paragraph > current_pos:
496
+ end_pos = last_paragraph + 2 # Include the \n\n
497
+
498
+ batch = document_markdown[current_pos:end_pos]
499
+ batches.append(batch)
500
+ current_pos = end_pos
501
+
502
+ print(f"📄 Created batch {len(batches)}: {len(batch)} chars (pos {current_pos-len(batch)}-{current_pos})")
503
+
504
+ return batches
505
+
506
+ async def gemini_chunk_document(document_markdown):
507
+ """Auto-chunk a document using Google Gemini 2.5 Pro with reliable structured output"""
508
+
509
+ # Get Gemini API key
510
+ gemini_api_key = os.environ.get("GEMINI_API_KEY")
511
+ if not gemini_api_key:
512
+ print("⚠️ No Gemini API key found")
513
+ return None, document_markdown
514
+
515
+ print(f"📄 Document length: {len(document_markdown)} characters")
516
+
517
+ try:
518
+ # Initialize Gemini client
519
+ client = genai.Client(api_key=gemini_api_key)
520
+
521
+ # Split document into batches if it's too large (Gemini has token limits)
522
+ batches = split_document_into_batches(document_markdown, max_chars=12000) # Gemini can handle larger batches
523
+ print(f"📄 Split document into {len(batches)} batches for Gemini")
524
+
525
+ all_chunks = []
526
+
527
+ # Process each batch
528
+ for batch_idx, batch in enumerate(batches):
529
+ print(f"\n🔄 Processing batch {batch_idx + 1}/{len(batches)} ({len(batch)} chars) with Gemini")
530
+
531
+ try:
532
+ # Create the prompt for Gemini
533
+ prompt = f"""You are an educational content analyzer. Analyze this document section and break it into logical learning chunks.
534
+
535
+ Each chunk should:
536
+ - Contain 2-3 related educational concepts that naturally belong together
537
+ - Be 150-500 words (optimal for learning)
538
+ - Have clear educational value
539
+ - Preserve all markdown/LaTeX formatting exactly
540
+ - Skip: abstracts, acknowledgments, references, author info, page numbers
541
+
542
+ Return your response as a valid JSON object with this exact structure:
543
+ {{
544
+ "chunks": [
545
+ {{
546
+ "topic": "Brief descriptive name (2-6 words)",
547
+ "text": "Complete chunk text with exact formatting preserved"
548
+ }}
549
+ ]
550
+ }}
551
+
552
+ Document section to analyze:
553
+ {batch}
554
+
555
+ Important: Return ONLY the JSON object, no other text."""
556
+
557
+ # Call Gemini 2.5 Pro (disable thinking for faster/cheaper responses)
558
+ response = client.models.generate_content(
559
+ model="gemini-2.5-pro",
560
+ contents=prompt,
561
+ config=genai.types.GenerateContentConfig(
562
+ thinking_config=genai.types.ThinkingConfig(thinking_budget=-1)
563
+ )
564
+ )
565
+
566
+ # Extract and parse response
567
+ response_text = response.text.strip()
568
+ print(f"📋 Gemini response preview: {response_text}...")
569
+
570
+ # Clean up the response (remove code blocks if present)
571
+ clean_response = response_text
572
+ if clean_response.startswith('```json'):
573
+ clean_response = clean_response[7:]
574
+ if clean_response.endswith('```'):
575
+ clean_response = clean_response[:-3]
576
+ clean_response = clean_response.strip()
577
+
578
+ # Parse JSON
579
+ try:
580
+ json_data = json.loads(clean_response)
581
+
582
+ # Validate structure
583
+ if not isinstance(json_data, dict) or 'chunks' not in json_data:
584
+ print(f"❌ Invalid response structure from Gemini batch {batch_idx + 1}")
585
+ continue
586
+
587
+ chunks = json_data['chunks']
588
+ if not isinstance(chunks, list):
589
+ print(f"❌ 'chunks' is not a list in batch {batch_idx + 1}")
590
+ continue
591
+
592
+ # Process chunks
593
+ batch_chunks = []
594
+ for i, chunk in enumerate(chunks):
595
+ if not isinstance(chunk, dict) or 'topic' not in chunk or 'text' not in chunk:
596
+ print(f"❌ Invalid chunk structure in batch {batch_idx + 1}, chunk {i}")
597
+ continue
598
+
599
+ # Clean up text formatting
600
+ chunk_text = chunk['text']
601
+ # Replace literal \n with actual newlines
602
+ chunk_text = chunk_text.replace('\\n', '\n')
603
+
604
+ batch_chunks.append({
605
+ "topic": chunk['topic'],
606
+ "text": chunk_text,
607
+ "chunk_index": len(all_chunks) + len(batch_chunks)
608
+ })
609
+
610
+ print(f"✅ Processed chunk: {chunk['topic']}")
611
+
612
+ all_chunks.extend(batch_chunks)
613
+ print(f"📊 Batch {batch_idx + 1} added {len(batch_chunks)} chunks (total: {len(all_chunks)})")
614
+
615
+ except json.JSONDecodeError as e:
616
+ print(f"❌ JSON parsing failed for batch {batch_idx + 1}: {e}")
617
+ print(f"❌ Response was: {response_text}")
618
+ continue
619
+
620
+ except Exception as e:
621
+ print(f"❌ Error processing batch {batch_idx + 1} with Gemini: {e}")
622
+ continue
623
+
624
+ # Return results
625
+ if all_chunks:
626
+ print(f"✅ Gemini successfully processed document with {len(all_chunks)} total chunks")
627
+ return all_chunks, document_markdown
628
+ else:
629
+ print("❌ Gemini processing failed for all batches")
630
+ return None, document_markdown
631
+
632
+ except Exception as e:
633
+ print(f"❌ Gemini chunking error: {e}")
634
+ return None, document_markdown
635
+
636
  async def auto_chunk_document(document_markdown, client=None):
637
+ """Auto-chunk a document using LLM with batch processing for large documents"""
638
+
639
+ # Debug: Print document info
640
+ print(f"📄 Document length: {len(document_markdown)} characters")
641
 
642
  # Get Fireworks API key
643
  fireworks_api_key = os.environ.get("FIREWORKS_API_KEY")
644
  if not fireworks_api_key:
645
+ print("⚠️ No Fireworks API key found, falling back to programmatic chunking")
646
+ chunks, original_markdown = programmatic_chunk_document(document_markdown)
647
+ return chunks, original_markdown
648
+
649
+ # Split document into batches if it's too large
650
+ batches = split_document_into_batches(document_markdown, max_chars=8000)
651
+ print(f"📄 Split document into {len(batches)} batches")
652
+
653
+ all_chunks = []
654
+
655
+ # Process each batch
656
+ for batch_idx, batch in enumerate(batches):
657
+ print(f"\n🔄 Processing batch {batch_idx + 1}/{len(batches)} ({len(batch)} chars)")
658
+
659
+ # Try structured output with retry logic for this batch
660
+ max_retries = 3
661
+ batch_chunks = None
662
+
663
+ for attempt in range(max_retries):
664
+ try:
665
+ print(f"🚀 Batch {batch_idx + 1} Attempt {attempt + 1}/{max_retries}: Calling Fireworks...")
666
+
667
+ # Initialize LLM
668
+ llm = init_chat_model(
669
+ "accounts/fireworks/models/llama4-maverick-instruct-basic",
670
+ model_provider="fireworks",
671
+ api_key=fireworks_api_key
672
+ )
673
+
674
+ # Use regular LLM and manual JSON parsing
675
+ prompt = f"""You are an educational content analyzer. Break this document section into logical learning chunks.
676
+
677
+ IMPORTANT: Return your response as a valid JSON object with this exact structure:
678
+ {{
679
+ "chunks": [
680
+ {{
681
+ "topic": "Brief topic name",
682
+ "text": "Complete chunk text with exact formatting"
683
+ }}
684
+ ]
685
+ }}
686
+
687
+ Rules for chunking:
688
+ - Each chunk should contain 2-3 related educational concepts
689
+ - Keep chunks concise: 100-300 words (avoid very long text blocks)
690
+ - Preserve all markdown/LaTeX formatting exactly as written
691
+ - Skip: abstracts, acknowledgements, references, author information, page numbers
692
+ - Create separate chunks for figures/tables with their captions
693
+ - Never split mathematical expressions or LaTeX formulas
694
+ - Process ALL content in this section - don't skip any educational material
695
+ - Ensure all JSON strings are properly formatted (no unescaped quotes)
696
+
697
+ Document section to analyze:
698
+ {batch}
699
+
700
+ Return only the JSON object, no other text."""
701
+
702
+ # Call regular LLM
703
+ result = llm.invoke(prompt)
704
+ print(f"📋 Raw LLM response type: {type(result)}")
705
+
706
+ # Extract text content
707
+ if hasattr(result, 'content'):
708
+ response_text = result.content
709
+ elif hasattr(result, 'text'):
710
+ response_text = result.text
711
+ else:
712
+ response_text = str(result)
713
+
714
+ print(f"📋 Response text preview: {response_text}...")
715
+
716
+ # Try to parse JSON manually
717
+
718
+ try:
719
+ # Clean up the response - remove any markdown code blocks and fix common issues
720
+ clean_response = response_text.strip()
721
+ if clean_response.startswith('```json'):
722
+ clean_response = clean_response[7:]
723
+ if clean_response.endswith('```'):
724
+ clean_response = clean_response[:-3]
725
+ clean_response = clean_response.strip()
726
+
727
+ # Fix common JSON truncation issues
728
+ # If the response doesn't end properly, try to close it
729
+ if not clean_response.endswith('}'):
730
+ # Try to find the last complete chunk entry and close properly
731
+ last_brace = clean_response.rfind('}')
732
+ if last_brace != -1:
733
+ # Find if we're inside a chunks array
734
+ chunks_start = clean_response.find('"chunks": [')
735
+ if chunks_start != -1 and last_brace > chunks_start:
736
+ # Close the chunks array and main object
737
+ clean_response = clean_response[:last_brace+1] + '\n ]\n}'
738
+ else:
739
+ clean_response = clean_response[:last_brace+1]
740
+
741
+ print(f"📋 Cleaned response preview: {clean_response[:300]}...")
742
+ print(f"📋 Cleaned response ends with: '{clean_response[-50:]}'")
743
+
744
+ # Additional safety: ensure we have a complete JSON structure
745
+ if not (clean_response.startswith('{') and clean_response.endswith('}')):
746
+ print(f"❌ Response doesn't look like valid JSON structure")
747
+ continue
748
+
749
+ # Fix common JSON escape issues with LaTeX
750
+ # Replace single backslashes with double backslashes in JSON strings
751
+ # But be careful not to affect already-escaped sequences
752
+ def fix_latex_escapes(text):
753
+ # Find all JSON string values (between quotes)
754
+ def escape_in_string(match):
755
+ string_content = match.group(1)
756
+ # Escape single backslashes in LaTeX commands
757
+ # Handle \mathrm, \left, \%, etc. but preserve JSON escapes like \n, \t, \", \\
758
+ # Pattern: backslash followed by letters OR specific LaTeX symbols like %
759
+ fixed = re.sub(r'(?<!\\)\\(?=[a-zA-Z%])', r'\\\\', string_content)
760
+ return f'"{fixed}"'
761
+
762
+ # Apply to all JSON string values
763
+ return re.sub(r'"([^"\\]*(\\.[^"\\]*)*)"', escape_in_string, text)
764
+
765
+ clean_response = fix_latex_escapes(clean_response)
766
+ print(f"📋 After escape fixing: {clean_response[:200]}...")
767
+
768
+ # Parse JSON
769
+ json_data = json.loads(clean_response)
770
+ print(f"📋 Successfully parsed JSON: {type(json_data)}")
771
+
772
+ # Validate with Pydantic
773
+ chunk_response = ChunkList.model_validate(json_data)
774
+ print(f"📋 Pydantic validation successful: {type(chunk_response)}")
775
+
776
+ # Fix literal \n strings in chunk text (convert to actual newlines)
777
+ for chunk in chunk_response.chunks:
778
+ if hasattr(chunk, 'text') and chunk.text:
779
+ # Replace literal \n with actual newlines for paragraph breaks
780
+ # Be careful not to affect LaTeX commands that might contain 'n'
781
+ chunk.text = chunk.text.replace('\\n', '\n')
782
+
783
+ except json.JSONDecodeError as e:
784
+ print(f"❌ Attempt {attempt + 1}: JSON parsing failed: {e}")
785
+ print(f"❌ Response was: {response_text}")
786
+ continue
787
+ except Exception as e:
788
+ print(f"❌ Attempt {attempt + 1}: Pydantic validation failed: {e}")
789
+ continue
790
+
791
+ chunks = chunk_response.chunks
792
+ if not chunks or len(chunks) == 0:
793
+ print(f"⚠️ Attempt {attempt + 1}: No chunks returned")
794
+ continue
795
+
796
+ # Success! Process chunks
797
+ processed_chunks = []
798
+ for i, chunk in enumerate(chunks):
799
+ print(f"\n📝 Processing chunk {i+1}: {chunk.topic}")
800
+
801
+ if not hasattr(chunk, 'text') or not chunk.text.strip():
802
+ print(f"❌ Chunk missing or empty text: {chunk}")
803
+ continue
804
+
805
+ print(f" Text preview: '{chunk.text[:100]}...'")
806
+
807
+ processed_chunks.append({
808
+ "topic": chunk.topic,
809
+ "text": chunk.text,
810
+ "chunk_index": i
811
+ })
812
+
813
+ if processed_chunks:
814
+ print(f"✅ Successfully processed {len(processed_chunks)} chunks for batch {batch_idx + 1}")
815
+ batch_chunks = processed_chunks
816
+ break
817
+ else:
818
+ print(f"❌ Batch {batch_idx + 1} Attempt {attempt + 1}: No valid chunks processed")
819
+ continue
820
+
821
+ except Exception as e:
822
+ print(f"❌ Batch {batch_idx + 1} Attempt {attempt + 1} failed: {e}")
823
+ if attempt == max_retries - 1:
824
+ print(f"❌ All {max_retries} attempts failed for batch {batch_idx + 1}")
825
+
826
+ # Add successful batch chunks to all_chunks
827
+ if batch_chunks:
828
+ all_chunks.extend(batch_chunks)
829
+ print(f"📊 Total chunks so far: {len(all_chunks)}")
830
+ else:
831
+ print(f"⚠️ Batch {batch_idx + 1} failed completely, skipping...")
832
+
833
+ # Final results
834
+ if all_chunks:
835
+ print(f"✅ Successfully processed document with {len(all_chunks)} total chunks from {len(batches)} batches")
836
+ # Re-index all chunks sequentially
837
+ for i, chunk in enumerate(all_chunks):
838
+ chunk["chunk_index"] = i
839
+ return all_chunks, document_markdown
840
+ else:
841
+ print("🔄 All batches failed, falling back to programmatic chunking...")
842
+ chunks, original_markdown = programmatic_chunk_document(document_markdown)
843
+ return chunks, original_markdown
844
 
845
  try:
846
  # Initialize Fireworks LLM with structured output
 
853
  # Create structured LLM that returns ChunkList object
854
  structured_llm = llm.with_structured_output(ChunkList)
855
 
856
+ # Create improved chunking prompt that returns complete chunk text
857
+ prompt = f"""## Task
858
+ Analyze this academic document and create logical educational chunks. Each chunk should contain 2-3 related educational concepts or lessons that a student would naturally learn together.
 
859
 
860
+ ## Step-by-Step Process
861
+ 1. **Scan the document** to identify main topics and educational concepts
862
+ 2. **Group related paragraphs** that teach connected ideas (even if separated by figures)
863
+ 3. **Create separate chunks** for figures/tables with their captions
864
+ 4. **Ensure each chunk** contains 2-3 educational lessons that build on each other
865
+ 5. **Preserve all formatting** exactly as written
866
+
867
+ ## Chunking Rules
868
+
869
+ ### Content Rules
870
+ - **Combine related content**: If a concept is split by a figure placement, reunite the related paragraphs in one chunk
871
+ - **2-3 educational lessons per chunk**: Each chunk should teach 2-3 connected concepts that logically belong together
872
+ - **Preserve complete thoughts**: Never split sentences, mathematical expressions, or LaTeX formulas
873
+ - **Skip metadata sections**: Exclude abstracts, acknowledgments, references, author info, page numbers
874
 
875
+ ### Formatting Rules
876
+ - **Preserve exactly**: All markdown, LaTeX, mathematical notation, and formatting
877
+ - **Include paragraph breaks**: Maintain original \\n\\n paragraph separations
878
+ - **Remove artifacts**: Strip page numbers, headers, footers, and formatting metadata
879
+
880
+ ### Special Elements
881
+ - **Figures/Tables/Images**: Create separate chunks containing the full caption and any accompanying text
882
+ - **Mathematical expressions**: Keep complete formulas together, never split LaTeX
883
+ - **Code blocks**: Preserve in their entirety with proper formatting
884
+
885
+ ## Output Format
886
+ Return a JSON object with this exact schema:
887
+
888
+ ```json
889
+ {{
890
+ "chunks": [
891
+ {{
892
+ "topic": "Brief descriptive name (2-6 words) for the educational content",
893
+ "text": "Complete chunk text with exact markdown/LaTeX formatting preserved"
894
+ }}
895
+ ]
896
+ }}
897
+ ```
898
+
899
+ ## Quality Criteria
900
+ **Good chunks:**
901
+ - Contain 2-3 related educational concepts
902
+ - Are 150-500 words (optimal learning unit size)
903
+ - Have clear educational value and logical flow
904
+ - Preserve all original formatting perfectly
905
+
906
+ **Avoid:**
907
+ - Single-sentence chunks
908
+ - Chunks with >5 unrelated concepts
909
+ - Split mathematical expressions
910
+ - Metadata or reference content
911
+
912
+ ## Examples
913
+
914
+ **Good chunk example:**
915
+ ```json
916
+ {{
917
+ "chunks": [
918
+ {{
919
+ "topic": "Gradient Descent Fundamentals",
920
+ "text": "## Gradient Descent Algorithm\\n\\nGradient descent is an optimization algorithm used to minimize functions...\\n\\n### Mathematical Formulation\\n\\nThe update rule is given by:\\n\\n$\\theta_{{t+1}} = \\theta_t - \\alpha \\nabla f(\\theta_t)$\\n\\nwhere $\\alpha$ is the learning rate..."
921
+ }}
922
+ ]
923
+ }}
924
+ ```
925
+
926
+ **Bad chunk example:**
927
+ ```json
928
+ {{
929
+ "chunks": [
930
+ {{
931
+ "topic": "Introduction",
932
+ "text": "This paper presents..."
933
+ }}
934
+ ]
935
+ }}
936
+ ```
937
+ (Too brief, not educational content)
938
+
939
+ ---
940
+
941
+ ## Document to Process:
942
+ {document_markdown}
943
+
944
+ Please analyze the document and return the JSON object with chunks following the above guidelines.
945
+ """
946
 
947
  # Call Fireworks with structured output
948
+ print("🚀 Calling Fireworks for document chunking...")
949
+ try:
950
+ chunk_response = structured_llm.invoke(prompt)
951
+ print(f"📋 Raw response type: {type(chunk_response)}")
952
+ print(f"📋 Raw response: {chunk_response}")
953
+ except Exception as invoke_error:
954
+ print(f"❌ Error during Fireworks invoke: {invoke_error}")
955
+ return [], document_markdown
956
 
957
+ if chunk_response is None:
958
+ print("❌ Received None response from Fireworks")
959
+ return [], document_markdown
 
 
 
960
 
961
+ if not hasattr(chunk_response, 'chunks'):
962
+ print(f"❌ Response missing 'chunks' attribute: {type(chunk_response)}")
963
+ print(f"Response content: {chunk_response}")
964
+ return [], document_markdown
965
 
966
+ chunks = chunk_response.chunks
967
+ if not chunks:
968
+ print("⚠️ No chunks returned from Fireworks")
969
+ return [], document_markdown
970
+
971
+ # Process chunks with direct text (no fuzzy matching needed)
972
+ processed_chunks = []
973
+ for i, chunk in enumerate(chunks):
974
+ print(f"\n📝 Processing chunk {i+1}: {chunk.topic}")
975
 
976
+ # Check if chunk has the expected 'text' attribute
977
+ if not hasattr(chunk, 'text'):
978
+ print(f"❌ Chunk missing 'text' attribute: {chunk}")
979
+ continue
 
 
 
 
980
 
981
+ print(f" Text preview: '{chunk.text[:100]}...'")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
982
 
983
+ processed_chunks.append({
984
+ "topic": chunk.topic,
985
+ "text": chunk.text,
986
+ "chunk_index": i
987
+ })
988
+
989
+ print(f"📊 Processed {len(processed_chunks)} chunks with direct text")
990
+
991
+ return processed_chunks, document_markdown
 
 
 
 
 
 
 
992
 
993
  except Exception as e:
994
  import traceback
995
  print(f"❌ Auto-chunking error: {e}")
996
  print(f"❌ Full traceback: {traceback.format_exc()}")
997
+ return [], document_markdown
998
 
999
  @app.post("/chunk_page")
1000
  async def chunk_page(request: dict):
 
1033
  3. Focus on educational content (concepts, examples, key points)
1034
  4. More dense content should have more chunks, less dense content fewer chunks
1035
  5. Identify chunks that would make good interactive lessons
1036
+ 6. SKIP chunks from abstract, references, author information, page numbers, etc.
1037
 
1038
  Return a list of chunks with topic, start_phrase, and end_phrase for each."""
1039
 
 
1043
  chunks = chunk_response.chunks
1044
  print(f"📝 Received {len(chunks)} chunks from Fireworks")
1045
 
1046
+ # Process chunks with direct text (no fuzzy matching needed)
1047
+ processed_chunks = []
1048
+ for i, chunk in enumerate(chunks):
1049
+ processed_chunks.append({
1050
+ "topic": chunk.topic,
1051
+ "text": chunk.text,
1052
+ "chunk_index": i
1053
+ })
1054
+ print(f"✅ Processed chunk: {chunk.topic}")
1055
+
1056
+ print(f"📊 Successfully processed {len(processed_chunks)} chunks")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1057
 
1058
  return {
1059
+ "chunks": processed_chunks,
1060
+ "total_found": len(processed_chunks),
1061
  "total_suggested": len(chunks)
1062
  }
1063
 
backend/requirements.txt CHANGED
@@ -10,3 +10,4 @@ langchain-core
10
  langchain-fireworks
11
  pydantic
12
  anthropic
 
 
10
  langchain-fireworks
11
  pydantic
12
  anthropic
13
+ google-genai
frontend/src/components/ChunkPanel.jsx CHANGED
@@ -67,7 +67,8 @@ const ChunkPanel = ({
67
  rehypePlugins={[rehypeRaw, rehypeKatex]}
68
  components={chunkMarkdownComponents}
69
  >
70
- {documentData.markdown.slice(
 
71
  documentData.chunks[currentChunkIndex].start_position,
72
  documentData.chunks[currentChunkIndex].end_position
73
  )}
 
67
  rehypePlugins={[rehypeRaw, rehypeKatex]}
68
  components={chunkMarkdownComponents}
69
  >
70
+ {documentData.chunks[currentChunkIndex].text ||
71
+ documentData.markdown.slice(
72
  documentData.chunks[currentChunkIndex].start_position,
73
  documentData.chunks[currentChunkIndex].end_position
74
  )}
frontend/src/components/DocumentProcessor.jsx CHANGED
@@ -1,4 +1,4 @@
1
- import { useMemo } from 'react';
2
  import 'katex/dist/katex.min.css';
3
 
4
  // Import custom hooks
@@ -14,7 +14,7 @@ import ChunkNavigation from './ChunkNavigation';
14
  import ChunkPanel from './ChunkPanel';
15
 
16
  // Import utilities
17
- import { highlightChunkInMarkdown } from '../utils/markdownUtils';
18
 
19
  function DocumentProcessor() {
20
  // Custom hooks
@@ -59,20 +59,14 @@ function DocumentProcessor() {
59
  isDragging,
60
  containerRef,
61
  handleMouseDown
62
- } = usePanelResize(40);
63
 
64
  // Enhanced startInteractiveLesson that uses the chat hook
65
  const handleStartInteractiveLesson = () => {
66
  startInteractiveLesson(() => startChunkLesson(currentChunkIndex, documentData));
67
  };
68
 
69
- // Memoize the highlighted markdown to prevent unnecessary re-renders
70
- const highlightedMarkdown = useMemo(() => {
71
- if (!documentData || !documentData.markdown || !documentData.chunks) {
72
- return '';
73
- }
74
- return highlightChunkInMarkdown(documentData.markdown, documentData.chunks, currentChunkIndex);
75
- }, [documentData?.markdown, documentData?.chunks, currentChunkIndex]);
76
 
77
  // Early returns for different states
78
  if (!selectedFile) {
@@ -130,11 +124,10 @@ function DocumentProcessor() {
130
  {/* Left Panel - Document */}
131
  <div style={{ width: `${leftPanelWidth}%`, height: '100%' }}>
132
  <DocumentViewer
133
- highlightedMarkdown={highlightedMarkdown}
134
  documentData={documentData}
135
- fetchImage={fetchImage}
136
- imageCache={imageCache}
137
- setImageCache={() => {}} // Handled by useDocumentProcessor
138
  />
139
  </div>
140
 
 
1
+ import React from 'react';
2
  import 'katex/dist/katex.min.css';
3
 
4
  // Import custom hooks
 
14
  import ChunkPanel from './ChunkPanel';
15
 
16
  // Import utilities
17
+ // Note: Removed markdown highlighting utilities - now using PDF viewer
18
 
19
  function DocumentProcessor() {
20
  // Custom hooks
 
59
  isDragging,
60
  containerRef,
61
  handleMouseDown
62
+ } = usePanelResize(50);
63
 
64
  // Enhanced startInteractiveLesson that uses the chat hook
65
  const handleStartInteractiveLesson = () => {
66
  startInteractiveLesson(() => startChunkLesson(currentChunkIndex, documentData));
67
  };
68
 
69
+ // Note: Removed highlightedMarkdown - now using PDF viewer instead of markdown
 
 
 
 
 
 
70
 
71
  // Early returns for different states
72
  if (!selectedFile) {
 
124
  {/* Left Panel - Document */}
125
  <div style={{ width: `${leftPanelWidth}%`, height: '100%' }}>
126
  <DocumentViewer
127
+ selectedFile={selectedFile}
128
  documentData={documentData}
129
+ currentChunkIndex={currentChunkIndex}
130
+ chunks={documentData?.chunks}
 
131
  />
132
  </div>
133
 
frontend/src/components/DocumentViewer.jsx CHANGED
@@ -1,51 +1,207 @@
1
- import ReactMarkdown from 'react-markdown';
2
- import remarkMath from 'remark-math';
3
- import rehypeKatex from 'rehype-katex';
4
- import rehypeRaw from 'rehype-raw';
5
- import { getDocumentMarkdownComponents } from '../utils/markdownComponents.jsx';
6
 
7
- const DocumentViewer = ({ highlightedMarkdown, documentData, fetchImage, imageCache, setImageCache }) => {
8
- const markdownComponents = getDocumentMarkdownComponents(documentData, fetchImage, imageCache, setImageCache);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  return (
11
- <div className="bg-white rounded-lg shadow-sm flex flex-col" style={{ width: '100%', height: '100%' }}>
 
12
  <div className="sticky top-0 bg-white rounded-t-lg px-6 py-4 border-b border-gray-200 z-10">
13
- <h2 className="text-lg font-semibold text-left text-gray-800">Document</h2>
14
  </div>
15
 
16
- <div className="flex-1 px-6 pt-6 pb-8 overflow-y-auto">
17
- <style>
18
- {`
19
- @keyframes fadeInHighlight {
20
- 0% {
21
- background-color: rgba(255, 214, 100, 0);
22
- border-left-color: rgba(156, 163, 175, 0);
23
- transform: translateX(-10px);
24
- opacity: 0;
 
 
 
 
 
 
 
 
 
25
  }
26
- 100% {
27
- background-color: rgba(255, 214, 100, 0.15);
28
- border-left-color: rgba(156, 163, 175, 0.5);
29
- transform: translateX(0);
30
- opacity: 1;
 
 
31
  }
32
- }
33
- `}
34
- </style>
35
- <div className="prose prose-sm max-w-none" style={{
36
- fontSize: '0.875rem',
37
- lineHeight: '1.5',
38
- color: 'rgb(55, 65, 81)'
39
- }}>
40
- <ReactMarkdown
41
- remarkPlugins={[remarkMath]}
42
- rehypePlugins={[rehypeRaw, rehypeKatex]}
43
- components={markdownComponents}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  >
45
- {highlightedMarkdown}
46
- </ReactMarkdown>
 
 
 
47
  </div>
48
- </div>
49
  </div>
50
  );
51
  };
 
1
+ import { useState, useRef } from 'react';
2
+ import { Document, Page, pdfjs } from 'react-pdf';
3
+ import 'react-pdf/dist/Page/AnnotationLayer.css';
4
+ import 'react-pdf/dist/Page/TextLayer.css';
 
5
 
6
+ // Set up PDF.js worker
7
+ pdfjs.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.js';
8
+
9
+ const DocumentViewer = ({ selectedFile, documentData, currentChunkIndex, chunks }) => {
10
+ const pdfContainerRef = useRef(null);
11
+ const [numPages, setNumPages] = useState(null);
12
+ const [currentPage, setCurrentPage] = useState(1);
13
+ const [zoomLevel, setZoomLevel] = useState(1);
14
+ const [visiblePages, setVisiblePages] = useState(new Set([1]));
15
+
16
+ // Handle scroll to update current page and track visible pages
17
+ const handleScroll = () => {
18
+ if (!pdfContainerRef.current || !numPages) return;
19
+
20
+ const container = pdfContainerRef.current;
21
+ const scrollTop = container.scrollTop;
22
+ const containerHeight = container.clientHeight;
23
+ const totalScrollHeight = container.scrollHeight - containerHeight;
24
+
25
+ // Calculate which page we're viewing based on scroll position
26
+ const scrollPercent = scrollTop / totalScrollHeight;
27
+ const newPage = Math.min(Math.floor(scrollPercent * numPages) + 1, numPages);
28
+
29
+ if (newPage !== currentPage) {
30
+ setCurrentPage(newPage);
31
+ }
32
+
33
+ // Track visible pages based on zoom level (more pages visible when zoomed out)
34
+ const newVisiblePages = new Set();
35
+ const visibleRange = Math.max(1, Math.ceil(2 / zoomLevel)); // More pages when zoomed out
36
+ for (let i = Math.max(1, newPage - visibleRange); i <= Math.min(numPages, newPage + visibleRange); i++) {
37
+ newVisiblePages.add(i);
38
+ }
39
+
40
+ // Update visible pages if changed
41
+ if (newVisiblePages.size !== visiblePages.size ||
42
+ ![...newVisiblePages].every(page => visiblePages.has(page))) {
43
+ setVisiblePages(newVisiblePages);
44
+ }
45
+ };
46
+
47
+ // Jump to specific page
48
+ const goToPage = (pageNumber) => {
49
+ if (!pdfContainerRef.current || !numPages) return;
50
+
51
+ // Update visible pages immediately for target page
52
+ const newVisiblePages = new Set();
53
+ const visibleRange = Math.max(1, Math.ceil(2 / zoomLevel)); // More pages when zoomed out
54
+ for (let i = Math.max(1, pageNumber - visibleRange); i <= Math.min(numPages, pageNumber + visibleRange); i++) {
55
+ newVisiblePages.add(i);
56
+ }
57
+ setVisiblePages(newVisiblePages);
58
+
59
+ const container = pdfContainerRef.current;
60
+ const totalScrollHeight = container.scrollHeight - container.clientHeight;
61
+
62
+ // Calculate scroll position for the target page
63
+ const targetScrollPercent = (pageNumber - 1) / numPages;
64
+ const targetScrollTop = targetScrollPercent * totalScrollHeight;
65
+
66
+ container.scrollTo({
67
+ top: targetScrollTop,
68
+ behavior: 'smooth'
69
+ });
70
+ };
71
+
72
+ // Zoom controls
73
+ const zoomIn = () => setZoomLevel(prev => Math.min(prev + 0.25, 3));
74
+ const zoomOut = () => setZoomLevel(prev => Math.max(prev - 0.25, 0.5));
75
+ const resetZoom = () => setZoomLevel(1);
76
 
77
  return (
78
+ <div className="bg-white rounded-lg shadow-sm flex flex-col relative" style={{ width: '100%', height: '100%' }}>
79
+ {/* Header */}
80
  <div className="sticky top-0 bg-white rounded-t-lg px-6 py-4 border-b border-gray-200 z-10">
81
+ <h2 className="text-lg font-semibold text-left text-gray-800">Original Document</h2>
82
  </div>
83
 
84
+ {/* PDF Content */}
85
+ <div
86
+ ref={pdfContainerRef}
87
+ className="flex-1 overflow-auto flex justify-center bg-gray-100"
88
+ onScroll={handleScroll}
89
+ >
90
+ {selectedFile ? (
91
+ <div className="py-4">
92
+ <Document
93
+ file={selectedFile}
94
+ onLoadSuccess={({ numPages }) => setNumPages(numPages)}
95
+ loading={
96
+ <div className="flex items-center justify-center h-64">
97
+ <div className="text-center">
98
+ <div className="w-8 h-8 border-4 border-blue-500 border-t-transparent rounded-full animate-spin mx-auto mb-2"></div>
99
+ <p className="text-gray-600">Loading PDF...</p>
100
+ </div>
101
+ </div>
102
  }
103
+ error={
104
+ <div className="flex items-center justify-center h-64">
105
+ <div className="text-center">
106
+ <p className="text-red-600 mb-2">Failed to load PDF</p>
107
+ <p className="text-gray-600 text-sm">Please check the file format</p>
108
+ </div>
109
+ </div>
110
  }
111
+ >
112
+ {/* Render all pages continuously */}
113
+ {numPages && Array.from(new Array(numPages), (_, index) => {
114
+ const pageNum = index + 1;
115
+ const isVisible = visiblePages.has(pageNum);
116
+ const currentZoom = isVisible ? zoomLevel : 1; // Only zoom visible pages
117
+
118
+ return (
119
+ <div key={pageNum} className="mb-4 flex justify-center">
120
+ <Page
121
+ pageNumber={pageNum}
122
+ width={typeof window !== 'undefined' ? window.innerWidth * 0.4 * 0.9 * currentZoom : 600 * currentZoom}
123
+ />
124
+ </div>
125
+ );
126
+ })}
127
+ </Document>
128
+ </div>
129
+ ) : (
130
+ <div className="flex items-center justify-center h-full">
131
+ <p className="text-gray-500">No PDF loaded</p>
132
+ </div>
133
+ )}
134
+ </div>
135
+
136
+ {/* Pagination overlay - floating pill */}
137
+ {numPages && (
138
+ <div className="absolute bottom-4 left-1/2 transform -translate-x-1/2 z-10">
139
+ <div className="flex items-center bg-gray-800/90 backdrop-blur-sm rounded-full shadow-lg px-3 py-2 space-x-3">
140
+ <button
141
+ onClick={() => goToPage(Math.max(currentPage - 1, 1))}
142
+ disabled={currentPage <= 1}
143
+ className="w-8 h-8 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
144
+ >
145
+ <svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
146
+ <path d="M10 12l-4-4 4-4v8z"/>
147
+ </svg>
148
+ </button>
149
+
150
+ <span className="px-3 py-1 text-sm font-medium text-white min-w-[60px] text-center">
151
+ {currentPage}/{numPages}
152
+ </span>
153
+
154
+ <button
155
+ onClick={() => goToPage(Math.min(currentPage + 1, numPages))}
156
+ disabled={currentPage >= numPages}
157
+ className="w-8 h-8 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
158
+ >
159
+ <svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
160
+ <path d="M6 4l4 4-4 4V4z"/>
161
+ </svg>
162
+ </button>
163
+ </div>
164
+ </div>
165
+ )}
166
+
167
+ {/* Zoom controls overlay - bottom right */}
168
+ {numPages && (
169
+ <div className="absolute bottom-4 right-4 z-10 flex flex-col items-center space-y-2">
170
+ {/* Main zoom pill - vertical */}
171
+ <div className="flex flex-col items-center bg-gray-800/90 backdrop-blur-sm rounded-full shadow-lg px-2 py-2 space-y-1">
172
+ <button
173
+ onClick={zoomIn}
174
+ disabled={zoomLevel >= 3}
175
+ className="w-6 h-6 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
176
+ >
177
+ <svg width="14" height="14" viewBox="0 0 16 16" fill="currentColor">
178
+ <path d="M8 4v4H4v1h4v4h1V9h4V8H9V4z"/>
179
+ </svg>
180
+ </button>
181
+
182
+ <button
183
+ onClick={zoomOut}
184
+ disabled={zoomLevel <= 0.5}
185
+ className="w-6 h-6 rounded-full bg-gray-600 hover:bg-gray-500 disabled:opacity-30 disabled:cursor-not-allowed flex items-center justify-center transition-colors text-white"
186
+ >
187
+ <svg width="14" height="14" viewBox="0 0 16 16" fill="currentColor">
188
+ <path d="M4 8h8v1H4z"/>
189
+ </svg>
190
+ </button>
191
+ </div>
192
+
193
+ {/* Reset button below */}
194
+ <button
195
+ onClick={resetZoom}
196
+ className="w-10 h-10 bg-gray-700 hover:bg-gray-500 backdrop-blur-sm rounded-full shadow-lg flex items-center justify-center text-white transition-colors"
197
  >
198
+ <svg width="14" height="14" viewBox="0 0 16 16" fill="currentColor" stroke="currentColor" strokeWidth="0.5">
199
+ <path d="M8 3a5 5 0 1 0 4.546 2.914.5.5 0 0 1 .908-.417A6 6 0 1 1 8 2v1z" strokeWidth="1"/>
200
+ <path d="M8 4.466V.534a.25.25 0 0 1 .41-.192l2.36 1.966c.12.1.12.284 0 .384L8.41 4.658A.25.25 0 0 1 8 4.466z"/>
201
+ </svg>
202
+ </button>
203
  </div>
204
+ )}
205
  </div>
206
  );
207
  };
test_fuzzy_find.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #%%
2
+ import matplotlib.pyplot as plt
3
+ from difflib import SequenceMatcher
4
+ import numpy as np
5
+
6
+ def fuzzy_find(text, pattern, start_pos=0):
7
+ """Find the best fuzzy match for pattern in text starting from start_pos"""
8
+ best_ratio = 0
9
+ best_pos = -1
10
+
11
+ # Search in sliding windows
12
+ pattern_len = len(pattern)
13
+ for i in range(start_pos, len(text) - pattern_len + 1):
14
+ window = text[i:i + pattern_len]
15
+ ratio = SequenceMatcher(None, pattern.lower(), window.lower()).ratio()
16
+
17
+ if ratio > best_ratio and ratio > 0.8: # Much stricter: 80% similarity
18
+ best_ratio = ratio
19
+ best_pos = i
20
+
21
+ return best_pos if best_pos != -1 else None
22
+
23
+ def analyze_fuzzy_ratios(markdown_text, chunk_text):
24
+ """
25
+ Analyze fuzzy matching ratios across the entire markdown text using a rolling window.
26
+ Returns positions and their corresponding similarity ratios.
27
+ """
28
+ chunk_len = len(chunk_text)
29
+ positions = []
30
+ ratios = []
31
+
32
+ # Rolling window over the entire markdown text
33
+ for i in range(len(markdown_text) - chunk_len + 1):
34
+ window = markdown_text[i:i + chunk_len]
35
+ ratio = SequenceMatcher(None, chunk_text.lower(), window.lower()).ratio()
36
+ positions.append(i)
37
+ ratios.append(ratio)
38
+
39
+ return positions, ratios
40
+
41
+ def plot_ratio_distribution(positions, ratios, chunk_text, markdown_file_path=None):
42
+ """
43
+ Create a plot showing the similarity ratio distribution across positions.
44
+ """
45
+ fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))
46
+
47
+ # Main plot: ratio vs position
48
+ ax1.plot(positions, ratios, 'b-', alpha=0.7, linewidth=1)
49
+ ax1.axhline(y=0.8, color='r', linestyle='--', label='Fuzzy find threshold (0.8)')
50
+ ax1.set_xlabel('Position in Markdown Text')
51
+ ax1.set_ylabel('Similarity Ratio')
52
+ ax1.set_title(f'Fuzzy Match Similarity Ratios Across Text\n(Chunk length: {len(chunk_text)} chars)')
53
+ ax1.grid(True, alpha=0.3)
54
+ ax1.legend()
55
+
56
+ # Highlight maximum ratio
57
+ max_ratio = max(ratios)
58
+ max_pos = positions[ratios.index(max_ratio)]
59
+ ax1.plot(max_pos, max_ratio, 'ro', markersize=8, label=f'Max ratio: {max_ratio:.3f} at pos {max_pos}')
60
+ ax1.legend()
61
+
62
+ # Histogram of ratios
63
+ ax2.hist(ratios, bins=50, alpha=0.7, edgecolor='black')
64
+ ax2.axvline(x=0.8, color='r', linestyle='--', label='Fuzzy find threshold (0.8)')
65
+ ax2.axvline(x=max_ratio, color='g', linestyle='--', label=f'Maximum ratio: {max_ratio:.3f}')
66
+ ax2.set_xlabel('Similarity Ratio')
67
+ ax2.set_ylabel('Frequency')
68
+ ax2.set_title('Distribution of Similarity Ratios')
69
+ ax2.legend()
70
+ ax2.grid(True, alpha=0.3)
71
+
72
+ plt.tight_layout()
73
+ return fig, max_ratio, max_pos
74
+
75
+ def compare_texts(original_chunk, found_text, max_pos):
76
+ """
77
+ Compare the original chunk text with the text found by fuzzy_find.
78
+ Shows character-by-character differences and similarity analysis.
79
+ """
80
+ print("\n" + "="*80)
81
+ print("TEXT COMPARISON: Original Chunk vs Fuzzy Find Result")
82
+ print("="*80)
83
+
84
+ print(f"\nOriginal chunk length: {len(original_chunk)} characters")
85
+ print(f"Found text length: {len(found_text)} characters")
86
+ print(f"Found at position: {max_pos}")
87
+
88
+ # Calculate overall similarity
89
+ similarity = SequenceMatcher(None, original_chunk.lower(), found_text.lower()).ratio()
90
+ print(f"Overall similarity: {similarity:.4f} ({similarity*100:.2f}%)")
91
+
92
+ # Show first 200 characters of each
93
+ print(f"\nOriginal chunk (first 200 chars):")
94
+ print(f"'{original_chunk}{'...' if len(original_chunk) > 200 else ''}'")
95
+
96
+ print(f"\nFound text (first 200 chars):")
97
+ print(f"'{found_text}{'...' if len(found_text) > 200 else ''}'")
98
+
99
+ # Character-by-character analysis for first 100 characters
100
+ print(f"\nCharacter-by-character comparison (first 100 chars):")
101
+ print("Original: ", end="")
102
+ for i, char in enumerate(original_chunk[:100]):
103
+ if i < len(found_text) and char.lower() == found_text[i].lower():
104
+ print(char, end="") # Same character
105
+ else:
106
+ print(f"[{char}]", end="") # Different character
107
+ print()
108
+
109
+ print("Found: ", end="")
110
+ for i, char in enumerate(found_text[:100]):
111
+ if i < len(original_chunk) and char.lower() == original_chunk[i].lower():
112
+ print(char, end="") # Same character
113
+ else:
114
+ print(f"[{char}]", end="") # Different character
115
+ print()
116
+
117
+ # Analyze differences
118
+ matcher = SequenceMatcher(None, original_chunk, found_text)
119
+ differences = []
120
+ for tag, i1, i2, j1, j2 in matcher.get_opcodes():
121
+ if tag != 'equal':
122
+ differences.append({
123
+ 'type': tag,
124
+ 'original_pos': (i1, i2),
125
+ 'found_pos': (j1, j2),
126
+ 'original_text': original_chunk[i1:i2],
127
+ 'found_text': found_text[j1:j2]
128
+ })
129
+
130
+ print(f"\nFound {len(differences)} differences:")
131
+ for i, diff in enumerate(differences[:10]): # Show first 10 differences
132
+ print(f"{i+1}. {diff['type'].upper()} at original[{diff['original_pos'][0]}:{diff['original_pos'][1]}] -> found[{diff['found_pos'][0]}:{diff['found_pos'][1]}]")
133
+ if diff['original_text']:
134
+ print(f" Original: '{diff['original_text'][:50]}{'...' if len(diff['original_text']) > 50 else ''}'")
135
+ if diff['found_text']:
136
+ print(f" Found: '{diff['found_text'][:50]}{'...' if len(diff['found_text']) > 50 else ''}'")
137
+
138
+ if len(differences) > 10:
139
+ print(f" ... and {len(differences) - 10} more differences")
140
+
141
+ return similarity, differences
142
+
143
+ def run_fuzzy_analysis():
144
+ """
145
+ Main function to run the fuzzy find analysis.
146
+ You can modify the markdown_text and chunk_text variables below.
147
+ """
148
+
149
+ # TODO: Replace these with your actual markdown content and chunk
150
+ markdown_text = """# An improved method for mobile characterisation of $\\delta^{13} \\mathrm{CH}_{4}$ source signatures and its application in Germany \n\nAntje Hoheisel ${ }^{1}$, Christiane Yeman ${ }^{1, a}$, Florian Dinger ${ }^{1,2}$, Henrik Eckhardt ${ }^{1}$, and Martina Schmidt ${ }^{1}$<br>${ }^{1}$ Institute of Environmental Physics, Heidelberg University, Heidelberg, Germany<br>${ }^{2}$ Max Planck Institute for Chemistry, Mainz, Germany<br>${ }^{a}$ now at: Laboratory of Ion Beam Physics, ETH Zurich, Zurich, Switzerland\n\nCorrespondence: Antje Hoheisel (antje.hoheisel@iup.uni-heidelberg.de)\nReceived: 7 August 2018 - Discussion started: 1 October 2018\nRevised: 17 January 2019 - Accepted: 28 January 2019 - Published: 22 February 2019\n\n\n#### Abstract\n\nThe carbon isotopic signature $\\left(\\delta^{13} \\mathrm{CH}_{4}\\right)$ of several methane sources in Germany (around Heidelberg and in North Rhine-Westphalia) were characterised. Mobile measurements of the plume of $\\mathrm{CH}_{4}$ sources are carried out using an analyser based on cavity ring-down spectroscopy (CRDS). To achieve precise results a CRDS analyser, which measures methane $\\left(\\mathrm{CH}_{4}\\right)$, carbon dioxide $\\left(\\mathrm{CO}_{2}\\right)$ and their ${ }^{13} \\mathrm{C}$-to- ${ }^{12} \\mathrm{C}$ ratios, was characterised especially with regard to cross sensitivities of composition differences of the gas matrix in air samples or calibration tanks. The two most important gases which affect $\\delta^{13} \\mathrm{CH}_{4}$ are water vapour $\\left(\\mathrm{H}_{2} \\mathrm{O}\\right)$ and ethane $\\left(\\mathrm{C}_{2} \\mathrm{H}_{6}\\right)$. To avoid the cross sensitivity with $\\mathrm{H}_{2} \\mathrm{O}$, the air is dried with a Nafion dryer during mobile measurements. $\\mathrm{C}_{2} \\mathrm{H}_{6}$ is typically abundant in natural gases and thus in methane plumes or samples originating from natural gas. $\\mathrm{A}_{2} \\mathrm{H}_{6}$ correction and calibration are essential to obtain accurate $\\delta^{13} \\mathrm{CH}_{4}$ results, which can deviate by up to $3 \\%$ depending on whether a $\\mathrm{C}_{2} \\mathrm{H}_{6}$ correction is applied.\n\nThe isotopic signature is determined with the Miller-Tans approach and the York fitting method. During 21 field campaigns the mean $\\delta^{13} \\mathrm{CH}_{4}$ signatures of three dairy farms $\\left(-63.9 \\pm 0.9 \\%_{e}\\right)$, a biogas plant $\\left(-62.4 \\pm 1.2 \\%_{e}\\right)$, a landfill $\\left(-58.7 \\pm 3.3 \\%_{e}\\right)$, a wastewater treatment plant $(-52.5 \\pm$ $1.4 \\%$ ), an active deep coal mine ( $-56.0 \\pm 2.3 \\%$ ) and two natural gas storage and gas compressor stations ( $-46.1 \\pm$ $0.8 \\%$ ) were recorded.\n\nIn addition, between December 2016 and November 2018 gas samples from the Heidelberg natural gas distribution network were measured with a mean $\\delta^{13} \\mathrm{CH}_{4}$ value of $-43.3 \\pm$ $0.8 \\%$. Contrary to previous measurements between 1991\n\n\n#### Abstract\n\nand 1996 by Levin et al. (1999), no strong seasonal cycle is shown.\n\n\n## 1 Introduction\n\nMethane $\\left(\\mathrm{CH}_{4}\\right)$ is the second most important anthropogenic greenhouse gas. The atmospheric growth rate of $\\mathrm{CH}_{4}$ has changed significantly during the last decades, stabilising at zero growth from 1999 to 2006 before beginning to increase again after 2007 (Dlugokencky et al., 2009). Several studies have focused on the recent $\\mathrm{CH}_{4}$ growth caused by changes in sources and sinks (Rigby et al., 2017; Turner et al., 2017).\n\nRecent studies by Schaefer et al. (2016), Rice et al. (2016) and Nisbet et al. (2016) have shown how the $\\delta^{13} \\mathrm{CH}_{4}$ measurements can help to understand the changes in global $\\mathrm{CH}_{4}$ increase rates and to assign the related source types. The stable carbon isotope ratio $\\left({ }^{13} \\mathrm{C} /{ }^{12} \\mathrm{C}\\right)$ of $\\mathrm{CH}_{4}$ sources varies due to the initial source material and the fractionation during production and release to the atmosphere. The source categories can be classified as pyrogenic (e.g. biomass burning), biogenic (e.g. wetlands and livestock) or thermogenic (e.g. a subcategory of fossil fuel extraction), which show different but also overlapping isotope ratio ranges. Various studies have shown that the assignment of isotopic signatures from different $\\mathrm{CH}_{4}$ sources remains uncertain due to large temporal variabilities and also regional specificities (e.g. Sherwood et al., 2017). This missing knowledge may result in large uncertainties when the $\\mathrm{CH}_{4}$ budget is determined on global or regional scales using isotope-based estimates. In addition to global studies, the use of $\\delta^{13} \\mathrm{CH}_{4}$ was already successfully"""
151
+
152
+ chunk_text = """## 1 Introduction\nMethane ($\mathrm{CH}_{4}$) is the second most important anthropogenic greenhouse gas. The atmospheric growth rate of $\mathrm{CH}_{4}$ has changed significantly during the last decades, stabilising at zero growth from 1999 to 2006 before beginning to increase again after 2007 (Dlugokencky et al., 2009). Several studies have focused on the recent $\mathrm{CH}_{4}$ growth caused by changes in sources and sinks (Rigby et al., 2017; Turner et al., 2017).\n\nRecent studies by Schaefer et al. (2016), Rice et al. (2016) and Nisbet et al. (2016) have shown how the $\delta^{13} \mathrm{CH}_{4}$ measurements can help to understand the changes in global $\mathrm{CH}_{4}$ increase rates and to assign the related source types. The stable carbon isotope ratio (${}^{13}\mathrm{C}$/${}^{12}\mathrm{C}$) of $\mathrm{CH}_{4}$ sources varies due to the initial source material and the fractionation during production and release to the atmosphere. The source categories can be classified as pyrogenic (e.g. biomass burning), biogenic (e.g. wetlands and livestock) or thermogenic (e.g. a subcategory of fossil fuel extraction), which show different but also overlapping isotope ratio ranges. Various studies have shown that the assignment of isotopic signatures from different $\mathrm{CH}_{4}$ sources remains uncertain due to large temporal variabilities and also regional specificities (e.g. Sherwood et al., 2017). This missing knowledge may result in large uncertainties when the $\mathrm{CH}_{4}$ budget is determined on global or regional scales using isotope-based estimates. In addition to global studies, the use of $\delta^{13}\mathrm{CH}_{4}$ was already successfully"""
153
+
154
+ print("Analyzing fuzzy matching ratios...")
155
+ print(f"Markdown text length: {len(markdown_text)} characters")
156
+ print(f"Chunk text length: {len(chunk_text)} characters")
157
+
158
+ # Run the analysis
159
+ positions, ratios = analyze_fuzzy_ratios(markdown_text, chunk_text)
160
+
161
+ # Create the plot
162
+ fig, max_ratio, max_pos = plot_ratio_distribution(positions, ratios, chunk_text)
163
+
164
+ # Print statistics
165
+ print(f"\nStatistics:")
166
+ print(f"Maximum similarity ratio: {max_ratio:.3f}")
167
+ print(f"Maximum ratio position: {max_pos}")
168
+ print(f"Number of positions above 0.8 threshold: {sum(1 for r in ratios if r > 0.8)}")
169
+ print(f"Mean ratio: {np.mean(ratios):.3f}")
170
+ print(f"Standard deviation: {np.std(ratios):.3f}")
171
+
172
+ # Test the original fuzzy_find function
173
+ result = fuzzy_find(markdown_text, chunk_text)
174
+ print(f"\nOriginal fuzzy_find result: {result}")
175
+ if result is not None:
176
+ print(f"Found match at position {result}")
177
+ else:
178
+ print("No match found above 0.8 threshold")
179
+
180
+ # Compare the found text with the original chunk
181
+ if max_ratio > 0: # If we found any match
182
+ found_text = markdown_text[max_pos:max_pos + len(chunk_text)]
183
+ text_similarity, differences = compare_texts(chunk_text, found_text, max_pos)
184
+ print(f"\nDetailed comparison similarity: {text_similarity:.4f}")
185
+ print(f"Number of character differences: {len(differences)}")
186
+
187
+ plt.show()
188
+ return positions, ratios, max_ratio, max_pos
189
+
190
+ if __name__ == "__main__":
191
+ # Run the analysis
192
+ positions, ratios, max_ratio, max_pos = run_fuzzy_analysis()
193
+
194
+ #%%