echoboi commited on
Commit
df06239
·
verified ·
1 Parent(s): c99d04d

Upload folder using huggingface_hub

Browse files
Dockerfile ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use lightweight Python base
2
+ FROM python:3.10-slim
3
+
4
+ # Prevent Python from writing .pyc files and buffering stdout/stderr
5
+ ENV PYTHONDONTWRITEBYTECODE=1 \
6
+ PYTHONUNBUFFERED=1
7
+
8
+ # Create app directory
9
+ WORKDIR /app
10
+
11
+ # System deps for pandas/openpyxl and builds
12
+ RUN apt-get update && apt-get install -y --no-install-recommends \
13
+ build-essential \
14
+ gcc \
15
+ curl \
16
+ && rm -rf /var/lib/apt/lists/*
17
+
18
+ # Copy requirements first to leverage Docker cache
19
+ COPY requirements.txt ./
20
+ RUN pip install --no-cache-dir -r requirements.txt
21
+
22
+ # Copy the rest of the source code
23
+ COPY . .
24
+
25
+ # Expose the port the Space will provide via $PORT
26
+ ENV PORT=7860
27
+
28
+ # Use gunicorn to serve Flask app
29
+ # Hugging Face Spaces expects the container to listen on 0.0.0.0:$PORT
30
+ CMD exec gunicorn --bind 0.0.0.0:$PORT --workers 2 --timeout 180 paper_analysis_backend:app
README.md CHANGED
@@ -1,13 +1,43 @@
1
- ---
2
- title: Ai Systematic Lit Review
3
- emoji: 🏃
4
- colorFrom: gray
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 5.45.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: ai_scientist
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: AI Systematic Literature Review
3
+ emoji: 🧪
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 4.0.0
8
+ pinned: false
9
+ license: mit
10
+ app_port: 7860
11
+ ---
12
+
13
+ # AI Systematic Literature Review
14
+
15
+ An intelligent tool for conducting systematic literature reviews using OpenAlex API and AI-powered paper filtering.
16
+
17
+ ## Features
18
+
19
+ - **🔍 Search Papers**: Find academic papers by title using OpenAlex API
20
+ - **📚 Collect Papers**: Gather related papers (cited, citing, and related) from a seed paper
21
+ - **🔬 Filter Papers**: Use AI to filter collected papers based on your research question
22
+ - **📁 Database Management**: View and manage your collections and filters
23
+ - **📊 Export Data**: Export results to BibTeX format
24
+
25
+ ## How to Use
26
+
27
+ 1. **Search Papers**: Enter a paper title to find papers in OpenAlex
28
+ 2. **Collect Papers**: Use a Work ID to collect related papers (cited, citing, and related)
29
+ 3. **Filter Papers**: Use AI to filter collected papers based on your research question
30
+ 4. **Database Files**: View all your collections and filters
31
+ 5. **Export Data**: Export your results to BibTeX format
32
+
33
+ ## Setup
34
+
35
+ To use AI filtering, you need to set your OpenAI API key as a secret in the Space settings.
36
+
37
+ ## Technical Details
38
+
39
+ - Built with Gradio for the user interface
40
+ - Uses OpenAlex API for paper discovery and collection
41
+ - Integrates with OpenAI API for intelligent paper filtering
42
+ - Automatically saves collections and filters for reuse
43
+ - Respects OpenAlex rate limits
app.py ADDED
@@ -0,0 +1,2106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import requests
3
+ import json
4
+ import time
5
+ import pandas as pd
6
+ from typing import Dict, List, Optional
7
+ import pickle
8
+ import os
9
+ import sys
10
+ import threading
11
+ import tempfile
12
+ import shutil
13
+ from datetime import datetime
14
+ import timeit
15
+ from tqdm import tqdm
16
+
17
+ # Define 'toc' function once
18
+ def toc(start_time):
19
+ elapsed = timeit.default_timer() - start_time
20
+ print(elapsed)
21
+
22
+ # Record start time
23
+ start_time = timeit.default_timer()
24
+
25
+ # Helper function to get all pages
26
+ def get_all_pages(url, headers, upper_limit=None):
27
+ all_results = []
28
+ unique_ids = set() # Track unique paper IDs
29
+ page = 1
30
+ processing_times = [] # Track time taken per paper
31
+
32
+ # Get first page to get total count
33
+ first_response = requests.get(f"{url}&page={page}", headers=headers)
34
+ if first_response.status_code != 200:
35
+ return []
36
+
37
+ data = first_response.json()
38
+ total_count = data.get('meta', {}).get('count', 0)
39
+ start_time = time.time()
40
+
41
+ # Add only unique papers from first page
42
+ for result in data.get('results', []):
43
+ if result.get('id') not in unique_ids:
44
+ unique_ids.add(result.get('id'))
45
+ all_results.append(result)
46
+ if upper_limit and len(all_results) >= upper_limit:
47
+ return all_results
48
+
49
+ papers_processed = len(all_results)
50
+ time_taken = time.time() - start_time
51
+ if papers_processed > 0:
52
+ processing_times.append(time_taken / papers_processed)
53
+
54
+ # Continue getting remaining pages until we have all papers
55
+ target_count = min(total_count, upper_limit) if upper_limit else total_count
56
+ pbar = tqdm(total=target_count, desc="Retrieving papers",
57
+ initial=len(all_results), unit="papers")
58
+
59
+ while len(all_results) < total_count:
60
+ page += 1
61
+ page_start_time = time.time()
62
+ paged_url = f"{url}&page={page}"
63
+ response = requests.get(paged_url, headers=headers)
64
+ if response.status_code != 200:
65
+ print(f"Error retrieving page {page}: {response.status_code}")
66
+ break
67
+
68
+ data = response.json()
69
+ results = data.get('results', [])
70
+ if not results:
71
+ break
72
+
73
+ # Add only unique papers from this page
74
+ new_papers = 0
75
+ for result in results:
76
+ if result.get('id') not in unique_ids:
77
+ unique_ids.add(result.get('id'))
78
+ all_results.append(result)
79
+ new_papers += 1
80
+ if upper_limit and len(all_results) >= upper_limit:
81
+ pbar.update(new_papers)
82
+ pbar.close()
83
+ return all_results
84
+
85
+ # Update processing times and estimated time remaining
86
+ if new_papers > 0:
87
+ time_taken = time.time() - page_start_time
88
+ processing_times.append(time_taken / new_papers)
89
+ avg_time_per_paper = sum(processing_times) / len(processing_times)
90
+ papers_remaining = target_count - len(all_results)
91
+ est_time_remaining = papers_remaining * avg_time_per_paper
92
+ pbar.set_postfix({'Est. Time Remaining': f'{est_time_remaining:.1f}s'})
93
+
94
+ pbar.update(new_papers)
95
+ # Add a small delay to respect rate limits
96
+ time.sleep(1)
97
+
98
+ pbar.close()
99
+ return all_results
100
+
101
+
102
+ def get_related_papers(work_id, upper_limit=None, progress_callback=None):
103
+ # Define base URL for OpenAlex API
104
+ base_url = "https://api.openalex.org/works"
105
+
106
+ work_query = f"/{work_id}" # OpenAlex work IDs can be used directly in path
107
+ work_url = base_url + work_query
108
+
109
+ # Add email to be a polite API user
110
+ headers = {'User-Agent': 'LowAI (chowdhary@iiasa.ac.at)'}
111
+ response = requests.get(work_url, headers=headers)
112
+ print(response)
113
+ if response.status_code == 200:
114
+ paper = response.json() # For direct work queries, the response is the paper object
115
+ paper_id = paper['id']
116
+
117
+ # Use referenced_works field on the seed work directly for cited papers
118
+ referenced_ids = paper.get('referenced_works', []) or []
119
+ print("\nTotal counts:")
120
+ print(f"Cited (referenced_works) count: {len(referenced_ids)}")
121
+
122
+ def fetch_works_by_ids(ids, chunk_size=50):
123
+ results = []
124
+ seen = set()
125
+ total_chunks = (len(ids) + chunk_size - 1) // chunk_size
126
+
127
+ for i in range(0, len(ids), chunk_size):
128
+ chunk = ids[i:i+chunk_size]
129
+ # Build ids filter: ids.openalex:ID1|ID2|ID3
130
+ ids_filter = '|'.join(chunk)
131
+ url = f"{base_url}?filter=ids.openalex:{ids_filter}&per-page=200"
132
+ resp = requests.get(url, headers=headers)
133
+ if resp.status_code != 200:
134
+ print(f"Error fetching IDs chunk {i//chunk_size+1}: {resp.status_code}")
135
+ continue
136
+ data = resp.json()
137
+ for r in data.get('results', []):
138
+ rid = r.get('id')
139
+ if rid and rid not in seen:
140
+ seen.add(rid)
141
+ results.append(r)
142
+
143
+ # Update progress for cited papers (0-30%)
144
+ if progress_callback:
145
+ progress = int(30 * (i // chunk_size + 1) / total_chunks)
146
+ progress_callback(progress, f"Fetching cited papers... {len(results)} found")
147
+
148
+ time.sleep(1) # be polite to API
149
+ if upper_limit and len(results) >= upper_limit:
150
+ return results[:upper_limit]
151
+ return results
152
+
153
+ print("\nRetrieving cited papers via referenced_works IDs...")
154
+ cited_papers = fetch_works_by_ids(referenced_ids)
155
+ print(f"Found {len(cited_papers)} unique cited papers")
156
+
157
+ # Count citing papers (works that cite the seed), then paginate to collect all
158
+ citing_count_url = f"{base_url}?filter=cites:{work_id}&per-page=1"
159
+ citing_count = requests.get(citing_count_url, headers=headers).json().get('meta', {}).get('count', 0)
160
+ print(f"Citing papers: {citing_count}")
161
+
162
+ # Get all citing papers with pagination
163
+ print("\nRetrieving citing papers (paginated)...")
164
+ page = 1
165
+ citing_papers = []
166
+ unique_ids = set()
167
+ target = citing_count if not upper_limit else min(upper_limit, citing_count)
168
+ from tqdm import tqdm
169
+ pbar = tqdm(total=target, desc="Retrieving citing papers", unit="papers")
170
+ while len(citing_papers) < target:
171
+ paged_url = f"{base_url}?filter=cites:{work_id}&per-page=200&sort=publication_date:desc&page={page}"
172
+ resp = requests.get(paged_url, headers=headers)
173
+ if resp.status_code != 200:
174
+ print(f"Error retrieving citing page {page}: {resp.status_code}")
175
+ break
176
+ data = resp.json()
177
+ results = data.get('results', [])
178
+ if not results:
179
+ break
180
+ new = 0
181
+ for r in results:
182
+ rid = r.get('id')
183
+ if rid and rid not in unique_ids:
184
+ unique_ids.add(rid)
185
+ citing_papers.append(r)
186
+ new += 1
187
+ if len(citing_papers) >= target:
188
+ break
189
+
190
+ # Update progress for citing papers (30-70%)
191
+ if progress_callback:
192
+ progress = 30 + int(40 * len(citing_papers) / target)
193
+ progress_callback(progress, f"Fetching citing papers... {len(citing_papers)} found")
194
+
195
+ pbar.update(new)
196
+ page += 1
197
+ time.sleep(1)
198
+ pbar.close()
199
+ print(f"Found {len(citing_papers)} unique citing papers")
200
+
201
+ # Get all related papers
202
+ print("\nRetrieving related papers...")
203
+ related_url = f"{base_url}?filter=related_to:{work_id}&per-page=200&sort=publication_date:desc"
204
+ related_papers = get_all_pages(related_url, headers, upper_limit)
205
+ print(f"Found {len(related_papers)} unique related papers")
206
+
207
+ # Update progress for related papers (70-90%)
208
+ if progress_callback:
209
+ progress_callback(70, f"Fetching related papers... {len(related_papers)} found")
210
+
211
+ # Create sets of IDs for quick lookup
212
+ cited_ids = {paper['id'] for paper in cited_papers}
213
+ citing_ids = {paper['id'] for paper in citing_papers}
214
+
215
+ # Print some debug information
216
+ print(f"\nDebug Information:")
217
+ print(f"Seed paper ID: {paper_id}")
218
+ print(f"Number of unique cited papers: {len(cited_ids)}")
219
+ print(f"Number of unique citing papers: {len(citing_ids)}")
220
+ print(f"Number of papers in both sets: {len(cited_ids.intersection(citing_ids))}")
221
+
222
+ # Update progress for processing (90-95%)
223
+ if progress_callback:
224
+ progress_callback(90, "Processing and deduplicating papers...")
225
+
226
+ # Combine all papers and remove duplicates while tracking relationship
227
+ all_papers = cited_papers + citing_papers + related_papers
228
+ seen_titles = set()
229
+ unique_papers = []
230
+ for paper in all_papers:
231
+ title = paper.get('title', '')
232
+ if title not in seen_titles:
233
+ seen_titles.add(title)
234
+ # Add relationship type
235
+ if paper['id'] in cited_ids:
236
+ paper['relationship'] = 'cited'
237
+ elif paper['id'] in citing_ids:
238
+ paper['relationship'] = 'citing'
239
+ else:
240
+ paper['relationship'] = 'related'
241
+ unique_papers.append(paper)
242
+
243
+ # Final progress update
244
+ if progress_callback:
245
+ progress_callback(100, f"Collection completed! Found {len(unique_papers)} unique papers")
246
+
247
+ return unique_papers
248
+ else:
249
+ print(f"Error retrieving seed paper: {response.status_code}")
250
+ return []
251
+ import requests
252
+ import json
253
+ from typing import Dict, List, Optional
254
+ from openai import OpenAI
255
+ import concurrent.futures
256
+ import threading
257
+ import time
258
+
259
+ def analyze_paper_relevance(content: Dict[str, str], research_question: str, api_key: str) -> Optional[Dict]:
260
+ """Analyze if a paper is relevant to the research question using GPT-5 mini."""
261
+ client = OpenAI(api_key=api_key)
262
+
263
+ title = content.get('title', '')
264
+ abstract = content.get('abstract', '')
265
+ has_abstract = bool(abstract and abstract.strip())
266
+
267
+ if has_abstract:
268
+ prompt = f"""
269
+ Research Question: {research_question}
270
+
271
+ Paper Title: {title}
272
+ Paper Abstract: {abstract}
273
+
274
+ Analyze this paper and determine:
275
+ 1. Is this paper highly relevant to answering the research question?
276
+ 2. What are the main aims/objectives of this paper?
277
+ 3. What are the key takeaways or findings?
278
+
279
+ Return ONLY a valid JSON object in this exact format:
280
+ {{
281
+ "relevant": true/false,
282
+ "relevance_reason": "brief explanation of why it is/isn't relevant",
283
+ "aims_of_paper": "main objectives of the paper",
284
+ "key_takeaways": "key findings or takeaways"
285
+ }}
286
+ """
287
+ else:
288
+ prompt = f"""
289
+ Research Question: {research_question}
290
+
291
+ Paper Title: {title}
292
+ Note: No abstract is available for this paper.
293
+
294
+ Analyze this paper based on the title only and determine:
295
+ 1. Is this paper likely to be relevant to answering the research question based on the title?
296
+
297
+ Return ONLY a valid JSON object in this exact format:
298
+ {{
299
+ "relevant": true/false,
300
+ "relevance_reason": "brief explanation of why it is/isn't relevant based on title"
301
+ }}
302
+ """
303
+
304
+ try:
305
+ # Try GPT-5 mini first, fallback to gpt-4o-mini if it fails
306
+ try:
307
+ response = client.responses.create(
308
+ model="gpt-5-nano",
309
+ input=prompt,
310
+ reasoning={"effort": "minimal"},
311
+ text={"verbosity": "low"}
312
+ )
313
+ except Exception as e:
314
+ print(f"GPT-5 nano failed, trying gpt-4o-mini: {e}")
315
+ response = client.chat.completions.create(
316
+ model="gpt-4o-mini",
317
+ messages=[{
318
+ "role": "user",
319
+ "content": prompt
320
+ }],
321
+ max_completion_tokens=1000
322
+ )
323
+
324
+ # Handle different response formats
325
+ if hasattr(response, 'choices') and response.choices:
326
+ # Old format (chat completions)
327
+ result = response.choices[0].message.content
328
+ elif hasattr(response, 'output'):
329
+ # New format (responses) - extract text from output
330
+ result = ""
331
+ for item in response.output:
332
+ if hasattr(item, "content") and item.content:
333
+ for content in item.content:
334
+ if hasattr(content, "text") and content.text:
335
+ result += content.text
336
+ else:
337
+ print("Unexpected response format")
338
+ return None
339
+
340
+ if not result:
341
+ print("Empty response from GPT")
342
+ return None
343
+
344
+ # Clean and parse the JSON response
345
+ result = result.strip()
346
+ if result.startswith("```json"):
347
+ result = result[7:]
348
+ if result.endswith("```"):
349
+ result = result[:-3]
350
+
351
+ # Try to parse JSON
352
+ try:
353
+ return json.loads(result.strip())
354
+ except json.JSONDecodeError as e:
355
+ print(f"Failed to parse JSON response: {e}")
356
+ print(f"Raw response: {result[:200]}...")
357
+ return None
358
+
359
+ except Exception as e:
360
+ print(f"Error in GPT analysis: {str(e)}")
361
+ return None
362
+
363
+ def extract_abstract_from_inverted_index(inverted_index: Dict) -> str:
364
+ """Extract abstract text from inverted index format."""
365
+ if not inverted_index:
366
+ return ""
367
+
368
+ words = []
369
+ for word, positions in inverted_index.items():
370
+ for pos in positions:
371
+ while len(words) <= pos:
372
+ words.append('')
373
+ words[pos] = word
374
+ return ' '.join(words).strip()
375
+
376
+ def analyze_single_paper(paper: Dict, research_question: str, api_key: str) -> Optional[Dict]:
377
+ """Analyze a single paper with its own client."""
378
+ try:
379
+ client = OpenAI(api_key=api_key)
380
+
381
+ # Extract title and abstract
382
+ title = paper.get('title', '')
383
+ abstract = extract_abstract_from_inverted_index(paper.get('abstract_inverted_index', {}))
384
+
385
+ if not title and not abstract:
386
+ return None
387
+
388
+ # Create content for analysis
389
+ content = {
390
+ 'title': title,
391
+ 'abstract': abstract
392
+ }
393
+
394
+ # Analyze with GPT
395
+ analysis = analyze_paper_relevance_with_client(content, research_question, client)
396
+ if analysis:
397
+ paper['gpt_analysis'] = analysis
398
+ paper['relevance_reason'] = analysis.get('relevance_reason', 'Analysis completed')
399
+ paper['relevance_score'] = analysis.get('relevant', False)
400
+ return paper
401
+
402
+ return None
403
+
404
+ except Exception as e:
405
+ print(f"Error analyzing paper: {e}")
406
+ return None
407
+
408
+ def analyze_paper_batch(papers_batch: List[Dict], research_question: str, api_key: str, batch_id: int) -> List[Dict]:
409
+ """Analyze a batch of papers in parallel using ThreadPoolExecutor."""
410
+ results = []
411
+
412
+ # Use ThreadPoolExecutor to process papers in parallel within the batch
413
+ with concurrent.futures.ThreadPoolExecutor(max_workers=len(papers_batch)) as executor:
414
+ # Submit all papers for parallel processing
415
+ future_to_paper = {
416
+ executor.submit(analyze_single_paper, paper, research_question, api_key): paper
417
+ for paper in papers_batch
418
+ }
419
+
420
+ # Collect results as they complete
421
+ for future in concurrent.futures.as_completed(future_to_paper):
422
+ try:
423
+ result = future.result()
424
+ if result:
425
+ results.append(result)
426
+ except Exception as e:
427
+ print(f"Error in parallel analysis: {e}")
428
+ continue
429
+
430
+ return results
431
+
432
+ def analyze_paper_relevance_with_client(content: Dict[str, str], research_question: str, client: OpenAI) -> Optional[Dict]:
433
+ """Analyze if a paper is relevant to the research question using provided client."""
434
+ title = content.get('title', '')
435
+ abstract = content.get('abstract', '')
436
+
437
+ prompt = f"""
438
+ Research Question: {research_question}
439
+
440
+ Paper Title: {title}
441
+ Paper Abstract: {abstract or 'No abstract available'}
442
+
443
+ Analyze this paper and determine:
444
+ 1. Is this paper highly relevant to answering the research question?
445
+ 2. What are the main aims/objectives of this paper?
446
+ 3. What are the key takeaways or findings?
447
+
448
+ Return ONLY a valid JSON object in this exact format:
449
+ {{
450
+ "relevant": true/false,
451
+ "relevance_reason": "brief explanation of why it is/isn't relevant",
452
+ "aims_of_paper": "main objectives of the paper",
453
+ "key_takeaways": "key findings or takeaways"
454
+ }}
455
+ """
456
+
457
+ try:
458
+ # Try GPT-5 nano first, fallback to gpt-4o-mini if it fails
459
+ try:
460
+ response = client.responses.create(
461
+ model="gpt-5-nano",
462
+ input=prompt,
463
+ reasoning={"effort": "minimal"},
464
+ text={"verbosity": "low"}
465
+ )
466
+ except Exception as e:
467
+ response = client.chat.completions.create(
468
+ model="gpt-4o-mini",
469
+ messages=[{
470
+ "role": "user",
471
+ "content": prompt
472
+ }],
473
+ max_completion_tokens=1000
474
+ )
475
+
476
+ # Handle different response formats
477
+ if hasattr(response, 'choices') and response.choices:
478
+ # Old format (chat completions)
479
+ result = response.choices[0].message.content
480
+ elif hasattr(response, 'output'):
481
+ # New format (responses) - extract text from output
482
+ result = ""
483
+ for item in response.output:
484
+ if hasattr(item, "content") and item.content:
485
+ for content in item.content:
486
+ if hasattr(content, "text") and content.text:
487
+ result += content.text
488
+ else:
489
+ return None
490
+
491
+ if not result:
492
+ return None
493
+
494
+ # Clean and parse the JSON response
495
+ result = result.strip()
496
+ if result.startswith("```json"):
497
+ result = result[7:]
498
+ if result.endswith("```"):
499
+ result = result[:-3]
500
+
501
+ # Try to parse JSON
502
+ try:
503
+ return json.loads(result.strip())
504
+ except json.JSONDecodeError:
505
+ return None
506
+
507
+ except Exception as e:
508
+ return None
509
+
510
+ def filter_papers_for_research_question(papers: List[Dict], research_question: str, api_key: str, limit: int = 10) -> List[Dict]:
511
+ """Analyze exactly 'limit' number of papers for relevance using parallel processing."""
512
+ if not papers or not research_question:
513
+ return []
514
+
515
+ # Sort papers by publication date (most recent first)
516
+ sorted_papers = sorted(papers, key=lambda x: x.get('publication_date', ''), reverse=True)
517
+
518
+ # Take only the first 'limit' papers for analysis
519
+ papers_to_analyze = sorted_papers[:limit]
520
+
521
+ print(f"Analyzing {len(papers_to_analyze)} papers for relevance to: {research_question}")
522
+
523
+ # Process all papers in parallel (no batching needed for small numbers)
524
+ all_results = []
525
+
526
+ with concurrent.futures.ThreadPoolExecutor(max_workers=min(limit, 20)) as executor:
527
+ # Submit all papers for parallel processing
528
+ future_to_paper = {
529
+ executor.submit(analyze_single_paper, paper, research_question, api_key): paper
530
+ for paper in papers_to_analyze
531
+ }
532
+
533
+ # Collect results as they complete
534
+ completed = 0
535
+ for future in concurrent.futures.as_completed(future_to_paper):
536
+ try:
537
+ result = future.result()
538
+ completed += 1
539
+ if result:
540
+ all_results.append(result)
541
+ print(f"Completed {completed}/{len(papers_to_analyze)} papers")
542
+ except Exception as e:
543
+ print(f"Error in parallel analysis: {e}")
544
+ completed += 1
545
+
546
+ # Sort by publication date again (most recent first)
547
+ all_results.sort(key=lambda x: x.get('publication_date', ''), reverse=True)
548
+
549
+ print(f"Analysis complete. Processed {len(all_results)} papers.")
550
+ return all_results
551
+ import requests
552
+ import re
553
+ import html
554
+
555
+ # Try to import BeautifulSoup, fallback to simple parsing if not available
556
+ try:
557
+ from bs4 import BeautifulSoup
558
+ HAS_BS4 = True
559
+ except ImportError:
560
+ HAS_BS4 = False
561
+ print("BeautifulSoup not available, using simple HTML parsing")
562
+
563
+ # Global progress tracking
564
+ progress_data = {}
565
+
566
+ # Configuration: read from environment (set in HF Space Secrets)
567
+ OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "").strip()
568
+ if not OPENAI_API_KEY:
569
+ print("[WARN] OPENAI_API_KEY is not set. Set it in Space Settings → Secrets.")
570
+
571
+ # Global progress tracking
572
+ progress_data = {}
573
+ # Determine script directory and robust project root
574
+ SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
575
+ ROOT_DIR = os.path.dirname(SCRIPT_DIR) if os.path.basename(SCRIPT_DIR) == "code" else SCRIPT_DIR
576
+
577
+ # Ensure we can import helper modules (prefer repo root; fallback to ./code)
578
+ CODE_DIR_CANDIDATE = os.path.join(ROOT_DIR, "code")
579
+ CODE_DIR = CODE_DIR_CANDIDATE if os.path.isdir(CODE_DIR_CANDIDATE) else ROOT_DIR
580
+ if CODE_DIR not in sys.path:
581
+ sys.path.insert(0, CODE_DIR)
582
+
583
+ # Database directories: prefer repo-root `database/` when present; fallback to CODE_DIR/database
584
+ DATABASE_DIR_ROOT = os.path.join(ROOT_DIR, "database")
585
+ DATABASE_DIR = DATABASE_DIR_ROOT if os.path.isdir(DATABASE_DIR_ROOT) else os.path.join(CODE_DIR, "database")
586
+ COLLECTION_DB_DIR = os.path.join(DATABASE_DIR, "collections")
587
+ FILTER_DB_DIR = os.path.join(DATABASE_DIR, "filters")
588
+
589
+ # Ensure database directories exist
590
+ os.makedirs(COLLECTION_DB_DIR, exist_ok=True)
591
+ os.makedirs(FILTER_DB_DIR, exist_ok=True)
592
+
593
+ def ensure_db_dirs() -> None:
594
+ """Ensure database directories exist (safe to call anytime)."""
595
+ try:
596
+ os.makedirs(COLLECTION_DB_DIR, exist_ok=True)
597
+ os.makedirs(FILTER_DB_DIR, exist_ok=True)
598
+ except Exception:
599
+ pass
600
+
601
+ # Robust HTTP headers for publisher sites
602
+ DEFAULT_HTTP_HEADERS = {
603
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0 Safari/537.36',
604
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
605
+ 'Accept-Language': 'en-US,en;q=0.9',
606
+ 'Cache-Control': 'no-cache',
607
+ }
608
+
609
+ def _http_get(url: str, timeout: int = 15) -> Optional[requests.Response]:
610
+ try:
611
+ resp = requests.get(url, headers=DEFAULT_HTTP_HEADERS, timeout=timeout, allow_redirects=True)
612
+ return resp
613
+ except Exception as e:
614
+ print(f"HTTP GET failed for {url}: {e}")
615
+ return None
616
+
617
+ def fetch_abstract_from_doi(doi: str) -> Optional[str]:
618
+ """Fetch abstract/highlights from a DOI URL with a robust, layered strategy."""
619
+ if not doi:
620
+ return None
621
+ # Normalize DOI
622
+ doi_clean = doi.replace('https://doi.org/', '').strip()
623
+
624
+ # 1) Crossref (fast, sometimes JATS)
625
+ try:
626
+ text = fetch_from_crossref(doi_clean)
627
+ if text and len(text) > 50:
628
+ return text
629
+ except Exception as e:
630
+ print(f"Crossref fetch failed: {e}")
631
+
632
+ # 2) Fetch target HTML via doi.org redirect
633
+ try:
634
+ start_url = f"https://doi.org/{doi_clean}"
635
+ resp = _http_get(start_url, timeout=15)
636
+ if not resp or resp.status_code >= 400:
637
+ return None
638
+ html_text = resp.text or ''
639
+ final_url = getattr(resp, 'url', start_url)
640
+ print(f"Resolved DOI to: {final_url}")
641
+
642
+ # Parse with robust pipeline
643
+ parsed = robust_extract_abstract(html_text)
644
+ if parsed and len(parsed) > 50:
645
+ return parsed
646
+ except Exception as e:
647
+ print(f"DOI HTML fetch failed: {e}")
648
+
649
+ # 3) PubMed placeholder (extendable)
650
+ try:
651
+ text = fetch_from_pubmed(doi_clean)
652
+ if text and len(text) > 50:
653
+ return text
654
+ except Exception:
655
+ pass
656
+
657
+ return None
658
+
659
+ def fetch_from_crossref(doi: str) -> Optional[str]:
660
+ """Fetch abstract from Crossref API."""
661
+ try:
662
+ url = f"https://api.crossref.org/works/{doi}"
663
+ response = _http_get(url, timeout=12)
664
+ if response.status_code == 200:
665
+ data = response.json()
666
+ if 'message' in data:
667
+ message = data['message']
668
+ # Check for abstract or highlights (case insensitive)
669
+ for key in message:
670
+ if key.lower() in ['abstract', 'highlights'] and message[key]:
671
+ raw = str(message[key])
672
+ # Crossref sometimes returns JATS/XML; strip tags and unescape entities
673
+ text = re.sub(r'<[^>]+>', ' ', raw)
674
+ text = html.unescape(re.sub(r'\s+', ' ', text)).strip()
675
+ return text
676
+ except Exception:
677
+ pass
678
+ return None
679
+
680
+ def fetch_from_doi_org(doi: str) -> Optional[str]:
681
+ """Legacy wrapper kept for API compatibility; now uses robust pipeline."""
682
+ try:
683
+ url = f"https://doi.org/{doi}"
684
+ resp = _http_get(url, timeout=15)
685
+ if not resp or resp.status_code >= 400:
686
+ return None
687
+ return robust_extract_abstract(resp.text or '')
688
+ except Exception:
689
+ return None
690
+
691
+ def extract_from_preloaded_state_bruteforce(content: str) -> Optional[str]:
692
+ """Extract abstract from window.__PRELOADED_STATE__ using brace matching and fallbacks."""
693
+ try:
694
+ start_idx = content.find('window.__PRELOADED_STATE__')
695
+ if start_idx == -1:
696
+ return None
697
+ # Find the first '{' after the equals sign
698
+ eq_idx = content.find('=', start_idx)
699
+ if eq_idx == -1:
700
+ return None
701
+ brace_idx = content.find('{', eq_idx)
702
+ if brace_idx == -1:
703
+ return None
704
+ # Brace matching to find the matching closing '}'
705
+ depth = 0
706
+ end_idx = -1
707
+ for i in range(brace_idx, min(len(content), brace_idx + 5_000_000)):
708
+ ch = content[i]
709
+ if ch == '{': depth += 1
710
+ elif ch == '}':
711
+ depth -= 1
712
+ if depth == 0:
713
+ end_idx = i
714
+ break
715
+ if end_idx == -1:
716
+ return None
717
+ json_str = content[brace_idx:end_idx+1]
718
+ try:
719
+ data = json.loads(json_str)
720
+ except Exception as e:
721
+ # Try to relax by removing trailing commas and control chars
722
+ cleaned = re.sub(r',\s*([}\]])', r'\1', json_str)
723
+ cleaned = re.sub(r'\u0000', '', cleaned)
724
+ try:
725
+ data = json.loads(cleaned)
726
+ except Exception as e2:
727
+ print(f"Failed to parse preloaded JSON: {e2}")
728
+ return None
729
+
730
+ # Same traversal as before
731
+ if isinstance(data, dict) and 'abstracts' in data and isinstance(data['abstracts'], dict) and 'content' in data['abstracts']:
732
+ abstracts = data['abstracts']['content']
733
+ if isinstance(abstracts, list):
734
+ for abstract_item in abstracts:
735
+ if isinstance(abstract_item, dict) and '$$' in abstract_item and abstract_item.get('#name') == 'abstract':
736
+ class_name = abstract_item.get('$', {}).get('class', '')
737
+ for section in abstract_item.get('$$', []):
738
+ if isinstance(section, dict) and section.get('#name') == 'abstract-sec':
739
+ section_text = extract_text_from_abstract_section(section)
740
+ section_highlights = extract_highlights_from_section(section)
741
+ if section_text and len(section_text.strip()) > 50:
742
+ return clean_text(section_text)
743
+ if section_highlights and len(section_highlights.strip()) > 50:
744
+ return clean_text(section_highlights)
745
+ if 'highlight' in class_name.lower():
746
+ highlights_text = extract_highlights_from_abstract_item(abstract_item)
747
+ if highlights_text and len(highlights_text.strip()) > 50:
748
+ return clean_text(highlights_text)
749
+ return None
750
+ except Exception as e:
751
+ print(f"Error extracting from preloaded state (bruteforce): {e}")
752
+ return None
753
+
754
+ def extract_from_json_ld(content: str) -> Optional[str]:
755
+ """Parse JSON-LD script tags and extract abstract/description if present."""
756
+ if not HAS_BS4:
757
+ return None
758
+ try:
759
+ soup = BeautifulSoup(content, 'html.parser')
760
+ for script in soup.find_all('script', type='application/ld+json'):
761
+ try:
762
+ data = json.loads(script.string or '{}')
763
+ except Exception:
764
+ continue
765
+ candidates = []
766
+ if isinstance(data, dict):
767
+ candidates.append(data)
768
+ elif isinstance(data, list):
769
+ candidates.extend([d for d in data if isinstance(d, dict)])
770
+ for obj in candidates:
771
+ for key in ['abstract', 'description']:
772
+ if key in obj and obj[key]:
773
+ text = clean_text(str(obj[key]))
774
+ if len(text) > 50:
775
+ return text
776
+ return None
777
+ except Exception as e:
778
+ print(f"Error extracting from JSON-LD: {e}")
779
+ return None
780
+
781
+ def clean_text(s: str) -> str:
782
+ s = html.unescape(s)
783
+ s = re.sub(r'\s+', ' ', s)
784
+ return s.strip()
785
+
786
+ def extract_from_meta_tags(soup) -> Optional[str]:
787
+ try:
788
+ # Common meta carriers of abstract-like summaries
789
+ candidates = []
790
+ # OpenGraph description
791
+ og = soup.find('meta', attrs={'property': 'og:description'})
792
+ if og and og.get('content'):
793
+ candidates.append(og['content'])
794
+ # Twitter description
795
+ tw = soup.find('meta', attrs={'name': 'twitter:description'})
796
+ if tw and tw.get('content'):
797
+ candidates.append(tw['content'])
798
+ # Dublin Core description
799
+ dc = soup.find('meta', attrs={'name': 'dc.description'})
800
+ if dc and dc.get('content'):
801
+ candidates.append(dc['content'])
802
+ # citation_abstract
803
+ cit_abs = soup.find('meta', attrs={'name': 'citation_abstract'})
804
+ if cit_abs and cit_abs.get('content'):
805
+ candidates.append(cit_abs['content'])
806
+ # Fallback: any meta description
807
+ desc = soup.find('meta', attrs={'name': 'description'})
808
+ if desc and desc.get('content'):
809
+ candidates.append(desc['content'])
810
+
811
+ # Clean and return the longest meaningful candidate
812
+ candidates = [clean_text(c) for c in candidates if isinstance(c, str)]
813
+ candidates.sort(key=lambda x: len(x), reverse=True)
814
+ for text in candidates:
815
+ if len(text) > 50:
816
+ return text
817
+ return None
818
+ except Exception:
819
+ return None
820
+
821
+ def robust_extract_abstract(html_text: str) -> Optional[str]:
822
+ """Layered extraction over raw HTML: preloaded-state, JSON-LD, meta tags, DOM, regex."""
823
+ if not html_text:
824
+ return None
825
+
826
+ # 1) ScienceDirect/Elsevier preloaded state (brace-matched)
827
+ try:
828
+ txt = extract_from_preloaded_state_bruteforce(html_text)
829
+ if txt and len(txt) > 50:
830
+ return clean_text(txt)
831
+ except Exception:
832
+ pass
833
+
834
+ # 2) JSON-LD
835
+ try:
836
+ txt = extract_from_json_ld(html_text)
837
+ if txt and len(txt) > 50:
838
+ return clean_text(txt)
839
+ except Exception:
840
+ pass
841
+
842
+ # 3) BeautifulSoup-based DOM extraction (meta + selectors + heading-sibling)
843
+ if HAS_BS4:
844
+ try:
845
+ soup = BeautifulSoup(html_text, 'html.parser')
846
+ # meta first
847
+ meta_txt = extract_from_meta_tags(soup)
848
+ if meta_txt and len(meta_txt) > 50:
849
+ return clean_text(meta_txt)
850
+
851
+ # selector scan
852
+ selectors = [
853
+ 'div.abstract', 'div.Abstract', 'div.ABSTRACT',
854
+ 'div[class*="abstract" i]', 'div[class*="Abstract" i]',
855
+ 'section.abstract', 'section.Abstract', 'section.ABSTRACT',
856
+ 'div[data-testid="abstract" i]', 'div[data-testid="Abstract" i]',
857
+ 'div.article-abstract', 'div.article-Abstract',
858
+ 'div.abstract-content', 'div.Abstract-content',
859
+ 'div.highlights', 'div.Highlights', 'div.HIGHLIGHTS',
860
+ 'div[class*="highlights" i]', 'div[class*="Highlights" i]',
861
+ 'section.highlights', 'section.Highlights', 'section.HIGHLIGHTS',
862
+ 'div[data-testid="highlights" i]', 'div[data-testid="Highlights" i]'
863
+ ]
864
+ for css in selectors:
865
+ node = soup.select_one(css)
866
+ if node:
867
+ t = clean_text(node.get_text(' ', strip=True))
868
+ if len(t) > 50:
869
+ return t
870
+
871
+ # headings near Abstract/Highlights
872
+ for tag in soup.find_all(['h1','h2','h3','h4','h5','h6','strong','b']):
873
+ try:
874
+ title = (tag.get_text() or '').strip().lower()
875
+ if 'abstract' in title or 'highlights' in title:
876
+ blocks = []
877
+ sib = tag
878
+ steps = 0
879
+ while sib and steps < 20:
880
+ sib = sib.find_next_sibling()
881
+ steps += 1
882
+ if not sib: break
883
+ if sib.name in ['p','div','section','article','ul','ol']:
884
+ blocks.append(sib.get_text(' ', strip=True))
885
+ joined = clean_text(' '.join(blocks))
886
+ if len(joined) > 50:
887
+ return joined
888
+ except Exception:
889
+ continue
890
+ except Exception:
891
+ pass
892
+
893
+ # 4) Regex fallback
894
+ try:
895
+ patterns = [
896
+ r'<div[^>]*class="[^\"]*(?:abstract|Abstract|ABSTRACT|highlights|Highlights|HIGHLIGHTS)[^\"]*"[^>]*>(.*?)</div>',
897
+ r'<section[^>]*class="[^\"]*(?:abstract|Abstract|ABSTRACT|highlights|Highlights|HIGHLIGHTS)[^\"]*"[^>]*>(.*?)</section>',
898
+ r'<div[^>]*data-testid="(?:abstract|Abstract|highlights|Highlights)"[^>]*>(.*?)</div>'
899
+ ]
900
+ for pat in patterns:
901
+ for m in re.findall(pat, html_text, re.DOTALL | re.IGNORECASE):
902
+ t = clean_text(re.sub(r'<[^>]+>', ' ', m))
903
+ if len(t) > 50:
904
+ return t
905
+ except Exception:
906
+ pass
907
+
908
+ return None
909
+
910
+ def extract_text_from_abstract_section(section: dict) -> str:
911
+ """Extract text content from abstract section structure."""
912
+ try:
913
+ text_parts = []
914
+
915
+ if '$$' in section:
916
+ for item in section['$$']:
917
+ if isinstance(item, dict):
918
+ # Direct text content from simple-para
919
+ if item.get('#name') == 'simple-para' and '_' in item:
920
+ text_parts.append(item['_'])
921
+ # Also check for para elements
922
+ elif item.get('#name') == 'para' and '_' in item:
923
+ text_parts.append(item['_'])
924
+ # Recursively extract from nested structure
925
+ elif '$$' in item:
926
+ nested_text = extract_text_from_abstract_section(item)
927
+ if nested_text:
928
+ text_parts.append(nested_text)
929
+
930
+ return ' '.join(text_parts)
931
+
932
+ except Exception as e:
933
+ print(f"Error extracting text from abstract section: {e}")
934
+ return ""
935
+
936
+ def extract_highlights_from_section(section: dict) -> str:
937
+ """Extract highlights content from section structure."""
938
+ try:
939
+ text_parts = []
940
+
941
+ if '$$' in section:
942
+ for item in section['$$']:
943
+ if isinstance(item, dict):
944
+ # Look for section-title with "Highlights"
945
+ if (item.get('#name') == 'section-title' and
946
+ item.get('_') and 'highlight' in item['_'].lower()):
947
+ # Found highlights section, extract list items
948
+ highlights_text = extract_highlights_list(item, section)
949
+ if highlights_text:
950
+ text_parts.append(highlights_text)
951
+ # Also look for direct list structures
952
+ elif item.get('#name') == 'list':
953
+ # Found list, extract list items directly
954
+ highlights_text = extract_highlights_list(item, section)
955
+ if highlights_text:
956
+ text_parts.append(highlights_text)
957
+ elif '$$' in item:
958
+ # Recursively search for highlights
959
+ nested_text = extract_highlights_from_section(item)
960
+ if nested_text:
961
+ text_parts.append(nested_text)
962
+
963
+ return ' '.join(text_parts)
964
+
965
+ except Exception as e:
966
+ print(f"Error extracting highlights from section: {e}")
967
+ return ""
968
+
969
+ def extract_highlights_list(title_item: dict, parent_section: dict) -> str:
970
+ """Extract highlights list items from the section structure."""
971
+ try:
972
+ highlights = []
973
+
974
+ # Look for the list structure after the highlights title
975
+ if '$$' in parent_section:
976
+ for item in parent_section['$$']:
977
+ if isinstance(item, dict) and item.get('#name') == 'list':
978
+ # Found list, extract list items
979
+ if '$$' in item:
980
+ for list_item in item['$$']:
981
+ if isinstance(list_item, dict) and list_item.get('#name') == 'list-item':
982
+ # Extract text from list item
983
+ item_text = extract_text_from_abstract_section(list_item)
984
+ if item_text:
985
+ highlights.append(f"• {item_text}")
986
+
987
+ # Also check if the title_item itself contains a list (for direct list structures)
988
+ if '$$' in title_item:
989
+ for item in title_item['$$']:
990
+ if isinstance(item, dict) and item.get('#name') == 'list':
991
+ if '$$' in item:
992
+ for list_item in item['$$']:
993
+ if isinstance(list_item, dict) and list_item.get('#name') == 'list-item':
994
+ item_text = extract_text_from_abstract_section(list_item)
995
+ if item_text:
996
+ highlights.append(f"• {item_text}")
997
+
998
+ return ' '.join(highlights)
999
+
1000
+ except Exception as e:
1001
+ print(f"Error extracting highlights list: {e}")
1002
+ return ""
1003
+
1004
+ def extract_highlights_from_abstract_item(abstract_item: dict) -> str:
1005
+ """Extract highlights from an abstract item that contains highlights."""
1006
+ try:
1007
+ highlights = []
1008
+
1009
+ if '$$' in abstract_item:
1010
+ for section in abstract_item['$$']:
1011
+ if isinstance(section, dict) and section.get('#name') == 'abstract-sec':
1012
+ # Look for highlights within this section
1013
+ highlights_text = extract_highlights_from_section(section)
1014
+ if highlights_text:
1015
+ highlights.append(highlights_text)
1016
+
1017
+ return ' '.join(highlights)
1018
+
1019
+ except Exception as e:
1020
+ print(f"Error extracting highlights from abstract item: {e}")
1021
+ return ""
1022
+
1023
+ def fetch_from_pubmed(doi: str) -> Optional[str]:
1024
+ """Fetch abstract from PubMed if available."""
1025
+ try:
1026
+ # This is a simplified approach - in practice, you'd need to use PubMed API
1027
+ # For now, we'll skip this method but could be extended to check for:
1028
+ # - abstract field
1029
+ # - highlights field
1030
+ # - other summary fields
1031
+ pass
1032
+ except Exception:
1033
+ pass
1034
+ return None
1035
+
1036
+ def convert_abstract_to_inverted_index(abstract: str) -> Dict:
1037
+ """Convert abstract text to inverted index format."""
1038
+ if not abstract:
1039
+ return {}
1040
+
1041
+ # Simple word tokenization and position mapping
1042
+ words = re.findall(r'\b\w+\b', abstract.lower())
1043
+ inverted_index = {}
1044
+
1045
+ for i, word in enumerate(words):
1046
+ if word not in inverted_index:
1047
+ inverted_index[word] = []
1048
+ inverted_index[word].append(i)
1049
+
1050
+ return inverted_index
1051
+
1052
+ def extract_work_id_from_url(url: str) -> Optional[str]:
1053
+ """Extract OpenAlex work ID from various URL formats."""
1054
+ if not url:
1055
+ return None
1056
+
1057
+ # Handle different URL formats
1058
+ if 'openalex.org' in url:
1059
+ if '/works/' in url:
1060
+ # Extract ID from URL like https://openalex.org/W2741809807
1061
+ work_id = url.split('/works/')[-1]
1062
+ return work_id
1063
+ elif 'api.openalex.org/works/' in url:
1064
+ # Extract ID from API URL
1065
+ work_id = url.split('/works/')[-1]
1066
+ return work_id
1067
+
1068
+ # If it's already just an ID
1069
+ if url.startswith('W') and len(url) > 5:
1070
+ return url
1071
+
1072
+ return None
1073
+
1074
+ def save_to_database(session_id: str, data_type: str, data: Dict) -> str:
1075
+ """Legacy-compatible save helper that routes to the new split DB layout."""
1076
+ if data_type == 'collection':
1077
+ work_id = data.get('work_id', '')
1078
+ title = data.get('title', '')
1079
+ return save_collection_to_database(work_id, title, data)
1080
+ if data_type == 'filter':
1081
+ source_collection = data.get('source_collection', '')
1082
+ research_question = data.get('research_question', '')
1083
+ return save_filter_to_database(source_collection, research_question, data)
1084
+
1085
+ # Fallback legacy path (single folder)
1086
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
1087
+ filename = f"{session_id}_{data_type}_{timestamp}.pkl"
1088
+ filepath = os.path.join(DATABASE_DIR, filename)
1089
+ with open(filepath, 'wb') as f: pickle.dump(data, f)
1090
+ return filename
1091
+
1092
+ def _clean_work_id(work_id_or_url: str) -> str:
1093
+ clean = extract_work_id_from_url(work_id_or_url) or work_id_or_url
1094
+ clean = clean.replace('https://api.openalex.org/works/', '').replace('https://openalex.org/', '')
1095
+ return clean
1096
+
1097
+ def save_collection_to_database(work_id_or_url: str, title: str, data: Dict) -> str:
1098
+ """Save a collection once per work. Filename is the clean work id only (dedup)."""
1099
+ ensure_db_dirs()
1100
+ clean_id = _clean_work_id(work_id_or_url)
1101
+ filename = f"{clean_id}.pkl"
1102
+ filepath = os.path.join(COLLECTION_DB_DIR, filename)
1103
+
1104
+ # Deduplicate: if exists, do NOT overwrite
1105
+ if os.path.exists(filepath):
1106
+ return filename
1107
+
1108
+ # Ensure helpful metadata for frontend display
1109
+ data = dict(data)
1110
+ data['work_id'] = work_id_or_url
1111
+ data['title'] = title
1112
+ data['work_identifier'] = clean_id
1113
+ data['created'] = datetime.now().isoformat()
1114
+
1115
+ with open(filepath, 'wb') as f: pickle.dump(data, f)
1116
+ return filename
1117
+
1118
+ def save_filter_to_database(source_collection_clean_id: str, research_question: str, data: Dict) -> str:
1119
+ """Save a filter result linked to a source collection. Multiple filters allowed."""
1120
+ ensure_db_dirs()
1121
+ # Slug for RQ to keep filenames short
1122
+ rq_slug = ''.join(c for c in research_question[:40] if c.isalnum() or c in (' ', '-', '_')).strip().replace(' ', '_') or 'rq'
1123
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
1124
+ filename = f"{source_collection_clean_id}__filter__{rq_slug}__{timestamp}.pkl"
1125
+ filepath = os.path.join(FILTER_DB_DIR, filename)
1126
+
1127
+ data = dict(data)
1128
+ data['filter_identifier'] = filename.replace('.pkl','')
1129
+ data['source_collection'] = source_collection_clean_id
1130
+ data['research_question'] = research_question
1131
+ data['created'] = datetime.now().isoformat()
1132
+
1133
+ with open(filepath, 'wb') as f: pickle.dump(data, f)
1134
+ return filename
1135
+
1136
+ def get_collection_files() -> List[Dict]:
1137
+ files: List[Dict] = []
1138
+ if not os.path.exists(COLLECTION_DB_DIR): return files
1139
+ for filename in os.listdir(COLLECTION_DB_DIR):
1140
+ if not filename.endswith('.pkl'): continue
1141
+ filepath = os.path.join(COLLECTION_DB_DIR, filename)
1142
+ try:
1143
+ stat = os.stat(filepath)
1144
+ with open(filepath, 'rb') as f: data = pickle.load(f)
1145
+ files.append({
1146
+ 'filename': filename,
1147
+ 'type': 'collection',
1148
+ 'work_identifier': data.get('work_identifier') or filename.replace('.pkl',''),
1149
+ 'title': data.get('title',''),
1150
+ 'work_id': data.get('work_id',''),
1151
+ 'total_papers': data.get('total_papers',0),
1152
+ 'created': data.get('created', datetime.fromtimestamp(stat.st_ctime).isoformat()),
1153
+ 'size': stat.st_size
1154
+ })
1155
+ except Exception:
1156
+ continue
1157
+ files.sort(key=lambda x: x['created'], reverse=True)
1158
+ return files
1159
+
1160
+ def get_filter_files() -> List[Dict]:
1161
+ files: List[Dict] = []
1162
+ if not os.path.exists(FILTER_DB_DIR): return files
1163
+ for filename in os.listdir(FILTER_DB_DIR):
1164
+ if not filename.endswith('.pkl'): continue
1165
+ filepath = os.path.join(FILTER_DB_DIR, filename)
1166
+ try:
1167
+ stat = os.stat(filepath)
1168
+ with open(filepath, 'rb') as f: data = pickle.load(f)
1169
+ files.append({
1170
+ 'filename': filename,
1171
+ 'type': 'filter',
1172
+ 'filter_identifier': data.get('filter_identifier') or filename.replace('.pkl',''),
1173
+ 'source_collection': data.get('source_collection',''),
1174
+ 'research_question': data.get('research_question',''),
1175
+ 'relevant_papers': data.get('relevant_papers',0),
1176
+ 'total_papers': data.get('total_papers',0),
1177
+ 'tested_papers': data.get('tested_papers',0),
1178
+ 'created': data.get('created', datetime.fromtimestamp(stat.st_ctime).isoformat()),
1179
+ 'size': stat.st_size
1180
+ })
1181
+ except Exception:
1182
+ continue
1183
+ files.sort(key=lambda x: x['created'], reverse=True)
1184
+ return files
1185
+
1186
+ def get_database_files() -> List[Dict]:
1187
+ """Combined listing for frontend history panel."""
1188
+ return get_collection_files() + get_filter_files()
1189
+
1190
+ def find_existing_collection(work_id_or_url: str) -> Optional[str]:
1191
+ """Return existing collection filename for a work id if present (dedup)."""
1192
+ clean_id = _clean_work_id(work_id_or_url)
1193
+ filename = f"{clean_id}.pkl"
1194
+ filepath = os.path.join(COLLECTION_DB_DIR, filename)
1195
+ return filename if os.path.exists(filepath) else None
1196
+
1197
+ def filter_papers_for_rq(papers: List[Dict], research_question: str) -> List[Dict]:
1198
+ """Filter papers based on research question using GPT-5 mini."""
1199
+ if not papers or not research_question:
1200
+ return []
1201
+
1202
+ relevant_papers = []
1203
+
1204
+ for i, paper in enumerate(papers):
1205
+ print(f"Analyzing paper {i+1}/{len(papers)}: {paper.get('title', 'No title')[:50]}...")
1206
+
1207
+ # Extract title and abstract
1208
+ title = paper.get('title', '')
1209
+ abstract = ''
1210
+
1211
+ # Try to get abstract from inverted index
1212
+ inverted_abstract = paper.get('abstract_inverted_index')
1213
+ if inverted_abstract:
1214
+ words = []
1215
+ for word, positions in inverted_abstract.items():
1216
+ for pos in positions:
1217
+ while len(words) <= pos:
1218
+ words.append('')
1219
+ words[pos] = word
1220
+ abstract = ' '.join(words).strip()
1221
+
1222
+ if not title and not abstract:
1223
+ continue
1224
+
1225
+ # Create content for GPT analysis
1226
+ content = {
1227
+ 'title': title,
1228
+ 'abstract': abstract
1229
+ }
1230
+
1231
+ # Analyze with GPT-5 mini
1232
+ try:
1233
+ analysis = analyze_paper_relevance(content, research_question, OPENAI_API_KEY)
1234
+ if analysis and analysis.get('aims_of_paper'):
1235
+ # Check if paper is relevant to research question
1236
+ relevance_prompt = f"""
1237
+ Research Question: {research_question}
1238
+
1239
+ Paper Title: {title}
1240
+ Paper Abstract: {abstract or 'No abstract available'}
1241
+
1242
+ Is this paper highly relevant to answering the research question?
1243
+ Consider the paper's aims, methods, and findings.
1244
+
1245
+ Return ONLY a JSON object: {{"relevant": true/false, "reason": "brief explanation"}}
1246
+ """
1247
+
1248
+ relevance_response = analyze_paper_relevance({
1249
+ 'title': 'Relevance Check',
1250
+ 'abstract': relevance_prompt
1251
+ }, research_question, OPENAI_API_KEY)
1252
+
1253
+ if relevance_response and relevance_response.get('aims_of_paper'):
1254
+ # Parse the relevance response
1255
+ try:
1256
+ relevance_data = json.loads(relevance_response['aims_of_paper'])
1257
+ if relevance_data.get('relevant', False):
1258
+ paper['relevance_reason'] = relevance_data.get('reason', 'Relevant to research question')
1259
+ paper['gpt_analysis'] = analysis
1260
+ relevant_papers.append(paper)
1261
+ except:
1262
+ # If parsing fails, include paper anyway if it has analysis
1263
+ paper['gpt_analysis'] = analysis
1264
+ relevant_papers.append(paper)
1265
+
1266
+ except Exception as e:
1267
+ print(f"Error analyzing paper {i+1}: {e}")
1268
+ continue
1269
+
1270
+ return relevant_papers
1271
+
1272
+ # Flask routes removed - now using Gradio interface
1273
+
1274
+ def search_papers_by_title(title: str) -> List[Dict]:
1275
+ """Search OpenAlex for papers by title and return ranked matches."""
1276
+ try:
1277
+ # Clean and prepare the title for search
1278
+ clean_title = title.strip()
1279
+ if not clean_title:
1280
+ return []
1281
+
1282
+ # Search OpenAlex API
1283
+ import urllib.parse
1284
+ params = {
1285
+ 'search': clean_title,
1286
+ 'per_page': 10, # Get top 10 results
1287
+ 'sort': 'relevance_score:desc' # Sort by relevance
1288
+ }
1289
+
1290
+ # Build URL with query parameters
1291
+ query_string = urllib.parse.urlencode(params)
1292
+ search_url = f"https://api.openalex.org/works?{query_string}"
1293
+
1294
+ print(f"EXACT URL BEING SEARCHED: {search_url}")
1295
+
1296
+ response = _http_get(search_url, timeout=10)
1297
+ if not response or response.status_code != 200:
1298
+ print(f"OpenAlex search failed: {response.status_code if response else 'No response'}")
1299
+ return []
1300
+
1301
+ data = response.json()
1302
+ results = data.get('results', [])
1303
+
1304
+ if not results:
1305
+ print(f"No results found for title: {clean_title}")
1306
+ return []
1307
+
1308
+ # Return top results (OpenAlex already ranks by relevance)
1309
+ scored_results = []
1310
+ for work in results[:5]: # Take top 5 from OpenAlex
1311
+ work_title = work.get('title', '')
1312
+ if not work_title:
1313
+ continue
1314
+
1315
+ work_id = work.get('id', '').replace('https://openalex.org/', '')
1316
+ scored_results.append({
1317
+ 'work_id': work_id,
1318
+ 'title': work_title,
1319
+ 'authors': ', '.join([author.get('author', {}).get('display_name', '') for author in work.get('authorships', [])[:3]]),
1320
+ 'year': work.get('publication_date', '')[:4] if work.get('publication_date') else 'Unknown',
1321
+ 'venue': work.get('primary_location', {}).get('source', {}).get('display_name', 'Unknown'),
1322
+ 'relevance_score': work.get('relevance_score', 0)
1323
+ })
1324
+
1325
+ return scored_results
1326
+
1327
+ except Exception as e:
1328
+ print(f"Error searching for papers by title: {e}")
1329
+ return []
1330
+
1331
+ # Flask API routes removed - now using Gradio interface
1332
+
1333
+ # Flask filter route removed - now using Gradio interface
1334
+
1335
+ # Flask database routes removed - now using Gradio interface
1336
+
1337
+ def generate_bibtex_entry(paper):
1338
+ """Generate a BibTeX entry for a single paper."""
1339
+ try:
1340
+ # Handle None or invalid paper objects
1341
+ if not paper or not isinstance(paper, dict):
1342
+ print(f"Invalid paper object: {paper}")
1343
+ return f"@article{{error_{hash(str(paper)) % 10000},\n title={{Invalid paper data}},\n author={{Unknown}},\n year={{Unknown}}\n}}"
1344
+
1345
+ # Extract basic info with safe defaults
1346
+ title = paper.get('title', 'Unknown Title')
1347
+ year = paper.get('publication_year', 'Unknown Year')
1348
+ doi = paper.get('doi', '')
1349
+
1350
+ # Generate a unique key (using OpenAlex ID or DOI)
1351
+ work_id = paper.get('id', '')
1352
+ if work_id and isinstance(work_id, str):
1353
+ work_id = work_id.replace('https://openalex.org/', '')
1354
+ if not work_id and doi:
1355
+ work_id = doi.replace('https://doi.org/', '').replace('/', '_')
1356
+ if not work_id:
1357
+ work_id = f"paper_{hash(title) % 10000}"
1358
+
1359
+ # Extract authors safely
1360
+ authorships = paper.get('authorships', [])
1361
+ author_list = []
1362
+ if isinstance(authorships, list):
1363
+ for authorship in authorships:
1364
+ if isinstance(authorship, dict):
1365
+ author = authorship.get('author', {})
1366
+ if isinstance(author, dict):
1367
+ display_name = author.get('display_name', '')
1368
+ if display_name:
1369
+ # Split name and format as "Last, First"
1370
+ name_parts = display_name.split()
1371
+ if len(name_parts) >= 2:
1372
+ last_name = name_parts[-1]
1373
+ first_name = ' '.join(name_parts[:-1])
1374
+ author_list.append(f"{last_name}, {first_name}")
1375
+ else:
1376
+ author_list.append(display_name)
1377
+
1378
+ authors = " and ".join(author_list) if author_list else "Unknown Author"
1379
+
1380
+ # Extract journal info safely
1381
+ primary_location = paper.get('primary_location', {})
1382
+ journal = 'Unknown Journal'
1383
+ if isinstance(primary_location, dict):
1384
+ source = primary_location.get('source', {})
1385
+ if isinstance(source, dict):
1386
+ journal = source.get('display_name', 'Unknown Journal')
1387
+
1388
+ # Extract volume, issue, pages safely
1389
+ biblio = paper.get('biblio', {})
1390
+ volume = ''
1391
+ issue = ''
1392
+ first_page = ''
1393
+ last_page = ''
1394
+ if isinstance(biblio, dict):
1395
+ volume = biblio.get('volume', '')
1396
+ issue = biblio.get('issue', '')
1397
+ first_page = biblio.get('first_page', '')
1398
+ last_page = biblio.get('last_page', '')
1399
+
1400
+ # Format pages
1401
+ if first_page and last_page and first_page != last_page:
1402
+ pages = f"{first_page}--{last_page}"
1403
+ elif first_page:
1404
+ pages = first_page
1405
+ else:
1406
+ pages = ""
1407
+
1408
+ # Format volume and issue
1409
+ volume_info = ""
1410
+ if volume:
1411
+ volume_info = f"volume={{{volume}}}"
1412
+ if issue:
1413
+ volume_info += f", number={{{issue}}}"
1414
+ elif issue:
1415
+ volume_info = f"number={{{issue}}}"
1416
+
1417
+ # Get URL (prefer DOI, fallback to landing page)
1418
+ url = doi if doi else ''
1419
+ if isinstance(primary_location, dict):
1420
+ landing_url = primary_location.get('landing_page_url', '')
1421
+ if landing_url and not url:
1422
+ url = landing_url
1423
+
1424
+ # Build BibTeX entry
1425
+ bibtex_entry = f"""@article{{{work_id},
1426
+ title={{{title}}},
1427
+ author={{{authors}}},
1428
+ journal={{{journal}}},
1429
+ year={{{year}}}"""
1430
+
1431
+ if volume_info:
1432
+ bibtex_entry += f",\n {volume_info}"
1433
+
1434
+ if pages:
1435
+ bibtex_entry += f",\n pages={{{pages}}}"
1436
+
1437
+ if doi:
1438
+ bibtex_entry += f",\n doi={{{doi.replace('https://doi.org/', '')}}}"
1439
+
1440
+ if url:
1441
+ bibtex_entry += f",\n url={{{url}}}"
1442
+
1443
+ bibtex_entry += "\n}"
1444
+
1445
+ return bibtex_entry
1446
+
1447
+ except Exception as e:
1448
+ print(f"Error generating BibTeX for paper: {e}")
1449
+ print(f"Paper data: {paper}")
1450
+ return f"@article{{error_{hash(str(paper)) % 10000},\n title={{Error generating entry}},\n author={{Unknown}},\n year={{Unknown}}\n}}"
1451
+
1452
+ # Flask BibTeX and download routes removed - now using Gradio interface
1453
+
1454
+ # Flask merge route removed - now using Gradio interface
1455
+
1456
+ def merge_collections(collection_filenames):
1457
+ """Merge multiple collections into a new collection with overlap analysis."""
1458
+ try:
1459
+ if len(collection_filenames) < 2:
1460
+ return {'success': False, 'message': 'At least 2 collections required for merging'}
1461
+
1462
+ # Load all collections and track their work IDs
1463
+ collections_data = []
1464
+ all_work_ids = set()
1465
+ collection_work_ids = [] # List of sets, one per collection
1466
+
1467
+ for filename in collection_filenames:
1468
+ collection_path = os.path.join(COLLECTION_DB_DIR, filename)
1469
+ if not os.path.exists(collection_path):
1470
+ return {'success': False, 'message': f'Collection {filename} not found'}
1471
+
1472
+ with open(collection_path, 'rb') as f:
1473
+ collection_data = pickle.load(f)
1474
+
1475
+ papers = collection_data.get('papers', [])
1476
+ collection_work_ids_set = set()
1477
+
1478
+ # Extract work IDs for this collection
1479
+ for paper in papers:
1480
+ if isinstance(paper, dict):
1481
+ work_id = paper.get('id', '')
1482
+ if work_id:
1483
+ collection_work_ids_set.add(work_id)
1484
+ all_work_ids.add(work_id)
1485
+
1486
+ collections_data.append({
1487
+ 'filename': filename,
1488
+ 'title': collection_data.get('title', filename.replace('.pkl', '')),
1489
+ 'papers': papers,
1490
+ 'work_ids': collection_work_ids_set,
1491
+ 'total_papers': len(papers)
1492
+ })
1493
+ collection_work_ids.append(collection_work_ids_set)
1494
+
1495
+ # Calculate overlap statistics
1496
+ overlap_stats = []
1497
+ total_unique_papers = len(all_work_ids)
1498
+
1499
+ for i, collection in enumerate(collections_data):
1500
+ collection_work_ids_i = collection_work_ids[i]
1501
+ overlaps = []
1502
+
1503
+ # Calculate overlap with each other collection
1504
+ for j, other_collection in enumerate(collections_data):
1505
+ if i != j:
1506
+ other_work_ids = collection_work_ids[j]
1507
+ intersection = collection_work_ids_i.intersection(other_work_ids)
1508
+ overlap_count = len(intersection)
1509
+ overlap_percentage = (overlap_count / len(collection_work_ids_i)) * 100 if collection_work_ids_i else 0
1510
+
1511
+ overlaps.append({
1512
+ 'collection': other_collection['title'],
1513
+ 'overlap_count': overlap_count,
1514
+ 'overlap_percentage': round(overlap_percentage, 1)
1515
+ })
1516
+
1517
+ overlap_stats.append({
1518
+ 'collection': collection['title'],
1519
+ 'total_papers': collection['total_papers'],
1520
+ 'overlaps': overlaps
1521
+ })
1522
+
1523
+ # Create merged collection with unique papers only
1524
+ merged_papers = []
1525
+ merged_work_ids = set()
1526
+
1527
+ for collection in collections_data:
1528
+ for paper in collection['papers']:
1529
+ if isinstance(paper, dict):
1530
+ work_id = paper.get('id', '')
1531
+ if work_id and work_id not in merged_work_ids:
1532
+ merged_papers.append(paper)
1533
+ merged_work_ids.add(work_id)
1534
+
1535
+ if not merged_papers:
1536
+ return {'success': False, 'message': 'No papers found in collections to merge'}
1537
+
1538
+ # Calculate total papers across all collections (before deduplication)
1539
+ total_papers_before_merge = sum(collection['total_papers'] for collection in collections_data)
1540
+ duplicates_removed = total_papers_before_merge - len(merged_papers)
1541
+ deduplication_percentage = (duplicates_removed / total_papers_before_merge) * 100 if total_papers_before_merge > 0 else 0
1542
+
1543
+ # Create merged collection data
1544
+ collection_titles = [collection['title'] for collection in collections_data]
1545
+ merged_title = f"MERGED: {' + '.join(collection_titles[:3])}"
1546
+ if len(collection_titles) > 3:
1547
+ merged_title += f" + {len(collection_titles) - 3} more"
1548
+
1549
+ merged_data = {
1550
+ 'work_identifier': f"merged_{int(time.time())}",
1551
+ 'title': merged_title,
1552
+ 'work_id': '',
1553
+ 'papers': merged_papers,
1554
+ 'total_papers': len(merged_papers),
1555
+ 'created': datetime.now().isoformat(),
1556
+ 'source_collections': collection_filenames,
1557
+ 'merge_stats': {
1558
+ 'total_papers_before_merge': total_papers_before_merge,
1559
+ 'duplicates_removed': duplicates_removed,
1560
+ 'deduplication_percentage': round(deduplication_percentage, 1),
1561
+ 'overlap_analysis': overlap_stats
1562
+ }
1563
+ }
1564
+
1565
+ # Save merged collection
1566
+ merged_filename = f"merged_{int(time.time())}.pkl"
1567
+ merged_path = os.path.join(COLLECTION_DB_DIR, merged_filename)
1568
+
1569
+ with open(merged_path, 'wb') as f:
1570
+ pickle.dump(merged_data, f)
1571
+
1572
+ return {
1573
+ 'success': True,
1574
+ 'message': f'Merged collection created with {len(merged_papers)} unique papers (removed {duplicates_removed} duplicates)',
1575
+ 'filename': merged_filename,
1576
+ 'total_papers': len(merged_papers),
1577
+ 'merge_stats': {
1578
+ 'total_papers_before_merge': total_papers_before_merge,
1579
+ 'duplicates_removed': duplicates_removed,
1580
+ 'deduplication_percentage': round(deduplication_percentage, 1),
1581
+ 'overlap_analysis': overlap_stats
1582
+ }
1583
+ }
1584
+
1585
+ except Exception as e:
1586
+ return {'success': False, 'message': f'Error merging collections: {str(e)}'}
1587
+
1588
+ def fetch_abstracts(papers):
1589
+ """Fetch missing abstracts for papers using their DOI URLs."""
1590
+ try:
1591
+ if not papers:
1592
+ return {'error': 'No papers provided'}
1593
+
1594
+ updated_papers = []
1595
+ fetched_count = 0
1596
+ total_processed = 0
1597
+
1598
+ for paper in papers:
1599
+ total_processed += 1
1600
+ updated_paper = paper.copy()
1601
+
1602
+ # Check if paper already has abstract (check both abstract_inverted_index and abstract fields)
1603
+ has_abstract = (
1604
+ (paper.get('abstract_inverted_index') and
1605
+ len(paper.get('abstract_inverted_index', {})) > 0) or
1606
+ (paper.get('abstract') and
1607
+ len(str(paper.get('abstract', '')).strip()) > 50)
1608
+ )
1609
+
1610
+ if not has_abstract and paper.get('doi'):
1611
+ print(f"Fetching abstract for DOI: {paper.get('doi')}")
1612
+ abstract = fetch_abstract_from_doi(paper.get('doi'))
1613
+
1614
+ if abstract:
1615
+ # Convert to inverted index format
1616
+ inverted_index = convert_abstract_to_inverted_index(abstract)
1617
+ updated_paper['abstract_inverted_index'] = inverted_index
1618
+ fetched_count += 1
1619
+ print(f"Successfully fetched abstract for: {paper.get('title', 'Unknown')[:50]}...")
1620
+ else:
1621
+ print(f"Could not fetch abstract for: {paper.get('title', 'Unknown')[:50]}...")
1622
+
1623
+ updated_papers.append(updated_paper)
1624
+
1625
+ return {
1626
+ 'success': True,
1627
+ 'fetched_count': fetched_count,
1628
+ 'total_processed': total_processed,
1629
+ 'updated_papers': updated_papers
1630
+ }
1631
+
1632
+ except Exception as e:
1633
+ print(f"Error fetching abstracts: {e}")
1634
+ return {'error': str(e)}
1635
+
1636
+ def export_excel_from_file(filename):
1637
+ """Export Excel from a specific database file."""
1638
+ try:
1639
+ # Try collections then filters then legacy
1640
+ filepath = os.path.join(COLLECTION_DB_DIR, filename)
1641
+ if not os.path.exists(filepath):
1642
+ filepath = os.path.join(FILTER_DB_DIR, filename)
1643
+ if not os.path.exists(filepath):
1644
+ filepath = os.path.join(DATABASE_DIR, filename)
1645
+ if not os.path.exists(filepath):
1646
+ return {'error': 'File not found'}
1647
+
1648
+ with open(filepath, 'rb') as f:
1649
+ data = pickle.load(f)
1650
+
1651
+ papers = data.get('papers', [])
1652
+ if not papers:
1653
+ return {'error': 'No papers found in file'}
1654
+
1655
+ # Prepare data for Excel export
1656
+ excel_data = []
1657
+ for paper in papers:
1658
+ # Extract abstract from inverted index
1659
+ abstract = ""
1660
+ if paper.get('abstract_inverted_index'):
1661
+ words = []
1662
+ for word, positions in paper['abstract_inverted_index'].items():
1663
+ for pos in positions:
1664
+ while len(words) <= pos:
1665
+ words.append('')
1666
+ words[pos] = word
1667
+ abstract = ' '.join(words).strip()
1668
+
1669
+ # Extract open access info with null checks
1670
+ oa_info = paper.get('open_access') or {}
1671
+ is_oa = oa_info.get('is_oa', False) if oa_info else False
1672
+ oa_status = oa_info.get('oa_status', '') if oa_info else ''
1673
+
1674
+ # Extract DOI with null check
1675
+ doi = ""
1676
+ if paper.get('doi'):
1677
+ doi = paper['doi'].replace('https://doi.org/', '')
1678
+
1679
+ # Extract authors with null checks
1680
+ authors = paper.get('authorships') or []
1681
+ author_names = []
1682
+ for author in authors[:5]: # Limit to first 5 authors
1683
+ if author and isinstance(author, dict):
1684
+ author_obj = author.get('author') or {}
1685
+ if author_obj and isinstance(author_obj, dict):
1686
+ author_names.append(author_obj.get('display_name', ''))
1687
+
1688
+ # Extract journal with null checks
1689
+ journal = ""
1690
+ primary_location = paper.get('primary_location')
1691
+ if primary_location and isinstance(primary_location, dict):
1692
+ source = primary_location.get('source')
1693
+ if source and isinstance(source, dict):
1694
+ journal = source.get('display_name', '')
1695
+
1696
+ # Extract GPT analysis with null checks
1697
+ gpt_analysis = paper.get('gpt_analysis') or {}
1698
+ gpt_aims = gpt_analysis.get('aims_of_paper', '') if gpt_analysis else ''
1699
+ gpt_takeaways = gpt_analysis.get('key_takeaways', '') if gpt_analysis else ''
1700
+
1701
+ excel_data.append({
1702
+ 'Title': paper.get('title', ''),
1703
+ 'Publication Date': paper.get('publication_date', ''),
1704
+ 'DOI': doi,
1705
+ 'Is Open Access': is_oa,
1706
+ 'OA Status': oa_status,
1707
+ 'Abstract': abstract,
1708
+ 'Relationship': paper.get('relationship', ''),
1709
+ 'Authors': ', '.join(author_names),
1710
+ 'Journal': journal,
1711
+ 'OpenAlex ID': paper.get('id', ''),
1712
+ 'Relevance Reason': paper.get('relevance_reason', ''),
1713
+ 'GPT Aims': gpt_aims,
1714
+ 'GPT Takeaways': gpt_takeaways
1715
+ })
1716
+
1717
+ # Create DataFrame and export to Excel
1718
+ df = pd.DataFrame(excel_data)
1719
+ excel_filename = f'{filename.replace(".pkl", "")}_{int(time.time())}.xlsx'
1720
+
1721
+ # Create Excel file in a temporary location
1722
+ temp_dir = tempfile.gettempdir()
1723
+ excel_path = os.path.join(temp_dir, excel_filename)
1724
+
1725
+ try:
1726
+ df.to_excel(excel_path, index=False)
1727
+ return {'success': True, 'message': f'Excel file created: {excel_filename}', 'filepath': excel_path}
1728
+ except Exception as e:
1729
+ print(f"Error creating Excel file: {e}")
1730
+ # Fallback: try current directory
1731
+ try:
1732
+ df.to_excel(excel_filename, index=False)
1733
+ return {'success': True, 'message': f'Excel file created: {excel_filename}', 'filepath': excel_filename}
1734
+ except Exception as e2:
1735
+ print(f"Error creating Excel file in current directory: {e2}")
1736
+ return {'error': f'Failed to create Excel file: {str(e2)}'}
1737
+
1738
+ except Exception as e:
1739
+ print(f"Error exporting Excel: {e}")
1740
+ return {'error': str(e)}
1741
+
1742
+ def export_excel():
1743
+ """Export collected papers to Excel format."""
1744
+ try:
1745
+ # Load papers from temporary file
1746
+ if not os.path.exists('temp_papers.pkl'):
1747
+ return {'error': 'No papers found. Please collect papers first.'}
1748
+
1749
+ with open('temp_papers.pkl', 'rb') as f:
1750
+ papers = pickle.load(f)
1751
+
1752
+ # Prepare data for Excel export
1753
+ excel_data = []
1754
+ for paper in papers:
1755
+ # Extract abstract from inverted index
1756
+ abstract = ""
1757
+ if paper.get('abstract_inverted_index'):
1758
+ words = []
1759
+ for word, positions in paper['abstract_inverted_index'].items():
1760
+ for pos in positions:
1761
+ while len(words) <= pos:
1762
+ words.append('')
1763
+ words[pos] = word
1764
+ abstract = ' '.join(words).strip()
1765
+
1766
+ # Extract open access info with null checks
1767
+ oa_info = paper.get('open_access') or {}
1768
+ is_oa = oa_info.get('is_oa', False) if oa_info else False
1769
+ oa_status = oa_info.get('oa_status', '') if oa_info else ''
1770
+
1771
+ # Extract DOI with null check
1772
+ doi = ""
1773
+ if paper.get('doi'):
1774
+ doi = paper['doi'].replace('https://doi.org/', '')
1775
+
1776
+ # Extract authors with null checks
1777
+ authors = paper.get('authorships') or []
1778
+ author_names = []
1779
+ for author in authors[:5]: # Limit to first 5 authors
1780
+ if author and isinstance(author, dict):
1781
+ author_obj = author.get('author') or {}
1782
+ if author_obj and isinstance(author_obj, dict):
1783
+ author_names.append(author_obj.get('display_name', ''))
1784
+
1785
+ # Extract journal with null checks
1786
+ journal = ""
1787
+ primary_location = paper.get('primary_location')
1788
+ if primary_location and isinstance(primary_location, dict):
1789
+ source = primary_location.get('source')
1790
+ if source and isinstance(source, dict):
1791
+ journal = source.get('display_name', '')
1792
+
1793
+ # Extract GPT analysis with null checks
1794
+ gpt_analysis = paper.get('gpt_analysis') or {}
1795
+ gpt_aims = gpt_analysis.get('aims_of_paper', '') if gpt_analysis else ''
1796
+ gpt_takeaways = gpt_analysis.get('key_takeaways', '') if gpt_analysis else ''
1797
+
1798
+ excel_data.append({
1799
+ 'Title': paper.get('title', ''),
1800
+ 'Publication Date': paper.get('publication_date', ''),
1801
+ 'DOI': doi,
1802
+ 'Is Open Access': is_oa,
1803
+ 'OA Status': oa_status,
1804
+ 'Abstract': abstract,
1805
+ 'Relationship': paper.get('relationship', ''),
1806
+ 'Authors': ', '.join(author_names),
1807
+ 'Journal': journal,
1808
+ 'OpenAlex ID': paper.get('id', ''),
1809
+ 'Relevance Reason': paper.get('relevance_reason', ''),
1810
+ 'GPT Aims': gpt_aims,
1811
+ 'GPT Takeaways': gpt_takeaways
1812
+ })
1813
+
1814
+ # Create DataFrame and export to Excel
1815
+ df = pd.DataFrame(excel_data)
1816
+ excel_filename = f'research_papers_{int(time.time())}.xlsx'
1817
+
1818
+ # Create Excel file in a temporary location
1819
+ temp_dir = tempfile.gettempdir()
1820
+ excel_path = os.path.join(temp_dir, excel_filename)
1821
+
1822
+ try:
1823
+ df.to_excel(excel_path, index=False)
1824
+ return {'success': True, 'message': f'Excel file created: {excel_filename}', 'filepath': excel_path}
1825
+ except Exception as e:
1826
+ print(f"Error creating Excel file: {e}")
1827
+ # Fallback: try current directory
1828
+ try:
1829
+ df.to_excel(excel_filename, index=False)
1830
+ return {'success': True, 'message': f'Excel file created: {excel_filename}', 'filepath': excel_filename}
1831
+ except Exception as e2:
1832
+ print(f"Error creating Excel file in current directory: {e2}")
1833
+ return {'error': f'Failed to create Excel file: {str(e2)}'}
1834
+
1835
+ except Exception as e:
1836
+ print(f"Error exporting Excel: {e}")
1837
+ return {'error': str(e)}
1838
+
1839
+ def search_papers_interface(paper_title: str):
1840
+ """Search for papers by title."""
1841
+ if not paper_title.strip():
1842
+ return "Please enter a paper title to search."
1843
+
1844
+ try:
1845
+ matches = search_papers_by_title(paper_title)
1846
+ if not matches:
1847
+ return "No papers found matching that title."
1848
+
1849
+ # Format results for display
1850
+ result_text = f"Found {len(matches)} papers:\n\n"
1851
+ for i, match in enumerate(matches, 1):
1852
+ result_text += f"{i}. {match['title']}\n"
1853
+ result_text += f" Authors: {match['authors']}\n"
1854
+ result_text += f" Year: {match['year']}\n"
1855
+ result_text += f" Journal: {match['venue']}\n"
1856
+ result_text += f" Work ID: {match['work_id']}\n\n"
1857
+
1858
+ return result_text
1859
+ except Exception as e:
1860
+ return f"Error searching papers: {str(e)}"
1861
+
1862
+ def collect_papers_interface(work_id: str, limit: int = 50):
1863
+ """Collect related papers from a work ID."""
1864
+ if not work_id.strip():
1865
+ return "Please enter a work ID to collect papers."
1866
+
1867
+ try:
1868
+ # Check if collection already exists
1869
+ existing_file = find_existing_collection(work_id)
1870
+ if existing_file:
1871
+ return f"Collection already exists: {existing_file}"
1872
+
1873
+ # Collect papers
1874
+ papers = get_related_papers(work_id, upper_limit=limit)
1875
+
1876
+ if not papers:
1877
+ return "No related papers found."
1878
+
1879
+ # Count papers by relationship type
1880
+ cited_count = sum(1 for p in papers if p.get('relationship') == 'cited')
1881
+ citing_count = sum(1 for p in papers if p.get('relationship') == 'citing')
1882
+ related_count = sum(1 for p in papers if p.get('relationship') == 'related')
1883
+
1884
+ # Save to database
1885
+ collection_data = {
1886
+ 'work_id': work_id,
1887
+ 'total_papers': len(papers),
1888
+ 'cited_papers': cited_count,
1889
+ 'citing_papers': citing_count,
1890
+ 'related_papers': related_count,
1891
+ 'limit': limit,
1892
+ 'papers': papers,
1893
+ }
1894
+
1895
+ # Get title for the collection
1896
+ title = work_id # Fallback to work_id if title not available
1897
+ try:
1898
+ seed_resp = requests.get(f'https://api.openalex.org/works/{work_id}', timeout=10)
1899
+ if seed_resp.ok:
1900
+ title = (seed_resp.json() or {}).get('title', work_id)
1901
+ except Exception:
1902
+ pass
1903
+
1904
+ db_filename = save_collection_to_database(work_id, title, collection_data)
1905
+
1906
+ result = f"Collection completed!\n\n"
1907
+ result += f"Total papers: {len(papers)}\n"
1908
+ result += f"Cited papers: {cited_count}\n"
1909
+ result += f"Citing papers: {citing_count}\n"
1910
+ result += f"Related papers: {related_count}\n"
1911
+ result += f"Saved as: {db_filename}"
1912
+
1913
+ return result
1914
+
1915
+ except Exception as e:
1916
+ return f"Error collecting papers: {str(e)}"
1917
+
1918
+ def filter_papers_interface(collection_filename: str, research_question: str, limit: int = 10):
1919
+ """Filter papers based on research question."""
1920
+ if not collection_filename.strip() or not research_question.strip():
1921
+ return "Please provide both collection filename and research question."
1922
+
1923
+ try:
1924
+ # Load collection
1925
+ filepath = os.path.join("database/collections", collection_filename)
1926
+ if not os.path.exists(filepath):
1927
+ return f"Collection file not found: {collection_filename}"
1928
+
1929
+ with open(filepath, 'rb') as f:
1930
+ collection_data = pickle.load(f)
1931
+
1932
+ papers = collection_data.get('papers', [])
1933
+ if not papers:
1934
+ return "No papers found in collection."
1935
+
1936
+ # Filter papers
1937
+ relevant_papers = filter_papers_for_research_question(papers, research_question, OPENAI_API_KEY, limit)
1938
+
1939
+ # Count relevant papers
1940
+ actual_relevant = sum(1 for paper in relevant_papers if paper.get('relevance_score') == True)
1941
+
1942
+ # Save filter results
1943
+ filter_data = {
1944
+ 'research_question': research_question,
1945
+ 'total_papers': len(papers),
1946
+ 'tested_papers': limit,
1947
+ 'relevant_papers': actual_relevant,
1948
+ 'limit': limit,
1949
+ 'papers': relevant_papers,
1950
+ 'source_collection': collection_filename.replace('.pkl', '')
1951
+ }
1952
+
1953
+ db_filename = save_filter_to_database(collection_filename.replace('.pkl', ''), research_question, filter_data)
1954
+
1955
+ result = f"Filtering completed!\n\n"
1956
+ result += f"Total papers in collection: {len(papers)}\n"
1957
+ result += f"Papers tested: {limit}\n"
1958
+ result += f"Relevant papers found: {actual_relevant}\n"
1959
+ result += f"Saved as: {db_filename}\n\n"
1960
+
1961
+ # Show relevant papers
1962
+ if relevant_papers:
1963
+ result += "Relevant papers:\n"
1964
+ for i, paper in enumerate(relevant_papers[:5], 1): # Show first 5
1965
+ result += f"{i}. {paper.get('title', 'No title')}\n"
1966
+ result += f" Reason: {paper.get('relevance_reason', 'No reason provided')}\n\n"
1967
+
1968
+ return result
1969
+
1970
+ except Exception as e:
1971
+ return f"Error filtering papers: {str(e)}"
1972
+
1973
+ def get_database_files_interface():
1974
+ """Get list of all database files."""
1975
+ try:
1976
+ files = get_database_files()
1977
+ if not files:
1978
+ return "No database files found."
1979
+
1980
+ result = f"Found {len(files)} database files:\n\n"
1981
+ for file_info in files:
1982
+ file_type = file_info.get('type', 'unknown')
1983
+ filename = file_info.get('filename', 'unknown')
1984
+ created = file_info.get('created', 'unknown')
1985
+ size = file_info.get('size', 0)
1986
+
1987
+ result += f"📁 {filename}\n"
1988
+ result += f" Type: {file_type}\n"
1989
+ result += f" Created: {created}\n"
1990
+ result += f" Size: {size} bytes\n\n"
1991
+
1992
+ return result
1993
+
1994
+ except Exception as e:
1995
+ return f"Error getting database files: {str(e)}"
1996
+
1997
+ def generate_bibtex_interface(filename: str):
1998
+ """Generate BibTeX for a collection."""
1999
+ if not filename.strip():
2000
+ return "Please provide a filename to generate BibTeX."
2001
+
2002
+ try:
2003
+ # Load collection
2004
+ filepath = os.path.join("database/collections", filename)
2005
+ if not os.path.exists(filepath):
2006
+ return f"Collection file not found: {filename}"
2007
+
2008
+ with open(filepath, 'rb') as f:
2009
+ collection_data = pickle.load(f)
2010
+
2011
+ papers = collection_data.get('papers', [])
2012
+ if not papers:
2013
+ return "No papers found in collection."
2014
+
2015
+ # Generate BibTeX entries
2016
+ bibtex_entries = []
2017
+ for paper in papers:
2018
+ entry = generate_bibtex_entry(paper)
2019
+ bibtex_entries.append(entry)
2020
+
2021
+ # Combine all entries
2022
+ bibtex_content = "\n\n".join(bibtex_entries)
2023
+
2024
+ # Save BibTeX file
2025
+ bibtex_filename = filename.replace('.pkl', '.bib')
2026
+ bibtex_path = os.path.join("database/collections", bibtex_filename)
2027
+
2028
+ with open(bibtex_path, 'w', encoding='utf-8') as f:
2029
+ f.write(bibtex_content)
2030
+
2031
+ result = f"BibTeX file generated successfully!\n\n"
2032
+ result += f"Filename: {bibtex_filename}\n"
2033
+ result += f"Entries: {len(papers)}\n"
2034
+ result += f"Saved to: {bibtex_path}\n\n"
2035
+ result += "First few entries:\n"
2036
+ result += bibtex_content[:1000] + "..." if len(bibtex_content) > 1000 else bibtex_content
2037
+
2038
+ return result
2039
+
2040
+ except Exception as e:
2041
+ return f"Error generating BibTeX: {str(e)}"
2042
+
2043
+ # Create Gradio interface
2044
+ with gr.Blocks(title="AI Systematic Literature Review", theme=gr.themes.Soft()) as demo:
2045
+ gr.Markdown("# 🧪 AI Systematic Literature Review")
2046
+ gr.Markdown("Search, collect, and analyze academic papers using OpenAlex and AI-powered filtering.")
2047
+
2048
+ with gr.Tabs():
2049
+ with gr.Tab("🔍 Search Papers"):
2050
+ gr.Markdown("Search for papers by title using OpenAlex API")
2051
+ with gr.Row():
2052
+ search_title = gr.Textbox(label="Paper Title", placeholder="Enter the title of the paper you want to search for...")
2053
+ search_btn = gr.Button("Search Papers", variant="primary")
2054
+ search_output = gr.Textbox(label="Search Results", lines=10, interactive=False)
2055
+ search_btn.click(search_papers_interface, inputs=search_title, outputs=search_output)
2056
+
2057
+ with gr.Tab("📚 Collect Papers"):
2058
+ gr.Markdown("Collect related papers from a seed paper using its OpenAlex Work ID")
2059
+ with gr.Row():
2060
+ work_id_input = gr.Textbox(label="OpenAlex Work ID", placeholder="e.g., W2741809807")
2061
+ limit_input = gr.Number(label="Limit", value=50, minimum=1, maximum=1000)
2062
+ collect_btn = gr.Button("Collect Papers", variant="primary")
2063
+ collect_output = gr.Textbox(label="Collection Results", lines=10, interactive=False)
2064
+ collect_btn.click(collect_papers_interface, inputs=[work_id_input, limit_input], outputs=collect_output)
2065
+
2066
+ with gr.Tab("🔬 Filter Papers"):
2067
+ gr.Markdown("Filter collected papers based on a research question using AI analysis")
2068
+ with gr.Row():
2069
+ collection_file = gr.Textbox(label="Collection Filename", placeholder="e.g., W2741809807.pkl")
2070
+ research_question = gr.Textbox(label="Research Question", placeholder="What is your research question?")
2071
+ filter_limit = gr.Number(label="Papers to Test", value=10, minimum=1, maximum=100)
2072
+ filter_btn = gr.Button("Filter Papers", variant="primary")
2073
+ filter_output = gr.Textbox(label="Filter Results", lines=15, interactive=False)
2074
+ filter_btn.click(filter_papers_interface, inputs=[collection_file, research_question, filter_limit], outputs=filter_output)
2075
+
2076
+ with gr.Tab("📁 Database Files"):
2077
+ gr.Markdown("View and manage your collected papers and filters")
2078
+ with gr.Row():
2079
+ db_btn = gr.Button("Refresh Database Files", variant="primary")
2080
+ db_output = gr.Textbox(label="Database Files", lines=15, interactive=False)
2081
+ db_btn.click(get_database_files_interface, outputs=db_output)
2082
+
2083
+ with gr.Tab("📊 Export Data"):
2084
+ gr.Markdown("Export your collections to various formats")
2085
+ with gr.Row():
2086
+ export_filename = gr.Textbox(label="Collection Filename", placeholder="e.g., W2741809807.pkl")
2087
+ export_bibtex_btn = gr.Button("Export to BibTeX")
2088
+ export_output = gr.Textbox(label="Export Results", lines=10, interactive=False)
2089
+ export_bibtex_btn.click(generate_bibtex_interface, inputs=export_filename, outputs=export_output)
2090
+
2091
+ gr.Markdown("""
2092
+ ## How to use:
2093
+ 1. **Search Papers**: Enter a paper title to find papers in OpenAlex
2094
+ 2. **Collect Papers**: Use a Work ID to collect related papers (cited, citing, and related)
2095
+ 3. **Filter Papers**: Use AI to filter collected papers based on your research question
2096
+ 4. **Database Files**: View all your collections and filters
2097
+ 5. **Export Data**: Export your results to BibTeX format
2098
+
2099
+ ## Note:
2100
+ - You need an OpenAI API key set as an environment variable for AI filtering
2101
+ - Collections are automatically saved and can be reused
2102
+ - The system respects OpenAlex rate limits
2103
+ """)
2104
+
2105
+ if __name__ == "__main__":
2106
+ demo.launch()
database/collections/W1607201421.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61bb4e949d3628dd87345737e7b8120aa3707b9231c1e467c85ea7daface8200
3
+ size 133
database/collections/W2774003070.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f73318a00a6301409fc306d82dc050be28d2aec4ff306cd1ed4575b3d361b983
3
+ size 132
database/collections/W3200878735.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6665f6e774a08975080f3eab7395fc7d52feb9f58fe7d3b9055340ceddc67215
3
+ size 132
database/filters/W2774003070__filter__talks_about_just_transitions_in_global_s__20250909_224951.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee597aa75ad51fa7ee130c5fbbcedce7021373a439403ae8766faaf552898b13
3
+ size 131
database/filters/W3200878735__filter__talks_about_just_transitions_in_global_s__20250909_225708.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc11a002a31bf98af1593b1f49d19e52275619fcc7de2da2506f3bf9925ca054
3
+ size 131
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ requests>=2.31.0
3
+ openai>=1.0.0
4
+ pandas>=2.0.0
5
+ tqdm>=4.65.0
6
+ openpyxl>=3.0.0
7
+ beautifulsoup4>=4.12.0
templates/index.html ADDED
@@ -0,0 +1,1667 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Research Paper Analysis Tool</title>
7
+ <style>
8
+ * {
9
+ margin: 0;
10
+ padding: 0;
11
+ box-sizing: border-box;
12
+ }
13
+
14
+ body {
15
+ font-family: 'Courier New', monospace;
16
+ background: #000000;
17
+ color: #ffffff;
18
+ line-height: 1.2;
19
+ padding: 15px;
20
+ margin: 0;
21
+ display: flex;
22
+ font-weight: bold;
23
+ }
24
+
25
+ .main-content {
26
+ flex: 1;
27
+ max-width: 50%;
28
+ margin-right: 15px;
29
+ }
30
+
31
+ .history-panel {
32
+ width: 275px;
33
+ border: 3px solid #ffffff;
34
+ padding: 15px;
35
+ background: #000000;
36
+ height: 75vh;
37
+ display: flex;
38
+ flex-direction: column;
39
+ margin-right: 10px;
40
+ }
41
+
42
+ .merge-panel {
43
+ width: 275px;
44
+ border: 3px solid #ffffff;
45
+ padding: 15px;
46
+ background: #000000;
47
+ height: 20vh;
48
+ display: flex;
49
+ flex-direction: column;
50
+ margin-right: 10px;
51
+ margin-bottom: 10px;
52
+ }
53
+
54
+ .merge-content {
55
+ flex: 1;
56
+ border: 2px dashed #ffffff;
57
+ padding: 10px;
58
+ margin-bottom: 10px;
59
+ min-height: 60px;
60
+ display: flex;
61
+ flex-direction: column;
62
+ align-items: center;
63
+ justify-content: center;
64
+ }
65
+
66
+ .merge-placeholder {
67
+ color: #666666;
68
+ font-size: 10px;
69
+ text-align: center;
70
+ text-transform: uppercase;
71
+ letter-spacing: 1px;
72
+ }
73
+
74
+ .merge-item {
75
+ background: #333333;
76
+ border: 1px solid #ffffff;
77
+ padding: 5px 8px;
78
+ margin: 2px 0;
79
+ font-size: 9px;
80
+ color: #ffffff;
81
+ text-transform: uppercase;
82
+ display: flex;
83
+ justify-content: space-between;
84
+ align-items: center;
85
+ }
86
+
87
+ .merge-actions {
88
+ display: flex;
89
+ gap: 6px;
90
+ }
91
+
92
+ .filters-panel {
93
+ width: 275px;
94
+ border: 3px solid #ffffff;
95
+ padding: 15px;
96
+ background: #000000;
97
+ height: 80vh;
98
+ display: flex;
99
+ flex-direction: column;
100
+ opacity: 0;
101
+ transform: translateX(20px);
102
+ transition: all 0.3s ease;
103
+ }
104
+
105
+ .filters-panel.visible {
106
+ opacity: 1;
107
+ transform: translateX(0);
108
+ }
109
+
110
+ .container {
111
+ max-width: 100%;
112
+ margin: 0;
113
+ }
114
+
115
+ h1 {
116
+ font-size: 1.8em;
117
+ text-align: center;
118
+ margin-bottom: 20px;
119
+ font-weight: bold;
120
+ color: #ffffff;
121
+ border: 3px solid #ffffff;
122
+ padding: 10px;
123
+ text-transform: uppercase;
124
+ letter-spacing: 2px;
125
+ }
126
+
127
+ .section {
128
+ margin: 15px 0;
129
+ border: 3px solid #ffffff;
130
+ padding: 15px;
131
+ background: #000000;
132
+ }
133
+
134
+ .section h2 {
135
+ font-size: 1.2em;
136
+ margin-bottom: 15px;
137
+ font-weight: bold;
138
+ color: #ffffff;
139
+ text-transform: uppercase;
140
+ letter-spacing: 1px;
141
+ border-bottom: 2px solid #ffffff;
142
+ padding-bottom: 5px;
143
+ }
144
+
145
+ input[type="text"], textarea {
146
+ width: 100%;
147
+ background: #000000;
148
+ color: #ffffff;
149
+ border: 3px solid #ffffff;
150
+ padding: 10px;
151
+ font-family: 'Courier New', monospace;
152
+ font-size: 12px;
153
+ margin-bottom: 10px;
154
+ font-weight: bold;
155
+ }
156
+
157
+ input[type="text"]:focus, textarea:focus {
158
+ outline: none;
159
+ border-color: #ffffff;
160
+ box-shadow: 0 0 0 3px #ffffff;
161
+ }
162
+
163
+ button {
164
+ background: #000000;
165
+ color: #ffffff;
166
+ border: 3px solid #ffffff;
167
+ padding: 8px 15px;
168
+ font-family: 'Courier New', monospace;
169
+ font-size: 11px;
170
+ cursor: pointer;
171
+ margin-right: 8px;
172
+ margin-bottom: 8px;
173
+ font-weight: bold;
174
+ text-transform: uppercase;
175
+ letter-spacing: 1px;
176
+ }
177
+
178
+ button:hover {
179
+ background: #ffffff;
180
+ color: #000000;
181
+ }
182
+
183
+ button:disabled {
184
+ opacity: 0.5;
185
+ cursor: not-allowed;
186
+ }
187
+
188
+ .status {
189
+ margin: 10px 0;
190
+ padding: 12px;
191
+ border: 1px solid #444444;
192
+ background: #2a2a2a;
193
+ border-radius: 4px;
194
+ }
195
+
196
+ .error {
197
+ border-color: #ff4444;
198
+ background: #2a1a1a;
199
+ color: #ff6666;
200
+ }
201
+
202
+ .success {
203
+ border-color: #44ff44;
204
+ background: #1a2a1a;
205
+ color: #66ff66;
206
+ }
207
+
208
+ .paper-list {
209
+ margin-top: 20px;
210
+ }
211
+
212
+ .paper-item {
213
+ border: 1px solid #444444;
214
+ margin: 15px 0;
215
+ padding: 20px;
216
+ background: #1a1a1a;
217
+ border-radius: 6px;
218
+ box-shadow: 0 1px 3px rgba(255,255,255,0.1);
219
+ }
220
+
221
+ .paper-title {
222
+ font-weight: 600;
223
+ margin-bottom: 12px;
224
+ color: #ffffff;
225
+ font-size: 1.1em;
226
+ }
227
+
228
+ .paper-meta {
229
+ font-size: 0.9em;
230
+ color: #aaaaaa;
231
+ margin-bottom: 12px;
232
+ }
233
+
234
+ .paper-abstract {
235
+ font-size: 0.9em;
236
+ line-height: 1.3;
237
+ margin-bottom: 10px;
238
+ }
239
+
240
+ .relevance-reason {
241
+ font-size: 0.85em;
242
+ color: #aaaaaa;
243
+ font-style: italic;
244
+ margin-top: 12px;
245
+ padding: 10px;
246
+ border-left: 3px solid #444444;
247
+ background: #2a2a2a;
248
+ border-radius: 0 4px 4px 0;
249
+ }
250
+
251
+ .loading {
252
+ text-align: center;
253
+ padding: 30px;
254
+ color: #aaaaaa;
255
+ }
256
+
257
+ .stats {
258
+ display: flex;
259
+ gap: 20px;
260
+ margin: 20px 0;
261
+ flex-wrap: wrap;
262
+ }
263
+
264
+ .stat-item {
265
+ border: 1px solid #444444;
266
+ padding: 15px;
267
+ text-align: center;
268
+ min-width: 120px;
269
+ background: #1a1a1a;
270
+ border-radius: 6px;
271
+ box-shadow: 0 1px 3px rgba(255,255,255,0.1);
272
+ }
273
+
274
+ .stat-number {
275
+ font-size: 2em;
276
+ font-weight: 600;
277
+ color: #ffffff;
278
+ }
279
+
280
+ .stat-label {
281
+ font-size: 0.8em;
282
+ text-transform: uppercase;
283
+ letter-spacing: 1px;
284
+ color: #aaaaaa;
285
+ }
286
+
287
+ .history-panel h3, .filters-panel h3 {
288
+ color: #ffffff;
289
+ margin-bottom: 15px;
290
+ font-size: 1em;
291
+ font-weight: bold;
292
+ flex-shrink: 0;
293
+ text-transform: uppercase;
294
+ letter-spacing: 2px;
295
+ border: 2px solid #ffffff;
296
+ padding: 8px;
297
+ text-align: center;
298
+ }
299
+
300
+ .history-content {
301
+ flex: 1;
302
+ overflow-y: auto;
303
+ padding-right: 5px;
304
+ }
305
+
306
+ .history-content::-webkit-scrollbar {
307
+ width: 6px;
308
+ }
309
+
310
+ .history-content::-webkit-scrollbar-track {
311
+ background: #1a1a1a;
312
+ border-radius: 3px;
313
+ }
314
+
315
+ .history-content::-webkit-scrollbar-thumb {
316
+ background: #444444;
317
+ border-radius: 3px;
318
+ }
319
+
320
+ .history-content::-webkit-scrollbar-thumb:hover {
321
+ background: #666666;
322
+ }
323
+
324
+ .history-item {
325
+ background: #000000;
326
+ border: 2px solid #ffffff;
327
+ padding: 10px;
328
+ margin-bottom: 8px;
329
+ color: #ffffff;
330
+ cursor: pointer;
331
+ transition: all 0.2s ease;
332
+ }
333
+
334
+ .history-item:hover {
335
+ background: #333333;
336
+ color: #ffffff;
337
+ }
338
+
339
+ .collection-item {
340
+ border-left: 4px solid #ffffff;
341
+ }
342
+
343
+ .filter-item {
344
+ border-left: 4px solid #666666;
345
+ margin-left: 8px;
346
+ }
347
+
348
+ .history-item .history-title {
349
+ font-weight: bold;
350
+ color: #ffffff;
351
+ margin-bottom: 5px;
352
+ font-size: 0.9em;
353
+ text-transform: uppercase;
354
+ letter-spacing: 1px;
355
+ }
356
+
357
+ .history-item .history-meta {
358
+ font-size: 0.7em;
359
+ color: #aaaaaa;
360
+ margin-bottom: 6px;
361
+ font-weight: bold;
362
+ }
363
+
364
+ .download-btn, .delete-btn {
365
+ background: #000000;
366
+ color: #ffffff;
367
+ border: 2px solid #ffffff;
368
+ padding: 4px 8px;
369
+ font-size: 9px;
370
+ margin-right: 4px;
371
+ cursor: pointer;
372
+ font-weight: bold;
373
+ text-transform: uppercase;
374
+ letter-spacing: 1px;
375
+ transition: all 0.2s ease;
376
+ }
377
+
378
+ .download-btn:hover, .delete-btn:hover {
379
+ background: #ffffff;
380
+ color: #000000;
381
+ }
382
+
383
+ .delete-btn {
384
+ background: #000000;
385
+ border-color: #ffffff;
386
+ color: #ffffff;
387
+ }
388
+
389
+ .delete-btn:hover {
390
+ background: #ffffff;
391
+ color: #000000;
392
+ }
393
+
394
+ .paper-match:hover {
395
+ background: #2a2a2a !important;
396
+ }
397
+
398
+ .progress-container {
399
+ margin: 20px 0;
400
+ border: 1px solid #444444;
401
+ padding: 15px;
402
+ background: #1a1a1a;
403
+ border-radius: 6px;
404
+ }
405
+
406
+ .progress-bar {
407
+ width: 100%;
408
+ height: 20px;
409
+ background: #2a2a2a;
410
+ border: 1px solid #444444;
411
+ position: relative;
412
+ margin: 10px 0;
413
+ border-radius: 10px;
414
+ }
415
+
416
+ .progress-fill {
417
+ height: 100%;
418
+ background: #ffffff;
419
+ width: 0%;
420
+ transition: width 0.3s ease;
421
+ border-radius: 10px;
422
+ }
423
+
424
+ .progress-text {
425
+ position: absolute;
426
+ top: 50%;
427
+ left: 50%;
428
+ transform: translate(-50%, -50%);
429
+ color: #ffffff;
430
+ font-weight: bold;
431
+ font-size: 12px;
432
+ }
433
+
434
+ .export-section {
435
+ margin: 20px 0;
436
+ text-align: center;
437
+ }
438
+
439
+ .export-btn {
440
+ background: #2a2a2a;
441
+ color: #ffffff;
442
+ border: 1px solid #444444;
443
+ padding: 15px 30px;
444
+ font-family: inherit;
445
+ font-size: 16px;
446
+ cursor: pointer;
447
+ margin: 20px 0;
448
+ border-radius: 4px;
449
+ transition: all 0.15s ease-in-out;
450
+ }
451
+
452
+ .export-btn:hover {
453
+ background: #444444;
454
+ border-color: #666666;
455
+ }
456
+
457
+ .summary-section {
458
+ margin: 20px 0;
459
+ }
460
+
461
+ .summary-table {
462
+ background: #1a1a1a;
463
+ border: 1px solid #444444;
464
+ margin: 10px 0;
465
+ overflow-x: auto;
466
+ border-radius: 6px;
467
+ }
468
+
469
+ .summary-table table {
470
+ width: 100%;
471
+ border-collapse: collapse;
472
+ font-family: inherit;
473
+ font-size: 12px;
474
+ }
475
+
476
+ .summary-table th {
477
+ background: #2a2a2a;
478
+ color: #ffffff;
479
+ padding: 8px;
480
+ text-align: left;
481
+ border: 1px solid #444444;
482
+ font-weight: bold;
483
+ }
484
+
485
+ .summary-table td {
486
+ padding: 8px;
487
+ border: 1px solid #444444;
488
+ color: #ffffff;
489
+ vertical-align: top;
490
+ }
491
+
492
+ .summary-table tr:nth-child(even) {
493
+ background: #2a2a2a;
494
+ }
495
+
496
+ .relevance-yes {
497
+ color: #ffffff;
498
+ font-weight: bold;
499
+ }
500
+
501
+ .relevance-no {
502
+ color: #aaaaaa;
503
+ font-weight: bold;
504
+ }
505
+
506
+ .relevance-unknown {
507
+ color: #cccccc;
508
+ font-weight: bold;
509
+ }
510
+
511
+
512
+
513
+ @media (max-width: 768px) {
514
+ body {
515
+ padding: 10px;
516
+ }
517
+
518
+ h1 {
519
+ font-size: 1.44em;
520
+ padding: 15px;
521
+ }
522
+
523
+ .stats {
524
+ flex-direction: column;
525
+ }
526
+ }
527
+ </style>
528
+ </head>
529
+ <body>
530
+ <div class="main-content">
531
+ <div class="container">
532
+ <h1>Collect Literature and Filter by Research Question</h1>
533
+
534
+ <!-- Step 1: Collect Papers -->
535
+ <div class="section">
536
+ <h2>Step 1: Collect Related Papers</h2>
537
+ <p>Choose how to collect papers:</p>
538
+
539
+ <div style="margin: 15px 0;">
540
+ <label style="display: block; margin-bottom: 8px; font-weight: bold;">METHOD:</label>
541
+ <div style="display: flex; gap: 15px; margin-bottom: 15px;">
542
+ <label style="display: flex; align-items: center; cursor: pointer;">
543
+ <input type="radio" name="collectMethod" value="url" checked style="margin-right: 8px;">
544
+ <span>OpenAlex URL</span>
545
+ </label>
546
+ <label style="display: flex; align-items: center; cursor: pointer;">
547
+ <input type="radio" name="collectMethod" value="title" style="margin-right: 8px;">
548
+ <span>Search by Title</span>
549
+ </label>
550
+ </div>
551
+ </div>
552
+
553
+ <div id="urlInput" style="display: block;">
554
+ <p>Enter an OpenAlex paper URL to collect all related papers (cited, citing, and related works).</p>
555
+ <input type="text" id="seedUrl" placeholder="https://api.openalex.org/works/W1607201421" value="https://api.openalex.org/works/W1607201421" />
556
+ </div>
557
+
558
+ <div id="titleInput" style="display: none;">
559
+ <p>Enter a paper title to search for and collect related papers.</p>
560
+ <input type="text" id="paperTitle" placeholder="Enter paper title..." value="just transitions" />
561
+ <button onclick="searchPapers()" id="searchBtn" style="margin-left: 10px;">Search Papers</button>
562
+ <div id="paperMatches" style="display: none; margin-top: 15px;"></div>
563
+ </div>
564
+
565
+ <button onclick="collectPapers()" id="collectBtn">Collect Papers</button>
566
+ <div id="collectStatus" class="status" style="display: none;"></div>
567
+ <div id="collectDownload" style="display: none;">
568
+ <button onclick="downloadCollectionExcel()" class="download-btn">Download Collection Excel</button>
569
+ </div>
570
+ </div>
571
+
572
+ <!-- Step 2: Filter Papers -->
573
+ <div class="section">
574
+ <h2>Step 2: Filter by Research Question</h2>
575
+ <p>Enter your research question to filter the collected papers for relevance.</p>
576
+ <textarea id="researchQuestion" rows="3" placeholder="What are the main impacts of climate change on ocean circulation patterns?">What are the key aspects of just transitions in climate policy and energy systems?</textarea>
577
+ <div style="margin: 10px 0;">
578
+ <label>Number of most recent papers to analyze:</label>
579
+ <input type="number" id="paperLimit" value="10" min="1" max="50" style="width: 80px; margin-left: 10px;">
580
+ <div style="font-size: 11px; color: #0a0; margin-top: 5px;">
581
+ Max 50 papers. For more, please provide your own GPT API key.
582
+ </div>
583
+ </div>
584
+ <button onclick="filterPapers()" id="filterBtn" disabled>Filter Papers</button>
585
+ <div id="filterStatus" class="status" style="display: none;"></div>
586
+ <div id="filterDownload" style="display: none;">
587
+ <button onclick="downloadFilterExcel()" class="download-btn">Download Filter Excel</button>
588
+ </div>
589
+ </div>
590
+
591
+ <!-- Results -->
592
+ <div class="section" id="resultsSection" style="display: none;">
593
+ <h2>Results</h2>
594
+ <div class="stats" id="stats"></div>
595
+ <div class="export-section">
596
+ <button onclick="exportToExcel()" class="export-btn">Download Excel</button>
597
+ </div>
598
+ <div class="summary-section" id="summarySection" style="display: none;">
599
+ <h3>Analysis Summary</h3>
600
+ <div class="summary-table" id="summaryTable"></div>
601
+ </div>
602
+ <div class="paper-list" id="paperList"></div>
603
+ </div>
604
+ </div>
605
+ </div>
606
+
607
+ <!-- History Panel -->
608
+ <div class="history-panel">
609
+ <h3>COLLECTIONS</h3>
610
+ <div class="history-content">
611
+ <div id="collectionsList"></div>
612
+ </div>
613
+ </div>
614
+
615
+ <!-- Merge Panel -->
616
+ <div class="merge-panel">
617
+ <h3>MERGE COLLECTIONS</h3>
618
+ <div class="merge-content" id="mergeBox" ondrop="dropCollection(event)" ondragover="allowDrop(event)">
619
+ <div class="merge-placeholder">DRAG COLLECTIONS HERE TO MERGE</div>
620
+ <div id="mergeItems"></div>
621
+ </div>
622
+ <div class="merge-actions" id="mergeActions" style="display:none;">
623
+ <button onclick="saveMergedCollection()" class="download-btn">SAVE TO COLLECTIONS</button>
624
+ <button onclick="clearMergeBox()" class="delete-btn">CLEAR</button>
625
+ </div>
626
+ </div>
627
+
628
+ <!-- Filters Panel -->
629
+ <div class="filters-panel" id="filtersPanel">
630
+ <h3>FILTERS</h3>
631
+ <div class="history-content">
632
+ <div id="filtersContainer"></div>
633
+ </div>
634
+ </div>
635
+
636
+ <script>
637
+ let collectedPapers = [];
638
+ let lastDisplayedPapers = [];
639
+
640
+ // Set default values when page loads
641
+ document.addEventListener('DOMContentLoaded', function() {
642
+ document.getElementById('seedUrl').value = 'https://api.openalex.org/works/W1607201421';
643
+ document.getElementById('researchQuestion').value = 'What are the key aspects of just transitions in climate policy and energy systems?';
644
+ loadHistory();
645
+
646
+ // Handle radio button switching
647
+ document.querySelectorAll('input[name="collectMethod"]').forEach(radio => {
648
+ radio.addEventListener('change', function() {
649
+ const urlInput = document.getElementById('urlInput');
650
+ const titleInput = document.getElementById('titleInput');
651
+
652
+ if (this.value === 'url') {
653
+ urlInput.style.display = 'block';
654
+ titleInput.style.display = 'none';
655
+ } else {
656
+ urlInput.style.display = 'none';
657
+ titleInput.style.display = 'block';
658
+ // Auto-search when switching to title method
659
+ const paperTitle = document.getElementById('paperTitle').value.trim();
660
+ if (paperTitle) {
661
+ searchPapers();
662
+ }
663
+ }
664
+ });
665
+ });
666
+ });
667
+
668
+ let currentCollectionFile = null;
669
+ let currentFilterFile = null;
670
+ let historyIndex = { collections: {}, filters: {} };
671
+ let selectedWorkId = null;
672
+
673
+ function showStatus(elementId, message, type = 'success') {
674
+ const element = document.getElementById(elementId);
675
+ element.textContent = message;
676
+ element.className = `status ${type}`;
677
+ element.style.display = 'block';
678
+ }
679
+
680
+ function hideStatus(elementId) {
681
+ document.getElementById(elementId).style.display = 'none';
682
+ }
683
+
684
+ async function searchPapers() {
685
+ const paperTitle = document.getElementById('paperTitle').value.trim();
686
+ if (!paperTitle) {
687
+ showStatus('collectStatus', 'Please enter a paper title', 'error');
688
+ return;
689
+ }
690
+
691
+ const searchBtn = document.getElementById('searchBtn');
692
+ searchBtn.disabled = true;
693
+ searchBtn.textContent = 'Searching...';
694
+
695
+ try {
696
+ const response = await fetch('/api/search-papers', {
697
+ method: 'POST',
698
+ headers: {
699
+ 'Content-Type': 'application/json',
700
+ },
701
+ body: JSON.stringify({ paper_title: paperTitle })
702
+ });
703
+
704
+ const data = await response.json();
705
+
706
+ if (data.success) {
707
+ displayPaperMatches(data.matches);
708
+ } else {
709
+ showStatus('collectStatus', data.error || 'Search failed', 'error');
710
+ }
711
+ } catch (error) {
712
+ showStatus('collectStatus', `Search error: ${error.message}`, 'error');
713
+ } finally {
714
+ searchBtn.disabled = false;
715
+ searchBtn.textContent = 'Search Papers';
716
+ }
717
+ }
718
+
719
+ function displayPaperMatches(matches) {
720
+ const matchesDiv = document.getElementById('paperMatches');
721
+ matchesDiv.innerHTML = `
722
+ <h4 style="color: #ffffff; margin-bottom: 10px; font-size: 0.9em;">SELECT PAPER:</h4>
723
+ ${matches.map((match, index) => `
724
+ <div class="paper-match" data-work-id="${match.work_id}" onclick="selectPaper('${match.work_id}', this)" style="
725
+ border: 2px solid #ffffff;
726
+ padding: 10px;
727
+ margin-bottom: 8px;
728
+ cursor: pointer;
729
+ background: #000000;
730
+ transition: all 0.2s ease;
731
+ ">
732
+ <div style="font-weight: bold; color: #ffffff; margin-bottom: 5px;">${match.title}</div>
733
+ <div style="font-size: 0.8em; color: #aaaaaa; margin-bottom: 3px;">Authors: ${match.authors}</div>
734
+ <div style="font-size: 0.8em; color: #aaaaaa; margin-bottom: 3px;">Year: ${match.year} | Venue: ${match.venue}</div>
735
+ <div style="font-size: 0.7em; color: #666666;">Relevance: ${match.relevance_score}</div>
736
+ </div>
737
+ `).join('')}
738
+ `;
739
+ matchesDiv.style.display = 'block';
740
+ }
741
+
742
+ function selectPaper(workId, element) {
743
+ // Remove previous selection
744
+ document.querySelectorAll('.paper-match').forEach(match => {
745
+ match.style.background = '#000000';
746
+ match.style.borderColor = '#ffffff';
747
+ });
748
+
749
+ // Highlight selected paper
750
+ element.style.background = '#ffffff';
751
+ element.style.color = '#000000';
752
+ element.style.borderColor = '#ffffff';
753
+
754
+ selectedWorkId = workId;
755
+
756
+ // Enable collect button
757
+ document.getElementById('collectBtn').disabled = false;
758
+ }
759
+
760
+ async function collectPapers() {
761
+ const method = document.querySelector('input[name="collectMethod"]:checked').value;
762
+ let seedUrl = '';
763
+ let paperTitle = '';
764
+
765
+ if (method === 'url') {
766
+ seedUrl = document.getElementById('seedUrl').value.trim();
767
+ if (!seedUrl) {
768
+ showStatus('collectStatus', 'Please enter a seed URL', 'error');
769
+ return;
770
+ }
771
+ } else {
772
+ paperTitle = document.getElementById('paperTitle').value.trim();
773
+ if (!paperTitle) {
774
+ showStatus('collectStatus', 'Please enter a paper title', 'error');
775
+ return;
776
+ }
777
+ if (!selectedWorkId) {
778
+ showStatus('collectStatus', 'Please search and select a paper first', 'error');
779
+ return;
780
+ }
781
+ }
782
+
783
+ const collectBtn = document.getElementById('collectBtn');
784
+ collectBtn.disabled = true;
785
+ collectBtn.textContent = 'Collecting...';
786
+ hideStatus('collectStatus');
787
+
788
+ // Show progress container
789
+ const progressContainer = document.createElement('div');
790
+ progressContainer.className = 'progress-container';
791
+ progressContainer.innerHTML = `
792
+ <div id="progressMessage">Starting paper collection...</div>
793
+ <div class="progress-bar">
794
+ <div class="progress-fill" id="collectProgress"></div>
795
+ <div class="progress-text" id="collectProgressText">0%</div>
796
+ </div>
797
+ `;
798
+ document.getElementById('collectStatus').parentNode.insertBefore(progressContainer, document.getElementById('collectStatus'));
799
+
800
+ try {
801
+ const response = await fetch('/api/collect-papers', {
802
+ method: 'POST',
803
+ headers: {
804
+ 'Content-Type': 'application/json',
805
+ },
806
+ body: JSON.stringify({
807
+ seed_url: seedUrl,
808
+ paper_title: paperTitle,
809
+ method: method,
810
+ selected_work_id: selectedWorkId,
811
+ user_api_key: window.userApiKey || null
812
+ })
813
+ });
814
+
815
+ const data = await response.json();
816
+
817
+ if (data.success && data.task_id) {
818
+ // Start polling for progress
819
+ pollProgress(data.task_id, 'collect', progressContainer);
820
+ } else {
821
+ showStatus('collectStatus', `Error: ${data.error}`, 'error');
822
+ collectBtn.disabled = false;
823
+ collectBtn.textContent = 'Collect Papers';
824
+ if (progressContainer.parentNode) {
825
+ progressContainer.parentNode.removeChild(progressContainer);
826
+ }
827
+ }
828
+ } catch (error) {
829
+ showStatus('collectStatus', `Error: ${error.message}`, 'error');
830
+ collectBtn.disabled = false;
831
+ collectBtn.textContent = 'Collect Papers';
832
+ if (progressContainer.parentNode) {
833
+ progressContainer.parentNode.removeChild(progressContainer);
834
+ }
835
+ }
836
+ }
837
+
838
+ async function pollProgress(taskId, type, progressContainer) {
839
+ const progressFill = document.getElementById('collectProgress');
840
+ const progressText = document.getElementById('collectProgressText');
841
+ const progressMessage = document.getElementById('progressMessage');
842
+
843
+ const pollInterval = setInterval(async () => {
844
+ try {
845
+ const response = await fetch(`/api/progress/${taskId}`);
846
+ const progress = await response.json();
847
+
848
+ if (progress.status === 'completed') {
849
+ clearInterval(pollInterval);
850
+
851
+ // Update progress bar to 100%
852
+ progressFill.style.width = '100%';
853
+ progressText.textContent = '100%';
854
+ progressMessage.textContent = 'Collection completed!';
855
+
856
+ // Process results
857
+ const result = progress.result;
858
+ collectedPapers = result.papers;
859
+ const breakdown = `${result.cited_papers} cited + ${result.citing_papers} citing + ${result.related_papers} related`;
860
+ showStatus('collectStatus', `Successfully collected ${result.total_papers} papers (${breakdown})`, 'success');
861
+ document.getElementById('filterBtn').disabled = false;
862
+ document.getElementById('resultsSection').style.display = 'block';
863
+ updateStats(result.total_papers, 0, result.cited_papers, result.citing_papers, result.related_papers);
864
+ currentCollectionFile = result.db_filename || null;
865
+ historyIndex.currentCollectionId = result.work_id ? (result.work_id.replace('https://api.openalex.org/works/','').replace('https://openalex.org/','')) : null;
866
+ document.getElementById('collectDownload').style.display = currentCollectionFile ? 'block' : 'none';
867
+
868
+ // Reset button
869
+ document.getElementById('collectBtn').disabled = false;
870
+ document.getElementById('collectBtn').textContent = 'Collect Papers';
871
+
872
+ // Refresh history to show new collection
873
+ loadHistory();
874
+
875
+ // Remove progress container after a delay
876
+ setTimeout(() => {
877
+ if (progressContainer.parentNode) {
878
+ progressContainer.parentNode.removeChild(progressContainer);
879
+ }
880
+ }, 2000);
881
+
882
+ } else if (progress.status === 'error') {
883
+ clearInterval(pollInterval);
884
+ showStatus('collectStatus', `Error: ${progress.message}`, 'error');
885
+ document.getElementById('collectBtn').disabled = false;
886
+ document.getElementById('collectBtn').textContent = 'Collect Papers';
887
+ if (progressContainer.parentNode) {
888
+ progressContainer.parentNode.removeChild(progressContainer);
889
+ }
890
+ } else if (progress.status === 'running') {
891
+ // Update progress bar
892
+ const progressPercent = Math.min(progress.progress || 0, 95); // Cap at 95% until completion
893
+ progressFill.style.width = `${progressPercent}%`;
894
+ progressText.textContent = `${Math.round(progressPercent)}%`;
895
+ progressMessage.textContent = progress.message || 'Processing...';
896
+ }
897
+ } catch (error) {
898
+ console.error('Error polling progress:', error);
899
+ }
900
+ }, 1000); // Poll every second
901
+ }
902
+
903
+ async function filterPapers() {
904
+ const researchQuestion = document.getElementById('researchQuestion').value.trim();
905
+ const paperLimit = parseInt(document.getElementById('paperLimit').value) || 10;
906
+
907
+ if (!researchQuestion) {
908
+ showStatus('filterStatus', 'Please enter a research question', 'error');
909
+ return;
910
+ }
911
+
912
+ // Check if user wants to analyze more than 50 papers
913
+ if (paperLimit > 50) {
914
+ const userApiKey = prompt(`You want to analyze ${paperLimit} papers, which exceeds the limit of 50.\n\nPlease provide your own OpenAI API key to continue:\n\n(Your API key will be used only for this analysis and not stored)`);
915
+ if (!userApiKey || userApiKey.trim() === '') {
916
+ showStatus('filterStatus', 'Analysis cancelled - no API key provided', 'error');
917
+ return;
918
+ }
919
+ // Store the user's API key temporarily for this request
920
+ window.userApiKey = userApiKey.trim();
921
+ } else {
922
+ // Clear any previous user API key
923
+ window.userApiKey = null;
924
+ }
925
+
926
+ const filterBtn = document.getElementById('filterBtn');
927
+ filterBtn.disabled = true;
928
+ filterBtn.textContent = 'Filtering...';
929
+ hideStatus('filterStatus');
930
+
931
+ // Show progress container
932
+ const progressContainer = document.createElement('div');
933
+ progressContainer.className = 'progress-container';
934
+ progressContainer.innerHTML = `
935
+ <div id="filterProgressMessage">Analyzing most recent papers for relevance...</div>
936
+ <div class="progress-bar">
937
+ <div class="progress-fill" id="filterProgress"></div>
938
+ <div class="progress-text" id="filterProgressText">0%</div>
939
+ </div>
940
+ `;
941
+ document.getElementById('filterStatus').parentNode.insertBefore(progressContainer, document.getElementById('filterStatus'));
942
+
943
+ try {
944
+ const response = await fetch('/api/filter-papers', {
945
+ method: 'POST',
946
+ headers: {
947
+ 'Content-Type': 'application/json',
948
+ },
949
+ body: JSON.stringify({
950
+ research_question: researchQuestion,
951
+ limit: paperLimit,
952
+ source_collection: historyIndex.currentCollectionId || null,
953
+ papers: collectedPapers.length > 0 ? collectedPapers : null,
954
+ user_api_key: window.userApiKey || null
955
+ })
956
+ });
957
+
958
+ const data = await response.json();
959
+
960
+ if (data.success) {
961
+ // Simulate progress for filtering (since it's synchronous in backend)
962
+ let progress = 0;
963
+ const progressInterval = setInterval(() => {
964
+ progress += 10;
965
+ if (progress > 90) progress = 90;
966
+
967
+ document.getElementById('filterProgress').style.width = `${progress}%`;
968
+ document.getElementById('filterProgressText').textContent = `${progress}%`;
969
+ document.getElementById('filterProgressMessage').textContent = `Analyzing most recent papers for relevance... ${progress}%`;
970
+
971
+ if (progress >= 90) {
972
+ clearInterval(progressInterval);
973
+
974
+ // Complete the progress
975
+ setTimeout(() => {
976
+ document.getElementById('filterProgress').style.width = '100%';
977
+ document.getElementById('filterProgressText').textContent = '100%';
978
+ document.getElementById('filterProgressMessage').textContent = 'Analysis completed!';
979
+
980
+ const tested = data.tested_papers || Math.min(data.limit || 0, data.total_papers || 0);
981
+ showStatus('filterStatus', `Analyzed ${tested} most recent papers; found ${data.relevant_papers} relevant`, 'success');
982
+ displayPapers(data.papers);
983
+ updateStats(data.total_papers, data.relevant_papers, 0, 0, 0, null, null, tested, data.oa_percentage, data.abstract_percentage);
984
+ currentFilterFile = data.db_filename || null;
985
+ document.getElementById('filterDownload').style.display = currentFilterFile ? 'block' : 'none';
986
+
987
+ filterBtn.disabled = false;
988
+ filterBtn.textContent = 'Filter Papers';
989
+
990
+ // Refresh history to show new filter
991
+ loadHistory();
992
+
993
+ // Remove progress container after a delay
994
+ setTimeout(() => {
995
+ if (progressContainer.parentNode) {
996
+ progressContainer.parentNode.removeChild(progressContainer);
997
+ }
998
+ }, 2000);
999
+ }, 500);
1000
+ }
1001
+ }, 200);
1002
+ } else {
1003
+ showStatus('filterStatus', `Error: ${data.error}`, 'error');
1004
+ filterBtn.disabled = false;
1005
+ filterBtn.textContent = 'Filter Papers';
1006
+ if (progressContainer.parentNode) {
1007
+ progressContainer.parentNode.removeChild(progressContainer);
1008
+ }
1009
+ }
1010
+ } catch (error) {
1011
+ showStatus('filterStatus', `Error: ${error.message}`, 'error');
1012
+ filterBtn.disabled = false;
1013
+ filterBtn.textContent = 'Filter Papers';
1014
+ if (progressContainer.parentNode) {
1015
+ progressContainer.parentNode.removeChild(progressContainer);
1016
+ }
1017
+ }
1018
+ }
1019
+
1020
+ function updateStats(total, relevant, cited = 0, citing = 0, related = 0, relevantAbs = null, totalAbs = null, tested = null, oaPercentage = null, abstractPercentage = null) {
1021
+ const statsDiv = document.getElementById('stats');
1022
+ const rate = tested && tested > 0 ? Math.round((relevant / tested) * 100) : 0;
1023
+ const absRate = (totalAbs !== null && totalAbs > 0 && relevantAbs !== null)
1024
+ ? Math.round((relevantAbs / totalAbs) * 100)
1025
+ : 0;
1026
+ statsDiv.innerHTML = `
1027
+ <div class="stat-item">
1028
+ <div class="stat-number">${total}</div>
1029
+ <div class="stat-label">Total Papers</div>
1030
+ </div>
1031
+ <div class="stat-item">
1032
+ <div class="stat-number">${tested || total}</div>
1033
+ <div class="stat-label">Tested Papers</div>
1034
+ </div>
1035
+ <div class="stat-item">
1036
+ <div class="stat-number">${relevant}</div>
1037
+ <div class="stat-label">Relevant Papers</div>
1038
+ </div>
1039
+ <div class="stat-item">
1040
+ <div class="stat-number">${rate}%</div>
1041
+ <div class="stat-label">Rel. Rate</div>
1042
+ </div>
1043
+ <div class="stat-item">
1044
+ <div class="stat-number">${absRate}%</div>
1045
+ <div class="stat-label">Rel. Rate (abs)</div>
1046
+ </div>
1047
+ <div class="stat-item">
1048
+ <div class="stat-number">${oaPercentage !== null ? oaPercentage + '%' : 'N/A'}</div>
1049
+ <div class="stat-label">Open Access</div>
1050
+ </div>
1051
+ <div class="stat-item">
1052
+ <div class="stat-number">${abstractPercentage !== null ? abstractPercentage + '%' : 'N/A'}</div>
1053
+ <div class="stat-label">With Abstract</div>
1054
+ </div>
1055
+ `;
1056
+ }
1057
+
1058
+ function computeAndUpdateRelevanceUsingPapers(papers) {
1059
+ if (!Array.isArray(papers)) papers = [];
1060
+ const total = papers.length;
1061
+ let relevant = 0, relevantAbs = 0, totalAbs = 0;
1062
+ for (const p of papers) {
1063
+ const score = p && (p.relevance_score === true || p.relevance_score === 'true');
1064
+ const hasInv = p && p.abstract_inverted_index && typeof p.abstract_inverted_index === 'object' && Object.keys(p.abstract_inverted_index).length > 0;
1065
+ if (hasInv) totalAbs += 1;
1066
+ if (score) {
1067
+ relevant += 1;
1068
+ if (hasInv) relevantAbs += 1;
1069
+ }
1070
+ }
1071
+ updateStats(total, relevant, 0, 0, 0, relevantAbs, totalAbs);
1072
+ }
1073
+
1074
+ function createSummaryTable(papers) {
1075
+ const tableRows = papers.map((paper, index) => {
1076
+ const title = paper.title || 'No title';
1077
+ const relevanceScore = paper.relevance_score;
1078
+ const relevanceReason = paper.relevance_reason || 'No analysis';
1079
+ const gptAnalysis = paper.gpt_analysis || {};
1080
+
1081
+ // Check if paper has abstract
1082
+ const hasAbstract = paper.abstract_inverted_index && Object.keys(paper.abstract_inverted_index).length > 0;
1083
+ const aims = hasAbstract ? (gptAnalysis.aims_of_paper || 'Not analyzed') : 'N/A (abstract absent)';
1084
+ const takeaways = hasAbstract ? (gptAnalysis.key_takeaways || 'Not analyzed') : 'N/A (abstract absent)';
1085
+
1086
+ let relevanceClass = 'relevance-unknown';
1087
+ let relevanceText = 'Unknown';
1088
+
1089
+ if (relevanceScore === true || relevanceScore === 'true') {
1090
+ relevanceClass = 'relevance-yes';
1091
+ relevanceText = 'YES';
1092
+ } else if (relevanceScore === false || relevanceScore === 'false') {
1093
+ relevanceClass = 'relevance-no';
1094
+ relevanceText = 'NO';
1095
+ }
1096
+
1097
+ return `
1098
+ <tr>
1099
+ <td>${index + 1}</td>
1100
+ <td title="${title}">${title.length > 60 ? title.substring(0, 60) + '...' : title}</td>
1101
+ <td class="${relevanceClass}">${relevanceText}</td>
1102
+ <td title="${relevanceReason}">${relevanceReason.length > 40 ? relevanceReason.substring(0, 40) + '...' : relevanceReason}</td>
1103
+ <td title="${aims}">${aims.length > 50 ? aims.substring(0, 50) + '...' : aims}</td>
1104
+ <td title="${takeaways}">${takeaways.length > 50 ? takeaways.substring(0, 50) + '...' : takeaways}</td>
1105
+ </tr>
1106
+ `;
1107
+ }).join('');
1108
+
1109
+ return `
1110
+ <table>
1111
+ <thead>
1112
+ <tr>
1113
+ <th>#</th>
1114
+ <th>Paper Title</th>
1115
+ <th>Relevant?</th>
1116
+ <th>Relevance Reason</th>
1117
+ <th>Main Aims</th>
1118
+ <th>Key Takeaways</th>
1119
+ </tr>
1120
+ </thead>
1121
+ <tbody>
1122
+ ${tableRows}
1123
+ </tbody>
1124
+ </table>
1125
+ `;
1126
+ }
1127
+
1128
+ function displayPapers(papers) {
1129
+ const paperListDiv = document.getElementById('paperList');
1130
+ const summarySection = document.getElementById('summarySection');
1131
+ const summaryTable = document.getElementById('summaryTable');
1132
+
1133
+ if (papers.length === 0) {
1134
+ paperListDiv.innerHTML = '<div class="paper-item">No papers analyzed.</div>';
1135
+ summarySection.style.display = 'none';
1136
+ return;
1137
+ }
1138
+
1139
+ // Show summary table
1140
+ summarySection.style.display = 'block';
1141
+ summaryTable.innerHTML = createSummaryTable(papers);
1142
+ lastDisplayedPapers = papers;
1143
+ // Update stats based on papers data (overall and with abstracts)
1144
+ computeAndUpdateRelevanceUsingPapers(papers);
1145
+
1146
+ paperListDiv.innerHTML = papers.map(paper => {
1147
+ // Extract abstract from inverted index
1148
+ let abstract = '';
1149
+ if (paper.abstract_inverted_index) {
1150
+ const words = [];
1151
+ for (const [word, positions] of Object.entries(paper.abstract_inverted_index)) {
1152
+ for (const pos of positions) {
1153
+ while (words.length <= pos) words.push('');
1154
+ words[pos] = word;
1155
+ }
1156
+ }
1157
+ abstract = words.join(' ').trim();
1158
+ }
1159
+
1160
+ // Extract open access info
1161
+ const oa = paper.open_access || {};
1162
+ const isOa = oa.is_oa ? 'Yes' : 'No';
1163
+ const oaStatus = oa.oa_status || '';
1164
+
1165
+ return `
1166
+ <div class="paper-item">
1167
+ <div class="paper-title">${paper.title || 'No title'}</div>
1168
+ <div class="paper-meta">
1169
+ <strong>Date:</strong> ${paper.publication_date || 'Unknown'} |
1170
+ <strong>Type:</strong> ${paper.relationship || 'Unknown'} |
1171
+ <strong>Open Access:</strong> ${isOa} (${oaStatus}) |
1172
+ <strong>DOI:</strong> ${paper.doi ? paper.doi.replace('https://doi.org/', '') : 'N/A'}
1173
+ </div>
1174
+ <div class="paper-meta">
1175
+ <strong>Authors:</strong> ${paper.authors ? paper.authors.slice(0, 3).map(a => a.display_name).join(', ') : 'Unknown'}
1176
+ </div>
1177
+ <div class="paper-abstract">
1178
+ ${abstract ? abstract.substring(0, 300) + '...' : 'No abstract available'}
1179
+ </div>
1180
+ ${paper.relevance_reason ? `<div class="relevance-reason">${paper.relevance_reason}</div>` : ''}
1181
+ ${paper.gpt_analysis ? `
1182
+ <div class="relevance-reason">
1183
+ <strong>GPT Analysis:</strong><br>
1184
+ ${paper.gpt_analysis.aims_of_paper && paper.gpt_analysis.aims_of_paper !== 'N/A (abstract absent)' ?
1185
+ `<strong>Aims:</strong> ${paper.gpt_analysis.aims_of_paper}<br>` : ''}
1186
+ ${paper.gpt_analysis.key_takeaways && paper.gpt_analysis.key_takeaways !== 'N/A (abstract absent)' ?
1187
+ `<strong>Key Takeaways:</strong> ${paper.gpt_analysis.key_takeaways}` : ''}
1188
+ </div>
1189
+ ` : ''}
1190
+ </div>
1191
+ `;
1192
+ }).join('');
1193
+ }
1194
+
1195
+ async function exportToExcel() {
1196
+ try {
1197
+ const response = await fetch('/api/export-excel');
1198
+ if (response.ok) {
1199
+ const blob = await response.blob();
1200
+ const url = window.URL.createObjectURL(blob);
1201
+ const a = document.createElement('a');
1202
+ a.href = url;
1203
+ a.download = `research_papers_${new Date().toISOString().split('T')[0]}.xlsx`;
1204
+ document.body.appendChild(a);
1205
+ a.click();
1206
+ window.URL.revokeObjectURL(url);
1207
+ document.body.removeChild(a);
1208
+ } else {
1209
+ const error = await response.json();
1210
+ alert(`Error exporting Excel: ${error.error}`);
1211
+ }
1212
+ } catch (error) {
1213
+ alert(`Error exporting Excel: ${error.message}`);
1214
+ }
1215
+ }
1216
+
1217
+ async function downloadCollectionExcel() {
1218
+ if (!currentCollectionFile) {
1219
+ alert('No collection file available');
1220
+ return;
1221
+ }
1222
+ try {
1223
+ const response = await fetch(`/api/export-excel/${currentCollectionFile}`);
1224
+ if (response.ok) {
1225
+ const blob = await response.blob();
1226
+ const url = window.URL.createObjectURL(blob);
1227
+ const a = document.createElement('a');
1228
+ a.href = url;
1229
+ a.download = `collection_${currentCollectionFile.replace('.pkl', '')}.xlsx`;
1230
+ document.body.appendChild(a);
1231
+ a.click();
1232
+ window.URL.revokeObjectURL(url);
1233
+ document.body.removeChild(a);
1234
+ } else {
1235
+ const error = await response.json();
1236
+ alert(`Error exporting Excel: ${error.error}`);
1237
+ }
1238
+ } catch (error) {
1239
+ alert(`Error exporting Excel: ${error.message}`);
1240
+ }
1241
+ }
1242
+
1243
+ async function downloadFilterExcel() {
1244
+ if (!currentFilterFile) {
1245
+ alert('No filter file available');
1246
+ return;
1247
+ }
1248
+ try {
1249
+ const response = await fetch(`/api/export-excel/${currentFilterFile}`);
1250
+ if (response.ok) {
1251
+ const blob = await response.blob();
1252
+ const url = window.URL.createObjectURL(blob);
1253
+ const a = document.createElement('a');
1254
+ a.href = url;
1255
+ a.download = `filter_${currentFilterFile.replace('.pkl', '')}.xlsx`;
1256
+ document.body.appendChild(a);
1257
+ a.click();
1258
+ window.URL.revokeObjectURL(url);
1259
+ document.body.removeChild(a);
1260
+ } else {
1261
+ const error = await response.json();
1262
+ alert(`Error exporting Excel: ${error.error}`);
1263
+ }
1264
+ } catch (error) {
1265
+ alert(`Error exporting Excel: ${error.message}`);
1266
+ }
1267
+ }
1268
+
1269
+ async function loadHistory() {
1270
+ try {
1271
+ const response = await fetch('/api/database-files');
1272
+ const data = await response.json();
1273
+ if (data.success) {
1274
+ buildHistoryIndex(data.files);
1275
+ displayHistory(data.files);
1276
+ }
1277
+ } catch (error) {
1278
+ console.error('Error loading history:', error);
1279
+ }
1280
+ }
1281
+
1282
+ function buildHistoryIndex(files) {
1283
+ historyIndex = { collections: {}, filters: {}, currentCollectionId: null };
1284
+ files.forEach(file => {
1285
+ if (file.type === 'collection') {
1286
+ const id = file.work_identifier || file.filename.replace('.pkl','');
1287
+ historyIndex.collections[id] = file;
1288
+ } else if (file.type === 'filter') {
1289
+ // Group filters by their source collection
1290
+ const sourceCollection = file.source_collection || 'unknown';
1291
+ if (!historyIndex.filters[sourceCollection]) {
1292
+ historyIndex.filters[sourceCollection] = [];
1293
+ }
1294
+ historyIndex.filters[sourceCollection].push(file);
1295
+ }
1296
+ });
1297
+ }
1298
+
1299
+ function displayHistory(files) {
1300
+ const collectionsList = document.getElementById('collectionsList');
1301
+ const filtersList = document.getElementById('filtersList');
1302
+ const filtersContainer = document.getElementById('filtersContainer');
1303
+
1304
+ // Separate collections and filters
1305
+ const collections = files.filter(file => file.type === 'collection');
1306
+ const filters = files.filter(file => file.type === 'filter');
1307
+
1308
+ if (collections.length === 0) {
1309
+ collectionsList.innerHTML = '<div class="history-item">No collections found</div>';
1310
+ return;
1311
+ }
1312
+
1313
+ // Display collections
1314
+ collectionsList.innerHTML = collections.map(collection => {
1315
+ const title = collection.title || collection.work_identifier || 'UNTITLED COLLECTION';
1316
+ const linkedFilters = filters.filter(filter => filter.source_collection === collection.work_identifier);
1317
+
1318
+ return `
1319
+ <div class="history-item collection-item" data-collection="${collection.work_identifier || ''}" onclick="selectCollection('${collection.filename}', '${collection.work_identifier || ''}', '${title}')" draggable="true" ondragstart="dragCollection(event, '${collection.filename}', '${title}', ${collection.total_papers || 0})">
1320
+ <div class="history-title">${title}</div>
1321
+ <div class="history-meta">${collection.created}</div>
1322
+ <div class="history-meta">${(collection.size / 1024).toFixed(1)} KB</div>
1323
+ <div class="history-meta">${collection.total_papers || 0} PAPER${(collection.total_papers || 0) !== 1 ? 'S' : ''}</div>
1324
+ <div class="history-meta">${linkedFilters.length} FILTER${linkedFilters.length !== 1 ? 'S' : ''}</div>
1325
+ <div style="margin-top:8px; display:grid; grid-template-columns: 1fr 1fr; grid-template-rows: 1fr 1fr; gap:6px; width:100%;">
1326
+ <button onclick="event.stopPropagation(); openCollection('${collection.filename}', '${collection.work_identifier || ''}')" class="download-btn" style="margin:0;">OPEN</button>
1327
+ <button onclick="event.stopPropagation(); downloadHistoryExcel('${collection.filename}')" class="download-btn" style="margin:0;">DOWNLOAD</button>
1328
+ <button onclick="event.stopPropagation(); generateBibtex('${collection.filename}')" class="download-btn" style="margin:0;">BIBTEX</button>
1329
+ <button onclick="event.stopPropagation(); deleteHistoryFile('${collection.filename}', '${collection.type}')" class="delete-btn" style="margin:0;">DELETE</button>
1330
+ </div>
1331
+ </div>
1332
+ `;
1333
+ }).join('');
1334
+ }
1335
+
1336
+ function selectCollection(filename, workIdentifier, title) {
1337
+ // Get filters for this collection
1338
+ const filters = historyIndex.filters[workIdentifier] || [];
1339
+ const filtersContainer = document.getElementById('filtersContainer');
1340
+ const filtersPanel = document.getElementById('filtersPanel');
1341
+
1342
+ if (filters.length === 0) {
1343
+ filtersContainer.innerHTML = '<div class="history-item">NO FILTERS FOUND</div>';
1344
+ } else {
1345
+ filtersContainer.innerHTML = filters.map(filter => {
1346
+ const filterTitle = filter.research_question || filter.filter_identifier || 'UNTITLED FILTER';
1347
+ const papersTested = filter.tested_papers || filter.papers_tested || filter.total_papers || 'N/A';
1348
+ return `
1349
+ <div class="history-item filter-item" data-filter-source="${filter.source_collection || ''}" onclick="openFilter('${filter.filename}', '${filter.source_collection || ''}')">
1350
+ <div class="history-title">${filterTitle}</div>
1351
+ <div class="history-meta">${filter.created}</div>
1352
+ <div class="history-meta">${(filter.size / 1024).toFixed(1)} KB</div>
1353
+ <div class="history-meta">${papersTested} PAPERS TESTED</div>
1354
+ <div style="margin-top:8px; display:flex; gap:6px;">
1355
+ <button onclick="event.stopPropagation(); openFilter('${filter.filename}', '${filter.source_collection || ''}')" class="download-btn">OPEN</button>
1356
+ <button onclick="event.stopPropagation(); downloadHistoryExcel('${filter.filename}')" class="download-btn">DOWNLOAD</button>
1357
+ <button onclick="event.stopPropagation(); deleteHistoryFile('${filter.filename}', '${filter.type}')" class="delete-btn">DELETE</button>
1358
+ </div>
1359
+ </div>
1360
+ `;
1361
+ }).join('');
1362
+ }
1363
+
1364
+ // Show filters panel with animation
1365
+ filtersPanel.classList.add('visible');
1366
+ }
1367
+
1368
+ window.highlightLinked = function(el, on) {
1369
+ try {
1370
+ const src = el.getAttribute('data-filter-source');
1371
+ if (src) {
1372
+ const items = document.querySelectorAll(`[data-collection="${src}"]`);
1373
+ items.forEach(item => item.classList.toggle('highlight', on));
1374
+ }
1375
+ } catch (e) {}
1376
+ }
1377
+
1378
+ window.openCollection = async function(filename, workIdentifier) {
1379
+ try {
1380
+ const response = await fetch(`/api/load-database-file/${filename}`);
1381
+ const data = await response.json();
1382
+ if (data.success) {
1383
+ const fileData = data.data || {};
1384
+ const papers = fileData.papers || [];
1385
+ displayPapers(papers);
1386
+ document.getElementById('resultsSection').style.display = 'block';
1387
+ updateStats(fileData.total_papers || papers.length || 0, 0, fileData.cited_papers || 0, fileData.citing_papers || 0, fileData.related_papers || 0);
1388
+ currentCollectionFile = filename; currentFilterFile = null; historyIndex.currentCollectionId = workIdentifier || (fileData.work_identifier || '');
1389
+ document.getElementById('collectDownload').style.display = 'block';
1390
+ document.getElementById('filterDownload').style.display = 'none';
1391
+ // Enable filter button when opening a collection
1392
+ document.getElementById('filterBtn').disabled = false;
1393
+ // Save papers to temp file for filtering
1394
+ collectedPapers = papers;
1395
+ }
1396
+ } catch (error) {
1397
+ alert(`Error opening collection: ${error.message}`);
1398
+ }
1399
+ }
1400
+
1401
+ window.openFilter = async function(filename, sourceCollectionId) {
1402
+ try {
1403
+ const response = await fetch(`/api/load-database-file/${filename}`);
1404
+ const data = await response.json();
1405
+ if (data.success) {
1406
+ const fileData = data.data || {};
1407
+ const papers = fileData.papers || [];
1408
+
1409
+ // Populate Step 2 with the research question
1410
+ const researchQuestion = fileData.research_question || '';
1411
+ const paperLimit = fileData.tested_papers || fileData.limit || 10;
1412
+
1413
+ document.getElementById('researchQuestion').value = researchQuestion;
1414
+ document.getElementById('paperLimit').value = paperLimit;
1415
+
1416
+ // Display the filtered papers
1417
+ displayPapers(papers);
1418
+ document.getElementById('resultsSection').style.display = 'block';
1419
+
1420
+ // Update stats with all saved statistics
1421
+ const totalPapers = fileData.total_papers || 0;
1422
+ const relevantPapers = fileData.relevant_papers || papers.length || 0;
1423
+ const testedPapers = fileData.tested_papers || fileData.limit || 0;
1424
+ const oaPercentage = fileData.oa_percentage || null;
1425
+ const abstractPercentage = fileData.abstract_percentage || null;
1426
+
1427
+ updateStats(
1428
+ totalPapers,
1429
+ relevantPapers,
1430
+ 0, // cited
1431
+ 0, // citing
1432
+ 0, // related
1433
+ null, // relevantAbs
1434
+ null, // totalAbs
1435
+ testedPapers, // tested
1436
+ oaPercentage, // oaPercentage
1437
+ abstractPercentage // abstractPercentage
1438
+ );
1439
+
1440
+ // Update state
1441
+ currentFilterFile = filename;
1442
+ currentCollectionFile = null;
1443
+ historyIndex.currentCollectionId = sourceCollectionId || fileData.source_collection || null;
1444
+
1445
+ // Show appropriate download buttons
1446
+ document.getElementById('filterDownload').style.display = 'block';
1447
+ document.getElementById('collectDownload').style.display = 'none';
1448
+
1449
+ // Enable the filter button since we have a research question
1450
+ document.getElementById('filterBtn').disabled = false;
1451
+ }
1452
+ } catch (error) {
1453
+ alert(`Error opening filter: ${error.message}`);
1454
+ }
1455
+ }
1456
+
1457
+ async function loadHistoryFile(filename) {
1458
+ try {
1459
+ const response = await fetch(`/api/load-database-file/${filename}`);
1460
+ const data = await response.json();
1461
+ if (data.success) {
1462
+ const fileData = data.data;
1463
+ if (fileData.papers) {
1464
+ displayPapers(fileData.papers);
1465
+ document.getElementById('resultsSection').style.display = 'block';
1466
+ }
1467
+ }
1468
+ } catch (error) {
1469
+ alert(`Error loading file: ${error.message}`);
1470
+ }
1471
+ }
1472
+
1473
+ async function downloadHistoryExcel(filename) {
1474
+ try {
1475
+ const response = await fetch(`/api/export-excel/${filename}`);
1476
+ if (response.ok) {
1477
+ const blob = await response.blob();
1478
+ const url = window.URL.createObjectURL(blob);
1479
+ const a = document.createElement('a');
1480
+ a.href = url;
1481
+ a.download = filename.replace('.pkl', '.xlsx');
1482
+ document.body.appendChild(a);
1483
+ a.click();
1484
+ window.URL.revokeObjectURL(url);
1485
+ document.body.removeChild(a);
1486
+ } else {
1487
+ const error = await response.json();
1488
+ alert(`Error exporting Excel: ${error.error}`);
1489
+ }
1490
+ } catch (error) {
1491
+ alert(`Error exporting Excel: ${error.message}`);
1492
+ }
1493
+ }
1494
+
1495
+ async function generateBibtex(filename) {
1496
+ try {
1497
+ // Show loading state
1498
+ const button = event.target;
1499
+ const originalText = button.textContent;
1500
+ button.textContent = 'GENERATING...';
1501
+ button.disabled = true;
1502
+
1503
+ const response = await fetch(`/api/generate-bibtex/${filename}`, {
1504
+ method: 'POST',
1505
+ headers: {
1506
+ 'Content-Type': 'application/json'
1507
+ }
1508
+ });
1509
+
1510
+ const result = await response.json();
1511
+
1512
+ if (result.success) {
1513
+ // Download the generated BibTeX file
1514
+ try {
1515
+ const downloadResponse = await fetch(`/api/download-database-file/${result.filename}`);
1516
+ if (downloadResponse.ok) {
1517
+ const blob = await downloadResponse.blob();
1518
+ const url = window.URL.createObjectURL(blob);
1519
+ const a = document.createElement('a');
1520
+ a.href = url;
1521
+ a.download = result.filename;
1522
+ document.body.appendChild(a);
1523
+ a.click();
1524
+ window.URL.revokeObjectURL(url);
1525
+ document.body.removeChild(a);
1526
+
1527
+ alert(`BibTeX file generated and downloaded successfully with ${result.entries_count} entries!`);
1528
+ } else {
1529
+ const errorText = await downloadResponse.text();
1530
+ console.error('Download failed:', downloadResponse.status, errorText);
1531
+ alert(`BibTeX generated but download failed (${downloadResponse.status}). The file is saved in the database directory.`);
1532
+ }
1533
+ } catch (downloadError) {
1534
+ console.error('Download error:', downloadError);
1535
+ alert(`BibTeX generated but download failed: ${downloadError.message}. The file is saved in the database directory.`);
1536
+ }
1537
+ } else {
1538
+ alert(`Error generating BibTeX: ${result.message}`);
1539
+ }
1540
+ } catch (error) {
1541
+ alert(`Error generating BibTeX: ${error.message}`);
1542
+ } finally {
1543
+ // Restore button state
1544
+ const button = event.target;
1545
+ button.textContent = 'BIBTEX';
1546
+ button.disabled = false;
1547
+ }
1548
+ }
1549
+
1550
+ async function deleteHistoryFile(filename, type) {
1551
+ const confirmation = prompt(`Are you sure you want to delete this ${type}?\n\nType "delete" to confirm deletion of: ${filename}`);
1552
+ if (confirmation !== 'delete') {
1553
+ return;
1554
+ }
1555
+
1556
+ try {
1557
+ const response = await fetch(`/api/delete-database-file/${filename}`, {
1558
+ method: 'DELETE'
1559
+ });
1560
+ const data = await response.json();
1561
+
1562
+ if (data.success) {
1563
+ alert('File deleted successfully');
1564
+ // Reload history to update the list
1565
+ loadHistory();
1566
+ } else {
1567
+ alert(`Error deleting file: ${data.error}`);
1568
+ }
1569
+ } catch (error) {
1570
+ alert(`Error deleting file: ${error.message}`);
1571
+ }
1572
+ }
1573
+
1574
+ // Merge functionality
1575
+ let mergedCollections = [];
1576
+
1577
+ function dragCollection(event, filename, title, paperCount) {
1578
+ event.dataTransfer.setData("text/plain", JSON.stringify({
1579
+ filename: filename,
1580
+ title: title,
1581
+ paperCount: paperCount
1582
+ }));
1583
+ }
1584
+
1585
+ function allowDrop(event) {
1586
+ event.preventDefault();
1587
+ }
1588
+
1589
+ function dropCollection(event) {
1590
+ event.preventDefault();
1591
+ const data = JSON.parse(event.dataTransfer.getData("text/plain"));
1592
+
1593
+ // Check if collection is already in merge box
1594
+ if (mergedCollections.some(item => item.filename === data.filename)) {
1595
+ return;
1596
+ }
1597
+
1598
+ mergedCollections.push(data);
1599
+ updateMergeBox();
1600
+ }
1601
+
1602
+ function updateMergeBox() {
1603
+ const mergeItems = document.getElementById('mergeItems');
1604
+ const mergeActions = document.getElementById('mergeActions');
1605
+ const placeholder = document.querySelector('.merge-placeholder');
1606
+
1607
+ if (mergedCollections.length === 0) {
1608
+ mergeItems.innerHTML = '';
1609
+ mergeActions.style.display = 'none';
1610
+ placeholder.style.display = 'block';
1611
+ } else {
1612
+ placeholder.style.display = 'none';
1613
+ mergeActions.style.display = 'flex';
1614
+
1615
+ mergeItems.innerHTML = mergedCollections.map((item, index) => `
1616
+ <div class="merge-item">
1617
+ <span>${item.title} (${item.paperCount} papers)</span>
1618
+ <button onclick="removeFromMerge(${index})" style="background:none; border:none; color:#ffffff; cursor:pointer; font-size:12px;">×</button>
1619
+ </div>
1620
+ `).join('');
1621
+ }
1622
+ }
1623
+
1624
+ function removeFromMerge(index) {
1625
+ mergedCollections.splice(index, 1);
1626
+ updateMergeBox();
1627
+ }
1628
+
1629
+ function clearMergeBox() {
1630
+ mergedCollections = [];
1631
+ updateMergeBox();
1632
+ }
1633
+
1634
+ async function saveMergedCollection() {
1635
+ if (mergedCollections.length < 2) {
1636
+ alert('Please add at least 2 collections to merge');
1637
+ return;
1638
+ }
1639
+
1640
+ try {
1641
+ const response = await fetch('/api/merge-collections', {
1642
+ method: 'POST',
1643
+ headers: {
1644
+ 'Content-Type': 'application/json'
1645
+ },
1646
+ body: JSON.stringify({
1647
+ collections: mergedCollections.map(item => item.filename)
1648
+ })
1649
+ });
1650
+
1651
+ const result = await response.json();
1652
+
1653
+ if (result.success) {
1654
+ alert(`Merged collection created successfully with ${result.total_papers} papers!`);
1655
+ clearMergeBox();
1656
+ loadHistory(); // Refresh the collections list
1657
+ } else {
1658
+ alert(`Error merging collections: ${result.message}`);
1659
+ }
1660
+ } catch (error) {
1661
+ alert(`Error merging collections: ${error.message}`);
1662
+ }
1663
+ }
1664
+
1665
+ </script>
1666
+ </body>
1667
+ </html>