Davidvandijcke commited on
Commit
816ce76
·
1 Parent(s): 627fbbe

Update to professional assistant with Gemini 2.5 Flash Preview

Browse files

- Uses Gemini 2.5 Flash Preview for better responses
- Professional chat interface
- 15 question limit per session
- Assistant speaks as expert about David (third person)
- Improved prompting for concise, informative responses
- Full paper loading for comprehensive context

.env.example CHANGED
@@ -1,4 +1,9 @@
1
  # Google AI API Key (optional but recommended)
2
  # Get your API key from https://aistudio.google.com/app/apikey
3
  # If not provided, the app will use a limited mode with lower quality
4
- # GOOGLE_API_KEY=your_api_key_here
 
 
 
 
 
 
1
  # Google AI API Key (optional but recommended)
2
  # Get your API key from https://aistudio.google.com/app/apikey
3
  # If not provided, the app will use a limited mode with lower quality
4
+ GOOGLE_API_KEY=your_google_api_key_here
5
+
6
+
7
+ # Optional: Override default model names
8
+ # EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
9
+ # LLM_MODEL=gemini-1.5-flash
.gitignore CHANGED
@@ -3,6 +3,10 @@
3
  .env.local
4
  .env.*.local
5
 
 
 
 
 
6
  # Cache
7
  vector_store_cache/
8
  __pycache__/
 
3
  .env.local
4
  .env.*.local
5
 
6
+ # uv
7
+ .venv/
8
+ uv.lock
9
+
10
  # Cache
11
  vector_store_cache/
12
  __pycache__/
IMPROVEMENTS_SUMMARY.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Research Assistant Improvements Summary
2
+
3
+ ## Overview
4
+
5
+ I've significantly improved the David Van Dijcke Research Assistant to provide more comprehensive and accurate responses by leveraging Gemini's large context window and implementing smart retrieval strategies.
6
+
7
+ ## Key Improvements
8
+
9
+ ### 1. Full Paper Loading (`app_full_context.py`)
10
+ - **Before**: Only loaded first 3-10 pages of each PDF
11
+ - **After**: Loads complete papers (all pages)
12
+ - **Impact**: Complete context for accurate, detailed responses
13
+
14
+ ### 2. Smart Retrieval (`app_optimized.py`)
15
+ - **Query Type Detection**: Identifies technical vs overview vs application queries
16
+ - **Section Extraction**: Intelligently parses papers into sections (intro, theory, results, etc.)
17
+ - **Hierarchical Search**: Uses both section-level and chunk-level retrieval
18
+ - **Response Caching**: Instant responses for repeated queries
19
+
20
+ ### 3. Enhanced Context Window Usage
21
+ - **Chunk Size**: Increased from 500 to 2000 characters
22
+ - **Context Limit**: Up to 1M characters (250k tokens) for Gemini 2.0 Flash
23
+ - **Paper Selection**: Smart selection of most relevant papers based on query
24
+
25
+ ### 4. UV Package Management
26
+ - **Faster Installation**: UV is significantly faster than pip
27
+ - **Better Dependency Resolution**: More reliable builds
28
+ - **Multiple Configurations**: Easy switching between versions
29
+ - **Lock File Support**: Reproducible environments
30
+
31
+ ## Performance Comparison
32
+
33
+ | Metric | Original | Full Context | Optimized |
34
+ |--------|----------|--------------|-----------|
35
+ | Pages Loaded | 3-10 | All | All |
36
+ | Chunk Size | 500 chars | 2000 chars | 1000-3000 chars |
37
+ | Context Window | ~2k chars | ~1M chars | Smart selection |
38
+ | Response Quality | Basic | Comprehensive | Targeted & Detailed |
39
+ | Speed | Fast | Slower | Fast (with caching) |
40
+
41
+ ## Usage Recommendations
42
+
43
+ 1. **For General Q&A**: Use `app_optimized.py` (best balance)
44
+ 2. **For Deep Technical Questions**: Use `app_full_context.py`
45
+ 3. **For Quick Testing**: Use original `app.py`
46
+ 4. **For Production**: Deploy `app_optimized.py` with caching
47
+
48
+ ## Technical Details
49
+
50
+ ### Vector Store Strategy
51
+ - **Chunks Store**: Smaller chunks (1000 chars) for detailed retrieval
52
+ - **Sections Store**: Larger chunks (3000 chars) for context preservation
53
+ - **Caching**: Separate caches for different chunking strategies
54
+
55
+ ### Query Processing Pipeline
56
+ 1. Query type classification
57
+ 2. Relevant paper identification (keyword + embedding search)
58
+ 3. Section/chunk retrieval based on query type
59
+ 4. Context assembly with priority ordering
60
+ 5. Response generation with Gemini
61
+ 6. Response caching for efficiency
62
+
63
+ ### Memory Optimization
64
+ - Lazy loading of papers
65
+ - JSON caching of processed papers
66
+ - Separate vector stores by granularity
67
+ - Response cache with query normalization
68
+
69
+ ## Next Steps
70
+
71
+ 1. **Fine-tune Retrieval**: Adjust weights for different query types
72
+ 2. **Add Conversation Memory**: Track context across multiple queries
73
+ 3. **Implement Streaming**: Stream responses for better UX
74
+ 4. **Add Citations**: Include specific page/section references
75
+ 5. **Multi-modal Support**: Include figures and tables from papers
README.md CHANGED
@@ -14,6 +14,13 @@ license: mit
14
 
15
  An AI-powered assistant specializing in David Van Dijcke's econometric research. David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
16
 
 
 
 
 
 
 
 
17
  ## Features
18
 
19
  - **Econometric Methods Focus**: Detailed information about David's methodological contributions
@@ -22,6 +29,15 @@ An AI-powered assistant specializing in David Van Dijcke's econometric research.
22
  - **Policy Applications**: How David applies econometric tools to answer questions with big data
23
  - **Research Portfolio**: Information on FDR, DISCO, RTO, and other papers
24
 
 
 
 
 
 
 
 
 
 
25
  ## Getting the Best Performance
26
 
27
  For high quality, accurate responses at very low cost, use Google's Gemini 2.5 Flash:
@@ -54,6 +70,33 @@ This space is designed to run on Hugging Face Spaces with CPU inference.
54
 
55
  ## Local Development
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  1. Install requirements:
58
  ```bash
59
  pip install -r requirements.txt
@@ -62,4 +105,6 @@ pip install -r requirements.txt
62
  2. Run the app:
63
  ```bash
64
  python app.py
65
- ```
 
 
 
14
 
15
  An AI-powered assistant specializing in David Van Dijcke's econometric research. David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
16
 
17
+ ## Available Versions
18
+
19
+ 1. **app.py** - Original version with basic chunking
20
+ 2. **app_improved.py** - Enhanced version with better prompts
21
+ 3. **app_full_context.py** - Full paper loading with Gemini's large context window
22
+ 4. **app_optimized.py** - Smart retrieval with section extraction and caching
23
+
24
  ## Features
25
 
26
  - **Econometric Methods Focus**: Detailed information about David's methodological contributions
 
29
  - **Policy Applications**: How David applies econometric tools to answer questions with big data
30
  - **Research Portfolio**: Information on FDR, DISCO, RTO, and other papers
31
 
32
+ ### New Improvements
33
+
34
+ - **Full Paper Loading**: Reads complete PDFs instead of just first few pages
35
+ - **Large Context Window**: Leverages Gemini 2.0 Flash's 1M+ token context
36
+ - **Smart Retrieval**: Query-type based retrieval (technical, overview, application)
37
+ - **Section Extraction**: Intelligent parsing of paper sections
38
+ - **Response Caching**: Instant responses for repeated queries
39
+ - **Hierarchical Search**: Both section-level and chunk-level retrieval
40
+
41
  ## Getting the Best Performance
42
 
43
  For high quality, accurate responses at very low cost, use Google's Gemini 2.5 Flash:
 
70
 
71
  ## Local Development
72
 
73
+ ### Option 1: Using UV (Recommended)
74
+
75
+ 1. Install UV:
76
+ ```bash
77
+ curl -LsSf https://astral.sh/uv/install.sh | sh
78
+ ```
79
+
80
+ 2. Create virtual environment and install dependencies:
81
+ ```bash
82
+ uv venv
83
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
84
+ uv pip install -e .
85
+ ```
86
+
87
+ 3. Copy environment file and add your API key:
88
+ ```bash
89
+ cp .env.example .env
90
+ # Edit .env and add your GOOGLE_API_KEY
91
+ ```
92
+
93
+ 4. Run the app:
94
+ ```bash
95
+ python app.py
96
+ ```
97
+
98
+ ### Option 2: Using pip
99
+
100
  1. Install requirements:
101
  ```bash
102
  pip install -r requirements.txt
 
105
  2. Run the app:
106
  ```bash
107
  python app.py
108
+ ```
109
+
110
+ See `README_UV_SETUP.md` for detailed UV setup instructions.
README_UV_SETUP.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # UV Setup Guide for David Research Assistant
2
+
3
+ This guide explains how to set up the development environment using `uv` instead of `pip`.
4
+
5
+ ## Prerequisites
6
+
7
+ Install `uv` if you haven't already:
8
+ ```bash
9
+ # macOS/Linux
10
+ curl -LsSf https://astral.sh/uv/install.sh | sh
11
+
12
+ # or using pip
13
+ pip install uv
14
+ ```
15
+
16
+ ## Setup Instructions
17
+
18
+ 1. **Create and activate virtual environment:**
19
+ ```bash
20
+ cd david-research-assistant
21
+ uv venv
22
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
23
+ ```
24
+
25
+ 2. **Install dependencies:**
26
+
27
+ For the standard version (with Google Generative AI):
28
+ ```bash
29
+ uv pip install -e .
30
+ ```
31
+
32
+ For the improved version (with Hugging Face):
33
+ ```bash
34
+ uv pip install -e ".[improved]"
35
+ ```
36
+
37
+ For development (includes testing and linting tools):
38
+ ```bash
39
+ uv pip install -e . --all-extras
40
+ uv pip install -e ".[test]"
41
+ ```
42
+
43
+ 3. **Set up environment variables:**
44
+ ```bash
45
+ cp .env.example .env
46
+ # Edit .env and add your GOOGLE_API_KEY if using the standard version
47
+ ```
48
+
49
+ ## Running the Application
50
+
51
+ ```bash
52
+ # Original version (basic chunking)
53
+ python app.py
54
+
55
+ # Improved version (better prompts)
56
+ python app_improved.py
57
+
58
+ # Full context version (complete papers)
59
+ python app_full_context.py
60
+
61
+ # Optimized version (smart retrieval)
62
+ python app_optimized.py
63
+
64
+ # Run tests
65
+ python test_assistant.py
66
+ python test_full_context.py
67
+ ```
68
+
69
+ ## Benefits of using UV
70
+
71
+ - **Faster installation**: UV is written in Rust and is significantly faster than pip
72
+ - **Better dependency resolution**: More reliable and predictable
73
+ - **Lock file support**: `uv.lock` ensures reproducible builds
74
+ - **Built-in virtual environment management**: No need for separate venv/virtualenv
75
+
76
+ ## Switching between versions
77
+
78
+ To switch between standard and improved versions:
79
+ ```bash
80
+ # Standard version
81
+ uv pip install -e .
82
+
83
+ # Improved version
84
+ uv pip install -e ".[improved]"
85
+ ```
app.py CHANGED
@@ -1,666 +1,254 @@
 
 
 
 
 
 
1
  import os
 
2
  import gradio as gr
3
- from typing import List, Tuple, Dict
4
- import json
5
- from datetime import datetime, timedelta
6
- import hashlib
7
- import threading
8
- from collections import defaultdict
9
- import time
10
- import re
11
- try:
12
- import yaml
13
- except ImportError:
14
- yaml = None
15
- logger.warning("PyYAML not installed. Markdown parsing will be disabled.")
16
- from pathlib import Path
17
-
18
- # Import only what we need for better performance
19
- from langchain.text_splitter import RecursiveCharacterTextSplitter
20
- from langchain.document_loaders import PyPDFLoader
21
- from langchain_community.embeddings import HuggingFaceEmbeddings
22
- from langchain_community.vectorstores import FAISS
23
- from langchain.schema import Document
24
  import google.generativeai as genai
25
- from google.generativeai.types import HarmCategory, HarmBlockThreshold # Ensure this is present
26
- import logging
27
-
28
- # Set up logging
29
- logging.basicConfig(level=logging.INFO)
30
- logger = logging.getLogger(__name__)
31
 
32
- # Rate limiting configuration
33
- MAX_MESSAGES_PER_SESSION = 20 # Generous limit per chat session
34
- MAX_CONCURRENT_SESSIONS = 10 # Maximum simultaneous sessions
35
- SESSION_TIMEOUT_HOURS = 2 # Sessions expire after 2 hours of inactivity
36
 
37
- # Note: This global variable is defined with BLOCK_NONE,
38
- # but the actual API calls in the methods use BLOCK_ONLY_HIGH, which is generally safer.
39
- # If you intend to use this global variable, ensure its values match your intent.
40
- safety_settings_block_none_for_all_categories = [ # Renamed for clarity based on its content
41
- {
42
- "category": HarmCategory.HARM_CATEGORY_HARASSMENT,
43
- "threshold": HarmBlockThreshold.BLOCK_NONE,
44
- },
45
- {
46
- "category": HarmCategory.HARM_CATEGORY_HATE_SPEECH,
47
- "threshold": HarmBlockThreshold.BLOCK_NONE,
48
- },
49
- {
50
- "category": HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
51
- "threshold": HarmBlockThreshold.BLOCK_NONE,
52
- },
53
- {
54
- "category": HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
55
- "threshold": HarmBlockThreshold.BLOCK_NONE,
56
- },
57
- ]
58
-
59
- class DynamicPaperDatabase:
60
- """Database that dynamically loads papers from markdown files"""
61
- def __init__(self, base_path: str = None):
62
- self.papers = {}
63
- self.base_path = base_path
64
- self.load_papers_from_markdown()
65
- self.create_lookup_indices()
66
 
67
- def parse_markdown_front_matter(self, filepath: str) -> Dict:
68
- if yaml is None:
69
- logger.warning(f"Cannot parse {filepath} - PyYAML not installed")
70
- return None
71
- try:
72
- with open(filepath, 'r', encoding='utf-8') as f:
73
- content = f.read()
74
- if content.startswith('---'):
75
- end_index = content.find('---', 3)
76
- if end_index != -1:
77
- front_matter = content[3:end_index].strip()
78
- data = yaml.safe_load(front_matter)
79
- paper_content = content[end_index+3:].strip()
80
- data['full_content'] = paper_content
81
- return data
82
- except Exception as e:
83
- logger.error(f"Error parsing {filepath}: {e}")
84
- return None
85
-
86
- def load_papers_from_markdown(self):
87
- if self.base_path:
88
- directories = [
89
- os.path.join(self.base_path, "_publications"),
90
- os.path.join(self.base_path, "_wps")
91
- ]
92
- found_any = False
93
- for directory in directories:
94
- if os.path.exists(directory):
95
- found_any = True
96
- for filename in os.listdir(directory):
97
- if filename.endswith('.md'):
98
- filepath = os.path.join(directory, filename)
99
- paper_data = self.parse_markdown_front_matter(filepath)
100
- if paper_data and 'title' in paper_data:
101
- paper_key = filename.replace('.md', '').lower()
102
- coauthors = [author.strip() for author in paper_data.get('coauthors', '').split(',') if author.strip()]
103
- authors = ["David Van Dijcke"] + coauthors
104
- seen = set()
105
- authors = [x for x in authors if not (x in seen or seen.add(x))]
106
- year = None
107
- if 'date' in paper_data:
108
- year = str(paper_data['date']).split('-')[0]
109
- elif filename.startswith('20'):
110
- year = filename[:4]
111
- paper_type = "working_paper" if "_wps" in directory else "publication"
112
- if 'job market' in paper_data.get('title', '').lower():
113
- paper_type = "job_market_paper"
114
- keywords = self.extract_keywords(paper_data)
115
- self.papers[paper_key] = {
116
- "title": paper_data['title'], "authors": authors, "year": int(year) if year else None,
117
- "type": paper_type, "keywords": keywords, "venue": paper_data.get('venue', ''),
118
- "excerpt": paper_data.get('excerpt', ''), "paperurl": paper_data.get('paperurl', ''),
119
- "citation": paper_data.get('citation', ''), "field": paper_data.get('field', ''),
120
- "full_content": paper_data.get('full_content', '')
121
- }
122
- logger.info(f"Loaded paper: {paper_data['title']} with authors: {authors}")
123
- if found_any: return
124
- logger.info("Using hardcoded paper database")
125
- self.load_hardcoded_papers()
126
 
127
- def load_hardcoded_papers(self):
128
- self.papers = {
129
- "r3d": {
130
- "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes", "authors": ["David Van Dijcke"], "year": 2025, "type": "job_market_paper",
131
- "keywords": ["regression discontinuity", "distribution", "quantile", "laqte", "income distribution", "functional data", "optimal transport"],
132
- "excerpt": "This paper extends regression discontinuity design to estimate causal effects on entire outcome distributions. It introduces Local Average Quantile Treatment Effects (LAQTE) as a functional-valued estimand that captures heterogeneous effects across the outcome distribution. The method uses optimal transport and functional data analysis to provide a complete characterization of distributional treatment effects in RD settings.",
133
- "field": "Econometrics", "full_content": "R3D develops new econometric theory for analyzing how policies affect entire distributions rather than just averages. Key contributions: (1) Introduces LAQTE estimand for RD with distribution-valued outcomes, (2) Develops estimation and inference procedures using functional data analysis, (3) Provides optimal bandwidth selection for functional estimands, (4) Applications to income distributions and policy evaluation."
134
- },
135
- "return-to-office": {
136
- "title": "Return to Office and the Tenure Distribution", "authors": ["David Van Dijcke", "Florian Gunsilius", "Austin Wright"], "year": 2025, "type": "working_paper", "venue": "Revision requested: The Review of Economics and Statistics",
137
- "keywords": ["return to office", "tenure", "distribution", "tech firms", "resumes"],
138
- "excerpt": "With the end of the COVID-19 pandemic, debates over return-to-office mandates have intensified, though their economic implications are not fully understood. Using 260 million resumes matched to company data, we analyze the impact of these policies on employee tenure and seniority at three large U.S. tech companies: Microsoft, SpaceX, and Apple.",
139
- "field": "Econometrics"
140
- },
141
- "fdr": {
142
- "title": "Free Discontinuity Regression", "authors": ["Florian Gunsilius", "David Van Dijcke"], "year": 2025, "type": "working_paper",
143
- "keywords": ["free discontinuity", "mumford-shah", "internet shutdown", "india", "multivariate", "causal inference", "geometric measure theory"],
144
- "excerpt": "This paper develops a new method for detecting and estimating multivariate discontinuities without prior knowledge of their location. Using a convex relaxation of the Mumford-Shah functional from geometric measure theory, FDR automatically identifies discontinuity sets and estimates treatment effects. Applied to internet shutdowns in India to show heterogeneous effects across regions.",
145
- "field": "Econometrics", "full_content": "FDR introduces methods from geometric measure theory to econometrics. The paper solves the problem of estimating causal effects when the discontinuity location is unknown and potentially complex (curves, surfaces). Applications include geographic regression discontinuities and policy boundaries."
146
- },
147
- "revenue-production": {
148
- "title": "On the Non-Identification of Revenue Production Functions", "authors": ["David Van Dijcke"], "year": 2023, "type": "working_paper",
149
- "keywords": ["revenue", "production function", "identification"], "field": "Econometrics"
150
- },
151
- "disco": {
152
- "title": "Distributional Synthetic Controls", "authors": ["Florian Gunsilius", "David Van Dijcke"], "year": 2025, "type": "working_paper",
153
- "keywords": ["distributional synthetic", "optimal transport", "synthetic control", "distribution", "causal inference", "quantile effects"],
154
- "excerpt": "This paper extends synthetic control methods to estimate effects on entire outcome distributions. Using optimal transport theory, DISCO creates synthetic controls that match the pre-treatment distribution of the treated unit. This enables estimation of quantile treatment effects and other distributional parameters. Includes an R package implementation.",
155
- "field": "Econometrics", "full_content": "DISCO combines synthetic controls with optimal transport to analyze distributional treatment effects. Key innovation: matching entire pre-treatment distributions rather than just means. Applications include analyzing distributional effects of minimum wage policies and other interventions."
156
- },
157
- "ukraine": {
158
- "title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine", "authors": ["David Van Dijcke", "Austin L. Wright", "Maria Polyak"], "year": 2023, "journal": "Proceedings of the National Academy of Sciences", "type": "publication",
159
- "keywords": ["ukraine", "air raid", "alerts", "casualties", "mobility"], "field": "Policy"
160
- },
161
- "unmasking": {
162
- "title": "Unmasking Partisanship: Polarization undermines public response to collective risk", "authors": ["Maria Milosh", "Marcus Painter", "Konstantin Sonin", "David Van Dijcke", "Austin Wright"], "year": 2021, "journal": "Journal of Public Economics", "type": "publication",
163
- "keywords": ["partisanship", "polarization", "covid", "mask", "social distancing"],
164
- "excerpt": "Political polarization and competing narratives can undermine public policy implementation. Partisanship may play a particularly important role in shaping heterogeneous responses to collective risk during periods of crisis when political agents manipulate signals received by the public.", "field": "Policy"
165
- },
166
- "science-skepticism": {
167
- "title": "Science Skepticism Reduced Compliance with COVID-19 Shelter-in-Place Policies", "authors": ["Adam Brzezinski", "Valentin Kecht", "David Van Dijcke", "Austin L. Wright"], "year": 2021, "journal": "Nature Human Behaviour", "citations": 226, "type": "publication",
168
- "keywords": ["covid", "science skepticism", "compliance", "shelter in place"], "field": "Policy"
169
- },
170
- "government-community": {
171
- "title": "The COVID-19 Pandemic: Government versus Community Action Across the United States", "authors": ["Adam Brzezinski", "Guido Deiana", "Valentin Kecht", "David Van Dijcke"], "year": 2020, "journal": "Covid Economics", "citations": 160, "type": "publication",
172
- "keywords": ["covid", "government", "community", "mandates", "voluntary"], "field": "Policy"
173
- },
174
- "work-effort": {
175
- "title": "Work Effort and the Cycle: Evidence from Survey Data", "authors": ["Vivien Lewis", "David van Dijcke"], "year": 2019, "journal": "Deutsche Bundesbank Discussion Papers", "type": "publication",
176
- "keywords": ["work effort", "business cycle", "survey"], "field": "Macro"
177
- }
178
  }
179
-
180
- def extract_keywords(self, paper_data: Dict) -> List[str]:
181
- keywords = []
182
- title = paper_data.get('title', '').lower()
183
- important_words = ['regression', 'discontinuity', 'distribution', 'synthetic',
184
- 'control', 'covid', 'pandemic', 'partisanship', 'return to office',
185
- 'ukraine', 'revenue', 'production', 'function', 'identification']
186
- for word in important_words:
187
- if word in title: keywords.append(word)
188
- excerpt = paper_data.get('excerpt', '').lower()
189
- for word in important_words:
190
- if word in excerpt and word not in keywords: keywords.append(word)
191
- if 'field' in paper_data: keywords.append(paper_data['field'].lower())
192
- return keywords
193
-
194
- def create_lookup_indices(self):
195
- self.title_to_key = {}
196
- self.keyword_to_papers = defaultdict(list)
197
- for key, paper in self.papers.items():
198
- normalized_title = paper["title"].lower().strip()
199
- self.title_to_key[normalized_title] = key
200
- title_words = normalized_title.split()
201
- if len(title_words) > 3:
202
- self.title_to_key[" ".join(title_words[:3])] = key
203
- for keyword in paper.get("keywords", []):
204
- self.keyword_to_papers[keyword.lower()].append(key)
205
-
206
- def find_paper(self, text: str) -> List[str]:
207
- text_lower = text.lower()
208
- found_papers = []
209
- for key in self.papers.keys():
210
- if key in text_lower: found_papers.append(key)
211
- for title_fragment, key in self.title_to_key.items():
212
- if title_fragment in text_lower and key not in found_papers: found_papers.append(key)
213
- keyword_matches = defaultdict(int)
214
- for keyword, paper_keys in self.keyword_to_papers.items():
215
- if keyword in text_lower:
216
- for paper_key in paper_keys: keyword_matches[paper_key] += 1
217
- for paper_key, match_count in keyword_matches.items():
218
- if match_count >= 2 and paper_key not in found_papers: found_papers.append(paper_key)
219
- return found_papers
220
-
221
- def verify_and_correct_response(self, response: str) -> str:
222
- mentioned_papers = self.find_paper(response)
223
- if not mentioned_papers: return response
224
- corrected_response = response
225
- for paper_key in mentioned_papers:
226
- paper = self.papers[paper_key]
227
- correct_authors = paper["authors"]
228
- paper_title = paper["title"]
229
- if len(correct_authors) == 1: author_str = correct_authors[0]
230
- elif len(correct_authors) == 2: author_str = " and ".join(correct_authors)
231
- else: author_str = ", ".join(correct_authors[:-1]) + ", and " + correct_authors[-1]
232
- title_pattern = re.escape(paper_title)
233
- patterns = [
234
- rf"({title_pattern})[^.]*?by\s+([^.]+?)(?:\.|,|\))", rf"({title_pattern})[^.]*?with\s+([^.]+?)(?:\.|,|\))",
235
- rf"({title_pattern})[^.]*?\(([^)]+?)\)", rf"({title_pattern})[^.]*?-\s*Authors:\s*([^.]+?)(?:\.|,|\n)",
236
- ]
237
- for pattern in patterns:
238
- matches = re.finditer(pattern, corrected_response, re.IGNORECASE)
239
- for match in matches:
240
- full_match = match.group(0)
241
- author_part = match.group(2)
242
- mentioned_authors = [a.strip() for a in re.split(r',|and', author_part)]
243
- if len(correct_authors) > 1 and len(mentioned_authors) == 1 and "David" in mentioned_authors[0]:
244
- if "by" in full_match: new_match = full_match.replace(f"by {author_part}", f"by {author_str}")
245
- elif "with" in full_match: new_match = full_match.replace(f"with {author_part}", f"with {author_str}")
246
- elif "(" in full_match and ")" in full_match: new_match = full_match.replace(f"({author_part})", f"({author_str})")
247
- elif "Authors:" in full_match: new_match = full_match.replace(f"Authors: {author_part}", f"Authors: {author_str}")
248
- else: new_match = full_match
249
- corrected_response = corrected_response.replace(full_match, new_match)
250
- for paper_key in mentioned_papers:
251
- paper = self.papers[paper_key]
252
- if len(paper["authors"]) > 1:
253
- possessive_patterns = [rf"David's\s+{re.escape(paper['title'])}", rf"his\s+{re.escape(paper['title'])}"]
254
- for pattern in possessive_patterns:
255
- if re.search(pattern, corrected_response, re.IGNORECASE):
256
- author_str_coauthors = " and ".join([a for a in paper["authors"] if a != "David Van Dijcke"]) # Corrected variable name
257
- if author_str_coauthors and author_str_coauthors not in corrected_response: # Check if coauthor_str is not empty
258
- sentences = corrected_response.split('.')
259
- for i, sentence in enumerate(sentences):
260
- if re.search(pattern, sentence, re.IGNORECASE):
261
- sentences[i] = sentence + f" (joint work with {author_str_coauthors})" # Corrected variable name
262
- corrected_response = '.'.join(sentences)
263
- break
264
- return corrected_response
265
-
266
- class JudgeAgent:
267
- def __init__(self, paper_db: DynamicPaperDatabase):
268
- self.paper_db = paper_db
269
- gemini_api_key = os.getenv("GOOGLE_API_KEY")
270
- self.use_gemini = False
271
- if gemini_api_key:
272
- try:
273
- genai.configure(api_key=gemini_api_key)
274
- model_preference = ['gemini-1.5-flash-002', 'gemini-1.5-flash', 'gemini-1.5-pro']
275
- for model_name in model_preference:
276
- try:
277
- self.judge_model = genai.GenerativeModel(model_name)
278
- self.judge_model.generate_content("Hello") # Test call
279
- self.use_gemini = True
280
- logger.info(f"Judge agent initialized with {model_name}")
281
- break
282
- except Exception:
283
- logger.warning(f"Failed to initialize judge with {model_name}, trying next.")
284
- if not self.use_gemini:
285
- logger.error("Failed to initialize judge agent with any Gemini model.")
286
- except Exception as e:
287
- logger.error(f"Failed to configure Gemini for judge agent: {e}")
288
- self.use_gemini = False
289
 
290
- def create_paper_context(self) -> str:
291
- context = "VERIFIED PAPER DATABASE:\n\n"
292
- for key, paper in self.paper_db.papers.items():
293
- authors_str = ", ".join(paper["authors"])
294
- context += f"Title: {paper['title']}\nAuthors: {authors_str}\n"
295
- if paper.get('year'): context += f"Year: {paper['year']}\n"
296
- if paper.get('venue'): context += f"Venue: {paper['venue']}\n"
297
- context += "\n"
298
- return context
 
 
 
 
 
 
299
 
300
- def judge_response(self, original_response: str, user_question: str) -> str:
301
- if not self.use_gemini:
302
- return self.paper_db.verify_and_correct_response(original_response)
 
303
 
304
- paper_context = self.create_paper_context()
305
- judge_prompt = f"""You are a fact-checking judge for David Van Dijcke's research assistant.
306
-
307
- CRITICAL TASK: Verify and correct the response below if needed. Return ONLY the final response text that should be shown to the user.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
308
 
309
- DO NOT include any meta-commentary like "The original response is accurate" or "I've corrected..."
310
- DO NOT explain what you changed or why.
311
- JUST return the clean, corrected response text.
 
 
 
 
312
 
313
- VERIFICATION CHECKLIST:
314
- 1. Are ALL coauthors mentioned for each paper? (Never attribute sole authorship unless verified)
315
- 2. Are paper titles, years, and venues accurate?
316
- 3. Is there any false information about papers or coauthors?
317
- 4. Are the claims supported by the paper database below?
318
 
319
- PAPER DATABASE:
320
- {paper_context}
321
 
322
- USER QUESTION: {user_question}
 
323
 
324
- RESPONSE TO VERIFY:
325
- {original_response}
326
 
327
- OUTPUT: Just the final response text, nothing else."""
328
 
 
 
329
  try:
330
- generation_config = genai.types.GenerationConfig(
331
- temperature=0.1, top_p=0.9, max_output_tokens=600,
332
- )
333
- judge_safety_settings = [
334
- {"category": HarmCategory.HARM_CATEGORY_HARASSMENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
335
- {"category": HarmCategory.HARM_CATEGORY_HATE_SPEECH, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
336
- {"category": HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
337
- {"category": HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
338
- ]
339
 
340
- judge_gemini_response = self.judge_model.generate_content(
341
- judge_prompt,
342
- generation_config=generation_config,
343
- safety_settings=judge_safety_settings
344
- )
345
 
346
- judged_text = ""
347
- if (judge_gemini_response.candidates and
348
- len(judge_gemini_response.candidates) > 0 and
349
- judge_gemini_response.candidates[0].content and
350
- judge_gemini_response.candidates[0].content.parts and
351
- len(judge_gemini_response.candidates[0].content.parts) > 0):
352
- judged_text = judge_gemini_response.text.strip()
353
- else:
354
- block_reason_info = "Reason unknown."
355
- finish_reason_info = "Finish reason unknown."
356
- if judge_gemini_response.prompt_feedback and judge_gemini_response.prompt_feedback.block_reason:
357
- block_reason_info = f"Prompt blocked for judge due to: {judge_gemini_response.prompt_feedback.block_reason.name}"
358
- if judge_gemini_response.prompt_feedback.block_reason_message:
359
- block_reason_info += f" (Message: {judge_gemini_response.prompt_feedback.block_reason_message})"
360
- logger.error(f"Gemini judge agent: {block_reason_info}")
361
- if judge_gemini_response.candidates and len(judge_gemini_response.candidates) > 0:
362
- candidate = judge_gemini_response.candidates[0]
363
- finish_reason_info = f"Finish reason for judge: {candidate.finish_reason.name}"
364
- logger.error(f"Gemini judge agent: {finish_reason_info}")
365
- if candidate.safety_ratings:
366
- for rating in candidate.safety_ratings:
367
- logger.error(f" Judge Safety Rating: Category={rating.category.name}, Probability={rating.probability.name}")
368
- logger.warning(f"Judge agent could not generate a refined response ({block_reason_info}, {finish_reason_info}). Falling back to pre-judge verified response.")
369
- return self.paper_db.verify_and_correct_response(original_response)
370
-
371
- final_response = self.paper_db.verify_and_correct_response(judged_text)
372
- return final_response
373
 
 
 
374
  except Exception as e:
375
- logger.error(f"Judge agent error: {e}", exc_info=True)
376
- return self.paper_db.verify_and_correct_response(original_response)
377
-
378
- class RateLimiter:
379
- def __init__(self):
380
- self.sessions = {}
381
- self.lock = threading.Lock()
382
- def get_session_info(self, session_id: str) -> Dict:
383
- with self.lock:
384
- current_time = datetime.now()
385
- expired_sessions = [sid for sid, info in self.sessions.items() if current_time - info['last_activity'] > timedelta(hours=SESSION_TIMEOUT_HOURS)]
386
- for sid in expired_sessions: del self.sessions[sid]; logger.info(f"Expired session: {sid}")
387
- if session_id not in self.sessions:
388
- if len(self.sessions) >= MAX_CONCURRENT_SESSIONS: return {'allowed': False, 'reason': 'Too many active sessions. Please try again later.'}
389
- self.sessions[session_id] = {'message_count': 0, 'created': current_time, 'last_activity': current_time}
390
- session = self.sessions[session_id]
391
- session['last_activity'] = current_time
392
- if session['message_count'] >= MAX_MESSAGES_PER_SESSION:
393
- return {'allowed': False, 'reason': f'You have reached the limit of {MAX_MESSAGES_PER_SESSION} messages. Please email David at dvdijcke@umich.edu for further questions.'}
394
- session['message_count'] += 1
395
- return {'allowed': True, 'message_count': session['message_count'], 'remaining': MAX_MESSAGES_PER_SESSION - session['message_count']}
396
-
397
- paper_db = DynamicPaperDatabase() # Use default base_path (None) for Hugging Face
398
- judge_agent = JudgeAgent(paper_db)
399
- rate_limiter = RateLimiter()
400
 
401
- class ImprovedResearchAssistant:
402
- def __init__(self):
403
- self.embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': True})
404
- gemini_api_key = os.getenv("GOOGLE_API_KEY")
405
- self.use_gemini = False
406
- if gemini_api_key:
407
- try:
408
- genai.configure(api_key=gemini_api_key)
409
- logger.info("Attempting to use Google Gemini for high quality responses")
410
- model_preference = ['gemini-1.5-flash-002', 'gemini-1.5-flash', 'gemini-1.5-pro']
411
- for model_name in model_preference:
412
- try:
413
- self.gemini_model = genai.GenerativeModel(model_name)
414
- self.gemini_model.generate_content("Hello") # Test call
415
- self.use_gemini = True
416
- logger.info(f"Successfully connected to {model_name}")
417
- break
418
- except Exception:
419
- logger.warning(f"Failed to connect to {model_name}, trying next.")
420
- if not self.use_gemini:
421
- logger.error("Failed to connect to any Gemini model. Using limited mode.")
422
- except Exception as e:
423
- logger.error(f"Failed to initialize Gemini: {e}")
424
- self.use_gemini = False
425
- else:
426
- logger.warning("No Google API key found. Using limited mode.")
427
- self.use_gemini = False
428
- self.vector_store = None
429
- self.cache_path = "vector_store_cache"
430
- logger.info("Building vector store from documents and markdown files...")
431
- self.load_documents()
432
-
433
- def load_documents(self):
434
- documents = []
435
- research_info = """
436
- David Van Dijcke is a PhD candidate in Economics at the University of Michigan, Ann Arbor.
437
- He is on the job market for the 2025-26 academic year as an ECONOMETRICIAN.
438
- RESEARCH PROFILE: David's research has two main components:
439
- 1. ECONOMETRIC THEORY: Developing novel methods for functional and high-dimensional data, combining tools from functional data analysis, optimal transport, and geometric measure theory
440
- 2. POLICY APPLICATIONS: Applying these methods to answer important policy questions using big data, from labor markets to public health to conflict zones
441
- IMPORTANT: Always credit coauthors when discussing papers. Economics papers typically use alphabetical author order.
442
- CONTACT: Email: dvdijcke@umich.edu, Website: https://davidvandijcke.com, Book a meeting: https://calendar.app.google/dKeDaigmFwnJPm8s6
443
- """
444
- documents.append(Document(page_content=research_info, metadata={"source": "website_overview", "type": "general_info"}))
445
- for paper_key, paper in paper_db.papers.items():
446
- paper_content = f"Paper: {paper['title']}\nAuthors: {', '.join(paper['authors'])}\nYear: {paper.get('year', 'forthcoming')}\nType: {paper['type']}\nField: {paper.get('field', 'Economics')}\n"
447
- if paper.get('venue'): paper_content += f"Venue: {paper['venue']}\n"
448
- if paper.get('citation'): paper_content += f"Citation: {paper['citation']}\n"
449
- if paper.get('excerpt'): paper_content += f"\nAbstract/Summary: {paper['excerpt']}\n"
450
- if paper.get('full_content'): paper_content += f"\nDetails: {paper['full_content']}\n"
451
- if paper['type'] == 'job_market_paper': paper_content += "\nNOTE: This is David's JOB MARKET PAPER for 2025-26.\n"
452
- documents.append(Document(page_content=paper_content, metadata={"source": f"paper_{paper_key}", "type": "research"}))
453
- key_pdfs = ["CV_DavidVanDijcke.pdf", "disco.pdf", "fdr.pdf", "r3d_arxiv_4apr2025.pdf", "rto.pdf", "unmasking_partisanship.pdf"]
454
- possible_dirs = ["documents", "./documents", os.path.join(os.getcwd(), "documents")]
455
- documents_dir = next((dir_path for dir_path in possible_dirs if os.path.exists(dir_path)), None)
456
- if documents_dir:
457
- logger.info(f"Found documents directory at: {documents_dir}")
458
- for filename in key_pdfs:
459
- filepath = os.path.join(documents_dir, filename)
460
- if os.path.exists(filepath):
461
- try:
462
- loader = PyPDFLoader(filepath)
463
- pdf_docs = loader.load()
464
- pages_to_load = 10 if "r3d" in filename.lower() else 5
465
- documents.extend(pdf_docs[:pages_to_load])
466
- logger.info(f"Loaded {filename} ({pages_to_load} pages)")
467
- except Exception as e:
468
- logger.warning(f"Error loading {filename}: {e}")
469
- else:
470
- logger.warning("No documents directory found. PDF loading skipped.")
471
- text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, length_function=len)
472
- splits = text_splitter.split_documents(documents)
473
- self.vector_store = FAISS.from_documents(splits, self.embeddings)
474
- try:
475
- if not os.path.exists(self.cache_path):
476
- os.makedirs(self.cache_path)
477
- self.vector_store.save_local(self.cache_path)
478
- logger.info("Vector store cached successfully")
479
- except Exception as e:
480
- logger.warning(f"Failed to cache vector store (non-critical): {e}")
481
-
482
- def is_greeting_or_casual(self, message: str) -> bool:
483
- greetings = ["hello", "hi", "hey", "good morning", "good afternoon", "good evening", "how are you", "what's up", "greetings", "howdy", "hola", "bonjour"]
484
- message_lower = message.lower().strip()
485
- starts_with_greeting = any(message_lower.startswith(greeting) for greeting in greetings)
486
- is_very_short = len(message_lower.split()) <= 2 and not any(word in message_lower for word in ["r3d", "paper", "research", "method", "econometric", "about", "tell", "what", "how"])
487
- return starts_with_greeting or is_very_short
488
 
489
- def generate_response(self, question: str, context: str) -> str:
490
- if self.use_gemini:
491
- prompt = f"""You are an expert AI assistant for David Van Dijcke's academic website, specializing in his ECONOMETRIC research.
492
- David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
493
-
494
- CRITICAL INSTRUCTIONS:
495
- 1. COAUTHORSHIP ACCURACY:
496
- - ALWAYS mention ALL coauthors when discussing any paper
497
- - NEVER attribute sole authorship to David unless he is truly the only author
498
- - Economics papers typically use alphabetical author order - this is standard, not a ranking
499
- - Use phrases like "joint work with", "coauthored with", or list all authors
500
- - If you mention a paper title, you MUST include all coauthors
501
-
502
- 2. PAPER DETAILS:
503
- - When asked about a specific paper, provide substantive details from the context
504
- - Explain the main contributions and innovations
505
- - Mention applications and empirical examples when available
506
- - For the job market paper (R3D), emphasize its importance and innovations
507
-
508
- 3. RESEARCH PROFILE:
509
- - David is an ECONOMETRICIAN who develops new statistical methods
510
- - His job market paper is R3D: Regression Discontinuity Design with Distribution-Valued Outcomes (sole authored)
511
- - He combines functional data analysis, optimal transport, and geometric measure theory
512
- - He applies these methods to answer policy questions with big data
513
- - His work extends causal inference beyond scalar outcomes to distribution-valued outcomes
514
-
515
- 4. KEY COLLABORATORS:
516
- - Florian Gunsilius (frequent coauthor on FDR, DISCO)
517
- - Austin Wright (Return to Office, Ukraine, COVID papers)
518
- - Other coauthors should be mentioned by name when discussing their joint work
519
-
520
- Be precise about technical details and provide substantive information. If uncertain about details, suggest emailing David at dvdijcke@umich.edu.
521
-
522
- Context about David Van Dijcke:
523
- {context}
524
-
525
- User's question: {question}
526
-
527
- Provide an accurate, detailed, and professional response. Remember to ALWAYS credit ALL coauthors and provide substantive information about the research."""
528
-
529
- try:
530
- generation_config = genai.types.GenerationConfig(
531
- temperature=0.2, top_p=0.9, max_output_tokens=500,
532
- )
533
- safety_settings_for_call = [
534
- {"category": HarmCategory.HARM_CATEGORY_HARASSMENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
535
- {"category": HarmCategory.HARM_CATEGORY_HATE_SPEECH, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
536
- {"category": HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
537
- {"category": HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
538
- ]
539
-
540
- gemini_api_response = self.gemini_model.generate_content(
541
- prompt,
542
- generation_config=generation_config,
543
- safety_settings=safety_settings_for_call
544
- )
545
-
546
- generated_text = ""
547
- if (gemini_api_response.candidates and
548
- len(gemini_api_response.candidates) > 0 and
549
- gemini_api_response.candidates[0].content and
550
- gemini_api_response.candidates[0].content.parts and
551
- len(gemini_api_response.candidates[0].content.parts) > 0):
552
- generated_text = gemini_api_response.text.strip()
553
- else:
554
- block_reason_info = "Reason unknown."
555
- finish_reason_info = "Finish reason unknown."
556
- if gemini_api_response.prompt_feedback and gemini_api_response.prompt_feedback.block_reason:
557
- block_reason_info = f"Prompt blocked due to: {gemini_api_response.prompt_feedback.block_reason.name}"
558
- if gemini_api_response.prompt_feedback.block_reason_message:
559
- block_reason_info += f" (Message: {gemini_api_response.prompt_feedback.block_reason_message})"
560
- logger.error(f"Gemini main assistant: {block_reason_info}")
561
- if gemini_api_response.candidates and len(gemini_api_response.candidates) > 0:
562
- candidate = gemini_api_response.candidates[0]
563
- finish_reason_info = f"Finish reason: {candidate.finish_reason.name}"
564
- logger.error(f"Gemini main assistant: {finish_reason_info}")
565
- if candidate.safety_ratings:
566
- for rating in candidate.safety_ratings:
567
- logger.error(f" Safety Rating: Category={rating.category.name}, Probability={rating.probability.name}")
568
- user_message = (f"I apologize, but I encountered an issue generating a response. "
569
- f"This might be due to content safety filters. ({finish_reason_info}) " # Simplified for user
570
- f"Please try rephrasing your question.")
571
- return user_message
572
-
573
- verified_response = paper_db.verify_and_correct_response(generated_text)
574
- final_response = judge_agent.judge_response(verified_response, question)
575
- return final_response
576
-
577
- except Exception as e:
578
- logger.error(f"Error with Gemini in main assistant: {e}", exc_info=True)
579
- if "finish_reason is 2" in str(e) or "SAFETY" in str(e).upper() or "finish_reason: SAFETY" in str(e):
580
- return "I apologize, but my response generation was blocked. This might be due to content safety filters. Please try rephrasing your question."
581
- return "I apologize, but I'm having trouble generating a response right now. Could you please try again?"
582
- else:
583
- return "I'm currently running in limited mode without access to a high-quality language model. To get the best responses, please add a Google API key to the Space settings."
584
 
585
- def answer_question(self, message: str, history: List[Tuple[str, str]] = None, session_id: str = None) -> str:
586
- if session_id:
587
- session_info = rate_limiter.get_session_info(session_id)
588
- if not session_info['allowed']: return session_info['reason']
589
 
590
- if self.is_greeting_or_casual(message):
591
- greeting_responses = [
592
- "Hello! I'm here to help you learn about David Van Dijcke, an econometrician on the 2025-26 job market. He develops cutting-edge methods for functional and high-dimensional data. What would you like to know about his research?",
593
- "Hi! Welcome to David Van Dijcke's research assistant. David is an econometrician who combines functional data analysis, optimal transport, and geometric measure theory to develop new causal inference methods. How can I help you learn about his work?",
594
- "Hello! I can tell you about David Van Dijcke's econometric research, including his job market paper on distribution-valued treatment effects and his collaborative work with researchers like Florian Gunsilius and Austin Wright. What aspect of his work interests you?",
595
- ]
596
- response_index = int(hashlib.md5(message.encode()).hexdigest(), 16) % len(greeting_responses)
597
- return greeting_responses[response_index]
598
 
599
- try:
600
- if not self.vector_store:
601
- logger.error("Vector store not initialized!")
602
- return "I'm sorry, there's an issue with my internal knowledge base. Please try again later."
603
-
604
- docs = self.vector_store.similarity_search(message, k=4)
605
- context = "\n".join([doc.page_content for doc in docs])
606
- response = self.generate_response(message, context)
607
-
608
- paper_keywords = ["r3d", "regression discontinuity", "free discontinuity", "fdr", "disco",
609
- "distributional synthetic", "return to office", "rto", "revenue",
610
- "production function", "unmasking", "ukraine", "covid", "pandemic"]
611
- if any(keyword in message.lower() for keyword in paper_keywords) and "davidvandijcke.com" not in response:
612
- response += "\n\n*For more details, you can find David's papers on his website at https://davidvandijcke.com*"
613
- return response
614
- except Exception as e:
615
- logger.error(f"Error in answer_question: {e}", exc_info=True)
616
- return "I apologize, but I'm having trouble accessing the information right now. Please try rephrasing your question or ask about David's research areas, publications, or academic background."
617
-
618
- def create_gradio_interface():
619
- assistant = ImprovedResearchAssistant()
620
- import uuid
621
- def chat_function(message, history, request: gr.Request):
622
- session_id = None
623
- if hasattr(request, 'session_hash') and request.session_hash: session_id = request.session_hash
624
- else:
625
- try:
626
- if hasattr(request, 'client'):
627
- client = request.client
628
- if hasattr(client, 'host'): session_id = f"ip_{client.host}"
629
- elif hasattr(client, 'value') and isinstance(client.value, tuple): session_id = f"ip_{client.value[0]}"
630
- else: session_id = f"ip_{str(client)}"
631
- except: pass
632
- if not session_id: session_id = str(uuid.uuid4())
633
- return assistant.answer_question(message, history, session_id)
634
-
635
- custom_css = """
636
- #chatbot { height: 600px; }
637
- .gradio-container { font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif; max-width: 900px; margin: auto; }
638
- .user-message, .bot-message { padding: 15px; border-radius: 10px; margin: 10px 0; }
639
- """
 
 
 
 
 
 
640
 
641
- demo = gr.ChatInterface(
642
- fn=chat_function,
643
- title="David Van Dijcke - Econometrician | Job Market 2025-26",
644
- description=("Welcome! I'm an AI assistant specializing in David Van Dijcke's econometric research. "
645
- "David develops novel econometric methods for functional and high-dimensional data. Ask me about his job market paper (R3D), "
646
- "the novel aspects of his research, or his collaborative research projects."),
647
- examples=["Hello! Who is David Van Dijcke?", "What econometric methods has David developed?",
648
- "Tell me about his job market paper", "Tell me about the Return to Office paper", "Who are David's coauthors?"],
649
- theme=gr.themes.Soft(primary_hue="blue", secondary_hue="gray", neutral_hue="gray", font=gr.themes.GoogleFont("Inter")),
650
- css=custom_css, retry_btn="Retry", undo_btn="Undo", clear_btn="Clear Chat", submit_btn="Send", autofocus=True
651
- )
652
  return demo
653
 
654
  if __name__ == "__main__":
655
- logger.info(f"Loaded {len(paper_db.papers)} papers using current configuration.")
656
- for key, paper in paper_db.papers.items():
657
- logger.info(f" - {paper['title']} ({', '.join(paper['authors'])})")
658
-
659
- # Try to create cache directory, but don't fail if we can't
660
- try:
661
- os.makedirs("vector_store_cache", exist_ok=True)
662
- except Exception as e:
663
- logger.warning(f"Could not create cache directory (non-critical): {e}")
664
-
665
- demo = create_gradio_interface()
666
- demo.launch(share=False, server_name="0.0.0.0", server_port=7860, show_error=True)
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Professional Research Assistant
4
+ Clean chat interface with expert responses
5
+ """
6
+
7
  import os
8
+ from typing import List, Tuple
9
  import gradio as gr
10
+ from langchain_community.document_loaders import PyPDFLoader
11
+ from dotenv import load_dotenv
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  import google.generativeai as genai
 
 
 
 
 
 
13
 
14
+ # Load environment variables
15
+ load_dotenv()
 
 
16
 
17
+ class ProfessionalAssistant:
18
+ """Professional assistant that speaks as an expert about David's work"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ def __init__(self):
21
+ # Setup Gemini
22
+ api_key = os.getenv("GOOGLE_API_KEY")
23
+ if api_key:
24
+ genai.configure(api_key=api_key)
25
+ try:
26
+ self.model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
27
+ print("Using Gemini 2.5 Flash Preview")
28
+ except:
29
+ self.model = genai.GenerativeModel('gemini-1.5-flash')
30
+ print("Using Gemini 1.5 Flash")
31
+ else:
32
+ self.model = None
33
+
34
+ # Load all papers
35
+ self.papers = self._load_all_papers()
36
+
37
+ # Pre-load context
38
+ self.context = self._create_context()
39
+
40
+ # Question counter
41
+ self.question_count = 0
42
+ self.question_limit = 15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
+ def _load_all_papers(self) -> dict:
45
+ """Load all papers completely"""
46
+ papers = {}
47
+ pdf_dir = "documents"
48
+
49
+ paper_files = {
50
+ "r3d": ("r3d_arxiv_4apr2025.pdf", "R3D (Job Market Paper)"),
51
+ "cv": ("CV_DavidVanDijcke.pdf", "CV"),
52
+ "fdr": ("fdr.pdf", "Free Discontinuity Regression"),
53
+ "disco": ("disco.pdf", "Distributional Synthetic Controls"),
54
+ "rto": ("rto.pdf", "Return to Office"),
55
+ "prodf": ("prodf.pdf", "Revenue Production Functions"),
56
+ "unmasking": ("unmasking_partisanship.pdf", "Unmasking Partisanship"),
57
+ "ukraine": ("van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf", "Ukraine Alerts")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  }
59
+
60
+ for key, (filename, title) in paper_files.items():
61
+ pdf_path = os.path.join(pdf_dir, filename)
62
+ if os.path.exists(pdf_path):
63
+ try:
64
+ loader = PyPDFLoader(pdf_path)
65
+ pages = loader.load()
66
+ text = "\n\n".join([p.page_content for p in pages])
67
+ papers[key] = {
68
+ "text": text,
69
+ "title": title,
70
+ "pages": len(pages)
71
+ }
72
+ print(f"Loaded {title}: {len(pages)} pages")
73
+ except Exception as e:
74
+ print(f"Error loading {filename}: {e}")
75
+
76
+ return papers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
+ def _create_context(self) -> str:
79
+ """Create comprehensive context from all papers"""
80
+ context_parts = []
81
+
82
+ # Add papers in priority order
83
+ priority_order = ["r3d", "cv", "fdr", "disco", "rto", "prodf"]
84
+
85
+ for key in priority_order:
86
+ if key in self.papers:
87
+ paper = self.papers[key]
88
+ # Add substantial excerpts
89
+ excerpt_length = 30000 if key == "r3d" else 15000
90
+ context_parts.append(f"\n[{paper['title']}]\n{paper['text'][:excerpt_length]}")
91
+
92
+ return "\n\n".join(context_parts)
93
 
94
+ def chat(self, message: str, history: List[Tuple[str, str]]) -> Tuple[str, List[Tuple[str, str]]]:
95
+ """Chat with proper history handling"""
96
+ if not message.strip():
97
+ return "", history
98
 
99
+ # Check question limit
100
+ if self.question_count >= self.question_limit:
101
+ response = "I've reached the question limit for this session (15 questions). Please refresh the page to start a new conversation."
102
+ history.append((message, response))
103
+ return "", history
104
+
105
+ if not self.model:
106
+ response = "I need a Google API key to provide detailed answers about David's research."
107
+ history.append((message, response))
108
+ return "", history
109
+
110
+ # Build conversation context
111
+ conversation = "Previous conversation:\n"
112
+ for human, assistant in history[-3:]: # Last 3 exchanges
113
+ conversation += f"User: {human}\nAssistant: {assistant}\n\n"
114
+
115
+ # Determine which papers to emphasize based on query
116
+ message_lower = message.lower()
117
+ specific_context = ""
118
+
119
+ if "job market" in message_lower or "r3d" in message_lower:
120
+ if "r3d" in self.papers:
121
+ specific_context = f"\n[R3D - Job Market Paper]\n{self.papers['r3d']['text'][:50000]}\n"
122
+ elif "fdr" in message_lower or "discontinuity" in message_lower:
123
+ if "fdr" in self.papers:
124
+ specific_context = f"\n[FDR Paper]\n{self.papers['fdr']['text'][:30000]}\n"
125
+
126
+ # Create prompt
127
+ prompt = f"""You are an expert assistant helping visitors learn about David Van Dijcke's research.
128
 
129
+ CRITICAL INSTRUCTIONS:
130
+ - You are NOT David - you are an expert explaining his work to website visitors
131
+ - Speak in third person about David (use "David" or "Van Dijcke", not "I" or "my")
132
+ - Be conversational but professional
133
+ - Give concise, informative answers (2-3 paragraphs max unless asked for details)
134
+ - Don't say "based on the provided papers" - just state facts confidently
135
+ - Focus on what makes his work innovative and important
136
 
137
+ Key facts:
138
+ - David is an econometrician on the 2025-26 job market from University of Michigan
139
+ - His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
140
+ - He specializes in functional data analysis and optimal transport methods
 
141
 
142
+ {conversation}
 
143
 
144
+ Full research context:
145
+ {self.context}
146
 
147
+ {specific_context}
 
148
 
149
+ Current question: {message}
150
 
151
+ Provide a concise, expert response:"""
152
+
153
  try:
154
+ response = self.model.generate_content(prompt)
155
+ answer = response.text
 
 
 
 
 
 
 
156
 
157
+ # Increment question counter
158
+ self.question_count += 1
 
 
 
159
 
160
+ # Add remaining questions info if getting close to limit
161
+ remaining = self.question_limit - self.question_count
162
+ if remaining <= 3 and remaining > 0:
163
+ answer += f"\n\n*({remaining} questions remaining in this session)*"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
+ history.append((message, answer))
166
+ return "", history
167
  except Exception as e:
168
+ error_response = f"I encountered an error. Please try rephrasing your question."
169
+ history.append((message, error_response))
170
+ return "", history
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
+ # Create interface
173
+ def create_interface():
174
+ assistant = ProfessionalAssistant()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
+ # Custom CSS for a clean look
177
+ custom_css = """
178
+ .gradio-container {
179
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
180
+ max-width: 900px;
181
+ margin: auto;
182
+ }
183
+ .chatbot {
184
+ height: 500px !important;
185
+ }
186
+ .message {
187
+ font-size: 15px !important;
188
+ line-height: 1.6 !important;
189
+ }
190
+ """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
+ with gr.Blocks(title="David Van Dijcke | Research Assistant", css=custom_css) as demo:
193
+ gr.Markdown("""
194
+ ## David Van Dijcke - Research Assistant
 
195
 
196
+ Welcome! I can help you learn about David Van Dijcke's econometric research. David is on the 2025-26 academic job market.
 
 
 
 
 
 
 
197
 
198
+ **Job Market Paper:** R3D - Regression Discontinuity Design with Distribution-Valued Outcomes
199
+
200
+ *Note: This session allows up to 15 questions. Refresh the page to start a new session.*
201
+ """)
202
+
203
+ chatbot = gr.Chatbot(
204
+ value=[],
205
+ elem_classes=["chatbot"],
206
+ bubble_full_width=False,
207
+ avatar_images=(None, None),
208
+ show_label=False
209
+ )
210
+
211
+ with gr.Row():
212
+ msg = gr.Textbox(
213
+ show_label=False,
214
+ placeholder="Ask about David's research, methods, or papers...",
215
+ elem_classes=["message-input"],
216
+ scale=4
217
+ )
218
+ submit = gr.Button("Send", scale=1, variant="primary")
219
+
220
+ # Clear button
221
+ clear = gr.Button("Clear conversation", size="sm")
222
+
223
+ # Examples in a nice layout
224
+ gr.Examples(
225
+ examples=[
226
+ "What is David's job market paper about?",
227
+ "What makes R3D innovative?",
228
+ "What are the practical applications of R3D?",
229
+ "Tell me about David's other research besides R3D",
230
+ "What makes David a strong candidate for an econometrics position?"
231
+ ],
232
+ inputs=msg,
233
+ label="Example questions:"
234
+ )
235
+
236
+ # Event handlers
237
+ msg.submit(assistant.chat, [msg, chatbot], [msg, chatbot])
238
+ submit.click(assistant.chat, [msg, chatbot], [msg, chatbot])
239
+ clear.click(lambda: [], None, chatbot, queue=False)
240
+
241
+ gr.Markdown("""
242
+ ---
243
+ *This assistant has access to David's complete research portfolio including published papers, working papers, and CV.*
244
+ """)
245
 
 
 
 
 
 
 
 
 
 
 
 
246
  return demo
247
 
248
  if __name__ == "__main__":
249
+ interface = create_interface()
250
+ interface.launch(
251
+ server_name="127.0.0.1",
252
+ server_port=7860,
253
+ show_error=True
254
+ )
 
 
 
 
 
 
app_enhanced.py ADDED
@@ -0,0 +1,599 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Enhanced Research Assistant
4
+ Improved version with better context handling, caching, and responses
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import hashlib
10
+ from typing import List, Dict, Optional, Tuple
11
+ import gradio as gr
12
+ from langchain_community.document_loaders import PyPDFLoader
13
+ from langchain_community.embeddings import HuggingFaceEmbeddings
14
+ from langchain_community.vectorstores import FAISS
15
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
16
+ from langchain.schema import Document
17
+ from dotenv import load_dotenv
18
+ import google.generativeai as genai
19
+
20
+ # Load environment variables
21
+ load_dotenv()
22
+
23
+ class EnhancedResearchAssistant:
24
+ """Enhanced assistant with better performance and accuracy"""
25
+
26
+ def __init__(self):
27
+ """Initialize with enhanced features"""
28
+ self.embeddings = HuggingFaceEmbeddings(
29
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
30
+ )
31
+
32
+ # Load papers with caching
33
+ self.papers = self._load_papers_cached()
34
+
35
+ # Create vector stores
36
+ self.vector_store = self._create_vector_store()
37
+
38
+ # Setup LLM
39
+ self.llm = self._setup_llm()
40
+
41
+ # Initialize response cache
42
+ self.response_cache = {}
43
+
44
+ # Pre-compute common contexts
45
+ self.precomputed_contexts = self._precompute_contexts()
46
+
47
+ def _load_papers_cached(self) -> Dict[str, Dict]:
48
+ """Load papers with caching to speed up startup"""
49
+ cache_file = "papers_metadata_cache.json"
50
+
51
+ # Try to load from cache
52
+ if os.path.exists(cache_file):
53
+ try:
54
+ with open(cache_file, 'r') as f:
55
+ print("Loading papers from cache...")
56
+ return json.load(f)
57
+ except:
58
+ pass
59
+
60
+ # Load papers fresh
61
+ papers = self._load_papers()
62
+
63
+ # Save to cache (excluding full text for size)
64
+ cache_data = {}
65
+ for key, paper in papers.items():
66
+ cache_data[key] = {
67
+ k: v for k, v in paper.items()
68
+ if k != "text" or len(v) < 1000 # Only cache short texts
69
+ }
70
+
71
+ try:
72
+ with open(cache_file, 'w') as f:
73
+ json.dump(cache_data, f)
74
+ except:
75
+ pass
76
+
77
+ return papers
78
+
79
+ def _load_papers(self) -> Dict[str, Dict]:
80
+ """Load all papers with enhanced metadata"""
81
+ papers = {}
82
+ pdf_dir = "documents"
83
+
84
+ paper_metadata = {
85
+ "r3d": {
86
+ "file": "r3d_arxiv_4apr2025.pdf",
87
+ "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
88
+ "type": "JOB MARKET PAPER",
89
+ "year": 2025,
90
+ "coauthors": [],
91
+ "abstract_keywords": ["regression discontinuity", "distribution", "optimal transport", "wasserstein", "functional data"],
92
+ "description": "Extends RDD to analyze entire outcome distributions using optimal transport theory"
93
+ },
94
+ "fdr": {
95
+ "file": "fdr.pdf",
96
+ "title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
97
+ "type": "Working Paper",
98
+ "year": 2024,
99
+ "coauthors": [],
100
+ "abstract_keywords": ["free discontinuity", "internet shutdowns", "geometric measure theory", "non-parametric"],
101
+ "description": "Novel econometric method for estimating regression functions with unknown discontinuity locations"
102
+ },
103
+ "disco": {
104
+ "file": "disco.pdf",
105
+ "title": "disco: Distributional Synthetic Controls",
106
+ "type": "Working Paper",
107
+ "year": 2025,
108
+ "coauthors": ["Florian Gunsilius"],
109
+ "abstract_keywords": ["synthetic controls", "distribution", "stata package", "causal inference"],
110
+ "description": "Stata package implementing distributional synthetic control methods"
111
+ },
112
+ "rto": {
113
+ "file": "rto.pdf",
114
+ "title": "Return to Office and the Tenure Distribution",
115
+ "type": "Working Paper",
116
+ "year": 2025,
117
+ "coauthors": ["Florian Gunsilius", "Austin Wright"],
118
+ "abstract_keywords": ["return to office", "tenure", "covid", "remote work", "labor"],
119
+ "description": "Analyzes distributional impacts of return-to-office mandates on employee tenure"
120
+ },
121
+ "prodf": {
122
+ "file": "prodf.pdf",
123
+ "title": "On the Non-Identification of Revenue Production Functions",
124
+ "type": "Working Paper",
125
+ "year": 2023,
126
+ "coauthors": [],
127
+ "abstract_keywords": ["production functions", "identification", "revenue", "productivity"],
128
+ "description": "Proves non-identification of production functions when using revenue as output proxy"
129
+ },
130
+ "unmasking": {
131
+ "file": "unmasking_partisanship.pdf",
132
+ "title": "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk",
133
+ "type": "Published Paper",
134
+ "year": 2021,
135
+ "journal": "Journal of Public Economics",
136
+ "coauthors": ["Anton Ivanov", "Kecht Florian", "Marco Giani", "Luke Taylor"],
137
+ "abstract_keywords": ["masks", "covid", "partisanship", "polarization", "public health"],
138
+ "description": "Shows how political polarization undermined mask-wearing during COVID-19"
139
+ },
140
+ "ukraine": {
141
+ "file": "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf",
142
+ "title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine",
143
+ "type": "Published Paper",
144
+ "year": 2023,
145
+ "journal": "Science Advances",
146
+ "coauthors": ["Yuri Zhukov", "others"],
147
+ "abstract_keywords": ["ukraine", "war", "alerts", "public safety", "mobile data"],
148
+ "description": "Demonstrates effectiveness of air raid alerts in saving lives during Ukraine invasion"
149
+ },
150
+ "cv": {
151
+ "file": "CV_DavidVanDijcke.pdf",
152
+ "title": "Curriculum Vitae",
153
+ "type": "CV",
154
+ "year": 2025,
155
+ "description": "David Van Dijcke's academic CV"
156
+ }
157
+ }
158
+
159
+ for key, metadata in paper_metadata.items():
160
+ pdf_path = os.path.join(pdf_dir, metadata["file"])
161
+ if os.path.exists(pdf_path):
162
+ try:
163
+ loader = PyPDFLoader(pdf_path)
164
+ pages = loader.load()
165
+
166
+ # Extract full text
167
+ full_text = "\n\n".join([p.page_content for p in pages])
168
+
169
+ # Extract abstract if possible
170
+ abstract = self._extract_abstract(full_text)
171
+
172
+ papers[key] = {
173
+ "text": full_text,
174
+ "abstract": abstract,
175
+ "pages": len(pages),
176
+ "filename": metadata["file"],
177
+ **metadata # Include all metadata
178
+ }
179
+ print(f"Loaded {metadata['title']}: {len(pages)} pages")
180
+
181
+ except Exception as e:
182
+ print(f"Error loading {metadata['file']}: {e}")
183
+
184
+ return papers
185
+
186
+ def _extract_abstract(self, text: str) -> str:
187
+ """Extract abstract from paper text"""
188
+ text_lower = text.lower()
189
+
190
+ # Common abstract patterns
191
+ abstract_start_patterns = ["abstract\n", "abstract.", "abstract:", "summary\n"]
192
+ abstract_end_patterns = ["introduction", "keywords:", "jel codes:", "1 introduction", "1. introduction"]
193
+
194
+ for start_pattern in abstract_start_patterns:
195
+ if start_pattern in text_lower:
196
+ start_idx = text_lower.find(start_pattern) + len(start_pattern)
197
+
198
+ # Find end of abstract
199
+ end_idx = len(text)
200
+ for end_pattern in abstract_end_patterns:
201
+ if end_pattern in text_lower[start_idx:start_idx+3000]:
202
+ possible_end = text_lower.find(end_pattern, start_idx)
203
+ if possible_end > start_idx:
204
+ end_idx = min(end_idx, possible_end)
205
+
206
+ abstract = text[start_idx:end_idx].strip()
207
+ if 50 < len(abstract) < 2000: # Reasonable abstract length
208
+ return abstract
209
+
210
+ # Fallback: return first substantive paragraph
211
+ paragraphs = text.split('\n\n')
212
+ for para in paragraphs[1:10]: # Skip first (usually title)
213
+ if 100 < len(para) < 1000:
214
+ return para.strip()
215
+
216
+ return ""
217
+
218
+ def _create_vector_store(self) -> Optional[FAISS]:
219
+ """Create vector store with multiple granularities"""
220
+ try:
221
+ documents = []
222
+
223
+ # Create different chunk sizes for different purposes
224
+ text_splitters = {
225
+ "small": RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50),
226
+ "medium": RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150),
227
+ "large": RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=300)
228
+ }
229
+
230
+ for key, paper in self.papers.items():
231
+ # Add abstract as its own document
232
+ if paper.get("abstract"):
233
+ doc = Document(
234
+ page_content=f"{paper['title']}\n\nAbstract: {paper['abstract']}",
235
+ metadata={"source": key, "type": "abstract", "title": paper['title']}
236
+ )
237
+ documents.append(doc)
238
+
239
+ # Add chunks of different sizes
240
+ for size_name, splitter in text_splitters.items():
241
+ chunks = splitter.split_text(paper["text"])
242
+
243
+ for i, chunk in enumerate(chunks[:20]): # Limit chunks per paper
244
+ doc = Document(
245
+ page_content=chunk,
246
+ metadata={
247
+ "source": key,
248
+ "type": f"chunk_{size_name}",
249
+ "chunk": i,
250
+ "title": paper['title']
251
+ }
252
+ )
253
+ documents.append(doc)
254
+
255
+ if documents:
256
+ return FAISS.from_documents(documents, self.embeddings)
257
+
258
+ except Exception as e:
259
+ print(f"Error creating vector store: {e}")
260
+
261
+ return None
262
+
263
+ def _setup_llm(self):
264
+ """Setup Gemini LLM"""
265
+ api_key = os.getenv("GOOGLE_API_KEY")
266
+
267
+ if api_key:
268
+ try:
269
+ genai.configure(api_key=api_key)
270
+ # Try to use best available model
271
+ try:
272
+ return genai.GenerativeModel('gemini-1.5-pro')
273
+ except:
274
+ return genai.GenerativeModel('gemini-1.5-flash')
275
+ except Exception as e:
276
+ print(f"Error setting up Gemini: {e}")
277
+
278
+ return None
279
+
280
+ def _precompute_contexts(self) -> Dict[str, str]:
281
+ """Precompute contexts for common queries"""
282
+ contexts = {}
283
+
284
+ # Job market paper context
285
+ if "r3d" in self.papers:
286
+ r3d = self.papers["r3d"]
287
+ contexts["job_market"] = f"""[JOB MARKET PAPER - R3D]
288
+
289
+ Title: {r3d['title']}
290
+
291
+ Abstract: {r3d.get('abstract', 'See paper for abstract')}
292
+
293
+ Key Contributions:
294
+ 1. Extends RDD to distribution-valued outcomes
295
+ 2. Uses optimal transport theory and Wasserstein distances
296
+ 3. Develops new identification and estimation procedures
297
+ 4. Applications to income distributions, test scores, etc.
298
+
299
+ This paper addresses the limitation that traditional RDD only examines average effects, enabling analysis of entire outcome distributions."""
300
+
301
+ # Overview context
302
+ overview_parts = ["David Van Dijcke is an econometrician on the 2025-26 academic job market.\n\nPAPERS:"]
303
+ for key, paper in self.papers.items():
304
+ if key != "cv":
305
+ if paper['type'] == "JOB MARKET PAPER":
306
+ overview_parts.append(f"\n• {paper['type']}: {paper['title']}")
307
+ elif paper.get('journal'):
308
+ overview_parts.append(f"\n• {paper['journal']} ({paper['year']}): {paper['title']}")
309
+ else:
310
+ overview_parts.append(f"\n• {paper['type']} ({paper['year']}): {paper['title']}")
311
+
312
+ contexts["overview"] = "\n".join(overview_parts)
313
+
314
+ return contexts
315
+
316
+ def answer_question(self, query: str, chat_history: List = None) -> str:
317
+ """Answer questions with enhanced context and caching"""
318
+ if not query.strip():
319
+ return "Please ask a question about David Van Dijcke's research."
320
+
321
+ # Check cache
322
+ query_hash = hashlib.md5(query.lower().encode()).hexdigest()
323
+ if query_hash in self.response_cache:
324
+ return self.response_cache[query_hash]
325
+
326
+ # Get context
327
+ context = self._get_smart_context(query)
328
+
329
+ # Generate response
330
+ if self.llm:
331
+ response = self._generate_llm_response(query, context)
332
+ else:
333
+ response = self._generate_fallback_response(query, context)
334
+
335
+ # Cache response
336
+ self.response_cache[query_hash] = response
337
+
338
+ return response
339
+
340
+ def _get_smart_context(self, query: str) -> str:
341
+ """Get context with smart routing based on query type"""
342
+ query_lower = query.lower()
343
+
344
+ # Route to precomputed contexts
345
+ if any(phrase in query_lower for phrase in ["job market", "jmp"]):
346
+ return self.precomputed_contexts.get("job_market", "")
347
+
348
+ if any(phrase in query_lower for phrase in ["overview", "papers", "research", "what has david"]):
349
+ return self.precomputed_contexts.get("overview", "")
350
+
351
+ # Build custom context
352
+ contexts = []
353
+
354
+ # Add relevant papers based on keywords
355
+ paper_matches = self._match_papers_to_query(query_lower)
356
+
357
+ for paper_key in paper_matches[:3]: # Top 3 matches
358
+ if paper_key in self.papers:
359
+ paper = self.papers[paper_key]
360
+
361
+ # Create rich context
362
+ paper_context = f"[{paper['type']}: {paper['title']}]"
363
+
364
+ if paper.get('abstract'):
365
+ paper_context += f"\n\nAbstract: {paper['abstract']}"
366
+
367
+ if paper.get('coauthors'):
368
+ paper_context += f"\n\nCoauthors: {', '.join(paper['coauthors'])}"
369
+
370
+ # Add relevant text sections
371
+ relevant_sections = self._extract_relevant_sections(paper['text'], query_lower)
372
+ if relevant_sections:
373
+ paper_context += f"\n\nRelevant excerpts:\n{relevant_sections}"
374
+
375
+ contexts.append(paper_context)
376
+
377
+ # Add vector search results if needed
378
+ if not contexts and self.vector_store:
379
+ try:
380
+ docs = self.vector_store.similarity_search(query, k=5)
381
+ for doc in docs:
382
+ contexts.append(f"[From {doc.metadata['title']}]\n{doc.page_content}")
383
+ except:
384
+ pass
385
+
386
+ return "\n\n---\n\n".join(contexts[:3])
387
+
388
+ def _match_papers_to_query(self, query_lower: str) -> List[str]:
389
+ """Match papers to query using keywords and scoring"""
390
+ scores = {}
391
+
392
+ for key, paper in self.papers.items():
393
+ if key == "cv":
394
+ continue
395
+
396
+ score = 0
397
+
398
+ # Check title
399
+ title_lower = paper['title'].lower()
400
+ title_words = set(title_lower.split())
401
+ query_words = set(query_lower.split())
402
+
403
+ # Word overlap
404
+ overlap = len(title_words.intersection(query_words))
405
+ score += overlap * 2
406
+
407
+ # Check keywords
408
+ for keyword in paper.get('abstract_keywords', []):
409
+ if keyword.lower() in query_lower:
410
+ score += 3
411
+
412
+ # Special cases
413
+ if key == "r3d" and any(term in query_lower for term in ["job market", "jmp", "main paper"]):
414
+ score += 10
415
+
416
+ # Check description
417
+ if paper.get('description'):
418
+ desc_words = set(paper['description'].lower().split())
419
+ desc_overlap = len(desc_words.intersection(query_words))
420
+ score += desc_overlap
421
+
422
+ if score > 0:
423
+ scores[key] = score
424
+
425
+ # Sort by score
426
+ sorted_papers = sorted(scores.items(), key=lambda x: x[1], reverse=True)
427
+ return [paper[0] for paper in sorted_papers]
428
+
429
+ def _extract_relevant_sections(self, text: str, query_lower: str, max_length: int = 2000) -> str:
430
+ """Extract most relevant sections from paper text"""
431
+ # Split into paragraphs
432
+ paragraphs = text.split('\n\n')
433
+
434
+ # Score paragraphs
435
+ scored_paragraphs = []
436
+ query_words = set(query_lower.split())
437
+
438
+ for para in paragraphs:
439
+ if len(para) < 50: # Skip short paragraphs
440
+ continue
441
+
442
+ para_lower = para.lower()
443
+ para_words = set(para_lower.split())
444
+
445
+ # Calculate relevance score
446
+ score = len(query_words.intersection(para_words))
447
+
448
+ # Boost for specific sections
449
+ if any(header in para_lower[:50] for header in ["abstract", "introduction", "conclusion"]):
450
+ score += 2
451
+
452
+ if score > 0:
453
+ scored_paragraphs.append((score, para))
454
+
455
+ # Sort by score and take top paragraphs
456
+ scored_paragraphs.sort(key=lambda x: x[0], reverse=True)
457
+
458
+ relevant_text = []
459
+ total_length = 0
460
+
461
+ for score, para in scored_paragraphs[:5]:
462
+ if total_length + len(para) > max_length:
463
+ break
464
+ relevant_text.append(para)
465
+ total_length += len(para)
466
+
467
+ return "\n\n".join(relevant_text)
468
+
469
+ def _generate_llm_response(self, query: str, context: str) -> str:
470
+ """Generate response using LLM with enhanced prompting"""
471
+ prompt = f"""You are an expert research assistant for David Van Dijcke, an econometrician on the 2025-26 academic job market.
472
+
473
+ Key facts about David:
474
+ - Job Market Paper: R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
475
+ - Specializes in: Functional data analysis, optimal transport, econometric theory
476
+ - From: University of Michigan
477
+ - Research interests: Econometrics, Industrial Organization, Political Economy
478
+
479
+ Context from David's papers:
480
+ {context}
481
+
482
+ Question: {query}
483
+
484
+ Instructions:
485
+ 1. Answer based primarily on the provided context
486
+ 2. Be specific and cite paper titles
487
+ 3. For job market questions, emphasize R3D
488
+ 4. Highlight David's unique contributions and methods
489
+ 5. Keep responses concise but informative
490
+
491
+ Answer:"""
492
+
493
+ try:
494
+ response = self.llm.generate_content(prompt)
495
+ return response.text
496
+ except Exception as e:
497
+ print(f"Error generating response: {e}")
498
+ return self._generate_fallback_response(query, context)
499
+
500
+ def _generate_fallback_response(self, query: str, context: str) -> str:
501
+ """Generate response without LLM"""
502
+ query_lower = query.lower()
503
+
504
+ # Enhanced fallback responses based on context
505
+ if "job market" in query_lower:
506
+ return self.precomputed_contexts.get("job_market", "David's job market paper is R3D.")
507
+
508
+ if any(term in query_lower for term in ["overview", "research", "papers"]):
509
+ return self.precomputed_contexts.get("overview", "David has multiple papers in econometrics.")
510
+
511
+ # Parse context for specific information
512
+ if context:
513
+ lines = context.split('\n')
514
+ for line in lines[:10]:
515
+ if "Abstract:" in line or "JOB MARKET" in line:
516
+ return f"Based on David's papers:\n\n{context[:1000]}..."
517
+
518
+ return "I can help with questions about David Van Dijcke's research. For best results, please ensure Google API key is configured."
519
+
520
+ # Create enhanced Gradio interface
521
+ def create_interface():
522
+ """Create enhanced Gradio interface"""
523
+ assistant = EnhancedResearchAssistant()
524
+
525
+ def chat(message, history):
526
+ response = assistant.answer_question(message, history)
527
+ history.append([message, response])
528
+ return "", history
529
+
530
+ with gr.Blocks(title="David Van Dijcke - Research Assistant", theme=gr.themes.Soft()) as demo:
531
+ gr.Markdown("""
532
+ # David Van Dijcke - Enhanced Research Assistant
533
+
534
+ **Econometrician on the 2025-26 Job Market** | University of Michigan
535
+
536
+ Job Market Paper: **R3D - Regression Discontinuity Design with Distribution-Valued Outcomes**
537
+ """)
538
+
539
+ with gr.Row():
540
+ with gr.Column(scale=3):
541
+ chatbot = gr.Chatbot(height=500)
542
+ msg = gr.Textbox(
543
+ label="Ask about David's research",
544
+ placeholder="Examples: What is his job market paper about? What methods has he developed?",
545
+ lines=2
546
+ )
547
+
548
+ with gr.Row():
549
+ submit = gr.Button("Submit", variant="primary")
550
+ clear = gr.Button("Clear")
551
+
552
+ with gr.Column(scale=1):
553
+ gr.Markdown("### Quick Links")
554
+ gr.Markdown("""
555
+ **Papers:**
556
+ - R3D (Job Market Paper)
557
+ - Free Discontinuity Regression
558
+ - Distributional Synthetic Controls
559
+ - Return to Office
560
+ - Revenue Production Functions
561
+
562
+ **Try asking about:**
563
+ - Job market paper details
564
+ - Econometric methods
565
+ - Optimal transport applications
566
+ - Specific papers
567
+ - Research agenda
568
+ """)
569
+
570
+ # Examples
571
+ gr.Examples(
572
+ examples=[
573
+ "What is David's job market paper about?",
574
+ "Explain the R3D methodology in detail",
575
+ "What econometric methods has David developed?",
576
+ "How does David use optimal transport in his research?",
577
+ "What are the main contributions of the FDR paper?",
578
+ "Tell me about David's coauthors and collaborations",
579
+ "What makes David's research unique?",
580
+ "What are the policy implications of David's work?"
581
+ ],
582
+ inputs=msg
583
+ )
584
+
585
+ # Event handlers
586
+ msg.submit(chat, [msg, chatbot], [msg, chatbot])
587
+ submit.click(chat, [msg, chatbot], [msg, chatbot])
588
+ clear.click(lambda: None, None, chatbot, queue=False)
589
+
590
+ return demo
591
+
592
+ if __name__ == "__main__":
593
+ interface = create_interface()
594
+ interface.launch(
595
+ server_name="127.0.0.1",
596
+ server_port=7860,
597
+ share=False,
598
+ quiet=True
599
+ )
app_final.py ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Final Research Assistant
4
+ Combines state-of-the-art LLM usage with stable Gradio interface
5
+ """
6
+
7
+ import os
8
+ from typing import List, Dict, Optional
9
+ import gradio as gr
10
+ from pypdf import PdfReader
11
+ from dotenv import load_dotenv
12
+ import google.generativeai as genai
13
+
14
+ # Load environment variables
15
+ load_dotenv()
16
+
17
+ class FinalResearchAssistant:
18
+ """State-of-the-art assistant with stable interface"""
19
+
20
+ def __init__(self):
21
+ """Initialize with full context approach"""
22
+ # Setup Gemini 2.5
23
+ self.llm = self._setup_llm()
24
+
25
+ # Load all papers at once
26
+ self.papers_full_text = self._load_all_papers()
27
+
28
+ # Create mega context
29
+ self.mega_context = self._create_mega_context()
30
+
31
+ # Initialize chat session
32
+ self.chat = None
33
+ if self.llm:
34
+ self._initialize_chat()
35
+
36
+ def _setup_llm(self):
37
+ """Setup Gemini 2.5 Flash"""
38
+ api_key = os.getenv("GOOGLE_API_KEY")
39
+
40
+ if not api_key:
41
+ print("No Google API key found")
42
+ return None
43
+
44
+ try:
45
+ genai.configure(api_key=api_key)
46
+
47
+ # Try Gemini 2.5 Flash Preview first
48
+ try:
49
+ model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
50
+ print("Using Gemini 2.5 Flash Preview")
51
+ return model
52
+ except:
53
+ # Fallback to stable version
54
+ model = genai.GenerativeModel('gemini-1.5-flash')
55
+ print("Using Gemini 1.5 Flash")
56
+ return model
57
+
58
+ except Exception as e:
59
+ print(f"Error setting up Gemini: {e}")
60
+ return None
61
+
62
+ def _load_all_papers(self) -> Dict[str, str]:
63
+ """Load all papers completely"""
64
+ papers = {}
65
+ pdf_dir = "documents"
66
+
67
+ paper_files = [
68
+ ("r3d", "r3d_arxiv_4apr2025.pdf", "JOB MARKET PAPER - R3D"),
69
+ ("cv", "CV_DavidVanDijcke.pdf", "CURRICULUM VITAE"),
70
+ ("fdr", "fdr.pdf", "Free Discontinuity Regression"),
71
+ ("disco", "disco.pdf", "Distributional Synthetic Controls"),
72
+ ("rto", "rto.pdf", "Return to Office"),
73
+ ("prodf", "prodf.pdf", "Revenue Production Functions"),
74
+ ]
75
+
76
+ for key, filename, title in paper_files:
77
+ pdf_path = os.path.join(pdf_dir, filename)
78
+ if os.path.exists(pdf_path):
79
+ try:
80
+ with open(pdf_path, 'rb') as file:
81
+ pdf_reader = PdfReader(file)
82
+
83
+ full_text = f"\n{'='*60}\n{title}\n{'='*60}\n\n"
84
+
85
+ for page_num, page in enumerate(pdf_reader.pages, 1):
86
+ text = page.extract_text()
87
+ if text.strip():
88
+ full_text += f"[Page {page_num}]\n{text}\n\n"
89
+
90
+ papers[key] = full_text
91
+ print(f"Loaded {title}: {len(full_text):,} chars")
92
+
93
+ except Exception as e:
94
+ print(f"Error loading {filename}: {e}")
95
+
96
+ return papers
97
+
98
+ def _create_mega_context(self) -> str:
99
+ """Create single context with all papers"""
100
+ context = "DAVID VAN DIJCKE - COMPLETE RESEARCH PORTFOLIO\n\n"
101
+
102
+ for key, text in self.papers_full_text.items():
103
+ context += text + "\n\n"
104
+
105
+ print(f"Total context: {len(context):,} characters")
106
+ return context
107
+
108
+ def _initialize_chat(self):
109
+ """Initialize chat with full context"""
110
+ try:
111
+ self.chat = self.llm.start_chat(history=[
112
+ {
113
+ "role": "user",
114
+ "parts": [f"""You are David Van Dijcke's research assistant. I'm giving you his complete research portfolio.
115
+
116
+ {self.mega_context}
117
+
118
+ Key facts:
119
+ - David is on the 2025-26 economics job market
120
+ - His JOB MARKET PAPER is R3D
121
+ - He's from University of Michigan
122
+ - He specializes in econometric methods
123
+
124
+ Please acknowledge you've loaded all papers."""]
125
+ },
126
+ {
127
+ "role": "model",
128
+ "parts": ["I've successfully loaded David Van Dijcke's complete research portfolio including his job market paper R3D, CV, and all other papers. I'm ready to answer any questions about his research, methods, or background."]
129
+ }
130
+ ])
131
+ print("Chat initialized with full context")
132
+ except Exception as e:
133
+ print(f"Could not initialize chat: {e}")
134
+ self.chat = None
135
+
136
+ def answer_question(self, query: str) -> str:
137
+ """Answer using full context"""
138
+ if not query.strip():
139
+ return "What would you like to know about David's research?"
140
+
141
+ if not self.llm:
142
+ return self._fallback_response(query)
143
+
144
+ try:
145
+ if self.chat:
146
+ # Use pre-loaded context
147
+ prompt = f"""Based on the papers I have loaded, please answer this question:
148
+
149
+ {query}
150
+
151
+ Remember to:
152
+ - Be conversational but accurate
153
+ - Reference specific papers when relevant
154
+ - For job market questions, focus on R3D
155
+ - Explain both intuition and technical details when appropriate"""
156
+
157
+ response = self.chat.send_message(prompt)
158
+ return response.text
159
+ else:
160
+ # Send everything in one request
161
+ prompt = f"""You are David Van Dijcke's research assistant. Based on his papers below, answer the question.
162
+
163
+ {self.mega_context}
164
+
165
+ Question: {query}
166
+
167
+ Be conversational, accurate, and highlight what makes David's work unique."""
168
+
169
+ response = self.llm.generate_content(prompt)
170
+ return response.text
171
+
172
+ except Exception as e:
173
+ print(f"Error: {e}")
174
+ return self._fallback_response(query)
175
+
176
+ def _fallback_response(self, query: str) -> str:
177
+ """Fallback without API"""
178
+ query_lower = query.lower()
179
+
180
+ if "job market" in query_lower:
181
+ return """David's job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
182
+
183
+ It extends RDD to analyze entire outcome distributions using optimal transport theory. This allows researchers to see not just if a policy works on average, but WHO it works for - crucial for understanding distributional effects and inequality."""
184
+
185
+ if "david" in query_lower or "who" in query_lower:
186
+ return """David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan. He develops novel methods for functional and distributional data analysis, with applications to important policy questions."""
187
+
188
+ return "I can help with questions about David's research. Please add a Google API key for best results."
189
+
190
+ # Simple interface
191
+ def create_interface():
192
+ """Create simple, stable interface"""
193
+ assistant = FinalResearchAssistant()
194
+
195
+ def chat(message, history):
196
+ response = assistant.answer_question(message)
197
+ history.append([message, response])
198
+ return "", history
199
+
200
+ with gr.Blocks(title="David Van Dijcke - Research Assistant") as demo:
201
+ gr.Markdown("""
202
+ # David Van Dijcke - Research Assistant
203
+
204
+ **Econometrician | 2025-26 Job Market | University of Michigan**
205
+
206
+ Job Market Paper: **R3D - Regression Discontinuity Design with Distribution-Valued Outcomes**
207
+ """)
208
+
209
+ chatbot = gr.Chatbot(height=450)
210
+ msg = gr.Textbox(
211
+ label="Ask about David's research",
212
+ placeholder="What is his job market paper about? What methods has he developed?",
213
+ lines=2
214
+ )
215
+
216
+ with gr.Row():
217
+ submit = gr.Button("Send", variant="primary")
218
+ clear = gr.Button("Clear")
219
+
220
+ gr.Examples(
221
+ examples=[
222
+ "What is David's job market paper about?",
223
+ "Explain R3D's methodology - both intuition and technical details",
224
+ "What real-world problems can R3D solve?",
225
+ "How does David use optimal transport in his research?",
226
+ "What makes David's research unique?",
227
+ "Tell me about his other papers besides R3D"
228
+ ],
229
+ inputs=msg
230
+ )
231
+
232
+ msg.submit(chat, [msg, chatbot], [msg, chatbot])
233
+ submit.click(chat, [msg, chatbot], [msg, chatbot])
234
+ clear.click(lambda: None, None, chatbot, queue=False)
235
+
236
+ return demo
237
+
238
+ if __name__ == "__main__":
239
+ interface = create_interface()
240
+ # Use same launch config as stable version
241
+ interface.launch(
242
+ server_name="127.0.0.1",
243
+ server_port=7860,
244
+ share=False,
245
+ quiet=True
246
+ )
app_full_context.py ADDED
@@ -0,0 +1,401 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Research Assistant with Full Paper Context
4
+ Loads complete papers and uses Gemini's large context window for comprehensive responses
5
+ """
6
+
7
+ import os
8
+ import time
9
+ from typing import List, Dict, Any, Optional
10
+ import gradio as gr
11
+ from langchain_community.document_loaders import PyPDFLoader
12
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
13
+ from langchain_community.embeddings import HuggingFaceEmbeddings
14
+ from langchain_community.vectorstores import FAISS
15
+ from langchain.schema import Document
16
+ from dotenv import load_dotenv
17
+ import google.generativeai as genai
18
+
19
+ # Load environment variables
20
+ load_dotenv()
21
+
22
+ class FullContextResearchAssistant:
23
+ """Research assistant that loads full papers and uses large context windows"""
24
+
25
+ def __init__(self):
26
+ """Initialize the assistant with full document loading"""
27
+ self.embeddings = HuggingFaceEmbeddings(
28
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
29
+ )
30
+ self.documents = self._load_all_documents()
31
+ self.vector_store = self._create_vector_store()
32
+ self.llm = self._setup_llm()
33
+
34
+ # Cache full paper texts for direct retrieval
35
+ self.full_papers = self._load_full_papers()
36
+
37
+ def _load_full_papers(self) -> Dict[str, str]:
38
+ """Load complete text of each paper"""
39
+ papers = {}
40
+ pdf_dir = "documents"
41
+
42
+ paper_metadata = {
43
+ "r3d_arxiv_4apr2025.pdf": {
44
+ "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
45
+ "type": "Job Market Paper",
46
+ "key": "r3d"
47
+ },
48
+ "fdr.pdf": {
49
+ "title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
50
+ "type": "Working Paper",
51
+ "key": "fdr"
52
+ },
53
+ "disco.pdf": {
54
+ "title": "Data-driven Inference on Optimal Stochastic Restrictions",
55
+ "type": "Working Paper",
56
+ "key": "disco"
57
+ },
58
+ "rto.pdf": {
59
+ "title": "Return to Office and the Tenure Distribution",
60
+ "type": "Working Paper",
61
+ "key": "rto"
62
+ },
63
+ "prodf.pdf": {
64
+ "title": "From output to outcomes: Productivity and the distributions it generates",
65
+ "type": "Working Paper",
66
+ "key": "prodf"
67
+ },
68
+ "unmasking_partisanship.pdf": {
69
+ "title": "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk",
70
+ "type": "Published Paper",
71
+ "key": "unmasking"
72
+ },
73
+ "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf": {
74
+ "title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine",
75
+ "type": "Published Paper",
76
+ "key": "ukraine"
77
+ },
78
+ "BrzezinskiKechtDeianaVanDijcke_18042020_CEPR_2.pdf": {
79
+ "title": "The Cost of Staying Open: Voluntary Social Distancing and Lockdowns in the US",
80
+ "type": "Published Paper",
81
+ "key": "staying_open"
82
+ },
83
+ "ssrn-3776854.pdf": {
84
+ "title": "Belief in Science Influences Physical Distancing in Response to COVID-19 Lockdown Policies",
85
+ "type": "Working Paper",
86
+ "key": "belief_science"
87
+ },
88
+ "BOE_revision_8dec2022.pdf": {
89
+ "title": "What Drives International Portfolio Flows?",
90
+ "type": "Working Paper",
91
+ "key": "portfolio_flows"
92
+ },
93
+ "CV_DavidVanDijcke.pdf": {
94
+ "title": "Curriculum Vitae",
95
+ "type": "CV",
96
+ "key": "cv"
97
+ }
98
+ }
99
+
100
+ for pdf_file, metadata in paper_metadata.items():
101
+ pdf_path = os.path.join(pdf_dir, pdf_file)
102
+ if os.path.exists(pdf_path):
103
+ try:
104
+ loader = PyPDFLoader(pdf_path)
105
+ # Load ALL pages
106
+ pages = loader.load()
107
+ full_text = "\n\n".join([page.page_content for page in pages])
108
+
109
+ papers[metadata["key"]] = {
110
+ "text": full_text,
111
+ "title": metadata["title"],
112
+ "type": metadata["type"],
113
+ "file": pdf_file,
114
+ "num_pages": len(pages),
115
+ "length": len(full_text)
116
+ }
117
+
118
+ print(f"Loaded full paper: {metadata['title']} ({len(pages)} pages, {len(full_text):,} chars)")
119
+
120
+ except Exception as e:
121
+ print(f"Error loading {pdf_file}: {e}")
122
+
123
+ return papers
124
+
125
+ def _load_all_documents(self) -> List[Document]:
126
+ """Load documents for vector store - using larger chunks"""
127
+ documents = []
128
+ pdf_dir = "documents"
129
+
130
+ # Use larger chunks for better context preservation
131
+ text_splitter = RecursiveCharacterTextSplitter(
132
+ chunk_size=2000, # Increased from 500
133
+ chunk_overlap=200, # Increased from 50
134
+ separators=["\n\n", "\n", " ", ""]
135
+ )
136
+
137
+ for pdf_file in os.listdir(pdf_dir):
138
+ if pdf_file.endswith('.pdf'):
139
+ pdf_path = os.path.join(pdf_dir, pdf_file)
140
+ try:
141
+ loader = PyPDFLoader(pdf_path)
142
+ pages = loader.load() # Load ALL pages
143
+
144
+ # Add metadata
145
+ for page in pages:
146
+ page.metadata['source'] = pdf_file
147
+ page.metadata['type'] = 'full_paper'
148
+
149
+ # Split into larger chunks
150
+ chunks = text_splitter.split_documents(pages)
151
+ documents.extend(chunks)
152
+
153
+ except Exception as e:
154
+ print(f"Error loading {pdf_file}: {e}")
155
+
156
+ return documents
157
+
158
+ def _create_vector_store(self) -> FAISS:
159
+ """Create or load vector store"""
160
+ cache_dir = "vector_store_cache_full"
161
+
162
+ if os.path.exists(cache_dir):
163
+ print("Loading cached vector store...")
164
+ return FAISS.load_local(cache_dir, self.embeddings, allow_dangerous_deserialization=True)
165
+
166
+ print(f"Creating vector store from {len(self.documents)} chunks...")
167
+ vector_store = FAISS.from_documents(self.documents, self.embeddings)
168
+
169
+ # Save for future use
170
+ os.makedirs(cache_dir, exist_ok=True)
171
+ vector_store.save_local(cache_dir)
172
+
173
+ return vector_store
174
+
175
+ def _setup_llm(self):
176
+ """Setup Gemini with large context window"""
177
+ api_key = os.getenv("GOOGLE_API_KEY")
178
+
179
+ if api_key:
180
+ genai.configure(api_key=api_key)
181
+ # Use Gemini 2.0 Flash for even larger context window
182
+ return genai.GenerativeModel('gemini-2.0-flash-exp')
183
+ else:
184
+ print("Warning: No GOOGLE_API_KEY found. Using limited mode.")
185
+ return None
186
+
187
+ def _get_relevant_papers(self, query: str) -> List[Dict[str, Any]]:
188
+ """Determine which full papers are most relevant to the query"""
189
+ # First use vector search to identify relevant papers
190
+ relevant_chunks = self.vector_store.similarity_search(query, k=10)
191
+
192
+ # Identify unique papers from chunks
193
+ relevant_paper_keys = set()
194
+ for chunk in relevant_chunks:
195
+ source = chunk.metadata.get('source', '')
196
+ # Map source file to paper key
197
+ for key, paper_info in self.full_papers.items():
198
+ if paper_info['file'] == source:
199
+ relevant_paper_keys.add(key)
200
+ break
201
+
202
+ # Also check for specific paper mentions in query
203
+ query_lower = query.lower()
204
+ keyword_map = {
205
+ 'r3d': ['r3d', 'regression discontinuity', 'distribution', 'job market'],
206
+ 'fdr': ['fdr', 'free discontinuity', 'internet shutdown'],
207
+ 'disco': ['disco', 'stochastic restriction', 'optimal transport'],
208
+ 'rto': ['rto', 'return to office', 'tenure'],
209
+ 'prodf': ['productivity', 'production function', 'revenue'],
210
+ 'unmasking': ['mask', 'partisan', 'polarization', 'covid'],
211
+ 'ukraine': ['ukraine', 'alert', 'invasion'],
212
+ 'staying_open': ['staying open', 'lockdown', 'voluntary'],
213
+ 'belief_science': ['belief', 'science', 'compliance'],
214
+ 'portfolio_flows': ['portfolio', 'flow', 'international'],
215
+ 'cv': ['cv', 'curriculum', 'job market', 'econometrician', 'david']
216
+ }
217
+
218
+ for key, keywords in keyword_map.items():
219
+ if any(keyword in query_lower for keyword in keywords):
220
+ relevant_paper_keys.add(key)
221
+
222
+ # Return full paper info for relevant papers
223
+ relevant_papers = []
224
+ for key in relevant_paper_keys:
225
+ if key in self.full_papers:
226
+ paper_info = self.full_papers[key].copy()
227
+ paper_info['key'] = key
228
+ relevant_papers.append(paper_info)
229
+
230
+ return relevant_papers
231
+
232
+ def answer_question(self, query: str) -> str:
233
+ """Answer questions using full paper context"""
234
+ if not query.strip():
235
+ return "Please ask a question about David Van Dijcke's research."
236
+
237
+ # Get relevant full papers
238
+ relevant_papers = self._get_relevant_papers(query)
239
+
240
+ if not relevant_papers and self.llm is None:
241
+ return self._get_fallback_response(query)
242
+
243
+ # Construct context with full papers
244
+ context_parts = []
245
+ total_chars = 0
246
+ max_chars = 1000000 # Gemini 2.0 Flash supports up to 1M tokens
247
+
248
+ # Always include CV first if available
249
+ if 'cv' in self.full_papers and total_chars < max_chars:
250
+ cv_text = self.full_papers['cv']['text'][:50000] # First 50k chars of CV
251
+ context_parts.append(f"=== CURRICULUM VITAE ===\n{cv_text}\n")
252
+ total_chars += len(cv_text)
253
+
254
+ # Add relevant papers
255
+ for paper in relevant_papers:
256
+ if total_chars >= max_chars:
257
+ break
258
+
259
+ paper_text = paper['text']
260
+ if total_chars + len(paper_text) > max_chars:
261
+ # Truncate if needed
262
+ paper_text = paper_text[:max_chars - total_chars]
263
+
264
+ context_parts.append(f"=== {paper['title'].upper()} ===\n{paper_text}\n")
265
+ total_chars += len(paper_text)
266
+
267
+ full_context = "\n\n".join(context_parts)
268
+
269
+ # Create prompt
270
+ prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market
271
+ who develops novel methods for analyzing functional and high-dimensional data.
272
+
273
+ You have access to David's FULL papers and CV. Use this comprehensive information to provide detailed, accurate answers.
274
+
275
+ Context (Full Papers):
276
+ {full_context}
277
+
278
+ Question: {query}
279
+
280
+ Instructions:
281
+ 1. Provide specific, detailed information from the papers
282
+ 2. Quote exact passages when relevant
283
+ 3. Explain technical concepts clearly
284
+ 4. Make connections across different papers when applicable
285
+ 5. Be precise about David's contributions and methods
286
+
287
+ Answer:"""
288
+
289
+ if self.llm:
290
+ try:
291
+ response = self.llm.generate_content(prompt)
292
+ return response.text
293
+ except Exception as e:
294
+ print(f"Error with Gemini API: {e}")
295
+ return self._get_fallback_response(query)
296
+ else:
297
+ return self._get_fallback_response(query)
298
+
299
+ def _get_fallback_response(self, query: str) -> str:
300
+ """Fallback response when API is not available"""
301
+ query_lower = query.lower()
302
+
303
+ responses = {
304
+ "r3d": """R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David's job market paper.
305
+
306
+ Key features:
307
+ - Extends RDD to distribution-valued outcomes
308
+ - Uses optimal transport and Wasserstein distances
309
+ - Allows testing effects on entire outcome distributions
310
+ - Applications to income distributions, test scores, etc.""",
311
+
312
+ "david": """David Van Dijcke is an econometrician on the 2025-26 job market. He specializes in:
313
+ - Functional data analysis
314
+ - High-dimensional econometrics
315
+ - Optimal transport methods
316
+ - Applications to big data
317
+
318
+ Currently at University of Michigan, completing his PhD.""",
319
+
320
+ "methods": """David develops econometric methods for:
321
+ 1. Distribution-valued outcomes (R3D)
322
+ 2. Free discontinuity problems (FDR)
323
+ 3. Stochastic restrictions (DISCO)
324
+ 4. High-dimensional productivity analysis
325
+
326
+ His work bridges mathematical theory and practical applications."""
327
+ }
328
+
329
+ # Check for keywords
330
+ for key, response in responses.items():
331
+ if key in query_lower:
332
+ return response
333
+
334
+ return """I'm David Van Dijcke's research assistant. I can help with questions about:
335
+ - His job market paper (R3D)
336
+ - Econometric methods he's developed
337
+ - His research papers and applications
338
+ - His background and expertise
339
+
340
+ For best results, please add a Google API key."""
341
+
342
+ # Create Gradio interface
343
+ def create_interface():
344
+ """Create the Gradio interface"""
345
+ assistant = FullContextResearchAssistant()
346
+
347
+ # Header with API key info
348
+ with gr.Blocks(title="David Van Dijcke - Research Assistant") as interface:
349
+ gr.Markdown("""
350
+ # David Van Dijcke - Econometric Research Assistant (Full Context Version)
351
+
352
+ This enhanced version loads COMPLETE papers to provide comprehensive, detailed answers about David's research.
353
+
354
+ **Features:**
355
+ - Full paper context (not just excerpts)
356
+ - Detailed technical explanations
357
+ - Comprehensive method descriptions
358
+ - Cross-paper connections
359
+
360
+ For best performance, add your Google API key in the Space settings.
361
+ """)
362
+
363
+ # Check API status
364
+ api_status = "✅ Google API configured - Full context mode active" if os.getenv("GOOGLE_API_KEY") else "⚠️ No API key - Limited mode"
365
+ gr.Markdown(f"**Status:** {api_status}")
366
+
367
+ # Chat interface
368
+ chatbot = gr.Chatbot(height=500)
369
+ msg = gr.Textbox(
370
+ label="Ask about David's research",
371
+ placeholder="What is David's job market paper about? What methods does he develop?",
372
+ lines=2
373
+ )
374
+ clear = gr.Button("Clear")
375
+
376
+ # Examples
377
+ gr.Examples(
378
+ examples=[
379
+ "What is David's job market paper R3D about? Explain the technical details.",
380
+ "How does David use optimal transport in his research?",
381
+ "What are the main contributions of the FDR paper?",
382
+ "Explain David's work on productivity and distributional outcomes.",
383
+ "What policy applications does David's research have?",
384
+ "Tell me about David's background and why he's suited for an econometrics position."
385
+ ],
386
+ inputs=msg
387
+ )
388
+
389
+ def respond(message, chat_history):
390
+ bot_message = assistant.answer_question(message)
391
+ chat_history.append((message, bot_message))
392
+ return "", chat_history
393
+
394
+ msg.submit(respond, [msg, chatbot], [msg, chatbot])
395
+ clear.click(lambda: None, None, chatbot, queue=False)
396
+
397
+ return interface
398
+
399
+ if __name__ == "__main__":
400
+ interface = create_interface()
401
+ interface.launch()
app_natural.py ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Natural Research Assistant
4
+ Focuses on clear, accessible, and accurate responses
5
+ """
6
+
7
+ import os
8
+ from typing import List, Dict, Optional
9
+ import gradio as gr
10
+ from langchain_community.document_loaders import PyPDFLoader
11
+ from langchain_community.embeddings import HuggingFaceEmbeddings
12
+ from langchain_community.vectorstores import FAISS
13
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
14
+ from langchain.schema import Document
15
+ from dotenv import load_dotenv
16
+ import google.generativeai as genai
17
+
18
+ # Load environment variables
19
+ load_dotenv()
20
+
21
+ class NaturalResearchAssistant:
22
+ """Assistant focused on natural, accessible communication"""
23
+
24
+ def __init__(self):
25
+ """Initialize with focus on clarity"""
26
+ self.embeddings = HuggingFaceEmbeddings(
27
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
28
+ )
29
+
30
+ # Load papers
31
+ self.papers = self._load_papers_simple()
32
+
33
+ # Simple vector store
34
+ self.vector_store = self._create_simple_vector_store()
35
+
36
+ # Setup LLM
37
+ self.llm = self._setup_llm()
38
+
39
+ # Pre-written clear explanations
40
+ self.clear_explanations = self._create_clear_explanations()
41
+
42
+ def _load_papers_simple(self) -> Dict[str, Dict]:
43
+ """Load papers with focus on key information"""
44
+ papers = {}
45
+ pdf_dir = "documents"
46
+
47
+ # Essential paper info
48
+ paper_info = {
49
+ "r3d": {
50
+ "file": "r3d_arxiv_4apr2025.pdf",
51
+ "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
52
+ "simple_explanation": "This paper extends a popular causal inference method (RDD) to study not just average effects but entire distributions - like how a policy affects income inequality, not just average income.",
53
+ "main_contribution": "Allows researchers to see how policies affect different parts of the population differently",
54
+ "real_world_use": "Can show if a minimum wage increase helps low earners more than high earners, or if a school policy helps struggling students catch up"
55
+ },
56
+ "fdr": {
57
+ "file": "fdr.pdf",
58
+ "title": "Free Discontinuity Regression",
59
+ "simple_explanation": "Develops a method to find sudden changes in data when you don't know where they occur - like finding where internet shutdowns hurt the economy most.",
60
+ "main_contribution": "Automatically detects breakpoints in data without pre-specifying them",
61
+ "real_world_use": "Measures economic damage from internet shutdowns, finds structural breaks in markets"
62
+ },
63
+ "rto": {
64
+ "file": "rto.pdf",
65
+ "title": "Return to Office and the Tenure Distribution",
66
+ "simple_explanation": "Studies how return-to-office mandates affect employee retention, finding that senior employees are more likely to leave.",
67
+ "main_contribution": "Shows RTO policies can backfire by driving away experienced talent",
68
+ "real_world_use": "Helps companies understand the hidden costs of ending remote work"
69
+ }
70
+ }
71
+
72
+ for key, info in paper_info.items():
73
+ pdf_path = os.path.join(pdf_dir, info["file"])
74
+ if os.path.exists(pdf_path):
75
+ try:
76
+ loader = PyPDFLoader(pdf_path)
77
+ pages = loader.load()
78
+ full_text = "\n\n".join([p.page_content for p in pages])
79
+
80
+ papers[key] = {
81
+ "text": full_text,
82
+ "pages": len(pages),
83
+ **info
84
+ }
85
+ except Exception as e:
86
+ print(f"Error loading {info['file']}: {e}")
87
+
88
+ return papers
89
+
90
+ def _create_simple_vector_store(self) -> Optional[FAISS]:
91
+ """Create simple vector store"""
92
+ try:
93
+ documents = []
94
+ text_splitter = RecursiveCharacterTextSplitter(
95
+ chunk_size=1000,
96
+ chunk_overlap=100
97
+ )
98
+
99
+ for key, paper in self.papers.items():
100
+ # Add the simple explanations as documents
101
+ if paper.get("simple_explanation"):
102
+ doc = Document(
103
+ page_content=f"{paper['title']}\n\n{paper['simple_explanation']}\n\nMain contribution: {paper['main_contribution']}\n\nReal-world use: {paper['real_world_use']}",
104
+ metadata={"source": key, "type": "explanation"}
105
+ )
106
+ documents.append(doc)
107
+
108
+ # Add some text chunks
109
+ chunks = text_splitter.split_text(paper["text"])[:10]
110
+ for i, chunk in enumerate(chunks):
111
+ doc = Document(
112
+ page_content=chunk,
113
+ metadata={"source": key, "type": "text", "chunk": i}
114
+ )
115
+ documents.append(doc)
116
+
117
+ return FAISS.from_documents(documents, self.embeddings) if documents else None
118
+
119
+ except Exception as e:
120
+ print(f"Error creating vector store: {e}")
121
+ return None
122
+
123
+ def _setup_llm(self):
124
+ """Setup Gemini LLM"""
125
+ api_key = os.getenv("GOOGLE_API_KEY")
126
+
127
+ if api_key:
128
+ try:
129
+ genai.configure(api_key=api_key)
130
+ return genai.GenerativeModel('gemini-1.5-flash')
131
+ except Exception as e:
132
+ print(f"Error setting up Gemini: {e}")
133
+
134
+ return None
135
+
136
+ def _create_clear_explanations(self) -> Dict[str, str]:
137
+ """Pre-written clear explanations for common questions"""
138
+ return {
139
+ "greeting": """Hi! I'm here to help explain David Van Dijcke's research in clear, accessible terms.
140
+
141
+ David is an econometrician on the job market who develops new statistical methods to answer important policy questions. His work helps us understand how policies affect different people differently - not just averages.
142
+
143
+ Feel free to ask about his job market paper (R3D), his other research, or what makes his work unique!""",
144
+
145
+ "job_market": """David's job market paper, R3D, solves an important problem in economics.
146
+
147
+ Traditional methods can tell us if a policy works "on average" - like whether a job training program increases average wages. But averages hide important details. Maybe the program helps low earners a lot but doesn't help high earners at all.
148
+
149
+ R3D lets researchers see the full picture - how a policy affects the entire distribution of outcomes. This means we can answer questions like:
150
+ - Does this education policy help struggling students catch up?
151
+ - Does this labor policy reduce inequality?
152
+ - Do subsidies benefit small firms more than large ones?
153
+
154
+ The technical innovation uses "optimal transport theory" - basically finding the most efficient way to compare whole distributions before and after a policy change.""",
155
+
156
+ "use_cases": """The R3D method has several important applications:
157
+
158
+ **Education Policy**: Instead of just asking "does this program raise test scores?", we can ask "does it help struggling students more than advanced students?"
159
+
160
+ **Labor Economics**: When studying minimum wage effects, we can see if it compresses the wage distribution (reduces inequality) beyond just raising the average.
161
+
162
+ **Development Economics**: For anti-poverty programs, we can see if they help the poorest households escape poverty or just slightly improve everyone's situation.
163
+
164
+ **Finance**: In studying financial regulations, we can see if they reduce extreme risks, not just average risk.
165
+
166
+ The key insight is that the same average effect can hide very different distributional stories - and those differences matter for policy.""",
167
+
168
+ "what_makes_unique": """What makes David's research unique:
169
+
170
+ 1. **Practical Focus**: While the methods are sophisticated, they're designed to answer real policy questions that matter to people's lives.
171
+
172
+ 2. **Distribution Thinking**: Most economics focuses on averages. David's work recognizes that how effects are distributed across people often matters more than the average.
173
+
174
+ 3. **Technical Innovation**: He brings tools from other fields (like optimal transport from mathematics) to solve economic problems in new ways.
175
+
176
+ 4. **Policy Relevance**: His papers directly address current issues - internet shutdowns, return-to-office policies, COVID responses - not just theoretical questions.
177
+
178
+ 5. **Clear Applications**: Each method comes with real examples showing how it helps answer important questions."""
179
+ }
180
+
181
+ def answer_question(self, query: str, chat_history: List = None) -> str:
182
+ """Answer with focus on clarity and accuracy"""
183
+ if not query.strip():
184
+ return "What would you like to know about David's research?"
185
+
186
+ query_lower = query.lower()
187
+
188
+ # Check for pre-written explanations
189
+ if any(greeting in query_lower for greeting in ["hi", "hello", "hey", "what's up"]):
190
+ return self.clear_explanations["greeting"]
191
+
192
+ if any(term in query_lower for term in ["job market", "jmp", "r3d"]) and "paper" in query_lower:
193
+ return self.clear_explanations["job_market"]
194
+
195
+ if any(term in query_lower for term in ["use", "application", "why", "purpose"]):
196
+ return self.clear_explanations["use_cases"]
197
+
198
+ if any(term in query_lower for term in ["unique", "special", "different"]):
199
+ return self.clear_explanations["what_makes_unique"]
200
+
201
+ # For other questions, use LLM with better prompting
202
+ if self.llm:
203
+ context = self._get_relevant_context(query)
204
+ return self._generate_natural_response(query, context)
205
+ else:
206
+ return self._get_simple_fallback(query)
207
+
208
+ def _get_relevant_context(self, query: str) -> str:
209
+ """Get relevant context focusing on explanations"""
210
+ contexts = []
211
+
212
+ # First, try to match specific papers
213
+ query_lower = query.lower()
214
+
215
+ for key, paper in self.papers.items():
216
+ paper_mentioned = False
217
+
218
+ # Check if paper is mentioned
219
+ if key in query_lower or any(word in query_lower for word in paper['title'].lower().split()):
220
+ paper_mentioned = True
221
+
222
+ if paper_mentioned:
223
+ context = f"Paper: {paper['title']}\n"
224
+ context += f"Simple explanation: {paper.get('simple_explanation', '')}\n"
225
+ context += f"Main contribution: {paper.get('main_contribution', '')}\n"
226
+ context += f"Real-world use: {paper.get('real_world_use', '')}"
227
+ contexts.append(context)
228
+
229
+ # If no specific paper matched, use vector search
230
+ if not contexts and self.vector_store:
231
+ try:
232
+ docs = self.vector_store.similarity_search(query, k=3)
233
+ for doc in docs:
234
+ contexts.append(doc.page_content)
235
+ except:
236
+ pass
237
+
238
+ return "\n\n---\n\n".join(contexts)
239
+
240
+ def _generate_natural_response(self, query: str, context: str) -> str:
241
+ """Generate natural, accessible response"""
242
+ prompt = f"""You are explaining David Van Dijcke's econometric research to someone who may not have a technical background.
243
+
244
+ David is on the 2025-26 economics job market. His job market paper is R3D.
245
+
246
+ Context about his work:
247
+ {context}
248
+
249
+ Question: {query}
250
+
251
+ Instructions:
252
+ 1. Give a clear, conversational answer in 2-3 paragraphs maximum
253
+ 2. Avoid technical jargon - explain concepts simply
254
+ 3. Use concrete examples when possible
255
+ 4. Focus on why this research matters, not just what it does
256
+ 5. Be friendly and approachable
257
+ 6. If discussing methods, explain the intuition, not the math
258
+
259
+ Answer in a natural, conversational tone:"""
260
+
261
+ try:
262
+ response = self.llm.generate_content(prompt)
263
+ return response.text
264
+ except Exception as e:
265
+ return self._get_simple_fallback(query)
266
+
267
+ def _get_simple_fallback(self, query: str) -> str:
268
+ """Simple fallback responses"""
269
+ query_lower = query.lower()
270
+
271
+ if "who" in query_lower or "david" in query_lower:
272
+ return """David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan.
273
+
274
+ He develops new statistical methods that help us understand how policies affect different people differently - going beyond simple averages to see the full picture. His job market paper (R3D) is about measuring distributional effects in policy evaluation."""
275
+
276
+ if "r3d" in query_lower or "job market" in query_lower:
277
+ return """R3D is David's job market paper. It extends regression discontinuity design to study entire distributions.
278
+
279
+ In simple terms: Traditional methods tell us if a policy works "on average." R3D shows us WHO it works for - whether it helps the poor more than the rich, struggling students more than advanced ones, etc. This matters because the same "average" effect can hide very different realities."""
280
+
281
+ return """I can help explain David Van Dijcke's research! He's an econometrician who develops methods to understand how policies affect different people differently.
282
+
283
+ Try asking about:
284
+ - His job market paper (R3D)
285
+ - What makes his research unique
286
+ - How his methods are used in practice"""
287
+
288
+ # Create interface
289
+ def create_interface():
290
+ """Create user-friendly interface"""
291
+ assistant = NaturalResearchAssistant()
292
+
293
+ def chat(message, history):
294
+ if history is None:
295
+ history = []
296
+ response = assistant.answer_question(message, history)
297
+ history.append([message, response])
298
+ return "", history
299
+
300
+ with gr.Blocks(title="David Van Dijcke - Research Assistant", css="""
301
+ .gradio-container {font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;}
302
+ """) as demo:
303
+
304
+ gr.Markdown("""
305
+ # Chat with David Van Dijcke's Research Assistant
306
+
307
+ **David Van Dijcke** | Econometrician | 2025-26 Job Market Candidate | University of Michigan
308
+ """)
309
+
310
+ chatbot = gr.Chatbot(
311
+ height=450,
312
+ show_label=False,
313
+ avatar_images=None
314
+ )
315
+
316
+ msg = gr.Textbox(
317
+ label="Your question",
318
+ placeholder="Ask me about David's research in plain English...",
319
+ lines=2
320
+ )
321
+
322
+ with gr.Row():
323
+ submit = gr.Button("Send", variant="primary")
324
+ clear = gr.Button("Clear Chat")
325
+
326
+ # Suggested questions
327
+ gr.Markdown("### Try asking:")
328
+ examples = gr.Examples(
329
+ examples=[
330
+ "What is David's job market paper about?",
331
+ "Why does R3D matter for policy?",
332
+ "What real-world problems does David's research solve?",
333
+ "How is David's work different from typical economics research?",
334
+ "Can you explain R3D without the technical details?",
335
+ "What are some applications of the R3D method?"
336
+ ],
337
+ inputs=msg,
338
+ label=""
339
+ )
340
+
341
+ # Event handlers
342
+ msg.submit(chat, [msg, chatbot], [msg, chatbot])
343
+ submit.click(chat, [msg, chatbot], [msg, chatbot])
344
+ clear.click(lambda: [], None, chatbot)
345
+
346
+ return demo
347
+
348
+ if __name__ == "__main__":
349
+ interface = create_interface()
350
+ interface.launch(
351
+ server_name="127.0.0.1",
352
+ server_port=7860,
353
+ share=False,
354
+ quiet=True
355
+ )
app_optimized.py ADDED
@@ -0,0 +1,554 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Optimized Research Assistant
4
+ Combines full paper loading with smart retrieval and caching
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import time
10
+ from typing import List, Dict, Any, Optional, Tuple
11
+ import gradio as gr
12
+ from langchain_community.document_loaders import PyPDFLoader
13
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
14
+ from langchain_community.embeddings import HuggingFaceEmbeddings
15
+ from langchain_community.vectorstores import FAISS
16
+ from langchain.schema import Document
17
+ from dotenv import load_dotenv
18
+ import google.generativeai as genai
19
+
20
+ # Load environment variables
21
+ load_dotenv()
22
+
23
+ class OptimizedResearchAssistant:
24
+ """Optimized assistant with full papers and smart retrieval"""
25
+
26
+ def __init__(self):
27
+ """Initialize with optimized loading and caching"""
28
+ self.embeddings = HuggingFaceEmbeddings(
29
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
30
+ )
31
+
32
+ # Load papers with smart caching
33
+ self.papers_metadata = self._load_papers_metadata()
34
+ self.full_papers = self._load_full_papers_cached()
35
+
36
+ # Create hierarchical vector stores
37
+ self.vector_store_chunks = self._create_vector_store("chunks")
38
+ self.vector_store_sections = self._create_vector_store("sections")
39
+
40
+ self.llm = self._setup_llm()
41
+
42
+ # Cache for responses
43
+ self.response_cache = {}
44
+
45
+ def _load_papers_metadata(self) -> Dict[str, Dict]:
46
+ """Load metadata about papers"""
47
+ return {
48
+ "r3d": {
49
+ "file": "r3d_arxiv_4apr2025.pdf",
50
+ "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
51
+ "type": "Job Market Paper",
52
+ "year": 2025,
53
+ "keywords": ["regression discontinuity", "distribution", "optimal transport", "wasserstein", "job market"],
54
+ "sections": ["introduction", "theory", "identification", "estimation", "applications", "conclusion"]
55
+ },
56
+ "fdr": {
57
+ "file": "fdr.pdf",
58
+ "title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
59
+ "type": "Working Paper",
60
+ "year": 2024,
61
+ "keywords": ["free discontinuity", "internet shutdowns", "geometric measure theory"],
62
+ "sections": ["introduction", "methodology", "application", "results"]
63
+ },
64
+ "disco": {
65
+ "file": "disco.pdf",
66
+ "title": "disco: Distributional Synthetic Controls",
67
+ "type": "Working Paper",
68
+ "year": 2025,
69
+ "keywords": ["distributional", "synthetic controls", "stata", "package"],
70
+ "sections": ["introduction", "methodology", "implementation", "application", "conclusion"]
71
+ },
72
+ "rto": {
73
+ "file": "rto.pdf",
74
+ "title": "Return to Office and the Tenure Distribution",
75
+ "type": "Working Paper",
76
+ "year": 2025,
77
+ "keywords": ["return to office", "tenure", "distribution", "covid", "remote work"],
78
+ "sections": ["introduction", "data", "methodology", "results", "conclusion"]
79
+ },
80
+ "prodf": {
81
+ "file": "prodf.pdf",
82
+ "title": "On the Non-Identification of Revenue Production Functions",
83
+ "type": "Working Paper",
84
+ "year": 2023,
85
+ "keywords": ["production functions", "revenue", "identification", "productivity"],
86
+ "sections": ["introduction", "theory", "identification", "conclusion"]
87
+ },
88
+ "unmasking": {
89
+ "file": "unmasking_partisanship.pdf",
90
+ "title": "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk",
91
+ "type": "Published Paper",
92
+ "year": 2021,
93
+ "keywords": ["masks", "partisanship", "polarization", "covid", "public health"],
94
+ "sections": ["introduction", "data", "methodology", "results", "conclusion"]
95
+ },
96
+ "ukraine": {
97
+ "file": "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf",
98
+ "title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine",
99
+ "type": "Published Paper",
100
+ "year": 2023,
101
+ "keywords": ["ukraine", "alerts", "invasion", "public response", "war"],
102
+ "sections": ["introduction", "data", "methodology", "results", "conclusion"]
103
+ },
104
+ "staying_open": {
105
+ "file": "BrzezinskiKechtDeianaVanDijcke_18042020_CEPR_2.pdf",
106
+ "title": "The Cost of Staying Open: Voluntary Social Distancing and Lockdowns in the US",
107
+ "type": "Published Paper",
108
+ "year": 2020,
109
+ "keywords": ["covid", "lockdown", "staying open", "voluntary", "social distancing"],
110
+ "sections": ["introduction", "data", "methodology", "results", "conclusion"]
111
+ },
112
+ "belief_science": {
113
+ "file": "ssrn-3776854.pdf",
114
+ "title": "Belief in Science Influences Physical Distancing in Response to COVID-19 Lockdown Policies",
115
+ "type": "Working Paper",
116
+ "year": 2021,
117
+ "keywords": ["belief", "science", "covid", "compliance", "physical distancing"],
118
+ "sections": ["introduction", "data", "methodology", "results", "conclusion"]
119
+ },
120
+ "portfolio_flows": {
121
+ "file": "BOE_revision_8dec2022.pdf",
122
+ "title": "What Drives International Portfolio Flows?",
123
+ "type": "Working Paper",
124
+ "year": 2022,
125
+ "keywords": ["portfolio", "flows", "international", "finance", "investment"],
126
+ "sections": ["introduction", "theory", "data", "results", "conclusion"]
127
+ },
128
+ "cv": {
129
+ "file": "CV_DavidVanDijcke.pdf",
130
+ "title": "Curriculum Vitae",
131
+ "type": "CV",
132
+ "year": 2025,
133
+ "keywords": ["cv", "resume", "background", "econometrician", "david"],
134
+ "sections": ["education", "research", "teaching", "awards"]
135
+ }
136
+ }
137
+
138
+ def _load_full_papers_cached(self) -> Dict[str, Dict]:
139
+ """Load full papers with caching"""
140
+ cache_file = "papers_cache.json"
141
+
142
+ # Try to load from cache
143
+ if os.path.exists(cache_file):
144
+ try:
145
+ with open(cache_file, 'r') as f:
146
+ return json.load(f)
147
+ except:
148
+ pass
149
+
150
+ # Load papers
151
+ papers = {}
152
+ pdf_dir = "documents"
153
+
154
+ for key, metadata in self.papers_metadata.items():
155
+ pdf_path = os.path.join(pdf_dir, metadata["file"])
156
+ if os.path.exists(pdf_path):
157
+ try:
158
+ loader = PyPDFLoader(pdf_path)
159
+ pages = loader.load()
160
+
161
+ # Extract sections intelligently
162
+ sections = self._extract_sections(pages, metadata["sections"])
163
+
164
+ papers[key] = {
165
+ "full_text": "\n\n".join([p.page_content for p in pages]),
166
+ "sections": sections,
167
+ "num_pages": len(pages),
168
+ "metadata": metadata
169
+ }
170
+
171
+ print(f"Loaded: {metadata['title']} ({len(pages)} pages)")
172
+
173
+ except Exception as e:
174
+ print(f"Error loading {metadata['file']}: {e}")
175
+
176
+ # Cache for next time
177
+ try:
178
+ # Create a serializable version
179
+ cache_data = {}
180
+ for key, paper in papers.items():
181
+ cache_data[key] = {
182
+ "full_text": paper["full_text"],
183
+ "sections": paper["sections"],
184
+ "num_pages": paper["num_pages"],
185
+ "metadata": paper["metadata"]
186
+ }
187
+
188
+ with open(cache_file, 'w') as f:
189
+ json.dump(cache_data, f)
190
+ except:
191
+ pass
192
+
193
+ return papers
194
+
195
+ def _extract_sections(self, pages: List[Document], expected_sections: List[str]) -> Dict[str, str]:
196
+ """Extract paper sections intelligently"""
197
+ full_text = "\n\n".join([p.page_content for p in pages])
198
+ sections = {}
199
+
200
+ # Common section patterns
201
+ section_patterns = {
202
+ "introduction": ["introduction", "1 introduction", "1. introduction"],
203
+ "theory": ["theory", "theoretical", "model", "2 theory", "2. theory"],
204
+ "methodology": ["methodology", "method", "empirical strategy", "3 method"],
205
+ "data": ["data", "dataset", "4 data"],
206
+ "results": ["results", "findings", "5 results"],
207
+ "conclusion": ["conclusion", "concluding", "6 conclusion"]
208
+ }
209
+
210
+ # Extract sections
211
+ for section_key in expected_sections:
212
+ patterns = section_patterns.get(section_key, [section_key])
213
+
214
+ for pattern in patterns:
215
+ # Find section start
216
+ import re
217
+ regex = re.compile(rf"\n+\s*({re.escape(pattern)})\s*\n", re.IGNORECASE)
218
+ match = regex.search(full_text)
219
+
220
+ if match:
221
+ start = match.end()
222
+ # Find next section or end
223
+ next_match = None
224
+ for next_key in expected_sections:
225
+ if next_key != section_key:
226
+ next_patterns = section_patterns.get(next_key, [next_key])
227
+ for next_pattern in next_patterns:
228
+ next_regex = re.compile(rf"\n+\s*({re.escape(next_pattern)})\s*\n", re.IGNORECASE)
229
+ next_match = next_regex.search(full_text[start:])
230
+ if next_match:
231
+ break
232
+ if next_match:
233
+ break
234
+
235
+ end = start + next_match.start() if next_match else len(full_text)
236
+ sections[section_key] = full_text[start:end].strip()
237
+ break
238
+
239
+ return sections
240
+
241
+ def _create_vector_store(self, store_type: str) -> FAISS:
242
+ """Create or load vector stores"""
243
+ cache_dir = f"vector_store_cache_{store_type}"
244
+
245
+ if os.path.exists(cache_dir):
246
+ try:
247
+ # Try with newer langchain version parameter
248
+ return FAISS.load_local(cache_dir, self.embeddings, allow_dangerous_deserialization=True)
249
+ except TypeError:
250
+ # Fall back to older version without the parameter
251
+ return FAISS.load_local(cache_dir, self.embeddings)
252
+
253
+ documents = []
254
+
255
+ if store_type == "chunks":
256
+ # Smaller chunks for detailed retrieval
257
+ splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
258
+ else:
259
+ # Larger chunks for section-level retrieval
260
+ splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=300)
261
+
262
+ for key, paper in self.full_papers.items():
263
+ # Create documents from sections
264
+ for section_name, section_text in paper["sections"].items():
265
+ if section_text:
266
+ doc = Document(
267
+ page_content=section_text,
268
+ metadata={
269
+ "paper_key": key,
270
+ "section": section_name,
271
+ "title": paper["metadata"]["title"],
272
+ "type": paper["metadata"]["type"]
273
+ }
274
+ )
275
+
276
+ # Split if needed
277
+ if store_type == "chunks":
278
+ chunks = splitter.split_documents([doc])
279
+ documents.extend(chunks)
280
+ else:
281
+ documents.append(doc)
282
+
283
+ vector_store = FAISS.from_documents(documents, self.embeddings)
284
+ os.makedirs(cache_dir, exist_ok=True)
285
+ vector_store.save_local(cache_dir)
286
+
287
+ return vector_store
288
+
289
+ def _setup_llm(self):
290
+ """Setup Gemini model"""
291
+ api_key = os.getenv("GOOGLE_API_KEY")
292
+
293
+ if api_key:
294
+ genai.configure(api_key=api_key)
295
+ # Use latest Gemini model
296
+ return genai.GenerativeModel('gemini-2.0-flash-exp')
297
+
298
+ return None
299
+
300
+ def _get_query_type(self, query: str) -> str:
301
+ """Determine query type for optimal retrieval"""
302
+ query_lower = query.lower()
303
+
304
+ if any(term in query_lower for term in ["technical", "method", "econometric", "detail"]):
305
+ return "technical"
306
+ elif any(term in query_lower for term in ["overview", "summary", "about", "who is"]):
307
+ return "overview"
308
+ elif any(term in query_lower for term in ["application", "policy", "empirical"]):
309
+ return "application"
310
+ elif any(term in query_lower for term in ["job market", "cv", "background"]):
311
+ return "biographical"
312
+ else:
313
+ return "general"
314
+
315
+ def _smart_retrieval(self, query: str) -> Tuple[str, List[str]]:
316
+ """Smart retrieval based on query type"""
317
+ query_type = self._get_query_type(query)
318
+
319
+ # Determine which papers are most relevant
320
+ relevant_papers = self._identify_relevant_papers(query)
321
+
322
+ context_parts = []
323
+ paper_list = []
324
+
325
+ # Always include CV summary for biographical queries
326
+ if query_type == "biographical" and "cv" in self.full_papers:
327
+ cv_sections = self.full_papers["cv"]["sections"]
328
+ context_parts.append(f"=== CV HIGHLIGHTS ===\n{cv_sections.get('education', '')}\n{cv_sections.get('research', '')}")
329
+ paper_list.append("CV")
330
+
331
+ # Add relevant papers based on query type
332
+ if query_type == "technical":
333
+ # For technical queries, include theory and methodology sections
334
+ for paper_key in relevant_papers[:3]: # Top 3 papers
335
+ if paper_key in self.full_papers:
336
+ paper = self.full_papers[paper_key]
337
+ sections = paper["sections"]
338
+
339
+ title = paper["metadata"]["title"]
340
+ theory = sections.get("theory", sections.get("methodology", ""))
341
+
342
+ if theory:
343
+ context_parts.append(f"=== {title} - TECHNICAL DETAILS ===\n{theory[:20000]}")
344
+ paper_list.append(title)
345
+
346
+ elif query_type == "overview":
347
+ # For overview queries, include introductions and conclusions
348
+ for paper_key in relevant_papers[:4]: # Top 4 papers
349
+ if paper_key in self.full_papers:
350
+ paper = self.full_papers[paper_key]
351
+ sections = paper["sections"]
352
+
353
+ title = paper["metadata"]["title"]
354
+ intro = sections.get("introduction", "")[:5000]
355
+ conclusion = sections.get("conclusion", "")[:3000]
356
+
357
+ context_parts.append(f"=== {title} ===\nIntroduction:\n{intro}\n\nConclusion:\n{conclusion}")
358
+ paper_list.append(title)
359
+
360
+ else:
361
+ # For general queries, use hybrid approach
362
+ # Get relevant chunks
363
+ chunks = self.vector_store_chunks.similarity_search(query, k=6)
364
+
365
+ # Group by paper
366
+ paper_chunks = {}
367
+ for chunk in chunks:
368
+ paper_key = chunk.metadata.get("paper_key")
369
+ if paper_key:
370
+ if paper_key not in paper_chunks:
371
+ paper_chunks[paper_key] = []
372
+ paper_chunks[paper_key].append(chunk.page_content)
373
+
374
+ # Add grouped chunks
375
+ for paper_key, chunks in paper_chunks.items():
376
+ if paper_key in self.full_papers:
377
+ title = self.full_papers[paper_key]["metadata"]["title"]
378
+ combined_chunks = "\n\n".join(chunks)
379
+ context_parts.append(f"=== {title} - RELEVANT EXCERPTS ===\n{combined_chunks}")
380
+ paper_list.append(title)
381
+
382
+ return "\n\n".join(context_parts), paper_list
383
+
384
+ def _identify_relevant_papers(self, query: str) -> List[str]:
385
+ """Identify most relevant papers for a query"""
386
+ query_lower = query.lower()
387
+ scores = {}
388
+
389
+ for key, metadata in self.papers_metadata.items():
390
+ score = 0
391
+
392
+ # Check keywords
393
+ for keyword in metadata["keywords"]:
394
+ if keyword in query_lower:
395
+ score += 2
396
+
397
+ # Check title
398
+ if any(word in query_lower for word in metadata["title"].lower().split()):
399
+ score += 1
400
+
401
+ # Special cases
402
+ if key == "r3d" and any(term in query_lower for term in ["job market", "jmp", "main paper"]):
403
+ score += 5
404
+
405
+ if score > 0:
406
+ scores[key] = score
407
+
408
+ # Sort by score
409
+ sorted_papers = sorted(scores.items(), key=lambda x: x[1], reverse=True)
410
+
411
+ return [paper[0] for paper in sorted_papers]
412
+
413
+ def answer_question(self, query: str) -> str:
414
+ """Answer questions with optimized retrieval"""
415
+ if not query.strip():
416
+ return "Please ask a question about David Van Dijcke's research."
417
+
418
+ # Check cache
419
+ cache_key = query.lower().strip()
420
+ if cache_key in self.response_cache:
421
+ return self.response_cache[cache_key]
422
+
423
+ # Get relevant context
424
+ context, papers_used = self._smart_retrieval(query)
425
+
426
+ if not self.llm:
427
+ return self._get_fallback_response(query)
428
+
429
+ # Create optimized prompt
430
+ prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 academic job market.
431
+
432
+ Context from papers: {', '.join(papers_used)}
433
+
434
+ {context}
435
+
436
+ Question: {query}
437
+
438
+ Instructions:
439
+ - Provide accurate, detailed answers based on the context
440
+ - Use specific examples and technical details when relevant
441
+ - Be clear and precise about David's contributions
442
+ - If discussing methods, explain both the intuition and technical aspects
443
+
444
+ Answer:"""
445
+
446
+ try:
447
+ response = self.llm.generate_content(prompt)
448
+ answer = response.text
449
+
450
+ # Cache response
451
+ self.response_cache[cache_key] = answer
452
+
453
+ return answer
454
+
455
+ except Exception as e:
456
+ print(f"Error: {e}")
457
+ return self._get_fallback_response(query)
458
+
459
+ def _get_fallback_response(self, query: str) -> str:
460
+ """Enhanced fallback responses"""
461
+ query_lower = query.lower()
462
+
463
+ # Check for specific paper mentions
464
+ if "r3d" in query_lower or "job market" in query_lower:
465
+ return """R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David Van Dijcke's job market paper.
466
+
467
+ Key innovations:
468
+ • Extends RDD to analyze entire outcome distributions, not just means
469
+ • Uses optimal transport theory and Wasserstein distances
470
+ • Develops new estimation and inference procedures
471
+ • Applications to income distributions, test score distributions
472
+
473
+ The paper addresses a fundamental limitation of traditional RDD that only looks at average effects, enabling researchers to study distributional impacts of policies."""
474
+
475
+ elif "fdr" in query_lower or "free discontinuity" in query_lower:
476
+ return """Free Discontinuity Regression (FDR) is David's paper on estimating regression functions with unknown discontinuities.
477
+
478
+ Key contributions:
479
+ • Develops methods for when discontinuity locations are unknown
480
+ • Uses geometric measure theory and free discontinuity problems
481
+ • Application to internet shutdowns' economic effects
482
+ • Shows traditional methods can be severely biased when discontinuities are misspecified"""
483
+
484
+ elif "david" in query_lower or "who" in query_lower:
485
+ return """David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan.
486
+
487
+ Specializations:
488
+ • Functional data analysis and high-dimensional econometrics
489
+ • Optimal transport methods in economics
490
+ • Distribution-valued outcomes and treatment effects
491
+ • Novel applications of geometric measure theory
492
+
493
+ His research develops cutting-edge econometric methods for modern data challenges, with applications to labor, development, and public policy."""
494
+
495
+ return "I'm David Van Dijcke's research assistant. Please ask about his econometric methods, papers, or background. For best results, configure a Google API key."
496
+
497
+ # Create interface
498
+ def create_interface():
499
+ """Create Gradio interface"""
500
+ assistant = OptimizedResearchAssistant()
501
+
502
+ with gr.Blocks(title="David Van Dijcke - Research Assistant") as interface:
503
+ gr.Markdown("""
504
+ # David Van Dijcke - Optimized Research Assistant
505
+
506
+ **Advanced Features:**
507
+ - Full paper loading with intelligent section extraction
508
+ - Smart retrieval based on query type
509
+ - Response caching for instant repeated queries
510
+ - Hierarchical search (sections + chunks)
511
+
512
+ Ask about David's econometric methods, research papers, or academic background.
513
+ """)
514
+
515
+ # API status
516
+ api_status = "✅ Full functionality enabled" if os.getenv("GOOGLE_API_KEY") else "⚠️ Limited mode"
517
+ gr.Markdown(f"**Status:** {api_status}")
518
+
519
+ chatbot = gr.Chatbot(height=500)
520
+ msg = gr.Textbox(
521
+ label="Your question",
522
+ placeholder="Example: Explain the technical innovations in David's job market paper",
523
+ lines=2
524
+ )
525
+ clear = gr.Button("Clear")
526
+
527
+ # Advanced examples
528
+ gr.Examples(
529
+ examples=[
530
+ "What are the key technical innovations in R3D? Explain the methodology in detail.",
531
+ "How does David apply optimal transport theory across his different papers?",
532
+ "Compare the identification strategies used in R3D versus FDR.",
533
+ "What makes David's approach to functional data analysis unique?",
534
+ "Explain how David's work on productivity relates to distributional outcomes.",
535
+ "What are David's main contributions to econometric theory and methods?",
536
+ "How do David's papers address policy-relevant questions?",
537
+ "What is David's research agenda and how do his papers fit together?"
538
+ ],
539
+ inputs=msg
540
+ )
541
+
542
+ def respond(message, chat_history):
543
+ bot_message = assistant.answer_question(message)
544
+ chat_history.append((message, bot_message))
545
+ return "", chat_history
546
+
547
+ msg.submit(respond, [msg, chatbot], [msg, chatbot])
548
+ clear.click(lambda: None, None, chatbot, queue=False)
549
+
550
+ return interface
551
+
552
+ if __name__ == "__main__":
553
+ interface = create_interface()
554
+ interface.launch()
app_professional.py ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Professional Research Assistant
4
+ Clean chat interface with expert responses
5
+ """
6
+
7
+ import os
8
+ from typing import List, Tuple
9
+ import gradio as gr
10
+ from langchain_community.document_loaders import PyPDFLoader
11
+ from dotenv import load_dotenv
12
+ import google.generativeai as genai
13
+
14
+ # Load environment variables
15
+ load_dotenv()
16
+
17
+ class ProfessionalAssistant:
18
+ """Professional assistant that speaks as an expert about David's work"""
19
+
20
+ def __init__(self):
21
+ # Setup Gemini
22
+ api_key = os.getenv("GOOGLE_API_KEY")
23
+ if api_key:
24
+ genai.configure(api_key=api_key)
25
+ try:
26
+ self.model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
27
+ print("Using Gemini 2.5 Flash Preview")
28
+ except:
29
+ self.model = genai.GenerativeModel('gemini-1.5-flash')
30
+ print("Using Gemini 1.5 Flash")
31
+ else:
32
+ self.model = None
33
+
34
+ # Load all papers
35
+ self.papers = self._load_all_papers()
36
+
37
+ # Pre-load context
38
+ self.context = self._create_context()
39
+
40
+ def _load_all_papers(self) -> dict:
41
+ """Load all papers completely"""
42
+ papers = {}
43
+ pdf_dir = "documents"
44
+
45
+ paper_files = {
46
+ "r3d": ("r3d_arxiv_4apr2025.pdf", "R3D (Job Market Paper)"),
47
+ "cv": ("CV_DavidVanDijcke.pdf", "CV"),
48
+ "fdr": ("fdr.pdf", "Free Discontinuity Regression"),
49
+ "disco": ("disco.pdf", "Distributional Synthetic Controls"),
50
+ "rto": ("rto.pdf", "Return to Office"),
51
+ "prodf": ("prodf.pdf", "Revenue Production Functions"),
52
+ "unmasking": ("unmasking_partisanship.pdf", "Unmasking Partisanship"),
53
+ "ukraine": ("van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf", "Ukraine Alerts")
54
+ }
55
+
56
+ for key, (filename, title) in paper_files.items():
57
+ pdf_path = os.path.join(pdf_dir, filename)
58
+ if os.path.exists(pdf_path):
59
+ try:
60
+ loader = PyPDFLoader(pdf_path)
61
+ pages = loader.load()
62
+ text = "\n\n".join([p.page_content for p in pages])
63
+ papers[key] = {
64
+ "text": text,
65
+ "title": title,
66
+ "pages": len(pages)
67
+ }
68
+ print(f"Loaded {title}: {len(pages)} pages")
69
+ except Exception as e:
70
+ print(f"Error loading {filename}: {e}")
71
+
72
+ return papers
73
+
74
+ def _create_context(self) -> str:
75
+ """Create comprehensive context from all papers"""
76
+ context_parts = []
77
+
78
+ # Add papers in priority order
79
+ priority_order = ["r3d", "cv", "fdr", "disco", "rto", "prodf"]
80
+
81
+ for key in priority_order:
82
+ if key in self.papers:
83
+ paper = self.papers[key]
84
+ # Add substantial excerpts
85
+ excerpt_length = 30000 if key == "r3d" else 15000
86
+ context_parts.append(f"\n[{paper['title']}]\n{paper['text'][:excerpt_length]}")
87
+
88
+ return "\n\n".join(context_parts)
89
+
90
+ def chat(self, message: str, history: List[Tuple[str, str]]) -> Tuple[str, List[Tuple[str, str]]]:
91
+ """Chat with proper history handling"""
92
+ if not message.strip():
93
+ return "", history
94
+
95
+ if not self.model:
96
+ response = "I need a Google API key to provide detailed answers about David's research."
97
+ history.append((message, response))
98
+ return "", history
99
+
100
+ # Build conversation context
101
+ conversation = "Previous conversation:\n"
102
+ for human, assistant in history[-3:]: # Last 3 exchanges
103
+ conversation += f"User: {human}\nAssistant: {assistant}\n\n"
104
+
105
+ # Determine which papers to emphasize based on query
106
+ message_lower = message.lower()
107
+ specific_context = ""
108
+
109
+ if "job market" in message_lower or "r3d" in message_lower:
110
+ if "r3d" in self.papers:
111
+ specific_context = f"\n[R3D - Job Market Paper]\n{self.papers['r3d']['text'][:50000]}\n"
112
+ elif "fdr" in message_lower or "discontinuity" in message_lower:
113
+ if "fdr" in self.papers:
114
+ specific_context = f"\n[FDR Paper]\n{self.papers['fdr']['text'][:30000]}\n"
115
+
116
+ # Create prompt
117
+ prompt = f"""You are an expert assistant helping visitors learn about David Van Dijcke's research.
118
+
119
+ CRITICAL INSTRUCTIONS:
120
+ - You are NOT David - you are an expert explaining his work to website visitors
121
+ - Speak in third person about David (use "David" or "Van Dijcke", not "I" or "my")
122
+ - Be conversational but professional
123
+ - Give concise, informative answers (2-3 paragraphs max unless asked for details)
124
+ - Don't say "based on the provided papers" - just state facts confidently
125
+ - Focus on what makes his work innovative and important
126
+
127
+ Key facts:
128
+ - David is an econometrician on the 2025-26 job market from University of Michigan
129
+ - His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
130
+ - He specializes in functional data analysis and optimal transport methods
131
+
132
+ {conversation}
133
+
134
+ Full research context:
135
+ {self.context}
136
+
137
+ {specific_context}
138
+
139
+ Current question: {message}
140
+
141
+ Provide a concise, expert response:"""
142
+
143
+ try:
144
+ response = self.model.generate_content(prompt)
145
+ answer = response.text
146
+ history.append((message, answer))
147
+ return "", history
148
+ except Exception as e:
149
+ error_response = f"I encountered an error. Please try rephrasing your question."
150
+ history.append((message, error_response))
151
+ return "", history
152
+
153
+ # Create interface
154
+ def create_interface():
155
+ assistant = ProfessionalAssistant()
156
+
157
+ # Custom CSS for a clean look
158
+ custom_css = """
159
+ .gradio-container {
160
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
161
+ max-width: 900px;
162
+ margin: auto;
163
+ }
164
+ .chatbot {
165
+ height: 500px !important;
166
+ }
167
+ .message {
168
+ font-size: 15px !important;
169
+ line-height: 1.6 !important;
170
+ }
171
+ """
172
+
173
+ with gr.Blocks(title="David Van Dijcke | Research Assistant", css=custom_css) as demo:
174
+ gr.Markdown("""
175
+ ## David Van Dijcke - Research Assistant
176
+
177
+ Welcome! I can help you learn about David Van Dijcke's econometric research. David is on the 2025-26 academic job market.
178
+
179
+ **Job Market Paper:** R3D - Regression Discontinuity Design with Distribution-Valued Outcomes
180
+ """)
181
+
182
+ chatbot = gr.Chatbot(
183
+ value=[],
184
+ elem_classes=["chatbot"],
185
+ bubble_full_width=False,
186
+ avatar_images=(None, None),
187
+ show_label=False
188
+ )
189
+
190
+ with gr.Row():
191
+ msg = gr.Textbox(
192
+ show_label=False,
193
+ placeholder="Ask about David's research, methods, or papers...",
194
+ elem_classes=["message-input"],
195
+ scale=4
196
+ )
197
+ submit = gr.Button("Send", scale=1, variant="primary")
198
+
199
+ # Clear button
200
+ clear = gr.Button("Clear conversation", size="sm")
201
+
202
+ # Examples in a nice layout
203
+ gr.Examples(
204
+ examples=[
205
+ "What is David's job market paper about?",
206
+ "What makes R3D innovative?",
207
+ "What are the practical applications of R3D?",
208
+ "Tell me about David's other research besides R3D",
209
+ "What makes David a strong candidate for an econometrics position?"
210
+ ],
211
+ inputs=msg,
212
+ label="Example questions:"
213
+ )
214
+
215
+ # Event handlers
216
+ msg.submit(assistant.chat, [msg, chatbot], [msg, chatbot])
217
+ submit.click(assistant.chat, [msg, chatbot], [msg, chatbot])
218
+ clear.click(lambda: [], None, chatbot, queue=False)
219
+
220
+ gr.Markdown("""
221
+ ---
222
+ *This assistant has access to David's complete research portfolio including published papers, working papers, and CV.*
223
+ """)
224
+
225
+ return demo
226
+
227
+ if __name__ == "__main__":
228
+ interface = create_interface()
229
+ interface.launch(
230
+ server_name="127.0.0.1",
231
+ server_port=7860,
232
+ show_error=True
233
+ )
app_simple_chat.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Simple Chat Assistant
4
+ Minimal implementation without Gradio complications
5
+ """
6
+
7
+ import os
8
+ from typing import List, Dict
9
+ import gradio as gr
10
+ from langchain_community.document_loaders import PyPDFLoader
11
+ from dotenv import load_dotenv
12
+ import google.generativeai as genai
13
+
14
+ # Load environment variables
15
+ load_dotenv()
16
+
17
+ class SimpleChatAssistant:
18
+ """Simple assistant without complex features"""
19
+
20
+ def __init__(self):
21
+ # Setup Gemini
22
+ api_key = os.getenv("GOOGLE_API_KEY")
23
+ if api_key:
24
+ genai.configure(api_key=api_key)
25
+ try:
26
+ self.model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
27
+ print("Using Gemini 2.5 Flash Preview")
28
+ except:
29
+ self.model = genai.GenerativeModel('gemini-1.5-flash')
30
+ print("Using Gemini 1.5 Flash")
31
+ else:
32
+ self.model = None
33
+ print("No API key found")
34
+
35
+ # Load papers
36
+ self.papers = self._load_papers()
37
+
38
+ def _load_papers(self) -> Dict[str, str]:
39
+ """Load key papers"""
40
+ papers = {}
41
+ pdf_dir = "documents"
42
+
43
+ key_files = [
44
+ ("r3d", "r3d_arxiv_4apr2025.pdf"),
45
+ ("cv", "CV_DavidVanDijcke.pdf"),
46
+ ("fdr", "fdr.pdf")
47
+ ]
48
+
49
+ for key, filename in key_files:
50
+ pdf_path = os.path.join(pdf_dir, filename)
51
+ if os.path.exists(pdf_path):
52
+ try:
53
+ loader = PyPDFLoader(pdf_path)
54
+ pages = loader.load()
55
+ text = "\n\n".join([p.page_content for p in pages])
56
+ papers[key] = text
57
+ print(f"Loaded {filename}: {len(pages)} pages")
58
+ except Exception as e:
59
+ print(f"Error loading {filename}: {e}")
60
+
61
+ return papers
62
+
63
+ def chat(self, message: str) -> str:
64
+ """Simple chat function"""
65
+ if not message.strip():
66
+ return "What would you like to know about David's research?"
67
+
68
+ if not self.model:
69
+ return "Please set up your Google API key to use the assistant."
70
+
71
+ # Build context
72
+ context = ""
73
+
74
+ # Add relevant paper based on query
75
+ message_lower = message.lower()
76
+
77
+ if "job market" in message_lower or "jmp" in message_lower:
78
+ if "r3d" in self.papers:
79
+ context = f"[JOB MARKET PAPER - R3D]\n\n{self.papers['r3d'][:50000]}"
80
+ elif "cv" in message_lower or "background" in message_lower:
81
+ if "cv" in self.papers:
82
+ context = f"[CV]\n\n{self.papers['cv'][:20000]}"
83
+ else:
84
+ # Add some context from each paper
85
+ for key, text in self.papers.items():
86
+ context += f"\n[{key.upper()}]\n{text[:10000]}\n"
87
+
88
+ # Create prompt
89
+ prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market.
90
+
91
+ His JOB MARKET PAPER is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
92
+
93
+ Context from papers:
94
+ {context}
95
+
96
+ Question: {message}
97
+
98
+ Provide a helpful, conversational response:"""
99
+
100
+ try:
101
+ response = self.model.generate_content(prompt)
102
+ return response.text
103
+ except Exception as e:
104
+ return f"Error: {str(e)}"
105
+
106
+ # Create simple interface
107
+ assistant = SimpleChatAssistant()
108
+
109
+ # Create the most basic Gradio interface possible
110
+ iface = gr.Interface(
111
+ fn=assistant.chat,
112
+ inputs=gr.Textbox(lines=2, placeholder="Ask about David's research..."),
113
+ outputs="text",
114
+ title="David Van Dijcke - Research Assistant",
115
+ description="Ask about David's job market paper (R3D) and research",
116
+ examples=[
117
+ "What is David's job market paper about?",
118
+ "What makes R3D innovative?",
119
+ "What is the use of R3D?"
120
+ ]
121
+ )
122
+
123
+ if __name__ == "__main__":
124
+ iface.launch(server_name="127.0.0.1", server_port=7860)
app_sota.py ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - State-of-the-Art Research Assistant
4
+ Uses modern LLM capabilities: full document context, native PDF handling, and advanced prompting
5
+ """
6
+
7
+ import os
8
+ import base64
9
+ from typing import List, Dict, Optional, Tuple
10
+ import gradio as gr
11
+ from pathlib import Path
12
+ from pypdf import PdfReader
13
+ from dotenv import load_dotenv
14
+ import google.generativeai as genai
15
+
16
+ # Load environment variables
17
+ load_dotenv()
18
+
19
+ class StateOfTheArtAssistant:
20
+ """Uses Gemini's full capabilities - large context window and native understanding"""
21
+
22
+ def __init__(self):
23
+ """Initialize with modern approach"""
24
+ # Setup Gemini with best model
25
+ self.llm = self._setup_advanced_llm()
26
+
27
+ # Load all papers into memory at once
28
+ self.papers_full_text = self._load_all_papers_full()
29
+
30
+ # Create a single mega-context with all papers
31
+ self.mega_context = self._create_mega_context()
32
+
33
+ # Pre-load common contexts into Gemini's memory
34
+ self.initialized = False
35
+ self._initialize_assistant()
36
+
37
+ def _setup_advanced_llm(self):
38
+ """Setup most capable Gemini model"""
39
+ api_key = os.getenv("GOOGLE_API_KEY")
40
+
41
+ if not api_key:
42
+ raise ValueError("Google API key is required for state-of-the-art performance")
43
+
44
+ genai.configure(api_key=api_key)
45
+
46
+ # Try to use the most capable model available
47
+ try:
48
+ # Gemini 2.5 Flash Preview - Latest and most capable
49
+ model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
50
+ print("Using Gemini 2.5 Flash Preview - Latest model with enhanced capabilities")
51
+ return model
52
+ except Exception as e:
53
+ print(f"Could not load Gemini 2.5 Flash Preview: {e}")
54
+ try:
55
+ # Fallback to 1.5 Pro
56
+ model = genai.GenerativeModel('gemini-1.5-pro-002')
57
+ print("Using Gemini 1.5 Pro as fallback")
58
+ return model
59
+ except:
60
+ try:
61
+ # Second fallback
62
+ model = genai.GenerativeModel('gemini-1.5-flash-002')
63
+ print("Using Gemini 1.5 Flash as fallback")
64
+ return model
65
+ except:
66
+ # Last resort
67
+ model = genai.GenerativeModel('gemini-1.5-flash')
68
+ print("Using Gemini 1.5 Flash (base) as last resort")
69
+ return model
70
+
71
+ def _load_all_papers_full(self) -> Dict[str, str]:
72
+ """Load complete papers without chunking"""
73
+ papers = {}
74
+ pdf_dir = "documents"
75
+
76
+ # Define papers with priority order (job market paper first)
77
+ paper_files = [
78
+ ("r3d", "r3d_arxiv_4apr2025.pdf", "JOB MARKET PAPER - R3D: Regression Discontinuity Design with Distribution-Valued Outcomes"),
79
+ ("cv", "CV_DavidVanDijcke.pdf", "CURRICULUM VITAE"),
80
+ ("fdr", "fdr.pdf", "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns"),
81
+ ("disco", "disco.pdf", "disco: Distributional Synthetic Controls"),
82
+ ("rto", "rto.pdf", "Return to Office and the Tenure Distribution"),
83
+ ("prodf", "prodf.pdf", "On the Non-Identification of Revenue Production Functions"),
84
+ ("unmasking", "unmasking_partisanship.pdf", "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk"),
85
+ ("ukraine", "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf", "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine")
86
+ ]
87
+
88
+ for key, filename, title in paper_files:
89
+ pdf_path = os.path.join(pdf_dir, filename)
90
+ if os.path.exists(pdf_path):
91
+ try:
92
+ # Read PDF completely
93
+ with open(pdf_path, 'rb') as file:
94
+ pdf_reader = PdfReader(file)
95
+
96
+ # Extract all text at once
97
+ full_text = f"\n\n{'='*80}\n{title}\n{'='*80}\n\n"
98
+
99
+ for page_num, page in enumerate(pdf_reader.pages, 1):
100
+ text = page.extract_text()
101
+ if text.strip():
102
+ full_text += f"\n[Page {page_num}]\n{text}\n"
103
+
104
+ papers[key] = full_text
105
+ print(f"Loaded {title}: {len(full_text):,} characters")
106
+
107
+ except Exception as e:
108
+ print(f"Error loading {filename}: {e}")
109
+
110
+ return papers
111
+
112
+ def _create_mega_context(self) -> str:
113
+ """Create single context with all papers for Gemini to process"""
114
+ mega_context = """COMPLETE RESEARCH PORTFOLIO OF DAVID VAN DIJCKE
115
+ Econometrician on the 2025-26 Job Market
116
+ University of Michigan
117
+
118
+ IMPORTANT: David's JOB MARKET PAPER is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
119
+
120
+ Below are ALL of David's papers in full text:
121
+
122
+ """
123
+
124
+ # Add papers in priority order
125
+ for key, full_text in self.papers_full_text.items():
126
+ mega_context += full_text + "\n\n"
127
+
128
+ print(f"Total context size: {len(mega_context):,} characters (~{len(mega_context)//4:,} tokens)")
129
+
130
+ return mega_context
131
+
132
+ def _initialize_assistant(self):
133
+ """Pre-load context into Gemini for faster responses"""
134
+ if self.initialized or not self.llm:
135
+ return
136
+
137
+ try:
138
+ # Create a chat session with the full context pre-loaded
139
+ self.chat = self.llm.start_chat(history=[
140
+ {
141
+ "role": "user",
142
+ "parts": [f"""You are David Van Dijcke's research assistant. I'm providing you with his COMPLETE research portfolio to answer questions about.
143
+
144
+ {self.mega_context}
145
+
146
+ REMEMBER:
147
+ 1. David is on the 2025-26 economics job market
148
+ 2. His JOB MARKET PAPER is R3D
149
+ 3. He's from University of Michigan
150
+ 4. He specializes in econometric methods for functional and distributional data
151
+
152
+ Please confirm you've loaded all the papers."""]
153
+ },
154
+ {
155
+ "role": "model",
156
+ "parts": ["""I've successfully loaded David Van Dijcke's complete research portfolio. I have access to:
157
+
158
+ 1. **JOB MARKET PAPER**: R3D - Regression Discontinuity Design with Distribution-Valued Outcomes
159
+ 2. His CV
160
+ 3. Free Discontinuity Regression (FDR)
161
+ 4. disco: Distributional Synthetic Controls
162
+ 5. Return to Office and the Tenure Distribution
163
+ 6. Revenue Production Functions paper
164
+ 7. Published work on COVID/masks and Ukraine
165
+
166
+ I'm ready to answer any questions about David's research, methods, contributions, or background with full context from all his papers."""]
167
+ }
168
+ ])
169
+
170
+ self.initialized = True
171
+ print("Assistant initialized with full paper context")
172
+
173
+ except Exception as e:
174
+ print(f"Could not pre-initialize: {e}")
175
+ self.chat = None
176
+
177
+ def answer_question(self, query: str, chat_history: List = None) -> str:
178
+ """Answer using full context already loaded in Gemini"""
179
+ if not query.strip():
180
+ return "What would you like to know about David's research?"
181
+
182
+ try:
183
+ if self.chat:
184
+ # Use existing chat with pre-loaded context
185
+ response = self.chat.send_message(f"""Based on the complete papers I have loaded, please answer this question:
186
+
187
+ {query}
188
+
189
+ Important guidelines:
190
+ - Be conversational and accessible
191
+ - For technical questions, explain both intuition AND technical details
192
+ - Always specify which paper you're referencing
193
+ - For job market paper questions, focus on R3D
194
+ - Highlight what makes David's work unique and impactful
195
+ - Use specific examples from the papers""")
196
+
197
+ return response.text
198
+
199
+ else:
200
+ # Fallback: Send everything in one request
201
+ prompt = f"""You are David Van Dijcke's research assistant. Based on his complete research portfolio below, answer the question.
202
+
203
+ {self.mega_context}
204
+
205
+ Question: {query}
206
+
207
+ Guidelines:
208
+ - Be conversational and accessible
209
+ - For technical questions, explain both intuition AND technical details
210
+ - Always specify which paper you're referencing
211
+ - For job market paper questions, focus on R3D
212
+ - Highlight what makes David's work unique and impactful
213
+
214
+ Answer:"""
215
+
216
+ response = self.llm.generate_content(prompt)
217
+ return response.text
218
+
219
+ except Exception as e:
220
+ print(f"Error: {e}")
221
+
222
+ # Try with truncated context if we hit limits
223
+ try:
224
+ # Focus on job market paper and CV
225
+ limited_context = self.papers_full_text.get("r3d", "")[:50000] + "\n\n" + self.papers_full_text.get("cv", "")[:20000]
226
+
227
+ prompt = f"""Answer based on David Van Dijcke's job market paper (R3D) and CV:
228
+
229
+ {limited_context}
230
+
231
+ Question: {query}
232
+
233
+ Answer conversationally:"""
234
+
235
+ response = self.llm.generate_content(prompt)
236
+ return response.text
237
+
238
+ except:
239
+ return "I'm having trouble processing that request. Please try rephrasing or asking about a specific paper."
240
+
241
+ # Create modern interface
242
+ def create_interface():
243
+ """Create state-of-the-art interface"""
244
+
245
+ # Initialize assistant
246
+ try:
247
+ assistant = StateOfTheArtAssistant()
248
+ status_message = "✅ Assistant loaded with full paper context"
249
+ except Exception as e:
250
+ print(f"Initialization error: {e}")
251
+ assistant = None
252
+ status_message = "❌ Error: Please check your Google API key"
253
+
254
+ def chat(message, history):
255
+ if not assistant:
256
+ return "", history + [[message, "Please set up your Google API key to use the assistant."]]
257
+
258
+ if history is None:
259
+ history = []
260
+
261
+ response = assistant.answer_question(message, history)
262
+ history.append([message, response])
263
+ return "", history
264
+
265
+ # Custom CSS for modern look
266
+ custom_css = """
267
+ .gradio-container {
268
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
269
+ }
270
+ .user-message, .bot-message {
271
+ padding: 12px 16px !important;
272
+ border-radius: 8px !important;
273
+ }
274
+ """
275
+
276
+ with gr.Blocks(title="David Van Dijcke - Research Assistant", css=custom_css) as demo:
277
+
278
+ gr.Markdown(f"""
279
+ # David Van Dijcke - AI Research Assistant
280
+
281
+ **Econometrician** | **2025-26 Job Market** | **University of Michigan**
282
+
283
+ {status_message}
284
+ """)
285
+
286
+ chatbot = gr.Chatbot(
287
+ height=500,
288
+ show_label=False,
289
+ elem_classes=["chatbot"]
290
+ )
291
+
292
+ with gr.Row():
293
+ msg = gr.Textbox(
294
+ label="Ask anything about David's research",
295
+ placeholder="What makes R3D innovative? How does David use optimal transport? What are his main contributions?",
296
+ lines=2,
297
+ scale=4
298
+ )
299
+ submit = gr.Button("Send", variant="primary", scale=1)
300
+
301
+ clear = gr.Button("Clear Conversation")
302
+
303
+ # Example queries organized by category
304
+ with gr.Accordion("Example Questions", open=True):
305
+ gr.Examples(
306
+ examples=[
307
+ "What is David's job market paper about and why is it important?",
308
+ "Explain R3D's methodology - both the intuition and technical details",
309
+ "How does David's work on optimal transport connect across his papers?",
310
+ "What real-world policy questions can R3D help answer?",
311
+ "Compare David's approach in R3D versus traditional RDD",
312
+ "What makes David uniquely qualified for an econometrics position?",
313
+ "How does the FDR paper relate to the job market paper?",
314
+ "What are the key identification strategies across David's papers?",
315
+ "Explain the practical applications of distributional synthetic controls",
316
+ "What broader research agenda do David's papers represent?"
317
+ ],
318
+ inputs=msg,
319
+ label="Click any example to try it"
320
+ )
321
+
322
+ # Event handlers
323
+ msg.submit(chat, [msg, chatbot], [msg, chatbot])
324
+ submit.click(chat, [msg, chatbot], [msg, chatbot])
325
+ clear.click(lambda: [], None, chatbot)
326
+
327
+ gr.Markdown("""
328
+ ---
329
+ 💡 **Tip**: This assistant has David's complete papers loaded. Ask technical questions, request comparisons across papers, or explore specific methodological details.
330
+ """)
331
+
332
+ return demo
333
+
334
+ if __name__ == "__main__":
335
+ interface = create_interface()
336
+ interface.launch(
337
+ server_name="0.0.0.0",
338
+ server_port=7860,
339
+ share=False,
340
+ quiet=True
341
+ )
app_stable.py ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Stable Research Assistant
4
+ A simplified, stable version that avoids dependency conflicts
5
+ """
6
+
7
+ import os
8
+ from typing import List, Dict, Optional
9
+ import gradio as gr
10
+ from langchain_community.document_loaders import PyPDFLoader
11
+ from langchain_community.embeddings import HuggingFaceEmbeddings
12
+ from langchain_community.vectorstores import FAISS
13
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
14
+ from dotenv import load_dotenv
15
+ import google.generativeai as genai
16
+
17
+ # Load environment variables
18
+ load_dotenv()
19
+
20
+ class StableResearchAssistant:
21
+ """Stable assistant with minimal dependencies"""
22
+
23
+ def __init__(self):
24
+ """Initialize with stable configuration"""
25
+ self.embeddings = HuggingFaceEmbeddings(
26
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
27
+ )
28
+
29
+ # Load all papers into memory
30
+ self.papers = self._load_papers()
31
+
32
+ # Create simple vector store
33
+ self.vector_store = self._create_vector_store()
34
+
35
+ # Setup LLM
36
+ self.llm = self._setup_llm()
37
+
38
+ def _load_papers(self) -> Dict[str, Dict]:
39
+ """Load all papers into memory"""
40
+ papers = {}
41
+ pdf_dir = "documents"
42
+
43
+ paper_metadata = {
44
+ "r3d": {
45
+ "file": "r3d_arxiv_4apr2025.pdf",
46
+ "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
47
+ "type": "JOB MARKET PAPER",
48
+ "description": "Extends RDD to analyze entire outcome distributions using optimal transport"
49
+ },
50
+ "fdr": {
51
+ "file": "fdr.pdf",
52
+ "title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
53
+ "type": "Working Paper",
54
+ "description": "Estimates regression functions with unknown discontinuity locations"
55
+ },
56
+ "disco": {
57
+ "file": "disco.pdf",
58
+ "title": "disco: Distributional Synthetic Controls",
59
+ "type": "Working Paper",
60
+ "description": "Stata package for distributional synthetic control methods"
61
+ },
62
+ "rto": {
63
+ "file": "rto.pdf",
64
+ "title": "Return to Office and the Tenure Distribution",
65
+ "type": "Working Paper",
66
+ "description": "Analyzes impact of return-to-office mandates on employee tenure"
67
+ },
68
+ "prodf": {
69
+ "file": "prodf.pdf",
70
+ "title": "On the Non-Identification of Revenue Production Functions",
71
+ "type": "Working Paper",
72
+ "description": "Shows non-identification of production functions with revenue data"
73
+ },
74
+ "cv": {
75
+ "file": "CV_DavidVanDijcke.pdf",
76
+ "title": "Curriculum Vitae",
77
+ "type": "CV",
78
+ "description": "David Van Dijcke's academic CV"
79
+ }
80
+ }
81
+
82
+ for key, metadata in paper_metadata.items():
83
+ pdf_path = os.path.join(pdf_dir, metadata["file"])
84
+ if os.path.exists(pdf_path):
85
+ try:
86
+ loader = PyPDFLoader(pdf_path)
87
+ pages = loader.load()
88
+
89
+ # Store full text with metadata
90
+ full_text = "\n\n".join([p.page_content for p in pages])
91
+ papers[key] = {
92
+ "text": full_text,
93
+ "pages": len(pages),
94
+ "filename": metadata["file"],
95
+ "title": metadata["title"],
96
+ "type": metadata["type"],
97
+ "description": metadata["description"]
98
+ }
99
+ print(f"Loaded {metadata['title']}: {len(pages)} pages")
100
+
101
+ except Exception as e:
102
+ print(f"Error loading {metadata['file']}: {e}")
103
+
104
+ return papers
105
+
106
+ def _create_vector_store(self) -> Optional[FAISS]:
107
+ """Create vector store from papers"""
108
+ try:
109
+ # Create documents with larger chunks
110
+ documents = []
111
+ text_splitter = RecursiveCharacterTextSplitter(
112
+ chunk_size=1500,
113
+ chunk_overlap=150
114
+ )
115
+
116
+ for key, paper in self.papers.items():
117
+ # Split text
118
+ chunks = text_splitter.split_text(paper["text"])
119
+
120
+ # Create documents
121
+ for i, chunk in enumerate(chunks):
122
+ from langchain.schema import Document
123
+ doc = Document(
124
+ page_content=chunk,
125
+ metadata={"source": key, "chunk": i}
126
+ )
127
+ documents.append(doc)
128
+
129
+ # Create vector store
130
+ if documents:
131
+ return FAISS.from_documents(documents, self.embeddings)
132
+
133
+ except Exception as e:
134
+ print(f"Error creating vector store: {e}")
135
+
136
+ return None
137
+
138
+ def _setup_llm(self):
139
+ """Setup Gemini LLM"""
140
+ api_key = os.getenv("GOOGLE_API_KEY")
141
+
142
+ if api_key:
143
+ try:
144
+ genai.configure(api_key=api_key)
145
+ return genai.GenerativeModel('gemini-1.5-flash')
146
+ except Exception as e:
147
+ print(f"Error setting up Gemini: {e}")
148
+
149
+ return None
150
+
151
+ def answer_question(self, query: str, chat_history: List = None) -> str:
152
+ """Answer questions about David's research"""
153
+ if not query.strip():
154
+ return "Please ask a question about David Van Dijcke's research."
155
+
156
+ # Get relevant context
157
+ context = self._get_context(query)
158
+
159
+ # Generate response
160
+ if self.llm:
161
+ prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 academic job market.
162
+
163
+ IMPORTANT: The context below contains labeled sections from David's actual papers. Pay attention to the labels like [JOB MARKET PAPER], [CURRICULUM VITAE], etc.
164
+
165
+ Context:
166
+ {context}
167
+
168
+ Question: {query}
169
+
170
+ Instructions:
171
+ - Answer based ONLY on the provided context
172
+ - If the context mentions "JOB MARKET PAPER", that refers to R3D
173
+ - Be specific and cite the paper titles when relevant
174
+ - For job market paper questions, focus on the R3D paper
175
+
176
+ Answer:"""
177
+
178
+ try:
179
+ response = self.llm.generate_content(prompt)
180
+ return response.text
181
+ except Exception as e:
182
+ print(f"Error generating response: {e}")
183
+ return self._fallback_response(query)
184
+ else:
185
+ return self._fallback_response(query)
186
+
187
+ def _get_context(self, query: str) -> str:
188
+ """Get relevant context for query"""
189
+ query_lower = query.lower()
190
+ contexts = []
191
+
192
+ # CRITICAL: Check for job market paper queries FIRST
193
+ if any(phrase in query_lower for phrase in ["job market", "jmp", "job market paper", "what is his job market"]):
194
+ # Add R3D paper info with clear labeling
195
+ if "r3d" in self.papers:
196
+ paper = self.papers["r3d"]
197
+ context = f"[JOB MARKET PAPER - R3D: Regression Discontinuity Design with Distribution-Valued Outcomes]\n\n"
198
+ context += f"This is David Van Dijcke's JOB MARKET PAPER.\n\n"
199
+ context += paper["text"][:20000] # Get first ~20k chars
200
+ contexts.append(context)
201
+ # Return immediately for job market paper queries
202
+ return context
203
+
204
+ # Check for general "what's up" or greeting
205
+ if any(phrase in query_lower for phrase in ["what's up", "whats up", "hello", "hi"]):
206
+ intro = """David Van Dijcke is an econometrician on the 2025-26 academic job market from the University of Michigan.
207
+
208
+ His JOB MARKET PAPER is "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes" which extends regression discontinuity design to analyze entire outcome distributions using optimal transport theory.
209
+
210
+ He has developed several innovative econometric methods including:
211
+ - R3D (Job Market Paper): Distribution-valued RDD
212
+ - Free Discontinuity Regression (FDR): Estimating regressions with unknown discontinuities
213
+ - Distributional Synthetic Controls (disco): A Stata package
214
+ - Work on non-identification of revenue production functions
215
+
216
+ Ask me about any of his papers or methods!"""
217
+ contexts.append(intro)
218
+
219
+ # Check for David/CV queries
220
+ if any(word in query_lower for word in ["david", "who", "background", "cv", "about"]):
221
+ if "cv" in self.papers:
222
+ cv_context = f"[CURRICULUM VITAE]\n\n{self.papers['cv']['text'][:8000]}"
223
+ contexts.append(cv_context)
224
+
225
+ # Check for specific paper mentions
226
+ paper_keywords = {
227
+ "r3d": ["r3d", "regression discontinuity", "distribution", "optimal transport", "wasserstein"],
228
+ "fdr": ["fdr", "free discontinuity", "internet shutdown"],
229
+ "rto": ["return to office", "tenure", "rto"],
230
+ "disco": ["disco", "synthetic control", "distributional"],
231
+ "prodf": ["production function", "revenue", "identification"]
232
+ }
233
+
234
+ for key, keywords in paper_keywords.items():
235
+ if any(kw in query_lower for kw in keywords):
236
+ if key in self.papers:
237
+ paper = self.papers[key]
238
+ paper_context = f"[{paper['type']}: {paper['title']}]\n\n"
239
+ paper_context += paper["text"][:15000]
240
+ contexts.append(paper_context)
241
+
242
+ # If no specific match, try vector search
243
+ if not contexts and self.vector_store:
244
+ try:
245
+ docs = self.vector_store.similarity_search(query, k=4)
246
+ for doc in docs:
247
+ source = doc.metadata.get("source", "")
248
+ if source in self.papers:
249
+ paper = self.papers[source]
250
+ chunk_context = f"[From {paper['title']}]\n{doc.page_content}"
251
+ contexts.append(chunk_context)
252
+ except:
253
+ pass
254
+
255
+ # Always include paper list if no context found
256
+ if not contexts:
257
+ paper_list = "David Van Dijcke's papers:\n"
258
+ for key, paper in self.papers.items():
259
+ if key != "cv":
260
+ paper_list += f"- {paper['type']}: {paper['title']}\n"
261
+ contexts.append(paper_list)
262
+
263
+ return "\n\n---\n\n".join(contexts[:3])
264
+
265
+ def _fallback_response(self, query: str) -> str:
266
+ """Fallback response without LLM"""
267
+ query_lower = query.lower()
268
+
269
+ # Job market paper query
270
+ if any(phrase in query_lower for phrase in ["job market", "jmp"]):
271
+ return """David Van Dijcke's JOB MARKET PAPER is:
272
+
273
+ "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes"
274
+
275
+ This paper extends regression discontinuity design (RDD) to analyze entire outcome distributions rather than just means. Key innovations:
276
+ - Uses optimal transport theory and Wasserstein distances
277
+ - Allows testing of distributional effects of policies
278
+ - Applications to income distributions, test score distributions
279
+ - Provides new identification and estimation procedures
280
+
281
+ This addresses a fundamental limitation of traditional RDD that only examines average treatment effects."""
282
+
283
+ # General greeting
284
+ if any(phrase in query_lower for phrase in ["what's up", "hello", "hi"]):
285
+ return """Hello! I'm David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market.
286
+
287
+ His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
288
+
289
+ I can tell you about:
290
+ - His job market paper (R3D)
291
+ - His other papers (FDR, disco, RTO, etc.)
292
+ - His econometric methods
293
+ - His background and CV
294
+
295
+ What would you like to know?"""
296
+
297
+ # Specific paper queries
298
+ if "r3d" in query_lower:
299
+ return "R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David's JOB MARKET PAPER. It extends RDD to analyze entire outcome distributions using optimal transport theory and Wasserstein distances."
300
+
301
+ if "fdr" in query_lower:
302
+ return "Free Discontinuity Regression (FDR) is David's paper on estimating regression functions with unknown discontinuity locations. It uses geometric measure theory with applications to measuring economic impacts of internet shutdowns."
303
+
304
+ if "david" in query_lower or "who" in query_lower:
305
+ return "David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan. He specializes in functional data analysis, optimal transport methods, and develops novel econometric techniques for modern data challenges."
306
+
307
+ return "I can help with questions about David Van Dijcke's research. Try asking about his job market paper (R3D), his methods, or his background. For best results, please add a Google API key."
308
+
309
+ # Create Gradio interface
310
+ def create_interface():
311
+ """Create simple Gradio interface"""
312
+ assistant = StableResearchAssistant()
313
+
314
+ def chat(message, history):
315
+ response = assistant.answer_question(message, history)
316
+ history.append([message, response])
317
+ return "", history
318
+
319
+ with gr.Blocks(title="David Van Dijcke - Research Assistant") as demo:
320
+ gr.Markdown("""
321
+ # David Van Dijcke - Research Assistant (Stable Version)
322
+
323
+ Ask questions about David's econometric research and papers.
324
+ """)
325
+
326
+ chatbot = gr.Chatbot(height=400)
327
+ msg = gr.Textbox(label="Your question", placeholder="What is David's job market paper about?")
328
+ clear = gr.Button("Clear")
329
+
330
+ # Examples
331
+ gr.Examples(
332
+ examples=[
333
+ "What is David's job market paper R3D about?",
334
+ "What econometric methods has David developed?",
335
+ "Tell me about David's background",
336
+ "How does David use optimal transport in his research?",
337
+ "What is the FDR paper about?"
338
+ ],
339
+ inputs=msg
340
+ )
341
+
342
+ msg.submit(chat, [msg, chatbot], [msg, chatbot])
343
+ clear.click(lambda: None, None, chatbot, queue=False)
344
+
345
+ return demo
346
+
347
+ if __name__ == "__main__":
348
+ # Simple launch without API endpoint issues
349
+ interface = create_interface()
350
+ interface.launch(
351
+ server_name="127.0.0.1",
352
+ server_port=7860,
353
+ share=False,
354
+ quiet=True
355
+ )
app_working.py ADDED
@@ -0,0 +1,368 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ David Van Dijcke - Stable Research Assistant
4
+ A simplified, stable version that avoids dependency conflicts
5
+ """
6
+
7
+ import os
8
+ from typing import List, Dict, Optional
9
+ import gradio as gr
10
+ from langchain_community.document_loaders import PyPDFLoader
11
+ from langchain_community.embeddings import HuggingFaceEmbeddings
12
+ from langchain_community.vectorstores import FAISS
13
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
14
+ from dotenv import load_dotenv
15
+ import google.generativeai as genai
16
+
17
+ # Load environment variables
18
+ load_dotenv()
19
+
20
+ class StableResearchAssistant:
21
+ """Stable assistant with minimal dependencies"""
22
+
23
+ def __init__(self):
24
+ """Initialize with stable configuration"""
25
+ self.embeddings = HuggingFaceEmbeddings(
26
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
27
+ )
28
+
29
+ # Load all papers into memory
30
+ self.papers = self._load_papers()
31
+
32
+ # Create simple vector store
33
+ self.vector_store = self._create_vector_store()
34
+
35
+ # Setup LLM
36
+ self.llm = self._setup_llm()
37
+
38
+ def _load_papers(self) -> Dict[str, Dict]:
39
+ """Load all papers into memory"""
40
+ papers = {}
41
+ pdf_dir = "documents"
42
+
43
+ paper_metadata = {
44
+ "r3d": {
45
+ "file": "r3d_arxiv_4apr2025.pdf",
46
+ "title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
47
+ "type": "JOB MARKET PAPER",
48
+ "description": "Extends RDD to analyze entire outcome distributions using optimal transport"
49
+ },
50
+ "fdr": {
51
+ "file": "fdr.pdf",
52
+ "title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
53
+ "type": "Working Paper",
54
+ "description": "Estimates regression functions with unknown discontinuity locations"
55
+ },
56
+ "disco": {
57
+ "file": "disco.pdf",
58
+ "title": "disco: Distributional Synthetic Controls",
59
+ "type": "Working Paper",
60
+ "description": "Stata package for distributional synthetic control methods"
61
+ },
62
+ "rto": {
63
+ "file": "rto.pdf",
64
+ "title": "Return to Office and the Tenure Distribution",
65
+ "type": "Working Paper",
66
+ "description": "Analyzes impact of return-to-office mandates on employee tenure"
67
+ },
68
+ "prodf": {
69
+ "file": "prodf.pdf",
70
+ "title": "On the Non-Identification of Revenue Production Functions",
71
+ "type": "Working Paper",
72
+ "description": "Shows non-identification of production functions with revenue data"
73
+ },
74
+ "cv": {
75
+ "file": "CV_DavidVanDijcke.pdf",
76
+ "title": "Curriculum Vitae",
77
+ "type": "CV",
78
+ "description": "David Van Dijcke's academic CV"
79
+ }
80
+ }
81
+
82
+ for key, metadata in paper_metadata.items():
83
+ pdf_path = os.path.join(pdf_dir, metadata["file"])
84
+ if os.path.exists(pdf_path):
85
+ try:
86
+ loader = PyPDFLoader(pdf_path)
87
+ pages = loader.load() # Load ALL pages
88
+
89
+ # Store full text with metadata
90
+ full_text = "\n\n".join([p.page_content for p in pages])
91
+ papers[key] = {
92
+ "text": full_text,
93
+ "pages": len(pages),
94
+ "filename": metadata["file"],
95
+ "title": metadata["title"],
96
+ "type": metadata["type"],
97
+ "description": metadata["description"]
98
+ }
99
+ print(f"Loaded {metadata['title']}: {len(pages)} pages, {len(full_text):,} characters")
100
+
101
+ except Exception as e:
102
+ print(f"Error loading {metadata['file']}: {e}")
103
+
104
+ return papers
105
+
106
+ def _create_vector_store(self) -> Optional[FAISS]:
107
+ """Create vector store from papers"""
108
+ try:
109
+ # Create documents with larger chunks
110
+ documents = []
111
+ text_splitter = RecursiveCharacterTextSplitter(
112
+ chunk_size=1500,
113
+ chunk_overlap=150
114
+ )
115
+
116
+ for key, paper in self.papers.items():
117
+ # Split text
118
+ chunks = text_splitter.split_text(paper["text"])
119
+
120
+ # Create documents
121
+ for i, chunk in enumerate(chunks):
122
+ from langchain.schema import Document
123
+ doc = Document(
124
+ page_content=chunk,
125
+ metadata={"source": key, "chunk": i}
126
+ )
127
+ documents.append(doc)
128
+
129
+ # Create vector store
130
+ if documents:
131
+ return FAISS.from_documents(documents, self.embeddings)
132
+
133
+ except Exception as e:
134
+ print(f"Error creating vector store: {e}")
135
+
136
+ return None
137
+
138
+ def _setup_llm(self):
139
+ """Setup Gemini LLM"""
140
+ api_key = os.getenv("GOOGLE_API_KEY")
141
+
142
+ if api_key:
143
+ try:
144
+ genai.configure(api_key=api_key)
145
+ # Use Gemini 2.5 Flash Preview
146
+ try:
147
+ model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
148
+ print("Using Gemini 2.5 Flash Preview")
149
+ return model
150
+ except:
151
+ model = genai.GenerativeModel('gemini-1.5-flash')
152
+ print("Using Gemini 1.5 Flash")
153
+ return model
154
+ except Exception as e:
155
+ print(f"Error setting up Gemini: {e}")
156
+
157
+ return None
158
+
159
+ def answer_question(self, query: str, chat_history: List = None) -> str:
160
+ """Answer questions about David's research"""
161
+ if not query.strip():
162
+ return "Please ask a question about David Van Dijcke's research."
163
+
164
+ # Get relevant context
165
+ context = self._get_context(query)
166
+
167
+ # Generate response
168
+ if self.llm:
169
+ prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 academic job market from the University of Michigan.
170
+
171
+ Key facts about David:
172
+ - His JOB MARKET PAPER is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
173
+ - He specializes in functional data analysis, optimal transport, and econometric theory
174
+ - He develops methods for analyzing distributional effects, not just averages
175
+
176
+ Context from his papers:
177
+ {context}
178
+
179
+ Question: {query}
180
+
181
+ Instructions:
182
+ - Provide a conversational yet informative response
183
+ - Be specific and accurate based on the papers
184
+ - For technical questions, explain both the intuition AND the technical details
185
+ - Highlight what makes David's work unique and important
186
+ - For "what is the use" questions, focus on real-world applications and policy relevance
187
+
188
+ Answer:"""
189
+
190
+ try:
191
+ response = self.llm.generate_content(prompt)
192
+ return response.text
193
+ except Exception as e:
194
+ print(f"Error generating response: {e}")
195
+ return self._fallback_response(query)
196
+ else:
197
+ return self._fallback_response(query)
198
+
199
+ def _get_context(self, query: str) -> str:
200
+ """Get relevant context for query"""
201
+ query_lower = query.lower()
202
+ contexts = []
203
+
204
+ # CRITICAL: Check for job market paper queries FIRST
205
+ if any(phrase in query_lower for phrase in ["job market", "jmp", "job market paper", "what is his job market"]):
206
+ # Add R3D paper info with clear labeling
207
+ if "r3d" in self.papers:
208
+ paper = self.papers["r3d"]
209
+ context = f"[JOB MARKET PAPER - R3D: Regression Discontinuity Design with Distribution-Valued Outcomes]\n\n"
210
+ context += f"This is David Van Dijcke's JOB MARKET PAPER.\n\n"
211
+ # Provide more context for Gemini 2.5's larger window
212
+ context += paper["text"][:100000] # Get first ~100k chars (about 25k tokens)
213
+ contexts.append(context)
214
+ # Return immediately for job market paper queries
215
+ return context
216
+
217
+ # Check for general "what's up" or greeting
218
+ if any(phrase in query_lower for phrase in ["what's up", "whats up", "hello", "hi"]):
219
+ intro = """David Van Dijcke is an econometrician on the 2025-26 academic job market from the University of Michigan.
220
+
221
+ His JOB MARKET PAPER is "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes" which extends regression discontinuity design to analyze entire outcome distributions using optimal transport theory.
222
+
223
+ He has developed several innovative econometric methods including:
224
+ - R3D (Job Market Paper): Distribution-valued RDD
225
+ - Free Discontinuity Regression (FDR): Estimating regressions with unknown discontinuities
226
+ - Distributional Synthetic Controls (disco): A Stata package
227
+ - Work on non-identification of revenue production functions
228
+
229
+ Ask me about any of his papers or methods!"""
230
+ contexts.append(intro)
231
+
232
+ # Check for David/CV queries
233
+ if any(word in query_lower for word in ["david", "who", "background", "cv", "about"]):
234
+ if "cv" in self.papers:
235
+ cv_context = f"[CURRICULUM VITAE]\n\n{self.papers['cv']['text'][:8000]}"
236
+ contexts.append(cv_context)
237
+
238
+ # Check for specific paper mentions
239
+ paper_keywords = {
240
+ "r3d": ["r3d", "regression discontinuity", "distribution", "optimal transport", "wasserstein"],
241
+ "fdr": ["fdr", "free discontinuity", "internet shutdown"],
242
+ "rto": ["return to office", "tenure", "rto"],
243
+ "disco": ["disco", "synthetic control", "distributional"],
244
+ "prodf": ["production function", "revenue", "identification"]
245
+ }
246
+
247
+ for key, keywords in paper_keywords.items():
248
+ if any(kw in query_lower for kw in keywords):
249
+ if key in self.papers:
250
+ paper = self.papers[key]
251
+ paper_context = f"[{paper['type']}: {paper['title']}]\n\n"
252
+ paper_context += paper["text"][:15000]
253
+ contexts.append(paper_context)
254
+
255
+ # If no specific match, try vector search
256
+ if not contexts and self.vector_store:
257
+ try:
258
+ docs = self.vector_store.similarity_search(query, k=4)
259
+ for doc in docs:
260
+ source = doc.metadata.get("source", "")
261
+ if source in self.papers:
262
+ paper = self.papers[source]
263
+ chunk_context = f"[From {paper['title']}]\n{doc.page_content}"
264
+ contexts.append(chunk_context)
265
+ except:
266
+ pass
267
+
268
+ # Always include paper list if no context found
269
+ if not contexts:
270
+ paper_list = "David Van Dijcke's papers:\n"
271
+ for key, paper in self.papers.items():
272
+ if key != "cv":
273
+ paper_list += f"- {paper['type']}: {paper['title']}\n"
274
+ contexts.append(paper_list)
275
+
276
+ return "\n\n---\n\n".join(contexts[:3])
277
+
278
+ def _fallback_response(self, query: str) -> str:
279
+ """Fallback response without LLM"""
280
+ query_lower = query.lower()
281
+
282
+ # Job market paper query
283
+ if any(phrase in query_lower for phrase in ["job market", "jmp"]):
284
+ return """David Van Dijcke's JOB MARKET PAPER is:
285
+
286
+ "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes"
287
+
288
+ This paper extends regression discontinuity design (RDD) to analyze entire outcome distributions rather than just means. Key innovations:
289
+ - Uses optimal transport theory and Wasserstein distances
290
+ - Allows testing of distributional effects of policies
291
+ - Applications to income distributions, test score distributions
292
+ - Provides new identification and estimation procedures
293
+
294
+ This addresses a fundamental limitation of traditional RDD that only examines average treatment effects."""
295
+
296
+ # General greeting
297
+ if any(phrase in query_lower for phrase in ["what's up", "hello", "hi"]):
298
+ return """Hello! I'm David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market.
299
+
300
+ His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
301
+
302
+ I can tell you about:
303
+ - His job market paper (R3D)
304
+ - His other papers (FDR, disco, RTO, etc.)
305
+ - His econometric methods
306
+ - His background and CV
307
+
308
+ What would you like to know?"""
309
+
310
+ # Specific paper queries
311
+ if "r3d" in query_lower:
312
+ return "R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David's JOB MARKET PAPER. It extends RDD to analyze entire outcome distributions using optimal transport theory and Wasserstein distances."
313
+
314
+ if "fdr" in query_lower:
315
+ return "Free Discontinuity Regression (FDR) is David's paper on estimating regression functions with unknown discontinuity locations. It uses geometric measure theory with applications to measuring economic impacts of internet shutdowns."
316
+
317
+ if "david" in query_lower or "who" in query_lower:
318
+ return "David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan. He specializes in functional data analysis, optimal transport methods, and develops novel econometric techniques for modern data challenges."
319
+
320
+ return "I can help with questions about David Van Dijcke's research. Try asking about his job market paper (R3D), his methods, or his background. For best results, please add a Google API key."
321
+
322
+ # Create Gradio interface
323
+ def create_interface():
324
+ """Create simple Gradio interface"""
325
+ assistant = StableResearchAssistant()
326
+
327
+ def chat(message, history):
328
+ response = assistant.answer_question(message, history)
329
+ history.append([message, response])
330
+ return "", history
331
+
332
+ with gr.Blocks(title="David Van Dijcke - Research Assistant") as demo:
333
+ gr.Markdown("""
334
+ # David Van Dijcke - Research Assistant (Stable Version)
335
+
336
+ Ask questions about David's econometric research and papers.
337
+ """)
338
+
339
+ chatbot = gr.Chatbot(height=400)
340
+ msg = gr.Textbox(label="Your question", placeholder="What is David's job market paper about?")
341
+ clear = gr.Button("Clear")
342
+
343
+ # Examples
344
+ gr.Examples(
345
+ examples=[
346
+ "What is David's job market paper R3D about?",
347
+ "What econometric methods has David developed?",
348
+ "Tell me about David's background",
349
+ "How does David use optimal transport in his research?",
350
+ "What is the FDR paper about?"
351
+ ],
352
+ inputs=msg
353
+ )
354
+
355
+ msg.submit(chat, [msg, chatbot], [msg, chatbot])
356
+ clear.click(lambda: None, None, chatbot, queue=False)
357
+
358
+ return demo
359
+
360
+ if __name__ == "__main__":
361
+ # Simple launch without API endpoint issues
362
+ interface = create_interface()
363
+ interface.launch(
364
+ server_name="127.0.0.1",
365
+ server_port=7860,
366
+ share=False,
367
+ quiet=True
368
+ )
pyproject.toml ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "david-research-assistant"
3
+ version = "0.1.0"
4
+ description = "AI Research Assistant for David Van Dijcke's academic website"
5
+ requires-python = ">=3.9"
6
+ dependencies = [
7
+ "gradio>=4.44.0",
8
+ "langchain>=0.1.9",
9
+ "langchain-community>=0.0.24",
10
+ "sentence-transformers==2.5.1",
11
+ "faiss-cpu==1.7.4",
12
+ "pypdf==4.0.2",
13
+ "google-generativeai>=0.8.3",
14
+ "python-dotenv==1.0.1",
15
+ "pyyaml==6.0.1",
16
+ "pydantic>=2.0,<3.0",
17
+ "fastapi>=0.100.0",
18
+ ]
19
+
20
+ [project.optional-dependencies]
21
+ improved = [
22
+ "gradio>=4.44.0",
23
+ "langchain==0.1.9",
24
+ "langchain-community==0.0.24",
25
+ "sentence-transformers==2.5.1",
26
+ "faiss-cpu==1.7.4",
27
+ "pypdf==4.0.2",
28
+ "huggingface-hub==0.20.3",
29
+ "python-dotenv==1.0.1",
30
+ "pydantic>=2.0,<3.0",
31
+ "fastapi>=0.100.0",
32
+ ]
33
+ full-context = [
34
+ "gradio>=4.44.0",
35
+ "langchain==0.1.9",
36
+ "langchain-community==0.0.24",
37
+ "sentence-transformers==2.5.1",
38
+ "faiss-cpu==1.7.4",
39
+ "pypdf==4.0.2",
40
+ "google-generativeai>=0.8.3",
41
+ "python-dotenv==1.0.1",
42
+ "pyyaml==6.0.1",
43
+ "pydantic>=2.0,<3.0",
44
+ "fastapi>=0.100.0",
45
+ ]
46
+ test = [
47
+ "pytest>=7.0",
48
+ "pytest-asyncio",
49
+ ]
50
+
51
+ [build-system]
52
+ requires = ["hatchling"]
53
+ build-backend = "hatchling.build"
54
+
55
+ [tool.hatch.build.targets.wheel]
56
+ packages = ["."]
57
+ include = ["*.py", "documents/", "requirements*.txt", "*.md"]
58
+
59
+ [tool.uv]
60
+ dev-dependencies = [
61
+ "ipython>=8.0",
62
+ "black>=23.0",
63
+ "ruff>=0.1.0",
64
+ ]
pyproject_stable.toml ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "david-research-assistant"
3
+ version = "0.1.0"
4
+ description = "AI Research Assistant for David Van Dijcke's academic website"
5
+ requires-python = ">=3.9,<3.11"
6
+ dependencies = [
7
+ "gradio==4.19.2",
8
+ "langchain==0.1.9",
9
+ "langchain-community==0.0.24",
10
+ "sentence-transformers==2.5.1",
11
+ "faiss-cpu==1.7.4",
12
+ "pypdf==4.0.2",
13
+ "google-generativeai==0.3.2",
14
+ "python-dotenv==1.0.1",
15
+ "pydantic==2.5.3",
16
+ "pydantic-core==2.14.6",
17
+ "fastapi==0.109.0",
18
+ "httpx==0.26.0",
19
+ "typing-extensions==4.9.0",
20
+ ]
21
+
22
+ [build-system]
23
+ requires = ["setuptools>=61.0"]
24
+ build-backend = "setuptools.build_meta"
setup_stable.sh ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ echo "Setting up stable David Research Assistant environment..."
4
+
5
+ # Clean up existing environment
6
+ echo "Cleaning up..."
7
+ rm -rf .venv uv.lock __pycache__ *.pyc
8
+
9
+ # Create fresh virtual environment
10
+ echo "Creating virtual environment..."
11
+ uv venv
12
+
13
+ # Activate it
14
+ source .venv/bin/activate
15
+
16
+ # Install specific versions that work together
17
+ echo "Installing dependencies..."
18
+ uv pip install \
19
+ gradio==4.19.2 \
20
+ langchain==0.1.9 \
21
+ langchain-community==0.0.24 \
22
+ sentence-transformers==2.5.1 \
23
+ faiss-cpu==1.7.4 \
24
+ pypdf==4.0.2 \
25
+ google-generativeai==0.3.2 \
26
+ python-dotenv==1.0.1 \
27
+ pydantic==2.5.3 \
28
+ pydantic-core==2.14.6 \
29
+ fastapi==0.109.0 \
30
+ httpx==0.26.0 \
31
+ typing-extensions==4.9.0
32
+
33
+ echo "Setup complete! Run with:"
34
+ echo "source .venv/bin/activate"
35
+ echo "python app_stable.py"
test_full_context.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script comparing original and full context versions
4
+ """
5
+
6
+ import os
7
+ import time
8
+ from app import ImprovedResearchAssistant
9
+ from app_full_context import FullContextResearchAssistant
10
+
11
+ def test_both_versions():
12
+ """Compare responses from both versions"""
13
+ print("Comparing Research Assistant Versions\n")
14
+ print("="*80)
15
+
16
+ # Initialize both assistants
17
+ print("Loading original assistant...")
18
+ original = ImprovedResearchAssistant()
19
+
20
+ print("Loading full context assistant...")
21
+ full_context = FullContextResearchAssistant()
22
+
23
+ # Test queries that benefit from full context
24
+ test_queries = [
25
+ "What specific econometric methods does David develop in R3D? Give technical details.",
26
+ "Explain the theoretical framework of optimal transport in David's research.",
27
+ "What are the main results and contributions across all of David's papers?",
28
+ "How does the FDR paper relate to David's other work on discontinuities?",
29
+ "What makes David uniquely qualified for an econometrics position? Use specific examples from his papers.",
30
+ "Describe the empirical applications in David's job market paper with specific details.",
31
+ "What are the identification strategies used across David's different papers?",
32
+ "How does David's work on productivity relate to distributional outcomes?"
33
+ ]
34
+
35
+ for i, query in enumerate(test_queries, 1):
36
+ print(f"\n{'='*80}")
37
+ print(f"Test {i}: {query}")
38
+ print('='*80)
39
+
40
+ # Original version
41
+ print("\n--- ORIGINAL VERSION (Chunked) ---")
42
+ start_time = time.time()
43
+ try:
44
+ original_response = original.answer_question(query)
45
+ original_time = time.time() - start_time
46
+ print(f"Response ({original_time:.2f}s):")
47
+ print(original_response[:500] + "..." if len(original_response) > 500 else original_response)
48
+ except Exception as e:
49
+ print(f"Error: {e}")
50
+ original_response = "Error"
51
+ original_time = 0
52
+
53
+ # Full context version
54
+ print("\n--- FULL CONTEXT VERSION ---")
55
+ start_time = time.time()
56
+ try:
57
+ full_response = full_context.answer_question(query)
58
+ full_time = time.time() - start_time
59
+ print(f"Response ({full_time:.2f}s):")
60
+ print(full_response[:500] + "..." if len(full_response) > 500 else full_response)
61
+ except Exception as e:
62
+ print(f"Error: {e}")
63
+ full_response = "Error"
64
+ full_time = 0
65
+
66
+ # Compare
67
+ print("\n--- COMPARISON ---")
68
+ print(f"Original length: {len(original_response)} chars")
69
+ print(f"Full context length: {len(full_response)} chars")
70
+ print(f"Length improvement: {len(full_response) / max(len(original_response), 1):.1f}x")
71
+
72
+ # Check for specific technical terms
73
+ technical_terms = ["optimal transport", "Wasserstein", "distribution", "discontinuity",
74
+ "identification", "econometric", "functional data", "geometric measure"]
75
+
76
+ original_terms = sum(1 for term in technical_terms if term.lower() in original_response.lower())
77
+ full_terms = sum(1 for term in technical_terms if term.lower() in full_response.lower())
78
+
79
+ print(f"Technical terms - Original: {original_terms}, Full: {full_terms}")
80
+
81
+ def analyze_paper_coverage():
82
+ """Analyze how much of each paper is loaded"""
83
+ print("\n" + "="*80)
84
+ print("PAPER COVERAGE ANALYSIS")
85
+ print("="*80)
86
+
87
+ assistant = FullContextResearchAssistant()
88
+
89
+ print("\nFull papers loaded:")
90
+ total_chars = 0
91
+ for key, paper_info in assistant.full_papers.items():
92
+ print(f"\n{key}:")
93
+ print(f" Title: {paper_info['title']}")
94
+ print(f" Pages: {paper_info['num_pages']}")
95
+ print(f" Characters: {paper_info['length']:,}")
96
+ total_chars += paper_info['length']
97
+
98
+ print(f"\nTotal characters across all papers: {total_chars:,}")
99
+ print(f"Approximate tokens (chars/4): {total_chars//4:,}")
100
+ print(f"Well within Gemini 2.0 Flash context window (1M+ tokens)")
101
+
102
+ if __name__ == "__main__":
103
+ # Check API key
104
+ if not os.getenv("GOOGLE_API_KEY"):
105
+ print("Warning: No GOOGLE_API_KEY found. Results will be limited.\n")
106
+
107
+ # Run tests
108
+ test_both_versions()
109
+ analyze_paper_coverage()
110
+
111
+ print("\n" + "="*80)
112
+ print("Testing complete!")
113
+ print("\nKey improvements in full context version:")
114
+ print("- Loads complete papers instead of just first few pages")
115
+ print("- Larger chunk sizes (2000 vs 500 chars)")
116
+ print("- Better context preservation")
117
+ print("- More comprehensive responses")
118
+ print("- Ability to make cross-paper connections")