Davidvandijcke commited on
Commit ·
816ce76
1
Parent(s): 627fbbe
Update to professional assistant with Gemini 2.5 Flash Preview
Browse files- Uses Gemini 2.5 Flash Preview for better responses
- Professional chat interface
- 15 question limit per session
- Assistant speaks as expert about David (third person)
- Improved prompting for concise, informative responses
- Full paper loading for comprehensive context
- .env.example +6 -1
- .gitignore +4 -0
- IMPROVEMENTS_SUMMARY.md +75 -0
- README.md +46 -1
- README_UV_SETUP.md +85 -0
- app.py +222 -634
- app_enhanced.py +599 -0
- app_final.py +246 -0
- app_full_context.py +401 -0
- app_natural.py +355 -0
- app_optimized.py +554 -0
- app_professional.py +233 -0
- app_simple_chat.py +124 -0
- app_sota.py +341 -0
- app_stable.py +355 -0
- app_working.py +368 -0
- pyproject.toml +64 -0
- pyproject_stable.toml +24 -0
- setup_stable.sh +35 -0
- test_full_context.py +118 -0
.env.example
CHANGED
|
@@ -1,4 +1,9 @@
|
|
| 1 |
# Google AI API Key (optional but recommended)
|
| 2 |
# Get your API key from https://aistudio.google.com/app/apikey
|
| 3 |
# If not provided, the app will use a limited mode with lower quality
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Google AI API Key (optional but recommended)
|
| 2 |
# Get your API key from https://aistudio.google.com/app/apikey
|
| 3 |
# If not provided, the app will use a limited mode with lower quality
|
| 4 |
+
GOOGLE_API_KEY=your_google_api_key_here
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
# Optional: Override default model names
|
| 8 |
+
# EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
|
| 9 |
+
# LLM_MODEL=gemini-1.5-flash
|
.gitignore
CHANGED
|
@@ -3,6 +3,10 @@
|
|
| 3 |
.env.local
|
| 4 |
.env.*.local
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
# Cache
|
| 7 |
vector_store_cache/
|
| 8 |
__pycache__/
|
|
|
|
| 3 |
.env.local
|
| 4 |
.env.*.local
|
| 5 |
|
| 6 |
+
# uv
|
| 7 |
+
.venv/
|
| 8 |
+
uv.lock
|
| 9 |
+
|
| 10 |
# Cache
|
| 11 |
vector_store_cache/
|
| 12 |
__pycache__/
|
IMPROVEMENTS_SUMMARY.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Research Assistant Improvements Summary
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
I've significantly improved the David Van Dijcke Research Assistant to provide more comprehensive and accurate responses by leveraging Gemini's large context window and implementing smart retrieval strategies.
|
| 6 |
+
|
| 7 |
+
## Key Improvements
|
| 8 |
+
|
| 9 |
+
### 1. Full Paper Loading (`app_full_context.py`)
|
| 10 |
+
- **Before**: Only loaded first 3-10 pages of each PDF
|
| 11 |
+
- **After**: Loads complete papers (all pages)
|
| 12 |
+
- **Impact**: Complete context for accurate, detailed responses
|
| 13 |
+
|
| 14 |
+
### 2. Smart Retrieval (`app_optimized.py`)
|
| 15 |
+
- **Query Type Detection**: Identifies technical vs overview vs application queries
|
| 16 |
+
- **Section Extraction**: Intelligently parses papers into sections (intro, theory, results, etc.)
|
| 17 |
+
- **Hierarchical Search**: Uses both section-level and chunk-level retrieval
|
| 18 |
+
- **Response Caching**: Instant responses for repeated queries
|
| 19 |
+
|
| 20 |
+
### 3. Enhanced Context Window Usage
|
| 21 |
+
- **Chunk Size**: Increased from 500 to 2000 characters
|
| 22 |
+
- **Context Limit**: Up to 1M characters (250k tokens) for Gemini 2.0 Flash
|
| 23 |
+
- **Paper Selection**: Smart selection of most relevant papers based on query
|
| 24 |
+
|
| 25 |
+
### 4. UV Package Management
|
| 26 |
+
- **Faster Installation**: UV is significantly faster than pip
|
| 27 |
+
- **Better Dependency Resolution**: More reliable builds
|
| 28 |
+
- **Multiple Configurations**: Easy switching between versions
|
| 29 |
+
- **Lock File Support**: Reproducible environments
|
| 30 |
+
|
| 31 |
+
## Performance Comparison
|
| 32 |
+
|
| 33 |
+
| Metric | Original | Full Context | Optimized |
|
| 34 |
+
|--------|----------|--------------|-----------|
|
| 35 |
+
| Pages Loaded | 3-10 | All | All |
|
| 36 |
+
| Chunk Size | 500 chars | 2000 chars | 1000-3000 chars |
|
| 37 |
+
| Context Window | ~2k chars | ~1M chars | Smart selection |
|
| 38 |
+
| Response Quality | Basic | Comprehensive | Targeted & Detailed |
|
| 39 |
+
| Speed | Fast | Slower | Fast (with caching) |
|
| 40 |
+
|
| 41 |
+
## Usage Recommendations
|
| 42 |
+
|
| 43 |
+
1. **For General Q&A**: Use `app_optimized.py` (best balance)
|
| 44 |
+
2. **For Deep Technical Questions**: Use `app_full_context.py`
|
| 45 |
+
3. **For Quick Testing**: Use original `app.py`
|
| 46 |
+
4. **For Production**: Deploy `app_optimized.py` with caching
|
| 47 |
+
|
| 48 |
+
## Technical Details
|
| 49 |
+
|
| 50 |
+
### Vector Store Strategy
|
| 51 |
+
- **Chunks Store**: Smaller chunks (1000 chars) for detailed retrieval
|
| 52 |
+
- **Sections Store**: Larger chunks (3000 chars) for context preservation
|
| 53 |
+
- **Caching**: Separate caches for different chunking strategies
|
| 54 |
+
|
| 55 |
+
### Query Processing Pipeline
|
| 56 |
+
1. Query type classification
|
| 57 |
+
2. Relevant paper identification (keyword + embedding search)
|
| 58 |
+
3. Section/chunk retrieval based on query type
|
| 59 |
+
4. Context assembly with priority ordering
|
| 60 |
+
5. Response generation with Gemini
|
| 61 |
+
6. Response caching for efficiency
|
| 62 |
+
|
| 63 |
+
### Memory Optimization
|
| 64 |
+
- Lazy loading of papers
|
| 65 |
+
- JSON caching of processed papers
|
| 66 |
+
- Separate vector stores by granularity
|
| 67 |
+
- Response cache with query normalization
|
| 68 |
+
|
| 69 |
+
## Next Steps
|
| 70 |
+
|
| 71 |
+
1. **Fine-tune Retrieval**: Adjust weights for different query types
|
| 72 |
+
2. **Add Conversation Memory**: Track context across multiple queries
|
| 73 |
+
3. **Implement Streaming**: Stream responses for better UX
|
| 74 |
+
4. **Add Citations**: Include specific page/section references
|
| 75 |
+
5. **Multi-modal Support**: Include figures and tables from papers
|
README.md
CHANGED
|
@@ -14,6 +14,13 @@ license: mit
|
|
| 14 |
|
| 15 |
An AI-powered assistant specializing in David Van Dijcke's econometric research. David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Features
|
| 18 |
|
| 19 |
- **Econometric Methods Focus**: Detailed information about David's methodological contributions
|
|
@@ -22,6 +29,15 @@ An AI-powered assistant specializing in David Van Dijcke's econometric research.
|
|
| 22 |
- **Policy Applications**: How David applies econometric tools to answer questions with big data
|
| 23 |
- **Research Portfolio**: Information on FDR, DISCO, RTO, and other papers
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## Getting the Best Performance
|
| 26 |
|
| 27 |
For high quality, accurate responses at very low cost, use Google's Gemini 2.5 Flash:
|
|
@@ -54,6 +70,33 @@ This space is designed to run on Hugging Face Spaces with CPU inference.
|
|
| 54 |
|
| 55 |
## Local Development
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
1. Install requirements:
|
| 58 |
```bash
|
| 59 |
pip install -r requirements.txt
|
|
@@ -62,4 +105,6 @@ pip install -r requirements.txt
|
|
| 62 |
2. Run the app:
|
| 63 |
```bash
|
| 64 |
python app.py
|
| 65 |
-
```
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
An AI-powered assistant specializing in David Van Dijcke's econometric research. David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
|
| 16 |
|
| 17 |
+
## Available Versions
|
| 18 |
+
|
| 19 |
+
1. **app.py** - Original version with basic chunking
|
| 20 |
+
2. **app_improved.py** - Enhanced version with better prompts
|
| 21 |
+
3. **app_full_context.py** - Full paper loading with Gemini's large context window
|
| 22 |
+
4. **app_optimized.py** - Smart retrieval with section extraction and caching
|
| 23 |
+
|
| 24 |
## Features
|
| 25 |
|
| 26 |
- **Econometric Methods Focus**: Detailed information about David's methodological contributions
|
|
|
|
| 29 |
- **Policy Applications**: How David applies econometric tools to answer questions with big data
|
| 30 |
- **Research Portfolio**: Information on FDR, DISCO, RTO, and other papers
|
| 31 |
|
| 32 |
+
### New Improvements
|
| 33 |
+
|
| 34 |
+
- **Full Paper Loading**: Reads complete PDFs instead of just first few pages
|
| 35 |
+
- **Large Context Window**: Leverages Gemini 2.0 Flash's 1M+ token context
|
| 36 |
+
- **Smart Retrieval**: Query-type based retrieval (technical, overview, application)
|
| 37 |
+
- **Section Extraction**: Intelligent parsing of paper sections
|
| 38 |
+
- **Response Caching**: Instant responses for repeated queries
|
| 39 |
+
- **Hierarchical Search**: Both section-level and chunk-level retrieval
|
| 40 |
+
|
| 41 |
## Getting the Best Performance
|
| 42 |
|
| 43 |
For high quality, accurate responses at very low cost, use Google's Gemini 2.5 Flash:
|
|
|
|
| 70 |
|
| 71 |
## Local Development
|
| 72 |
|
| 73 |
+
### Option 1: Using UV (Recommended)
|
| 74 |
+
|
| 75 |
+
1. Install UV:
|
| 76 |
+
```bash
|
| 77 |
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
2. Create virtual environment and install dependencies:
|
| 81 |
+
```bash
|
| 82 |
+
uv venv
|
| 83 |
+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 84 |
+
uv pip install -e .
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
3. Copy environment file and add your API key:
|
| 88 |
+
```bash
|
| 89 |
+
cp .env.example .env
|
| 90 |
+
# Edit .env and add your GOOGLE_API_KEY
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
4. Run the app:
|
| 94 |
+
```bash
|
| 95 |
+
python app.py
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
### Option 2: Using pip
|
| 99 |
+
|
| 100 |
1. Install requirements:
|
| 101 |
```bash
|
| 102 |
pip install -r requirements.txt
|
|
|
|
| 105 |
2. Run the app:
|
| 106 |
```bash
|
| 107 |
python app.py
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
See `README_UV_SETUP.md` for detailed UV setup instructions.
|
README_UV_SETUP.md
ADDED
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# UV Setup Guide for David Research Assistant
|
| 2 |
+
|
| 3 |
+
This guide explains how to set up the development environment using `uv` instead of `pip`.
|
| 4 |
+
|
| 5 |
+
## Prerequisites
|
| 6 |
+
|
| 7 |
+
Install `uv` if you haven't already:
|
| 8 |
+
```bash
|
| 9 |
+
# macOS/Linux
|
| 10 |
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 11 |
+
|
| 12 |
+
# or using pip
|
| 13 |
+
pip install uv
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
## Setup Instructions
|
| 17 |
+
|
| 18 |
+
1. **Create and activate virtual environment:**
|
| 19 |
+
```bash
|
| 20 |
+
cd david-research-assistant
|
| 21 |
+
uv venv
|
| 22 |
+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
2. **Install dependencies:**
|
| 26 |
+
|
| 27 |
+
For the standard version (with Google Generative AI):
|
| 28 |
+
```bash
|
| 29 |
+
uv pip install -e .
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
For the improved version (with Hugging Face):
|
| 33 |
+
```bash
|
| 34 |
+
uv pip install -e ".[improved]"
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
For development (includes testing and linting tools):
|
| 38 |
+
```bash
|
| 39 |
+
uv pip install -e . --all-extras
|
| 40 |
+
uv pip install -e ".[test]"
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
3. **Set up environment variables:**
|
| 44 |
+
```bash
|
| 45 |
+
cp .env.example .env
|
| 46 |
+
# Edit .env and add your GOOGLE_API_KEY if using the standard version
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## Running the Application
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
# Original version (basic chunking)
|
| 53 |
+
python app.py
|
| 54 |
+
|
| 55 |
+
# Improved version (better prompts)
|
| 56 |
+
python app_improved.py
|
| 57 |
+
|
| 58 |
+
# Full context version (complete papers)
|
| 59 |
+
python app_full_context.py
|
| 60 |
+
|
| 61 |
+
# Optimized version (smart retrieval)
|
| 62 |
+
python app_optimized.py
|
| 63 |
+
|
| 64 |
+
# Run tests
|
| 65 |
+
python test_assistant.py
|
| 66 |
+
python test_full_context.py
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Benefits of using UV
|
| 70 |
+
|
| 71 |
+
- **Faster installation**: UV is written in Rust and is significantly faster than pip
|
| 72 |
+
- **Better dependency resolution**: More reliable and predictable
|
| 73 |
+
- **Lock file support**: `uv.lock` ensures reproducible builds
|
| 74 |
+
- **Built-in virtual environment management**: No need for separate venv/virtualenv
|
| 75 |
+
|
| 76 |
+
## Switching between versions
|
| 77 |
+
|
| 78 |
+
To switch between standard and improved versions:
|
| 79 |
+
```bash
|
| 80 |
+
# Standard version
|
| 81 |
+
uv pip install -e .
|
| 82 |
+
|
| 83 |
+
# Improved version
|
| 84 |
+
uv pip install -e ".[improved]"
|
| 85 |
+
```
|
app.py
CHANGED
|
@@ -1,666 +1,254 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import os
|
|
|
|
| 2 |
import gradio as gr
|
| 3 |
-
from
|
| 4 |
-
import
|
| 5 |
-
from datetime import datetime, timedelta
|
| 6 |
-
import hashlib
|
| 7 |
-
import threading
|
| 8 |
-
from collections import defaultdict
|
| 9 |
-
import time
|
| 10 |
-
import re
|
| 11 |
-
try:
|
| 12 |
-
import yaml
|
| 13 |
-
except ImportError:
|
| 14 |
-
yaml = None
|
| 15 |
-
logger.warning("PyYAML not installed. Markdown parsing will be disabled.")
|
| 16 |
-
from pathlib import Path
|
| 17 |
-
|
| 18 |
-
# Import only what we need for better performance
|
| 19 |
-
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 20 |
-
from langchain.document_loaders import PyPDFLoader
|
| 21 |
-
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 22 |
-
from langchain_community.vectorstores import FAISS
|
| 23 |
-
from langchain.schema import Document
|
| 24 |
import google.generativeai as genai
|
| 25 |
-
from google.generativeai.types import HarmCategory, HarmBlockThreshold # Ensure this is present
|
| 26 |
-
import logging
|
| 27 |
-
|
| 28 |
-
# Set up logging
|
| 29 |
-
logging.basicConfig(level=logging.INFO)
|
| 30 |
-
logger = logging.getLogger(__name__)
|
| 31 |
|
| 32 |
-
#
|
| 33 |
-
|
| 34 |
-
MAX_CONCURRENT_SESSIONS = 10 # Maximum simultaneous sessions
|
| 35 |
-
SESSION_TIMEOUT_HOURS = 2 # Sessions expire after 2 hours of inactivity
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
# If you intend to use this global variable, ensure its values match your intent.
|
| 40 |
-
safety_settings_block_none_for_all_categories = [ # Renamed for clarity based on its content
|
| 41 |
-
{
|
| 42 |
-
"category": HarmCategory.HARM_CATEGORY_HARASSMENT,
|
| 43 |
-
"threshold": HarmBlockThreshold.BLOCK_NONE,
|
| 44 |
-
},
|
| 45 |
-
{
|
| 46 |
-
"category": HarmCategory.HARM_CATEGORY_HATE_SPEECH,
|
| 47 |
-
"threshold": HarmBlockThreshold.BLOCK_NONE,
|
| 48 |
-
},
|
| 49 |
-
{
|
| 50 |
-
"category": HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
|
| 51 |
-
"threshold": HarmBlockThreshold.BLOCK_NONE,
|
| 52 |
-
},
|
| 53 |
-
{
|
| 54 |
-
"category": HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
|
| 55 |
-
"threshold": HarmBlockThreshold.BLOCK_NONE,
|
| 56 |
-
},
|
| 57 |
-
]
|
| 58 |
-
|
| 59 |
-
class DynamicPaperDatabase:
|
| 60 |
-
"""Database that dynamically loads papers from markdown files"""
|
| 61 |
-
def __init__(self, base_path: str = None):
|
| 62 |
-
self.papers = {}
|
| 63 |
-
self.base_path = base_path
|
| 64 |
-
self.load_papers_from_markdown()
|
| 65 |
-
self.create_lookup_indices()
|
| 66 |
|
| 67 |
-
def
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
os.path.join(self.base_path, "_wps")
|
| 91 |
-
]
|
| 92 |
-
found_any = False
|
| 93 |
-
for directory in directories:
|
| 94 |
-
if os.path.exists(directory):
|
| 95 |
-
found_any = True
|
| 96 |
-
for filename in os.listdir(directory):
|
| 97 |
-
if filename.endswith('.md'):
|
| 98 |
-
filepath = os.path.join(directory, filename)
|
| 99 |
-
paper_data = self.parse_markdown_front_matter(filepath)
|
| 100 |
-
if paper_data and 'title' in paper_data:
|
| 101 |
-
paper_key = filename.replace('.md', '').lower()
|
| 102 |
-
coauthors = [author.strip() for author in paper_data.get('coauthors', '').split(',') if author.strip()]
|
| 103 |
-
authors = ["David Van Dijcke"] + coauthors
|
| 104 |
-
seen = set()
|
| 105 |
-
authors = [x for x in authors if not (x in seen or seen.add(x))]
|
| 106 |
-
year = None
|
| 107 |
-
if 'date' in paper_data:
|
| 108 |
-
year = str(paper_data['date']).split('-')[0]
|
| 109 |
-
elif filename.startswith('20'):
|
| 110 |
-
year = filename[:4]
|
| 111 |
-
paper_type = "working_paper" if "_wps" in directory else "publication"
|
| 112 |
-
if 'job market' in paper_data.get('title', '').lower():
|
| 113 |
-
paper_type = "job_market_paper"
|
| 114 |
-
keywords = self.extract_keywords(paper_data)
|
| 115 |
-
self.papers[paper_key] = {
|
| 116 |
-
"title": paper_data['title'], "authors": authors, "year": int(year) if year else None,
|
| 117 |
-
"type": paper_type, "keywords": keywords, "venue": paper_data.get('venue', ''),
|
| 118 |
-
"excerpt": paper_data.get('excerpt', ''), "paperurl": paper_data.get('paperurl', ''),
|
| 119 |
-
"citation": paper_data.get('citation', ''), "field": paper_data.get('field', ''),
|
| 120 |
-
"full_content": paper_data.get('full_content', '')
|
| 121 |
-
}
|
| 122 |
-
logger.info(f"Loaded paper: {paper_data['title']} with authors: {authors}")
|
| 123 |
-
if found_any: return
|
| 124 |
-
logger.info("Using hardcoded paper database")
|
| 125 |
-
self.load_hardcoded_papers()
|
| 126 |
|
| 127 |
-
def
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
"
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
"fdr": {
|
| 142 |
-
"title": "Free Discontinuity Regression", "authors": ["Florian Gunsilius", "David Van Dijcke"], "year": 2025, "type": "working_paper",
|
| 143 |
-
"keywords": ["free discontinuity", "mumford-shah", "internet shutdown", "india", "multivariate", "causal inference", "geometric measure theory"],
|
| 144 |
-
"excerpt": "This paper develops a new method for detecting and estimating multivariate discontinuities without prior knowledge of their location. Using a convex relaxation of the Mumford-Shah functional from geometric measure theory, FDR automatically identifies discontinuity sets and estimates treatment effects. Applied to internet shutdowns in India to show heterogeneous effects across regions.",
|
| 145 |
-
"field": "Econometrics", "full_content": "FDR introduces methods from geometric measure theory to econometrics. The paper solves the problem of estimating causal effects when the discontinuity location is unknown and potentially complex (curves, surfaces). Applications include geographic regression discontinuities and policy boundaries."
|
| 146 |
-
},
|
| 147 |
-
"revenue-production": {
|
| 148 |
-
"title": "On the Non-Identification of Revenue Production Functions", "authors": ["David Van Dijcke"], "year": 2023, "type": "working_paper",
|
| 149 |
-
"keywords": ["revenue", "production function", "identification"], "field": "Econometrics"
|
| 150 |
-
},
|
| 151 |
-
"disco": {
|
| 152 |
-
"title": "Distributional Synthetic Controls", "authors": ["Florian Gunsilius", "David Van Dijcke"], "year": 2025, "type": "working_paper",
|
| 153 |
-
"keywords": ["distributional synthetic", "optimal transport", "synthetic control", "distribution", "causal inference", "quantile effects"],
|
| 154 |
-
"excerpt": "This paper extends synthetic control methods to estimate effects on entire outcome distributions. Using optimal transport theory, DISCO creates synthetic controls that match the pre-treatment distribution of the treated unit. This enables estimation of quantile treatment effects and other distributional parameters. Includes an R package implementation.",
|
| 155 |
-
"field": "Econometrics", "full_content": "DISCO combines synthetic controls with optimal transport to analyze distributional treatment effects. Key innovation: matching entire pre-treatment distributions rather than just means. Applications include analyzing distributional effects of minimum wage policies and other interventions."
|
| 156 |
-
},
|
| 157 |
-
"ukraine": {
|
| 158 |
-
"title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine", "authors": ["David Van Dijcke", "Austin L. Wright", "Maria Polyak"], "year": 2023, "journal": "Proceedings of the National Academy of Sciences", "type": "publication",
|
| 159 |
-
"keywords": ["ukraine", "air raid", "alerts", "casualties", "mobility"], "field": "Policy"
|
| 160 |
-
},
|
| 161 |
-
"unmasking": {
|
| 162 |
-
"title": "Unmasking Partisanship: Polarization undermines public response to collective risk", "authors": ["Maria Milosh", "Marcus Painter", "Konstantin Sonin", "David Van Dijcke", "Austin Wright"], "year": 2021, "journal": "Journal of Public Economics", "type": "publication",
|
| 163 |
-
"keywords": ["partisanship", "polarization", "covid", "mask", "social distancing"],
|
| 164 |
-
"excerpt": "Political polarization and competing narratives can undermine public policy implementation. Partisanship may play a particularly important role in shaping heterogeneous responses to collective risk during periods of crisis when political agents manipulate signals received by the public.", "field": "Policy"
|
| 165 |
-
},
|
| 166 |
-
"science-skepticism": {
|
| 167 |
-
"title": "Science Skepticism Reduced Compliance with COVID-19 Shelter-in-Place Policies", "authors": ["Adam Brzezinski", "Valentin Kecht", "David Van Dijcke", "Austin L. Wright"], "year": 2021, "journal": "Nature Human Behaviour", "citations": 226, "type": "publication",
|
| 168 |
-
"keywords": ["covid", "science skepticism", "compliance", "shelter in place"], "field": "Policy"
|
| 169 |
-
},
|
| 170 |
-
"government-community": {
|
| 171 |
-
"title": "The COVID-19 Pandemic: Government versus Community Action Across the United States", "authors": ["Adam Brzezinski", "Guido Deiana", "Valentin Kecht", "David Van Dijcke"], "year": 2020, "journal": "Covid Economics", "citations": 160, "type": "publication",
|
| 172 |
-
"keywords": ["covid", "government", "community", "mandates", "voluntary"], "field": "Policy"
|
| 173 |
-
},
|
| 174 |
-
"work-effort": {
|
| 175 |
-
"title": "Work Effort and the Cycle: Evidence from Survey Data", "authors": ["Vivien Lewis", "David van Dijcke"], "year": 2019, "journal": "Deutsche Bundesbank Discussion Papers", "type": "publication",
|
| 176 |
-
"keywords": ["work effort", "business cycle", "survey"], "field": "Macro"
|
| 177 |
-
}
|
| 178 |
}
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
for key, paper in self.papers.items():
|
| 198 |
-
normalized_title = paper["title"].lower().strip()
|
| 199 |
-
self.title_to_key[normalized_title] = key
|
| 200 |
-
title_words = normalized_title.split()
|
| 201 |
-
if len(title_words) > 3:
|
| 202 |
-
self.title_to_key[" ".join(title_words[:3])] = key
|
| 203 |
-
for keyword in paper.get("keywords", []):
|
| 204 |
-
self.keyword_to_papers[keyword.lower()].append(key)
|
| 205 |
-
|
| 206 |
-
def find_paper(self, text: str) -> List[str]:
|
| 207 |
-
text_lower = text.lower()
|
| 208 |
-
found_papers = []
|
| 209 |
-
for key in self.papers.keys():
|
| 210 |
-
if key in text_lower: found_papers.append(key)
|
| 211 |
-
for title_fragment, key in self.title_to_key.items():
|
| 212 |
-
if title_fragment in text_lower and key not in found_papers: found_papers.append(key)
|
| 213 |
-
keyword_matches = defaultdict(int)
|
| 214 |
-
for keyword, paper_keys in self.keyword_to_papers.items():
|
| 215 |
-
if keyword in text_lower:
|
| 216 |
-
for paper_key in paper_keys: keyword_matches[paper_key] += 1
|
| 217 |
-
for paper_key, match_count in keyword_matches.items():
|
| 218 |
-
if match_count >= 2 and paper_key not in found_papers: found_papers.append(paper_key)
|
| 219 |
-
return found_papers
|
| 220 |
-
|
| 221 |
-
def verify_and_correct_response(self, response: str) -> str:
|
| 222 |
-
mentioned_papers = self.find_paper(response)
|
| 223 |
-
if not mentioned_papers: return response
|
| 224 |
-
corrected_response = response
|
| 225 |
-
for paper_key in mentioned_papers:
|
| 226 |
-
paper = self.papers[paper_key]
|
| 227 |
-
correct_authors = paper["authors"]
|
| 228 |
-
paper_title = paper["title"]
|
| 229 |
-
if len(correct_authors) == 1: author_str = correct_authors[0]
|
| 230 |
-
elif len(correct_authors) == 2: author_str = " and ".join(correct_authors)
|
| 231 |
-
else: author_str = ", ".join(correct_authors[:-1]) + ", and " + correct_authors[-1]
|
| 232 |
-
title_pattern = re.escape(paper_title)
|
| 233 |
-
patterns = [
|
| 234 |
-
rf"({title_pattern})[^.]*?by\s+([^.]+?)(?:\.|,|\))", rf"({title_pattern})[^.]*?with\s+([^.]+?)(?:\.|,|\))",
|
| 235 |
-
rf"({title_pattern})[^.]*?\(([^)]+?)\)", rf"({title_pattern})[^.]*?-\s*Authors:\s*([^.]+?)(?:\.|,|\n)",
|
| 236 |
-
]
|
| 237 |
-
for pattern in patterns:
|
| 238 |
-
matches = re.finditer(pattern, corrected_response, re.IGNORECASE)
|
| 239 |
-
for match in matches:
|
| 240 |
-
full_match = match.group(0)
|
| 241 |
-
author_part = match.group(2)
|
| 242 |
-
mentioned_authors = [a.strip() for a in re.split(r',|and', author_part)]
|
| 243 |
-
if len(correct_authors) > 1 and len(mentioned_authors) == 1 and "David" in mentioned_authors[0]:
|
| 244 |
-
if "by" in full_match: new_match = full_match.replace(f"by {author_part}", f"by {author_str}")
|
| 245 |
-
elif "with" in full_match: new_match = full_match.replace(f"with {author_part}", f"with {author_str}")
|
| 246 |
-
elif "(" in full_match and ")" in full_match: new_match = full_match.replace(f"({author_part})", f"({author_str})")
|
| 247 |
-
elif "Authors:" in full_match: new_match = full_match.replace(f"Authors: {author_part}", f"Authors: {author_str}")
|
| 248 |
-
else: new_match = full_match
|
| 249 |
-
corrected_response = corrected_response.replace(full_match, new_match)
|
| 250 |
-
for paper_key in mentioned_papers:
|
| 251 |
-
paper = self.papers[paper_key]
|
| 252 |
-
if len(paper["authors"]) > 1:
|
| 253 |
-
possessive_patterns = [rf"David's\s+{re.escape(paper['title'])}", rf"his\s+{re.escape(paper['title'])}"]
|
| 254 |
-
for pattern in possessive_patterns:
|
| 255 |
-
if re.search(pattern, corrected_response, re.IGNORECASE):
|
| 256 |
-
author_str_coauthors = " and ".join([a for a in paper["authors"] if a != "David Van Dijcke"]) # Corrected variable name
|
| 257 |
-
if author_str_coauthors and author_str_coauthors not in corrected_response: # Check if coauthor_str is not empty
|
| 258 |
-
sentences = corrected_response.split('.')
|
| 259 |
-
for i, sentence in enumerate(sentences):
|
| 260 |
-
if re.search(pattern, sentence, re.IGNORECASE):
|
| 261 |
-
sentences[i] = sentence + f" (joint work with {author_str_coauthors})" # Corrected variable name
|
| 262 |
-
corrected_response = '.'.join(sentences)
|
| 263 |
-
break
|
| 264 |
-
return corrected_response
|
| 265 |
-
|
| 266 |
-
class JudgeAgent:
|
| 267 |
-
def __init__(self, paper_db: DynamicPaperDatabase):
|
| 268 |
-
self.paper_db = paper_db
|
| 269 |
-
gemini_api_key = os.getenv("GOOGLE_API_KEY")
|
| 270 |
-
self.use_gemini = False
|
| 271 |
-
if gemini_api_key:
|
| 272 |
-
try:
|
| 273 |
-
genai.configure(api_key=gemini_api_key)
|
| 274 |
-
model_preference = ['gemini-1.5-flash-002', 'gemini-1.5-flash', 'gemini-1.5-pro']
|
| 275 |
-
for model_name in model_preference:
|
| 276 |
-
try:
|
| 277 |
-
self.judge_model = genai.GenerativeModel(model_name)
|
| 278 |
-
self.judge_model.generate_content("Hello") # Test call
|
| 279 |
-
self.use_gemini = True
|
| 280 |
-
logger.info(f"Judge agent initialized with {model_name}")
|
| 281 |
-
break
|
| 282 |
-
except Exception:
|
| 283 |
-
logger.warning(f"Failed to initialize judge with {model_name}, trying next.")
|
| 284 |
-
if not self.use_gemini:
|
| 285 |
-
logger.error("Failed to initialize judge agent with any Gemini model.")
|
| 286 |
-
except Exception as e:
|
| 287 |
-
logger.error(f"Failed to configure Gemini for judge agent: {e}")
|
| 288 |
-
self.use_gemini = False
|
| 289 |
|
| 290 |
-
def
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
|
| 300 |
-
def
|
| 301 |
-
|
| 302 |
-
|
|
|
|
| 303 |
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 308 |
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
4. Are the claims supported by the paper database below?
|
| 318 |
|
| 319 |
-
|
| 320 |
-
{paper_context}
|
| 321 |
|
| 322 |
-
|
|
|
|
| 323 |
|
| 324 |
-
|
| 325 |
-
{original_response}
|
| 326 |
|
| 327 |
-
|
| 328 |
|
|
|
|
|
|
|
| 329 |
try:
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
)
|
| 333 |
-
judge_safety_settings = [
|
| 334 |
-
{"category": HarmCategory.HARM_CATEGORY_HARASSMENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 335 |
-
{"category": HarmCategory.HARM_CATEGORY_HATE_SPEECH, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 336 |
-
{"category": HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 337 |
-
{"category": HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 338 |
-
]
|
| 339 |
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
generation_config=generation_config,
|
| 343 |
-
safety_settings=judge_safety_settings
|
| 344 |
-
)
|
| 345 |
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
judge_gemini_response.candidates[0].content.parts and
|
| 351 |
-
len(judge_gemini_response.candidates[0].content.parts) > 0):
|
| 352 |
-
judged_text = judge_gemini_response.text.strip()
|
| 353 |
-
else:
|
| 354 |
-
block_reason_info = "Reason unknown."
|
| 355 |
-
finish_reason_info = "Finish reason unknown."
|
| 356 |
-
if judge_gemini_response.prompt_feedback and judge_gemini_response.prompt_feedback.block_reason:
|
| 357 |
-
block_reason_info = f"Prompt blocked for judge due to: {judge_gemini_response.prompt_feedback.block_reason.name}"
|
| 358 |
-
if judge_gemini_response.prompt_feedback.block_reason_message:
|
| 359 |
-
block_reason_info += f" (Message: {judge_gemini_response.prompt_feedback.block_reason_message})"
|
| 360 |
-
logger.error(f"Gemini judge agent: {block_reason_info}")
|
| 361 |
-
if judge_gemini_response.candidates and len(judge_gemini_response.candidates) > 0:
|
| 362 |
-
candidate = judge_gemini_response.candidates[0]
|
| 363 |
-
finish_reason_info = f"Finish reason for judge: {candidate.finish_reason.name}"
|
| 364 |
-
logger.error(f"Gemini judge agent: {finish_reason_info}")
|
| 365 |
-
if candidate.safety_ratings:
|
| 366 |
-
for rating in candidate.safety_ratings:
|
| 367 |
-
logger.error(f" Judge Safety Rating: Category={rating.category.name}, Probability={rating.probability.name}")
|
| 368 |
-
logger.warning(f"Judge agent could not generate a refined response ({block_reason_info}, {finish_reason_info}). Falling back to pre-judge verified response.")
|
| 369 |
-
return self.paper_db.verify_and_correct_response(original_response)
|
| 370 |
-
|
| 371 |
-
final_response = self.paper_db.verify_and_correct_response(judged_text)
|
| 372 |
-
return final_response
|
| 373 |
|
|
|
|
|
|
|
| 374 |
except Exception as e:
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
class RateLimiter:
|
| 379 |
-
def __init__(self):
|
| 380 |
-
self.sessions = {}
|
| 381 |
-
self.lock = threading.Lock()
|
| 382 |
-
def get_session_info(self, session_id: str) -> Dict:
|
| 383 |
-
with self.lock:
|
| 384 |
-
current_time = datetime.now()
|
| 385 |
-
expired_sessions = [sid for sid, info in self.sessions.items() if current_time - info['last_activity'] > timedelta(hours=SESSION_TIMEOUT_HOURS)]
|
| 386 |
-
for sid in expired_sessions: del self.sessions[sid]; logger.info(f"Expired session: {sid}")
|
| 387 |
-
if session_id not in self.sessions:
|
| 388 |
-
if len(self.sessions) >= MAX_CONCURRENT_SESSIONS: return {'allowed': False, 'reason': 'Too many active sessions. Please try again later.'}
|
| 389 |
-
self.sessions[session_id] = {'message_count': 0, 'created': current_time, 'last_activity': current_time}
|
| 390 |
-
session = self.sessions[session_id]
|
| 391 |
-
session['last_activity'] = current_time
|
| 392 |
-
if session['message_count'] >= MAX_MESSAGES_PER_SESSION:
|
| 393 |
-
return {'allowed': False, 'reason': f'You have reached the limit of {MAX_MESSAGES_PER_SESSION} messages. Please email David at dvdijcke@umich.edu for further questions.'}
|
| 394 |
-
session['message_count'] += 1
|
| 395 |
-
return {'allowed': True, 'message_count': session['message_count'], 'remaining': MAX_MESSAGES_PER_SESSION - session['message_count']}
|
| 396 |
-
|
| 397 |
-
paper_db = DynamicPaperDatabase() # Use default base_path (None) for Hugging Face
|
| 398 |
-
judge_agent = JudgeAgent(paper_db)
|
| 399 |
-
rate_limiter = RateLimiter()
|
| 400 |
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
gemini_api_key = os.getenv("GOOGLE_API_KEY")
|
| 405 |
-
self.use_gemini = False
|
| 406 |
-
if gemini_api_key:
|
| 407 |
-
try:
|
| 408 |
-
genai.configure(api_key=gemini_api_key)
|
| 409 |
-
logger.info("Attempting to use Google Gemini for high quality responses")
|
| 410 |
-
model_preference = ['gemini-1.5-flash-002', 'gemini-1.5-flash', 'gemini-1.5-pro']
|
| 411 |
-
for model_name in model_preference:
|
| 412 |
-
try:
|
| 413 |
-
self.gemini_model = genai.GenerativeModel(model_name)
|
| 414 |
-
self.gemini_model.generate_content("Hello") # Test call
|
| 415 |
-
self.use_gemini = True
|
| 416 |
-
logger.info(f"Successfully connected to {model_name}")
|
| 417 |
-
break
|
| 418 |
-
except Exception:
|
| 419 |
-
logger.warning(f"Failed to connect to {model_name}, trying next.")
|
| 420 |
-
if not self.use_gemini:
|
| 421 |
-
logger.error("Failed to connect to any Gemini model. Using limited mode.")
|
| 422 |
-
except Exception as e:
|
| 423 |
-
logger.error(f"Failed to initialize Gemini: {e}")
|
| 424 |
-
self.use_gemini = False
|
| 425 |
-
else:
|
| 426 |
-
logger.warning("No Google API key found. Using limited mode.")
|
| 427 |
-
self.use_gemini = False
|
| 428 |
-
self.vector_store = None
|
| 429 |
-
self.cache_path = "vector_store_cache"
|
| 430 |
-
logger.info("Building vector store from documents and markdown files...")
|
| 431 |
-
self.load_documents()
|
| 432 |
-
|
| 433 |
-
def load_documents(self):
|
| 434 |
-
documents = []
|
| 435 |
-
research_info = """
|
| 436 |
-
David Van Dijcke is a PhD candidate in Economics at the University of Michigan, Ann Arbor.
|
| 437 |
-
He is on the job market for the 2025-26 academic year as an ECONOMETRICIAN.
|
| 438 |
-
RESEARCH PROFILE: David's research has two main components:
|
| 439 |
-
1. ECONOMETRIC THEORY: Developing novel methods for functional and high-dimensional data, combining tools from functional data analysis, optimal transport, and geometric measure theory
|
| 440 |
-
2. POLICY APPLICATIONS: Applying these methods to answer important policy questions using big data, from labor markets to public health to conflict zones
|
| 441 |
-
IMPORTANT: Always credit coauthors when discussing papers. Economics papers typically use alphabetical author order.
|
| 442 |
-
CONTACT: Email: dvdijcke@umich.edu, Website: https://davidvandijcke.com, Book a meeting: https://calendar.app.google/dKeDaigmFwnJPm8s6
|
| 443 |
-
"""
|
| 444 |
-
documents.append(Document(page_content=research_info, metadata={"source": "website_overview", "type": "general_info"}))
|
| 445 |
-
for paper_key, paper in paper_db.papers.items():
|
| 446 |
-
paper_content = f"Paper: {paper['title']}\nAuthors: {', '.join(paper['authors'])}\nYear: {paper.get('year', 'forthcoming')}\nType: {paper['type']}\nField: {paper.get('field', 'Economics')}\n"
|
| 447 |
-
if paper.get('venue'): paper_content += f"Venue: {paper['venue']}\n"
|
| 448 |
-
if paper.get('citation'): paper_content += f"Citation: {paper['citation']}\n"
|
| 449 |
-
if paper.get('excerpt'): paper_content += f"\nAbstract/Summary: {paper['excerpt']}\n"
|
| 450 |
-
if paper.get('full_content'): paper_content += f"\nDetails: {paper['full_content']}\n"
|
| 451 |
-
if paper['type'] == 'job_market_paper': paper_content += "\nNOTE: This is David's JOB MARKET PAPER for 2025-26.\n"
|
| 452 |
-
documents.append(Document(page_content=paper_content, metadata={"source": f"paper_{paper_key}", "type": "research"}))
|
| 453 |
-
key_pdfs = ["CV_DavidVanDijcke.pdf", "disco.pdf", "fdr.pdf", "r3d_arxiv_4apr2025.pdf", "rto.pdf", "unmasking_partisanship.pdf"]
|
| 454 |
-
possible_dirs = ["documents", "./documents", os.path.join(os.getcwd(), "documents")]
|
| 455 |
-
documents_dir = next((dir_path for dir_path in possible_dirs if os.path.exists(dir_path)), None)
|
| 456 |
-
if documents_dir:
|
| 457 |
-
logger.info(f"Found documents directory at: {documents_dir}")
|
| 458 |
-
for filename in key_pdfs:
|
| 459 |
-
filepath = os.path.join(documents_dir, filename)
|
| 460 |
-
if os.path.exists(filepath):
|
| 461 |
-
try:
|
| 462 |
-
loader = PyPDFLoader(filepath)
|
| 463 |
-
pdf_docs = loader.load()
|
| 464 |
-
pages_to_load = 10 if "r3d" in filename.lower() else 5
|
| 465 |
-
documents.extend(pdf_docs[:pages_to_load])
|
| 466 |
-
logger.info(f"Loaded {filename} ({pages_to_load} pages)")
|
| 467 |
-
except Exception as e:
|
| 468 |
-
logger.warning(f"Error loading {filename}: {e}")
|
| 469 |
-
else:
|
| 470 |
-
logger.warning("No documents directory found. PDF loading skipped.")
|
| 471 |
-
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, length_function=len)
|
| 472 |
-
splits = text_splitter.split_documents(documents)
|
| 473 |
-
self.vector_store = FAISS.from_documents(splits, self.embeddings)
|
| 474 |
-
try:
|
| 475 |
-
if not os.path.exists(self.cache_path):
|
| 476 |
-
os.makedirs(self.cache_path)
|
| 477 |
-
self.vector_store.save_local(self.cache_path)
|
| 478 |
-
logger.info("Vector store cached successfully")
|
| 479 |
-
except Exception as e:
|
| 480 |
-
logger.warning(f"Failed to cache vector store (non-critical): {e}")
|
| 481 |
-
|
| 482 |
-
def is_greeting_or_casual(self, message: str) -> bool:
|
| 483 |
-
greetings = ["hello", "hi", "hey", "good morning", "good afternoon", "good evening", "how are you", "what's up", "greetings", "howdy", "hola", "bonjour"]
|
| 484 |
-
message_lower = message.lower().strip()
|
| 485 |
-
starts_with_greeting = any(message_lower.startswith(greeting) for greeting in greetings)
|
| 486 |
-
is_very_short = len(message_lower.split()) <= 2 and not any(word in message_lower for word in ["r3d", "paper", "research", "method", "econometric", "about", "tell", "what", "how"])
|
| 487 |
-
return starts_with_greeting or is_very_short
|
| 488 |
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
- Explain the main contributions and innovations
|
| 505 |
-
- Mention applications and empirical examples when available
|
| 506 |
-
- For the job market paper (R3D), emphasize its importance and innovations
|
| 507 |
-
|
| 508 |
-
3. RESEARCH PROFILE:
|
| 509 |
-
- David is an ECONOMETRICIAN who develops new statistical methods
|
| 510 |
-
- His job market paper is R3D: Regression Discontinuity Design with Distribution-Valued Outcomes (sole authored)
|
| 511 |
-
- He combines functional data analysis, optimal transport, and geometric measure theory
|
| 512 |
-
- He applies these methods to answer policy questions with big data
|
| 513 |
-
- His work extends causal inference beyond scalar outcomes to distribution-valued outcomes
|
| 514 |
-
|
| 515 |
-
4. KEY COLLABORATORS:
|
| 516 |
-
- Florian Gunsilius (frequent coauthor on FDR, DISCO)
|
| 517 |
-
- Austin Wright (Return to Office, Ukraine, COVID papers)
|
| 518 |
-
- Other coauthors should be mentioned by name when discussing their joint work
|
| 519 |
-
|
| 520 |
-
Be precise about technical details and provide substantive information. If uncertain about details, suggest emailing David at dvdijcke@umich.edu.
|
| 521 |
-
|
| 522 |
-
Context about David Van Dijcke:
|
| 523 |
-
{context}
|
| 524 |
-
|
| 525 |
-
User's question: {question}
|
| 526 |
-
|
| 527 |
-
Provide an accurate, detailed, and professional response. Remember to ALWAYS credit ALL coauthors and provide substantive information about the research."""
|
| 528 |
-
|
| 529 |
-
try:
|
| 530 |
-
generation_config = genai.types.GenerationConfig(
|
| 531 |
-
temperature=0.2, top_p=0.9, max_output_tokens=500,
|
| 532 |
-
)
|
| 533 |
-
safety_settings_for_call = [
|
| 534 |
-
{"category": HarmCategory.HARM_CATEGORY_HARASSMENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 535 |
-
{"category": HarmCategory.HARM_CATEGORY_HATE_SPEECH, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 536 |
-
{"category": HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 537 |
-
{"category": HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, "threshold": HarmBlockThreshold.BLOCK_ONLY_HIGH},
|
| 538 |
-
]
|
| 539 |
-
|
| 540 |
-
gemini_api_response = self.gemini_model.generate_content(
|
| 541 |
-
prompt,
|
| 542 |
-
generation_config=generation_config,
|
| 543 |
-
safety_settings=safety_settings_for_call
|
| 544 |
-
)
|
| 545 |
-
|
| 546 |
-
generated_text = ""
|
| 547 |
-
if (gemini_api_response.candidates and
|
| 548 |
-
len(gemini_api_response.candidates) > 0 and
|
| 549 |
-
gemini_api_response.candidates[0].content and
|
| 550 |
-
gemini_api_response.candidates[0].content.parts and
|
| 551 |
-
len(gemini_api_response.candidates[0].content.parts) > 0):
|
| 552 |
-
generated_text = gemini_api_response.text.strip()
|
| 553 |
-
else:
|
| 554 |
-
block_reason_info = "Reason unknown."
|
| 555 |
-
finish_reason_info = "Finish reason unknown."
|
| 556 |
-
if gemini_api_response.prompt_feedback and gemini_api_response.prompt_feedback.block_reason:
|
| 557 |
-
block_reason_info = f"Prompt blocked due to: {gemini_api_response.prompt_feedback.block_reason.name}"
|
| 558 |
-
if gemini_api_response.prompt_feedback.block_reason_message:
|
| 559 |
-
block_reason_info += f" (Message: {gemini_api_response.prompt_feedback.block_reason_message})"
|
| 560 |
-
logger.error(f"Gemini main assistant: {block_reason_info}")
|
| 561 |
-
if gemini_api_response.candidates and len(gemini_api_response.candidates) > 0:
|
| 562 |
-
candidate = gemini_api_response.candidates[0]
|
| 563 |
-
finish_reason_info = f"Finish reason: {candidate.finish_reason.name}"
|
| 564 |
-
logger.error(f"Gemini main assistant: {finish_reason_info}")
|
| 565 |
-
if candidate.safety_ratings:
|
| 566 |
-
for rating in candidate.safety_ratings:
|
| 567 |
-
logger.error(f" Safety Rating: Category={rating.category.name}, Probability={rating.probability.name}")
|
| 568 |
-
user_message = (f"I apologize, but I encountered an issue generating a response. "
|
| 569 |
-
f"This might be due to content safety filters. ({finish_reason_info}) " # Simplified for user
|
| 570 |
-
f"Please try rephrasing your question.")
|
| 571 |
-
return user_message
|
| 572 |
-
|
| 573 |
-
verified_response = paper_db.verify_and_correct_response(generated_text)
|
| 574 |
-
final_response = judge_agent.judge_response(verified_response, question)
|
| 575 |
-
return final_response
|
| 576 |
-
|
| 577 |
-
except Exception as e:
|
| 578 |
-
logger.error(f"Error with Gemini in main assistant: {e}", exc_info=True)
|
| 579 |
-
if "finish_reason is 2" in str(e) or "SAFETY" in str(e).upper() or "finish_reason: SAFETY" in str(e):
|
| 580 |
-
return "I apologize, but my response generation was blocked. This might be due to content safety filters. Please try rephrasing your question."
|
| 581 |
-
return "I apologize, but I'm having trouble generating a response right now. Could you please try again?"
|
| 582 |
-
else:
|
| 583 |
-
return "I'm currently running in limited mode without access to a high-quality language model. To get the best responses, please add a Google API key to the Space settings."
|
| 584 |
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
if not session_info['allowed']: return session_info['reason']
|
| 589 |
|
| 590 |
-
|
| 591 |
-
greeting_responses = [
|
| 592 |
-
"Hello! I'm here to help you learn about David Van Dijcke, an econometrician on the 2025-26 job market. He develops cutting-edge methods for functional and high-dimensional data. What would you like to know about his research?",
|
| 593 |
-
"Hi! Welcome to David Van Dijcke's research assistant. David is an econometrician who combines functional data analysis, optimal transport, and geometric measure theory to develop new causal inference methods. How can I help you learn about his work?",
|
| 594 |
-
"Hello! I can tell you about David Van Dijcke's econometric research, including his job market paper on distribution-valued treatment effects and his collaborative work with researchers like Florian Gunsilius and Austin Wright. What aspect of his work interests you?",
|
| 595 |
-
]
|
| 596 |
-
response_index = int(hashlib.md5(message.encode()).hexdigest(), 16) % len(greeting_responses)
|
| 597 |
-
return greeting_responses[response_index]
|
| 598 |
|
| 599 |
-
|
| 600 |
-
|
| 601 |
-
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
| 620 |
-
|
| 621 |
-
|
| 622 |
-
|
| 623 |
-
|
| 624 |
-
|
| 625 |
-
|
| 626 |
-
|
| 627 |
-
|
| 628 |
-
|
| 629 |
-
|
| 630 |
-
|
| 631 |
-
|
| 632 |
-
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 640 |
|
| 641 |
-
demo = gr.ChatInterface(
|
| 642 |
-
fn=chat_function,
|
| 643 |
-
title="David Van Dijcke - Econometrician | Job Market 2025-26",
|
| 644 |
-
description=("Welcome! I'm an AI assistant specializing in David Van Dijcke's econometric research. "
|
| 645 |
-
"David develops novel econometric methods for functional and high-dimensional data. Ask me about his job market paper (R3D), "
|
| 646 |
-
"the novel aspects of his research, or his collaborative research projects."),
|
| 647 |
-
examples=["Hello! Who is David Van Dijcke?", "What econometric methods has David developed?",
|
| 648 |
-
"Tell me about his job market paper", "Tell me about the Return to Office paper", "Who are David's coauthors?"],
|
| 649 |
-
theme=gr.themes.Soft(primary_hue="blue", secondary_hue="gray", neutral_hue="gray", font=gr.themes.GoogleFont("Inter")),
|
| 650 |
-
css=custom_css, retry_btn="Retry", undo_btn="Undo", clear_btn="Clear Chat", submit_btn="Send", autofocus=True
|
| 651 |
-
)
|
| 652 |
return demo
|
| 653 |
|
| 654 |
if __name__ == "__main__":
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
|
| 658 |
-
|
| 659 |
-
|
| 660 |
-
|
| 661 |
-
os.makedirs("vector_store_cache", exist_ok=True)
|
| 662 |
-
except Exception as e:
|
| 663 |
-
logger.warning(f"Could not create cache directory (non-critical): {e}")
|
| 664 |
-
|
| 665 |
-
demo = create_gradio_interface()
|
| 666 |
-
demo.launch(share=False, server_name="0.0.0.0", server_port=7860, show_error=True)
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Professional Research Assistant
|
| 4 |
+
Clean chat interface with expert responses
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
import os
|
| 8 |
+
from typing import List, Tuple
|
| 9 |
import gradio as gr
|
| 10 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 11 |
+
from dotenv import load_dotenv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
import google.generativeai as genai
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Load environment variables
|
| 15 |
+
load_dotenv()
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
class ProfessionalAssistant:
|
| 18 |
+
"""Professional assistant that speaks as an expert about David's work"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
def __init__(self):
|
| 21 |
+
# Setup Gemini
|
| 22 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 23 |
+
if api_key:
|
| 24 |
+
genai.configure(api_key=api_key)
|
| 25 |
+
try:
|
| 26 |
+
self.model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
|
| 27 |
+
print("Using Gemini 2.5 Flash Preview")
|
| 28 |
+
except:
|
| 29 |
+
self.model = genai.GenerativeModel('gemini-1.5-flash')
|
| 30 |
+
print("Using Gemini 1.5 Flash")
|
| 31 |
+
else:
|
| 32 |
+
self.model = None
|
| 33 |
+
|
| 34 |
+
# Load all papers
|
| 35 |
+
self.papers = self._load_all_papers()
|
| 36 |
+
|
| 37 |
+
# Pre-load context
|
| 38 |
+
self.context = self._create_context()
|
| 39 |
+
|
| 40 |
+
# Question counter
|
| 41 |
+
self.question_count = 0
|
| 42 |
+
self.question_limit = 15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
def _load_all_papers(self) -> dict:
|
| 45 |
+
"""Load all papers completely"""
|
| 46 |
+
papers = {}
|
| 47 |
+
pdf_dir = "documents"
|
| 48 |
+
|
| 49 |
+
paper_files = {
|
| 50 |
+
"r3d": ("r3d_arxiv_4apr2025.pdf", "R3D (Job Market Paper)"),
|
| 51 |
+
"cv": ("CV_DavidVanDijcke.pdf", "CV"),
|
| 52 |
+
"fdr": ("fdr.pdf", "Free Discontinuity Regression"),
|
| 53 |
+
"disco": ("disco.pdf", "Distributional Synthetic Controls"),
|
| 54 |
+
"rto": ("rto.pdf", "Return to Office"),
|
| 55 |
+
"prodf": ("prodf.pdf", "Revenue Production Functions"),
|
| 56 |
+
"unmasking": ("unmasking_partisanship.pdf", "Unmasking Partisanship"),
|
| 57 |
+
"ukraine": ("van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf", "Ukraine Alerts")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
}
|
| 59 |
+
|
| 60 |
+
for key, (filename, title) in paper_files.items():
|
| 61 |
+
pdf_path = os.path.join(pdf_dir, filename)
|
| 62 |
+
if os.path.exists(pdf_path):
|
| 63 |
+
try:
|
| 64 |
+
loader = PyPDFLoader(pdf_path)
|
| 65 |
+
pages = loader.load()
|
| 66 |
+
text = "\n\n".join([p.page_content for p in pages])
|
| 67 |
+
papers[key] = {
|
| 68 |
+
"text": text,
|
| 69 |
+
"title": title,
|
| 70 |
+
"pages": len(pages)
|
| 71 |
+
}
|
| 72 |
+
print(f"Loaded {title}: {len(pages)} pages")
|
| 73 |
+
except Exception as e:
|
| 74 |
+
print(f"Error loading {filename}: {e}")
|
| 75 |
+
|
| 76 |
+
return papers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
def _create_context(self) -> str:
|
| 79 |
+
"""Create comprehensive context from all papers"""
|
| 80 |
+
context_parts = []
|
| 81 |
+
|
| 82 |
+
# Add papers in priority order
|
| 83 |
+
priority_order = ["r3d", "cv", "fdr", "disco", "rto", "prodf"]
|
| 84 |
+
|
| 85 |
+
for key in priority_order:
|
| 86 |
+
if key in self.papers:
|
| 87 |
+
paper = self.papers[key]
|
| 88 |
+
# Add substantial excerpts
|
| 89 |
+
excerpt_length = 30000 if key == "r3d" else 15000
|
| 90 |
+
context_parts.append(f"\n[{paper['title']}]\n{paper['text'][:excerpt_length]}")
|
| 91 |
+
|
| 92 |
+
return "\n\n".join(context_parts)
|
| 93 |
|
| 94 |
+
def chat(self, message: str, history: List[Tuple[str, str]]) -> Tuple[str, List[Tuple[str, str]]]:
|
| 95 |
+
"""Chat with proper history handling"""
|
| 96 |
+
if not message.strip():
|
| 97 |
+
return "", history
|
| 98 |
|
| 99 |
+
# Check question limit
|
| 100 |
+
if self.question_count >= self.question_limit:
|
| 101 |
+
response = "I've reached the question limit for this session (15 questions). Please refresh the page to start a new conversation."
|
| 102 |
+
history.append((message, response))
|
| 103 |
+
return "", history
|
| 104 |
+
|
| 105 |
+
if not self.model:
|
| 106 |
+
response = "I need a Google API key to provide detailed answers about David's research."
|
| 107 |
+
history.append((message, response))
|
| 108 |
+
return "", history
|
| 109 |
+
|
| 110 |
+
# Build conversation context
|
| 111 |
+
conversation = "Previous conversation:\n"
|
| 112 |
+
for human, assistant in history[-3:]: # Last 3 exchanges
|
| 113 |
+
conversation += f"User: {human}\nAssistant: {assistant}\n\n"
|
| 114 |
+
|
| 115 |
+
# Determine which papers to emphasize based on query
|
| 116 |
+
message_lower = message.lower()
|
| 117 |
+
specific_context = ""
|
| 118 |
+
|
| 119 |
+
if "job market" in message_lower or "r3d" in message_lower:
|
| 120 |
+
if "r3d" in self.papers:
|
| 121 |
+
specific_context = f"\n[R3D - Job Market Paper]\n{self.papers['r3d']['text'][:50000]}\n"
|
| 122 |
+
elif "fdr" in message_lower or "discontinuity" in message_lower:
|
| 123 |
+
if "fdr" in self.papers:
|
| 124 |
+
specific_context = f"\n[FDR Paper]\n{self.papers['fdr']['text'][:30000]}\n"
|
| 125 |
+
|
| 126 |
+
# Create prompt
|
| 127 |
+
prompt = f"""You are an expert assistant helping visitors learn about David Van Dijcke's research.
|
| 128 |
|
| 129 |
+
CRITICAL INSTRUCTIONS:
|
| 130 |
+
- You are NOT David - you are an expert explaining his work to website visitors
|
| 131 |
+
- Speak in third person about David (use "David" or "Van Dijcke", not "I" or "my")
|
| 132 |
+
- Be conversational but professional
|
| 133 |
+
- Give concise, informative answers (2-3 paragraphs max unless asked for details)
|
| 134 |
+
- Don't say "based on the provided papers" - just state facts confidently
|
| 135 |
+
- Focus on what makes his work innovative and important
|
| 136 |
|
| 137 |
+
Key facts:
|
| 138 |
+
- David is an econometrician on the 2025-26 job market from University of Michigan
|
| 139 |
+
- His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
|
| 140 |
+
- He specializes in functional data analysis and optimal transport methods
|
|
|
|
| 141 |
|
| 142 |
+
{conversation}
|
|
|
|
| 143 |
|
| 144 |
+
Full research context:
|
| 145 |
+
{self.context}
|
| 146 |
|
| 147 |
+
{specific_context}
|
|
|
|
| 148 |
|
| 149 |
+
Current question: {message}
|
| 150 |
|
| 151 |
+
Provide a concise, expert response:"""
|
| 152 |
+
|
| 153 |
try:
|
| 154 |
+
response = self.model.generate_content(prompt)
|
| 155 |
+
answer = response.text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
+
# Increment question counter
|
| 158 |
+
self.question_count += 1
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
+
# Add remaining questions info if getting close to limit
|
| 161 |
+
remaining = self.question_limit - self.question_count
|
| 162 |
+
if remaining <= 3 and remaining > 0:
|
| 163 |
+
answer += f"\n\n*({remaining} questions remaining in this session)*"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
|
| 165 |
+
history.append((message, answer))
|
| 166 |
+
return "", history
|
| 167 |
except Exception as e:
|
| 168 |
+
error_response = f"I encountered an error. Please try rephrasing your question."
|
| 169 |
+
history.append((message, error_response))
|
| 170 |
+
return "", history
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
+
# Create interface
|
| 173 |
+
def create_interface():
|
| 174 |
+
assistant = ProfessionalAssistant()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
+
# Custom CSS for a clean look
|
| 177 |
+
custom_css = """
|
| 178 |
+
.gradio-container {
|
| 179 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
|
| 180 |
+
max-width: 900px;
|
| 181 |
+
margin: auto;
|
| 182 |
+
}
|
| 183 |
+
.chatbot {
|
| 184 |
+
height: 500px !important;
|
| 185 |
+
}
|
| 186 |
+
.message {
|
| 187 |
+
font-size: 15px !important;
|
| 188 |
+
line-height: 1.6 !important;
|
| 189 |
+
}
|
| 190 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
+
with gr.Blocks(title="David Van Dijcke | Research Assistant", css=custom_css) as demo:
|
| 193 |
+
gr.Markdown("""
|
| 194 |
+
## David Van Dijcke - Research Assistant
|
|
|
|
| 195 |
|
| 196 |
+
Welcome! I can help you learn about David Van Dijcke's econometric research. David is on the 2025-26 academic job market.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
|
| 198 |
+
**Job Market Paper:** R3D - Regression Discontinuity Design with Distribution-Valued Outcomes
|
| 199 |
+
|
| 200 |
+
*Note: This session allows up to 15 questions. Refresh the page to start a new session.*
|
| 201 |
+
""")
|
| 202 |
+
|
| 203 |
+
chatbot = gr.Chatbot(
|
| 204 |
+
value=[],
|
| 205 |
+
elem_classes=["chatbot"],
|
| 206 |
+
bubble_full_width=False,
|
| 207 |
+
avatar_images=(None, None),
|
| 208 |
+
show_label=False
|
| 209 |
+
)
|
| 210 |
+
|
| 211 |
+
with gr.Row():
|
| 212 |
+
msg = gr.Textbox(
|
| 213 |
+
show_label=False,
|
| 214 |
+
placeholder="Ask about David's research, methods, or papers...",
|
| 215 |
+
elem_classes=["message-input"],
|
| 216 |
+
scale=4
|
| 217 |
+
)
|
| 218 |
+
submit = gr.Button("Send", scale=1, variant="primary")
|
| 219 |
+
|
| 220 |
+
# Clear button
|
| 221 |
+
clear = gr.Button("Clear conversation", size="sm")
|
| 222 |
+
|
| 223 |
+
# Examples in a nice layout
|
| 224 |
+
gr.Examples(
|
| 225 |
+
examples=[
|
| 226 |
+
"What is David's job market paper about?",
|
| 227 |
+
"What makes R3D innovative?",
|
| 228 |
+
"What are the practical applications of R3D?",
|
| 229 |
+
"Tell me about David's other research besides R3D",
|
| 230 |
+
"What makes David a strong candidate for an econometrics position?"
|
| 231 |
+
],
|
| 232 |
+
inputs=msg,
|
| 233 |
+
label="Example questions:"
|
| 234 |
+
)
|
| 235 |
+
|
| 236 |
+
# Event handlers
|
| 237 |
+
msg.submit(assistant.chat, [msg, chatbot], [msg, chatbot])
|
| 238 |
+
submit.click(assistant.chat, [msg, chatbot], [msg, chatbot])
|
| 239 |
+
clear.click(lambda: [], None, chatbot, queue=False)
|
| 240 |
+
|
| 241 |
+
gr.Markdown("""
|
| 242 |
+
---
|
| 243 |
+
*This assistant has access to David's complete research portfolio including published papers, working papers, and CV.*
|
| 244 |
+
""")
|
| 245 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
return demo
|
| 247 |
|
| 248 |
if __name__ == "__main__":
|
| 249 |
+
interface = create_interface()
|
| 250 |
+
interface.launch(
|
| 251 |
+
server_name="127.0.0.1",
|
| 252 |
+
server_port=7860,
|
| 253 |
+
show_error=True
|
| 254 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app_enhanced.py
ADDED
|
@@ -0,0 +1,599 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Enhanced Research Assistant
|
| 4 |
+
Improved version with better context handling, caching, and responses
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import hashlib
|
| 10 |
+
from typing import List, Dict, Optional, Tuple
|
| 11 |
+
import gradio as gr
|
| 12 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 13 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 14 |
+
from langchain_community.vectorstores import FAISS
|
| 15 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 16 |
+
from langchain.schema import Document
|
| 17 |
+
from dotenv import load_dotenv
|
| 18 |
+
import google.generativeai as genai
|
| 19 |
+
|
| 20 |
+
# Load environment variables
|
| 21 |
+
load_dotenv()
|
| 22 |
+
|
| 23 |
+
class EnhancedResearchAssistant:
|
| 24 |
+
"""Enhanced assistant with better performance and accuracy"""
|
| 25 |
+
|
| 26 |
+
def __init__(self):
|
| 27 |
+
"""Initialize with enhanced features"""
|
| 28 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 29 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
| 30 |
+
)
|
| 31 |
+
|
| 32 |
+
# Load papers with caching
|
| 33 |
+
self.papers = self._load_papers_cached()
|
| 34 |
+
|
| 35 |
+
# Create vector stores
|
| 36 |
+
self.vector_store = self._create_vector_store()
|
| 37 |
+
|
| 38 |
+
# Setup LLM
|
| 39 |
+
self.llm = self._setup_llm()
|
| 40 |
+
|
| 41 |
+
# Initialize response cache
|
| 42 |
+
self.response_cache = {}
|
| 43 |
+
|
| 44 |
+
# Pre-compute common contexts
|
| 45 |
+
self.precomputed_contexts = self._precompute_contexts()
|
| 46 |
+
|
| 47 |
+
def _load_papers_cached(self) -> Dict[str, Dict]:
|
| 48 |
+
"""Load papers with caching to speed up startup"""
|
| 49 |
+
cache_file = "papers_metadata_cache.json"
|
| 50 |
+
|
| 51 |
+
# Try to load from cache
|
| 52 |
+
if os.path.exists(cache_file):
|
| 53 |
+
try:
|
| 54 |
+
with open(cache_file, 'r') as f:
|
| 55 |
+
print("Loading papers from cache...")
|
| 56 |
+
return json.load(f)
|
| 57 |
+
except:
|
| 58 |
+
pass
|
| 59 |
+
|
| 60 |
+
# Load papers fresh
|
| 61 |
+
papers = self._load_papers()
|
| 62 |
+
|
| 63 |
+
# Save to cache (excluding full text for size)
|
| 64 |
+
cache_data = {}
|
| 65 |
+
for key, paper in papers.items():
|
| 66 |
+
cache_data[key] = {
|
| 67 |
+
k: v for k, v in paper.items()
|
| 68 |
+
if k != "text" or len(v) < 1000 # Only cache short texts
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
try:
|
| 72 |
+
with open(cache_file, 'w') as f:
|
| 73 |
+
json.dump(cache_data, f)
|
| 74 |
+
except:
|
| 75 |
+
pass
|
| 76 |
+
|
| 77 |
+
return papers
|
| 78 |
+
|
| 79 |
+
def _load_papers(self) -> Dict[str, Dict]:
|
| 80 |
+
"""Load all papers with enhanced metadata"""
|
| 81 |
+
papers = {}
|
| 82 |
+
pdf_dir = "documents"
|
| 83 |
+
|
| 84 |
+
paper_metadata = {
|
| 85 |
+
"r3d": {
|
| 86 |
+
"file": "r3d_arxiv_4apr2025.pdf",
|
| 87 |
+
"title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
|
| 88 |
+
"type": "JOB MARKET PAPER",
|
| 89 |
+
"year": 2025,
|
| 90 |
+
"coauthors": [],
|
| 91 |
+
"abstract_keywords": ["regression discontinuity", "distribution", "optimal transport", "wasserstein", "functional data"],
|
| 92 |
+
"description": "Extends RDD to analyze entire outcome distributions using optimal transport theory"
|
| 93 |
+
},
|
| 94 |
+
"fdr": {
|
| 95 |
+
"file": "fdr.pdf",
|
| 96 |
+
"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
|
| 97 |
+
"type": "Working Paper",
|
| 98 |
+
"year": 2024,
|
| 99 |
+
"coauthors": [],
|
| 100 |
+
"abstract_keywords": ["free discontinuity", "internet shutdowns", "geometric measure theory", "non-parametric"],
|
| 101 |
+
"description": "Novel econometric method for estimating regression functions with unknown discontinuity locations"
|
| 102 |
+
},
|
| 103 |
+
"disco": {
|
| 104 |
+
"file": "disco.pdf",
|
| 105 |
+
"title": "disco: Distributional Synthetic Controls",
|
| 106 |
+
"type": "Working Paper",
|
| 107 |
+
"year": 2025,
|
| 108 |
+
"coauthors": ["Florian Gunsilius"],
|
| 109 |
+
"abstract_keywords": ["synthetic controls", "distribution", "stata package", "causal inference"],
|
| 110 |
+
"description": "Stata package implementing distributional synthetic control methods"
|
| 111 |
+
},
|
| 112 |
+
"rto": {
|
| 113 |
+
"file": "rto.pdf",
|
| 114 |
+
"title": "Return to Office and the Tenure Distribution",
|
| 115 |
+
"type": "Working Paper",
|
| 116 |
+
"year": 2025,
|
| 117 |
+
"coauthors": ["Florian Gunsilius", "Austin Wright"],
|
| 118 |
+
"abstract_keywords": ["return to office", "tenure", "covid", "remote work", "labor"],
|
| 119 |
+
"description": "Analyzes distributional impacts of return-to-office mandates on employee tenure"
|
| 120 |
+
},
|
| 121 |
+
"prodf": {
|
| 122 |
+
"file": "prodf.pdf",
|
| 123 |
+
"title": "On the Non-Identification of Revenue Production Functions",
|
| 124 |
+
"type": "Working Paper",
|
| 125 |
+
"year": 2023,
|
| 126 |
+
"coauthors": [],
|
| 127 |
+
"abstract_keywords": ["production functions", "identification", "revenue", "productivity"],
|
| 128 |
+
"description": "Proves non-identification of production functions when using revenue as output proxy"
|
| 129 |
+
},
|
| 130 |
+
"unmasking": {
|
| 131 |
+
"file": "unmasking_partisanship.pdf",
|
| 132 |
+
"title": "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk",
|
| 133 |
+
"type": "Published Paper",
|
| 134 |
+
"year": 2021,
|
| 135 |
+
"journal": "Journal of Public Economics",
|
| 136 |
+
"coauthors": ["Anton Ivanov", "Kecht Florian", "Marco Giani", "Luke Taylor"],
|
| 137 |
+
"abstract_keywords": ["masks", "covid", "partisanship", "polarization", "public health"],
|
| 138 |
+
"description": "Shows how political polarization undermined mask-wearing during COVID-19"
|
| 139 |
+
},
|
| 140 |
+
"ukraine": {
|
| 141 |
+
"file": "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf",
|
| 142 |
+
"title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine",
|
| 143 |
+
"type": "Published Paper",
|
| 144 |
+
"year": 2023,
|
| 145 |
+
"journal": "Science Advances",
|
| 146 |
+
"coauthors": ["Yuri Zhukov", "others"],
|
| 147 |
+
"abstract_keywords": ["ukraine", "war", "alerts", "public safety", "mobile data"],
|
| 148 |
+
"description": "Demonstrates effectiveness of air raid alerts in saving lives during Ukraine invasion"
|
| 149 |
+
},
|
| 150 |
+
"cv": {
|
| 151 |
+
"file": "CV_DavidVanDijcke.pdf",
|
| 152 |
+
"title": "Curriculum Vitae",
|
| 153 |
+
"type": "CV",
|
| 154 |
+
"year": 2025,
|
| 155 |
+
"description": "David Van Dijcke's academic CV"
|
| 156 |
+
}
|
| 157 |
+
}
|
| 158 |
+
|
| 159 |
+
for key, metadata in paper_metadata.items():
|
| 160 |
+
pdf_path = os.path.join(pdf_dir, metadata["file"])
|
| 161 |
+
if os.path.exists(pdf_path):
|
| 162 |
+
try:
|
| 163 |
+
loader = PyPDFLoader(pdf_path)
|
| 164 |
+
pages = loader.load()
|
| 165 |
+
|
| 166 |
+
# Extract full text
|
| 167 |
+
full_text = "\n\n".join([p.page_content for p in pages])
|
| 168 |
+
|
| 169 |
+
# Extract abstract if possible
|
| 170 |
+
abstract = self._extract_abstract(full_text)
|
| 171 |
+
|
| 172 |
+
papers[key] = {
|
| 173 |
+
"text": full_text,
|
| 174 |
+
"abstract": abstract,
|
| 175 |
+
"pages": len(pages),
|
| 176 |
+
"filename": metadata["file"],
|
| 177 |
+
**metadata # Include all metadata
|
| 178 |
+
}
|
| 179 |
+
print(f"Loaded {metadata['title']}: {len(pages)} pages")
|
| 180 |
+
|
| 181 |
+
except Exception as e:
|
| 182 |
+
print(f"Error loading {metadata['file']}: {e}")
|
| 183 |
+
|
| 184 |
+
return papers
|
| 185 |
+
|
| 186 |
+
def _extract_abstract(self, text: str) -> str:
|
| 187 |
+
"""Extract abstract from paper text"""
|
| 188 |
+
text_lower = text.lower()
|
| 189 |
+
|
| 190 |
+
# Common abstract patterns
|
| 191 |
+
abstract_start_patterns = ["abstract\n", "abstract.", "abstract:", "summary\n"]
|
| 192 |
+
abstract_end_patterns = ["introduction", "keywords:", "jel codes:", "1 introduction", "1. introduction"]
|
| 193 |
+
|
| 194 |
+
for start_pattern in abstract_start_patterns:
|
| 195 |
+
if start_pattern in text_lower:
|
| 196 |
+
start_idx = text_lower.find(start_pattern) + len(start_pattern)
|
| 197 |
+
|
| 198 |
+
# Find end of abstract
|
| 199 |
+
end_idx = len(text)
|
| 200 |
+
for end_pattern in abstract_end_patterns:
|
| 201 |
+
if end_pattern in text_lower[start_idx:start_idx+3000]:
|
| 202 |
+
possible_end = text_lower.find(end_pattern, start_idx)
|
| 203 |
+
if possible_end > start_idx:
|
| 204 |
+
end_idx = min(end_idx, possible_end)
|
| 205 |
+
|
| 206 |
+
abstract = text[start_idx:end_idx].strip()
|
| 207 |
+
if 50 < len(abstract) < 2000: # Reasonable abstract length
|
| 208 |
+
return abstract
|
| 209 |
+
|
| 210 |
+
# Fallback: return first substantive paragraph
|
| 211 |
+
paragraphs = text.split('\n\n')
|
| 212 |
+
for para in paragraphs[1:10]: # Skip first (usually title)
|
| 213 |
+
if 100 < len(para) < 1000:
|
| 214 |
+
return para.strip()
|
| 215 |
+
|
| 216 |
+
return ""
|
| 217 |
+
|
| 218 |
+
def _create_vector_store(self) -> Optional[FAISS]:
|
| 219 |
+
"""Create vector store with multiple granularities"""
|
| 220 |
+
try:
|
| 221 |
+
documents = []
|
| 222 |
+
|
| 223 |
+
# Create different chunk sizes for different purposes
|
| 224 |
+
text_splitters = {
|
| 225 |
+
"small": RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50),
|
| 226 |
+
"medium": RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150),
|
| 227 |
+
"large": RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=300)
|
| 228 |
+
}
|
| 229 |
+
|
| 230 |
+
for key, paper in self.papers.items():
|
| 231 |
+
# Add abstract as its own document
|
| 232 |
+
if paper.get("abstract"):
|
| 233 |
+
doc = Document(
|
| 234 |
+
page_content=f"{paper['title']}\n\nAbstract: {paper['abstract']}",
|
| 235 |
+
metadata={"source": key, "type": "abstract", "title": paper['title']}
|
| 236 |
+
)
|
| 237 |
+
documents.append(doc)
|
| 238 |
+
|
| 239 |
+
# Add chunks of different sizes
|
| 240 |
+
for size_name, splitter in text_splitters.items():
|
| 241 |
+
chunks = splitter.split_text(paper["text"])
|
| 242 |
+
|
| 243 |
+
for i, chunk in enumerate(chunks[:20]): # Limit chunks per paper
|
| 244 |
+
doc = Document(
|
| 245 |
+
page_content=chunk,
|
| 246 |
+
metadata={
|
| 247 |
+
"source": key,
|
| 248 |
+
"type": f"chunk_{size_name}",
|
| 249 |
+
"chunk": i,
|
| 250 |
+
"title": paper['title']
|
| 251 |
+
}
|
| 252 |
+
)
|
| 253 |
+
documents.append(doc)
|
| 254 |
+
|
| 255 |
+
if documents:
|
| 256 |
+
return FAISS.from_documents(documents, self.embeddings)
|
| 257 |
+
|
| 258 |
+
except Exception as e:
|
| 259 |
+
print(f"Error creating vector store: {e}")
|
| 260 |
+
|
| 261 |
+
return None
|
| 262 |
+
|
| 263 |
+
def _setup_llm(self):
|
| 264 |
+
"""Setup Gemini LLM"""
|
| 265 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 266 |
+
|
| 267 |
+
if api_key:
|
| 268 |
+
try:
|
| 269 |
+
genai.configure(api_key=api_key)
|
| 270 |
+
# Try to use best available model
|
| 271 |
+
try:
|
| 272 |
+
return genai.GenerativeModel('gemini-1.5-pro')
|
| 273 |
+
except:
|
| 274 |
+
return genai.GenerativeModel('gemini-1.5-flash')
|
| 275 |
+
except Exception as e:
|
| 276 |
+
print(f"Error setting up Gemini: {e}")
|
| 277 |
+
|
| 278 |
+
return None
|
| 279 |
+
|
| 280 |
+
def _precompute_contexts(self) -> Dict[str, str]:
|
| 281 |
+
"""Precompute contexts for common queries"""
|
| 282 |
+
contexts = {}
|
| 283 |
+
|
| 284 |
+
# Job market paper context
|
| 285 |
+
if "r3d" in self.papers:
|
| 286 |
+
r3d = self.papers["r3d"]
|
| 287 |
+
contexts["job_market"] = f"""[JOB MARKET PAPER - R3D]
|
| 288 |
+
|
| 289 |
+
Title: {r3d['title']}
|
| 290 |
+
|
| 291 |
+
Abstract: {r3d.get('abstract', 'See paper for abstract')}
|
| 292 |
+
|
| 293 |
+
Key Contributions:
|
| 294 |
+
1. Extends RDD to distribution-valued outcomes
|
| 295 |
+
2. Uses optimal transport theory and Wasserstein distances
|
| 296 |
+
3. Develops new identification and estimation procedures
|
| 297 |
+
4. Applications to income distributions, test scores, etc.
|
| 298 |
+
|
| 299 |
+
This paper addresses the limitation that traditional RDD only examines average effects, enabling analysis of entire outcome distributions."""
|
| 300 |
+
|
| 301 |
+
# Overview context
|
| 302 |
+
overview_parts = ["David Van Dijcke is an econometrician on the 2025-26 academic job market.\n\nPAPERS:"]
|
| 303 |
+
for key, paper in self.papers.items():
|
| 304 |
+
if key != "cv":
|
| 305 |
+
if paper['type'] == "JOB MARKET PAPER":
|
| 306 |
+
overview_parts.append(f"\n• {paper['type']}: {paper['title']}")
|
| 307 |
+
elif paper.get('journal'):
|
| 308 |
+
overview_parts.append(f"\n• {paper['journal']} ({paper['year']}): {paper['title']}")
|
| 309 |
+
else:
|
| 310 |
+
overview_parts.append(f"\n• {paper['type']} ({paper['year']}): {paper['title']}")
|
| 311 |
+
|
| 312 |
+
contexts["overview"] = "\n".join(overview_parts)
|
| 313 |
+
|
| 314 |
+
return contexts
|
| 315 |
+
|
| 316 |
+
def answer_question(self, query: str, chat_history: List = None) -> str:
|
| 317 |
+
"""Answer questions with enhanced context and caching"""
|
| 318 |
+
if not query.strip():
|
| 319 |
+
return "Please ask a question about David Van Dijcke's research."
|
| 320 |
+
|
| 321 |
+
# Check cache
|
| 322 |
+
query_hash = hashlib.md5(query.lower().encode()).hexdigest()
|
| 323 |
+
if query_hash in self.response_cache:
|
| 324 |
+
return self.response_cache[query_hash]
|
| 325 |
+
|
| 326 |
+
# Get context
|
| 327 |
+
context = self._get_smart_context(query)
|
| 328 |
+
|
| 329 |
+
# Generate response
|
| 330 |
+
if self.llm:
|
| 331 |
+
response = self._generate_llm_response(query, context)
|
| 332 |
+
else:
|
| 333 |
+
response = self._generate_fallback_response(query, context)
|
| 334 |
+
|
| 335 |
+
# Cache response
|
| 336 |
+
self.response_cache[query_hash] = response
|
| 337 |
+
|
| 338 |
+
return response
|
| 339 |
+
|
| 340 |
+
def _get_smart_context(self, query: str) -> str:
|
| 341 |
+
"""Get context with smart routing based on query type"""
|
| 342 |
+
query_lower = query.lower()
|
| 343 |
+
|
| 344 |
+
# Route to precomputed contexts
|
| 345 |
+
if any(phrase in query_lower for phrase in ["job market", "jmp"]):
|
| 346 |
+
return self.precomputed_contexts.get("job_market", "")
|
| 347 |
+
|
| 348 |
+
if any(phrase in query_lower for phrase in ["overview", "papers", "research", "what has david"]):
|
| 349 |
+
return self.precomputed_contexts.get("overview", "")
|
| 350 |
+
|
| 351 |
+
# Build custom context
|
| 352 |
+
contexts = []
|
| 353 |
+
|
| 354 |
+
# Add relevant papers based on keywords
|
| 355 |
+
paper_matches = self._match_papers_to_query(query_lower)
|
| 356 |
+
|
| 357 |
+
for paper_key in paper_matches[:3]: # Top 3 matches
|
| 358 |
+
if paper_key in self.papers:
|
| 359 |
+
paper = self.papers[paper_key]
|
| 360 |
+
|
| 361 |
+
# Create rich context
|
| 362 |
+
paper_context = f"[{paper['type']}: {paper['title']}]"
|
| 363 |
+
|
| 364 |
+
if paper.get('abstract'):
|
| 365 |
+
paper_context += f"\n\nAbstract: {paper['abstract']}"
|
| 366 |
+
|
| 367 |
+
if paper.get('coauthors'):
|
| 368 |
+
paper_context += f"\n\nCoauthors: {', '.join(paper['coauthors'])}"
|
| 369 |
+
|
| 370 |
+
# Add relevant text sections
|
| 371 |
+
relevant_sections = self._extract_relevant_sections(paper['text'], query_lower)
|
| 372 |
+
if relevant_sections:
|
| 373 |
+
paper_context += f"\n\nRelevant excerpts:\n{relevant_sections}"
|
| 374 |
+
|
| 375 |
+
contexts.append(paper_context)
|
| 376 |
+
|
| 377 |
+
# Add vector search results if needed
|
| 378 |
+
if not contexts and self.vector_store:
|
| 379 |
+
try:
|
| 380 |
+
docs = self.vector_store.similarity_search(query, k=5)
|
| 381 |
+
for doc in docs:
|
| 382 |
+
contexts.append(f"[From {doc.metadata['title']}]\n{doc.page_content}")
|
| 383 |
+
except:
|
| 384 |
+
pass
|
| 385 |
+
|
| 386 |
+
return "\n\n---\n\n".join(contexts[:3])
|
| 387 |
+
|
| 388 |
+
def _match_papers_to_query(self, query_lower: str) -> List[str]:
|
| 389 |
+
"""Match papers to query using keywords and scoring"""
|
| 390 |
+
scores = {}
|
| 391 |
+
|
| 392 |
+
for key, paper in self.papers.items():
|
| 393 |
+
if key == "cv":
|
| 394 |
+
continue
|
| 395 |
+
|
| 396 |
+
score = 0
|
| 397 |
+
|
| 398 |
+
# Check title
|
| 399 |
+
title_lower = paper['title'].lower()
|
| 400 |
+
title_words = set(title_lower.split())
|
| 401 |
+
query_words = set(query_lower.split())
|
| 402 |
+
|
| 403 |
+
# Word overlap
|
| 404 |
+
overlap = len(title_words.intersection(query_words))
|
| 405 |
+
score += overlap * 2
|
| 406 |
+
|
| 407 |
+
# Check keywords
|
| 408 |
+
for keyword in paper.get('abstract_keywords', []):
|
| 409 |
+
if keyword.lower() in query_lower:
|
| 410 |
+
score += 3
|
| 411 |
+
|
| 412 |
+
# Special cases
|
| 413 |
+
if key == "r3d" and any(term in query_lower for term in ["job market", "jmp", "main paper"]):
|
| 414 |
+
score += 10
|
| 415 |
+
|
| 416 |
+
# Check description
|
| 417 |
+
if paper.get('description'):
|
| 418 |
+
desc_words = set(paper['description'].lower().split())
|
| 419 |
+
desc_overlap = len(desc_words.intersection(query_words))
|
| 420 |
+
score += desc_overlap
|
| 421 |
+
|
| 422 |
+
if score > 0:
|
| 423 |
+
scores[key] = score
|
| 424 |
+
|
| 425 |
+
# Sort by score
|
| 426 |
+
sorted_papers = sorted(scores.items(), key=lambda x: x[1], reverse=True)
|
| 427 |
+
return [paper[0] for paper in sorted_papers]
|
| 428 |
+
|
| 429 |
+
def _extract_relevant_sections(self, text: str, query_lower: str, max_length: int = 2000) -> str:
|
| 430 |
+
"""Extract most relevant sections from paper text"""
|
| 431 |
+
# Split into paragraphs
|
| 432 |
+
paragraphs = text.split('\n\n')
|
| 433 |
+
|
| 434 |
+
# Score paragraphs
|
| 435 |
+
scored_paragraphs = []
|
| 436 |
+
query_words = set(query_lower.split())
|
| 437 |
+
|
| 438 |
+
for para in paragraphs:
|
| 439 |
+
if len(para) < 50: # Skip short paragraphs
|
| 440 |
+
continue
|
| 441 |
+
|
| 442 |
+
para_lower = para.lower()
|
| 443 |
+
para_words = set(para_lower.split())
|
| 444 |
+
|
| 445 |
+
# Calculate relevance score
|
| 446 |
+
score = len(query_words.intersection(para_words))
|
| 447 |
+
|
| 448 |
+
# Boost for specific sections
|
| 449 |
+
if any(header in para_lower[:50] for header in ["abstract", "introduction", "conclusion"]):
|
| 450 |
+
score += 2
|
| 451 |
+
|
| 452 |
+
if score > 0:
|
| 453 |
+
scored_paragraphs.append((score, para))
|
| 454 |
+
|
| 455 |
+
# Sort by score and take top paragraphs
|
| 456 |
+
scored_paragraphs.sort(key=lambda x: x[0], reverse=True)
|
| 457 |
+
|
| 458 |
+
relevant_text = []
|
| 459 |
+
total_length = 0
|
| 460 |
+
|
| 461 |
+
for score, para in scored_paragraphs[:5]:
|
| 462 |
+
if total_length + len(para) > max_length:
|
| 463 |
+
break
|
| 464 |
+
relevant_text.append(para)
|
| 465 |
+
total_length += len(para)
|
| 466 |
+
|
| 467 |
+
return "\n\n".join(relevant_text)
|
| 468 |
+
|
| 469 |
+
def _generate_llm_response(self, query: str, context: str) -> str:
|
| 470 |
+
"""Generate response using LLM with enhanced prompting"""
|
| 471 |
+
prompt = f"""You are an expert research assistant for David Van Dijcke, an econometrician on the 2025-26 academic job market.
|
| 472 |
+
|
| 473 |
+
Key facts about David:
|
| 474 |
+
- Job Market Paper: R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
|
| 475 |
+
- Specializes in: Functional data analysis, optimal transport, econometric theory
|
| 476 |
+
- From: University of Michigan
|
| 477 |
+
- Research interests: Econometrics, Industrial Organization, Political Economy
|
| 478 |
+
|
| 479 |
+
Context from David's papers:
|
| 480 |
+
{context}
|
| 481 |
+
|
| 482 |
+
Question: {query}
|
| 483 |
+
|
| 484 |
+
Instructions:
|
| 485 |
+
1. Answer based primarily on the provided context
|
| 486 |
+
2. Be specific and cite paper titles
|
| 487 |
+
3. For job market questions, emphasize R3D
|
| 488 |
+
4. Highlight David's unique contributions and methods
|
| 489 |
+
5. Keep responses concise but informative
|
| 490 |
+
|
| 491 |
+
Answer:"""
|
| 492 |
+
|
| 493 |
+
try:
|
| 494 |
+
response = self.llm.generate_content(prompt)
|
| 495 |
+
return response.text
|
| 496 |
+
except Exception as e:
|
| 497 |
+
print(f"Error generating response: {e}")
|
| 498 |
+
return self._generate_fallback_response(query, context)
|
| 499 |
+
|
| 500 |
+
def _generate_fallback_response(self, query: str, context: str) -> str:
|
| 501 |
+
"""Generate response without LLM"""
|
| 502 |
+
query_lower = query.lower()
|
| 503 |
+
|
| 504 |
+
# Enhanced fallback responses based on context
|
| 505 |
+
if "job market" in query_lower:
|
| 506 |
+
return self.precomputed_contexts.get("job_market", "David's job market paper is R3D.")
|
| 507 |
+
|
| 508 |
+
if any(term in query_lower for term in ["overview", "research", "papers"]):
|
| 509 |
+
return self.precomputed_contexts.get("overview", "David has multiple papers in econometrics.")
|
| 510 |
+
|
| 511 |
+
# Parse context for specific information
|
| 512 |
+
if context:
|
| 513 |
+
lines = context.split('\n')
|
| 514 |
+
for line in lines[:10]:
|
| 515 |
+
if "Abstract:" in line or "JOB MARKET" in line:
|
| 516 |
+
return f"Based on David's papers:\n\n{context[:1000]}..."
|
| 517 |
+
|
| 518 |
+
return "I can help with questions about David Van Dijcke's research. For best results, please ensure Google API key is configured."
|
| 519 |
+
|
| 520 |
+
# Create enhanced Gradio interface
|
| 521 |
+
def create_interface():
|
| 522 |
+
"""Create enhanced Gradio interface"""
|
| 523 |
+
assistant = EnhancedResearchAssistant()
|
| 524 |
+
|
| 525 |
+
def chat(message, history):
|
| 526 |
+
response = assistant.answer_question(message, history)
|
| 527 |
+
history.append([message, response])
|
| 528 |
+
return "", history
|
| 529 |
+
|
| 530 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant", theme=gr.themes.Soft()) as demo:
|
| 531 |
+
gr.Markdown("""
|
| 532 |
+
# David Van Dijcke - Enhanced Research Assistant
|
| 533 |
+
|
| 534 |
+
**Econometrician on the 2025-26 Job Market** | University of Michigan
|
| 535 |
+
|
| 536 |
+
Job Market Paper: **R3D - Regression Discontinuity Design with Distribution-Valued Outcomes**
|
| 537 |
+
""")
|
| 538 |
+
|
| 539 |
+
with gr.Row():
|
| 540 |
+
with gr.Column(scale=3):
|
| 541 |
+
chatbot = gr.Chatbot(height=500)
|
| 542 |
+
msg = gr.Textbox(
|
| 543 |
+
label="Ask about David's research",
|
| 544 |
+
placeholder="Examples: What is his job market paper about? What methods has he developed?",
|
| 545 |
+
lines=2
|
| 546 |
+
)
|
| 547 |
+
|
| 548 |
+
with gr.Row():
|
| 549 |
+
submit = gr.Button("Submit", variant="primary")
|
| 550 |
+
clear = gr.Button("Clear")
|
| 551 |
+
|
| 552 |
+
with gr.Column(scale=1):
|
| 553 |
+
gr.Markdown("### Quick Links")
|
| 554 |
+
gr.Markdown("""
|
| 555 |
+
**Papers:**
|
| 556 |
+
- R3D (Job Market Paper)
|
| 557 |
+
- Free Discontinuity Regression
|
| 558 |
+
- Distributional Synthetic Controls
|
| 559 |
+
- Return to Office
|
| 560 |
+
- Revenue Production Functions
|
| 561 |
+
|
| 562 |
+
**Try asking about:**
|
| 563 |
+
- Job market paper details
|
| 564 |
+
- Econometric methods
|
| 565 |
+
- Optimal transport applications
|
| 566 |
+
- Specific papers
|
| 567 |
+
- Research agenda
|
| 568 |
+
""")
|
| 569 |
+
|
| 570 |
+
# Examples
|
| 571 |
+
gr.Examples(
|
| 572 |
+
examples=[
|
| 573 |
+
"What is David's job market paper about?",
|
| 574 |
+
"Explain the R3D methodology in detail",
|
| 575 |
+
"What econometric methods has David developed?",
|
| 576 |
+
"How does David use optimal transport in his research?",
|
| 577 |
+
"What are the main contributions of the FDR paper?",
|
| 578 |
+
"Tell me about David's coauthors and collaborations",
|
| 579 |
+
"What makes David's research unique?",
|
| 580 |
+
"What are the policy implications of David's work?"
|
| 581 |
+
],
|
| 582 |
+
inputs=msg
|
| 583 |
+
)
|
| 584 |
+
|
| 585 |
+
# Event handlers
|
| 586 |
+
msg.submit(chat, [msg, chatbot], [msg, chatbot])
|
| 587 |
+
submit.click(chat, [msg, chatbot], [msg, chatbot])
|
| 588 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
| 589 |
+
|
| 590 |
+
return demo
|
| 591 |
+
|
| 592 |
+
if __name__ == "__main__":
|
| 593 |
+
interface = create_interface()
|
| 594 |
+
interface.launch(
|
| 595 |
+
server_name="127.0.0.1",
|
| 596 |
+
server_port=7860,
|
| 597 |
+
share=False,
|
| 598 |
+
quiet=True
|
| 599 |
+
)
|
app_final.py
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Final Research Assistant
|
| 4 |
+
Combines state-of-the-art LLM usage with stable Gradio interface
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from typing import List, Dict, Optional
|
| 9 |
+
import gradio as gr
|
| 10 |
+
from pypdf import PdfReader
|
| 11 |
+
from dotenv import load_dotenv
|
| 12 |
+
import google.generativeai as genai
|
| 13 |
+
|
| 14 |
+
# Load environment variables
|
| 15 |
+
load_dotenv()
|
| 16 |
+
|
| 17 |
+
class FinalResearchAssistant:
|
| 18 |
+
"""State-of-the-art assistant with stable interface"""
|
| 19 |
+
|
| 20 |
+
def __init__(self):
|
| 21 |
+
"""Initialize with full context approach"""
|
| 22 |
+
# Setup Gemini 2.5
|
| 23 |
+
self.llm = self._setup_llm()
|
| 24 |
+
|
| 25 |
+
# Load all papers at once
|
| 26 |
+
self.papers_full_text = self._load_all_papers()
|
| 27 |
+
|
| 28 |
+
# Create mega context
|
| 29 |
+
self.mega_context = self._create_mega_context()
|
| 30 |
+
|
| 31 |
+
# Initialize chat session
|
| 32 |
+
self.chat = None
|
| 33 |
+
if self.llm:
|
| 34 |
+
self._initialize_chat()
|
| 35 |
+
|
| 36 |
+
def _setup_llm(self):
|
| 37 |
+
"""Setup Gemini 2.5 Flash"""
|
| 38 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 39 |
+
|
| 40 |
+
if not api_key:
|
| 41 |
+
print("No Google API key found")
|
| 42 |
+
return None
|
| 43 |
+
|
| 44 |
+
try:
|
| 45 |
+
genai.configure(api_key=api_key)
|
| 46 |
+
|
| 47 |
+
# Try Gemini 2.5 Flash Preview first
|
| 48 |
+
try:
|
| 49 |
+
model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
|
| 50 |
+
print("Using Gemini 2.5 Flash Preview")
|
| 51 |
+
return model
|
| 52 |
+
except:
|
| 53 |
+
# Fallback to stable version
|
| 54 |
+
model = genai.GenerativeModel('gemini-1.5-flash')
|
| 55 |
+
print("Using Gemini 1.5 Flash")
|
| 56 |
+
return model
|
| 57 |
+
|
| 58 |
+
except Exception as e:
|
| 59 |
+
print(f"Error setting up Gemini: {e}")
|
| 60 |
+
return None
|
| 61 |
+
|
| 62 |
+
def _load_all_papers(self) -> Dict[str, str]:
|
| 63 |
+
"""Load all papers completely"""
|
| 64 |
+
papers = {}
|
| 65 |
+
pdf_dir = "documents"
|
| 66 |
+
|
| 67 |
+
paper_files = [
|
| 68 |
+
("r3d", "r3d_arxiv_4apr2025.pdf", "JOB MARKET PAPER - R3D"),
|
| 69 |
+
("cv", "CV_DavidVanDijcke.pdf", "CURRICULUM VITAE"),
|
| 70 |
+
("fdr", "fdr.pdf", "Free Discontinuity Regression"),
|
| 71 |
+
("disco", "disco.pdf", "Distributional Synthetic Controls"),
|
| 72 |
+
("rto", "rto.pdf", "Return to Office"),
|
| 73 |
+
("prodf", "prodf.pdf", "Revenue Production Functions"),
|
| 74 |
+
]
|
| 75 |
+
|
| 76 |
+
for key, filename, title in paper_files:
|
| 77 |
+
pdf_path = os.path.join(pdf_dir, filename)
|
| 78 |
+
if os.path.exists(pdf_path):
|
| 79 |
+
try:
|
| 80 |
+
with open(pdf_path, 'rb') as file:
|
| 81 |
+
pdf_reader = PdfReader(file)
|
| 82 |
+
|
| 83 |
+
full_text = f"\n{'='*60}\n{title}\n{'='*60}\n\n"
|
| 84 |
+
|
| 85 |
+
for page_num, page in enumerate(pdf_reader.pages, 1):
|
| 86 |
+
text = page.extract_text()
|
| 87 |
+
if text.strip():
|
| 88 |
+
full_text += f"[Page {page_num}]\n{text}\n\n"
|
| 89 |
+
|
| 90 |
+
papers[key] = full_text
|
| 91 |
+
print(f"Loaded {title}: {len(full_text):,} chars")
|
| 92 |
+
|
| 93 |
+
except Exception as e:
|
| 94 |
+
print(f"Error loading {filename}: {e}")
|
| 95 |
+
|
| 96 |
+
return papers
|
| 97 |
+
|
| 98 |
+
def _create_mega_context(self) -> str:
|
| 99 |
+
"""Create single context with all papers"""
|
| 100 |
+
context = "DAVID VAN DIJCKE - COMPLETE RESEARCH PORTFOLIO\n\n"
|
| 101 |
+
|
| 102 |
+
for key, text in self.papers_full_text.items():
|
| 103 |
+
context += text + "\n\n"
|
| 104 |
+
|
| 105 |
+
print(f"Total context: {len(context):,} characters")
|
| 106 |
+
return context
|
| 107 |
+
|
| 108 |
+
def _initialize_chat(self):
|
| 109 |
+
"""Initialize chat with full context"""
|
| 110 |
+
try:
|
| 111 |
+
self.chat = self.llm.start_chat(history=[
|
| 112 |
+
{
|
| 113 |
+
"role": "user",
|
| 114 |
+
"parts": [f"""You are David Van Dijcke's research assistant. I'm giving you his complete research portfolio.
|
| 115 |
+
|
| 116 |
+
{self.mega_context}
|
| 117 |
+
|
| 118 |
+
Key facts:
|
| 119 |
+
- David is on the 2025-26 economics job market
|
| 120 |
+
- His JOB MARKET PAPER is R3D
|
| 121 |
+
- He's from University of Michigan
|
| 122 |
+
- He specializes in econometric methods
|
| 123 |
+
|
| 124 |
+
Please acknowledge you've loaded all papers."""]
|
| 125 |
+
},
|
| 126 |
+
{
|
| 127 |
+
"role": "model",
|
| 128 |
+
"parts": ["I've successfully loaded David Van Dijcke's complete research portfolio including his job market paper R3D, CV, and all other papers. I'm ready to answer any questions about his research, methods, or background."]
|
| 129 |
+
}
|
| 130 |
+
])
|
| 131 |
+
print("Chat initialized with full context")
|
| 132 |
+
except Exception as e:
|
| 133 |
+
print(f"Could not initialize chat: {e}")
|
| 134 |
+
self.chat = None
|
| 135 |
+
|
| 136 |
+
def answer_question(self, query: str) -> str:
|
| 137 |
+
"""Answer using full context"""
|
| 138 |
+
if not query.strip():
|
| 139 |
+
return "What would you like to know about David's research?"
|
| 140 |
+
|
| 141 |
+
if not self.llm:
|
| 142 |
+
return self._fallback_response(query)
|
| 143 |
+
|
| 144 |
+
try:
|
| 145 |
+
if self.chat:
|
| 146 |
+
# Use pre-loaded context
|
| 147 |
+
prompt = f"""Based on the papers I have loaded, please answer this question:
|
| 148 |
+
|
| 149 |
+
{query}
|
| 150 |
+
|
| 151 |
+
Remember to:
|
| 152 |
+
- Be conversational but accurate
|
| 153 |
+
- Reference specific papers when relevant
|
| 154 |
+
- For job market questions, focus on R3D
|
| 155 |
+
- Explain both intuition and technical details when appropriate"""
|
| 156 |
+
|
| 157 |
+
response = self.chat.send_message(prompt)
|
| 158 |
+
return response.text
|
| 159 |
+
else:
|
| 160 |
+
# Send everything in one request
|
| 161 |
+
prompt = f"""You are David Van Dijcke's research assistant. Based on his papers below, answer the question.
|
| 162 |
+
|
| 163 |
+
{self.mega_context}
|
| 164 |
+
|
| 165 |
+
Question: {query}
|
| 166 |
+
|
| 167 |
+
Be conversational, accurate, and highlight what makes David's work unique."""
|
| 168 |
+
|
| 169 |
+
response = self.llm.generate_content(prompt)
|
| 170 |
+
return response.text
|
| 171 |
+
|
| 172 |
+
except Exception as e:
|
| 173 |
+
print(f"Error: {e}")
|
| 174 |
+
return self._fallback_response(query)
|
| 175 |
+
|
| 176 |
+
def _fallback_response(self, query: str) -> str:
|
| 177 |
+
"""Fallback without API"""
|
| 178 |
+
query_lower = query.lower()
|
| 179 |
+
|
| 180 |
+
if "job market" in query_lower:
|
| 181 |
+
return """David's job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
|
| 182 |
+
|
| 183 |
+
It extends RDD to analyze entire outcome distributions using optimal transport theory. This allows researchers to see not just if a policy works on average, but WHO it works for - crucial for understanding distributional effects and inequality."""
|
| 184 |
+
|
| 185 |
+
if "david" in query_lower or "who" in query_lower:
|
| 186 |
+
return """David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan. He develops novel methods for functional and distributional data analysis, with applications to important policy questions."""
|
| 187 |
+
|
| 188 |
+
return "I can help with questions about David's research. Please add a Google API key for best results."
|
| 189 |
+
|
| 190 |
+
# Simple interface
|
| 191 |
+
def create_interface():
|
| 192 |
+
"""Create simple, stable interface"""
|
| 193 |
+
assistant = FinalResearchAssistant()
|
| 194 |
+
|
| 195 |
+
def chat(message, history):
|
| 196 |
+
response = assistant.answer_question(message)
|
| 197 |
+
history.append([message, response])
|
| 198 |
+
return "", history
|
| 199 |
+
|
| 200 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant") as demo:
|
| 201 |
+
gr.Markdown("""
|
| 202 |
+
# David Van Dijcke - Research Assistant
|
| 203 |
+
|
| 204 |
+
**Econometrician | 2025-26 Job Market | University of Michigan**
|
| 205 |
+
|
| 206 |
+
Job Market Paper: **R3D - Regression Discontinuity Design with Distribution-Valued Outcomes**
|
| 207 |
+
""")
|
| 208 |
+
|
| 209 |
+
chatbot = gr.Chatbot(height=450)
|
| 210 |
+
msg = gr.Textbox(
|
| 211 |
+
label="Ask about David's research",
|
| 212 |
+
placeholder="What is his job market paper about? What methods has he developed?",
|
| 213 |
+
lines=2
|
| 214 |
+
)
|
| 215 |
+
|
| 216 |
+
with gr.Row():
|
| 217 |
+
submit = gr.Button("Send", variant="primary")
|
| 218 |
+
clear = gr.Button("Clear")
|
| 219 |
+
|
| 220 |
+
gr.Examples(
|
| 221 |
+
examples=[
|
| 222 |
+
"What is David's job market paper about?",
|
| 223 |
+
"Explain R3D's methodology - both intuition and technical details",
|
| 224 |
+
"What real-world problems can R3D solve?",
|
| 225 |
+
"How does David use optimal transport in his research?",
|
| 226 |
+
"What makes David's research unique?",
|
| 227 |
+
"Tell me about his other papers besides R3D"
|
| 228 |
+
],
|
| 229 |
+
inputs=msg
|
| 230 |
+
)
|
| 231 |
+
|
| 232 |
+
msg.submit(chat, [msg, chatbot], [msg, chatbot])
|
| 233 |
+
submit.click(chat, [msg, chatbot], [msg, chatbot])
|
| 234 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
| 235 |
+
|
| 236 |
+
return demo
|
| 237 |
+
|
| 238 |
+
if __name__ == "__main__":
|
| 239 |
+
interface = create_interface()
|
| 240 |
+
# Use same launch config as stable version
|
| 241 |
+
interface.launch(
|
| 242 |
+
server_name="127.0.0.1",
|
| 243 |
+
server_port=7860,
|
| 244 |
+
share=False,
|
| 245 |
+
quiet=True
|
| 246 |
+
)
|
app_full_context.py
ADDED
|
@@ -0,0 +1,401 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Research Assistant with Full Paper Context
|
| 4 |
+
Loads complete papers and uses Gemini's large context window for comprehensive responses
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import time
|
| 9 |
+
from typing import List, Dict, Any, Optional
|
| 10 |
+
import gradio as gr
|
| 11 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 12 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 13 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 14 |
+
from langchain_community.vectorstores import FAISS
|
| 15 |
+
from langchain.schema import Document
|
| 16 |
+
from dotenv import load_dotenv
|
| 17 |
+
import google.generativeai as genai
|
| 18 |
+
|
| 19 |
+
# Load environment variables
|
| 20 |
+
load_dotenv()
|
| 21 |
+
|
| 22 |
+
class FullContextResearchAssistant:
|
| 23 |
+
"""Research assistant that loads full papers and uses large context windows"""
|
| 24 |
+
|
| 25 |
+
def __init__(self):
|
| 26 |
+
"""Initialize the assistant with full document loading"""
|
| 27 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 28 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
| 29 |
+
)
|
| 30 |
+
self.documents = self._load_all_documents()
|
| 31 |
+
self.vector_store = self._create_vector_store()
|
| 32 |
+
self.llm = self._setup_llm()
|
| 33 |
+
|
| 34 |
+
# Cache full paper texts for direct retrieval
|
| 35 |
+
self.full_papers = self._load_full_papers()
|
| 36 |
+
|
| 37 |
+
def _load_full_papers(self) -> Dict[str, str]:
|
| 38 |
+
"""Load complete text of each paper"""
|
| 39 |
+
papers = {}
|
| 40 |
+
pdf_dir = "documents"
|
| 41 |
+
|
| 42 |
+
paper_metadata = {
|
| 43 |
+
"r3d_arxiv_4apr2025.pdf": {
|
| 44 |
+
"title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
|
| 45 |
+
"type": "Job Market Paper",
|
| 46 |
+
"key": "r3d"
|
| 47 |
+
},
|
| 48 |
+
"fdr.pdf": {
|
| 49 |
+
"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
|
| 50 |
+
"type": "Working Paper",
|
| 51 |
+
"key": "fdr"
|
| 52 |
+
},
|
| 53 |
+
"disco.pdf": {
|
| 54 |
+
"title": "Data-driven Inference on Optimal Stochastic Restrictions",
|
| 55 |
+
"type": "Working Paper",
|
| 56 |
+
"key": "disco"
|
| 57 |
+
},
|
| 58 |
+
"rto.pdf": {
|
| 59 |
+
"title": "Return to Office and the Tenure Distribution",
|
| 60 |
+
"type": "Working Paper",
|
| 61 |
+
"key": "rto"
|
| 62 |
+
},
|
| 63 |
+
"prodf.pdf": {
|
| 64 |
+
"title": "From output to outcomes: Productivity and the distributions it generates",
|
| 65 |
+
"type": "Working Paper",
|
| 66 |
+
"key": "prodf"
|
| 67 |
+
},
|
| 68 |
+
"unmasking_partisanship.pdf": {
|
| 69 |
+
"title": "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk",
|
| 70 |
+
"type": "Published Paper",
|
| 71 |
+
"key": "unmasking"
|
| 72 |
+
},
|
| 73 |
+
"van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf": {
|
| 74 |
+
"title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine",
|
| 75 |
+
"type": "Published Paper",
|
| 76 |
+
"key": "ukraine"
|
| 77 |
+
},
|
| 78 |
+
"BrzezinskiKechtDeianaVanDijcke_18042020_CEPR_2.pdf": {
|
| 79 |
+
"title": "The Cost of Staying Open: Voluntary Social Distancing and Lockdowns in the US",
|
| 80 |
+
"type": "Published Paper",
|
| 81 |
+
"key": "staying_open"
|
| 82 |
+
},
|
| 83 |
+
"ssrn-3776854.pdf": {
|
| 84 |
+
"title": "Belief in Science Influences Physical Distancing in Response to COVID-19 Lockdown Policies",
|
| 85 |
+
"type": "Working Paper",
|
| 86 |
+
"key": "belief_science"
|
| 87 |
+
},
|
| 88 |
+
"BOE_revision_8dec2022.pdf": {
|
| 89 |
+
"title": "What Drives International Portfolio Flows?",
|
| 90 |
+
"type": "Working Paper",
|
| 91 |
+
"key": "portfolio_flows"
|
| 92 |
+
},
|
| 93 |
+
"CV_DavidVanDijcke.pdf": {
|
| 94 |
+
"title": "Curriculum Vitae",
|
| 95 |
+
"type": "CV",
|
| 96 |
+
"key": "cv"
|
| 97 |
+
}
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
for pdf_file, metadata in paper_metadata.items():
|
| 101 |
+
pdf_path = os.path.join(pdf_dir, pdf_file)
|
| 102 |
+
if os.path.exists(pdf_path):
|
| 103 |
+
try:
|
| 104 |
+
loader = PyPDFLoader(pdf_path)
|
| 105 |
+
# Load ALL pages
|
| 106 |
+
pages = loader.load()
|
| 107 |
+
full_text = "\n\n".join([page.page_content for page in pages])
|
| 108 |
+
|
| 109 |
+
papers[metadata["key"]] = {
|
| 110 |
+
"text": full_text,
|
| 111 |
+
"title": metadata["title"],
|
| 112 |
+
"type": metadata["type"],
|
| 113 |
+
"file": pdf_file,
|
| 114 |
+
"num_pages": len(pages),
|
| 115 |
+
"length": len(full_text)
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
print(f"Loaded full paper: {metadata['title']} ({len(pages)} pages, {len(full_text):,} chars)")
|
| 119 |
+
|
| 120 |
+
except Exception as e:
|
| 121 |
+
print(f"Error loading {pdf_file}: {e}")
|
| 122 |
+
|
| 123 |
+
return papers
|
| 124 |
+
|
| 125 |
+
def _load_all_documents(self) -> List[Document]:
|
| 126 |
+
"""Load documents for vector store - using larger chunks"""
|
| 127 |
+
documents = []
|
| 128 |
+
pdf_dir = "documents"
|
| 129 |
+
|
| 130 |
+
# Use larger chunks for better context preservation
|
| 131 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
| 132 |
+
chunk_size=2000, # Increased from 500
|
| 133 |
+
chunk_overlap=200, # Increased from 50
|
| 134 |
+
separators=["\n\n", "\n", " ", ""]
|
| 135 |
+
)
|
| 136 |
+
|
| 137 |
+
for pdf_file in os.listdir(pdf_dir):
|
| 138 |
+
if pdf_file.endswith('.pdf'):
|
| 139 |
+
pdf_path = os.path.join(pdf_dir, pdf_file)
|
| 140 |
+
try:
|
| 141 |
+
loader = PyPDFLoader(pdf_path)
|
| 142 |
+
pages = loader.load() # Load ALL pages
|
| 143 |
+
|
| 144 |
+
# Add metadata
|
| 145 |
+
for page in pages:
|
| 146 |
+
page.metadata['source'] = pdf_file
|
| 147 |
+
page.metadata['type'] = 'full_paper'
|
| 148 |
+
|
| 149 |
+
# Split into larger chunks
|
| 150 |
+
chunks = text_splitter.split_documents(pages)
|
| 151 |
+
documents.extend(chunks)
|
| 152 |
+
|
| 153 |
+
except Exception as e:
|
| 154 |
+
print(f"Error loading {pdf_file}: {e}")
|
| 155 |
+
|
| 156 |
+
return documents
|
| 157 |
+
|
| 158 |
+
def _create_vector_store(self) -> FAISS:
|
| 159 |
+
"""Create or load vector store"""
|
| 160 |
+
cache_dir = "vector_store_cache_full"
|
| 161 |
+
|
| 162 |
+
if os.path.exists(cache_dir):
|
| 163 |
+
print("Loading cached vector store...")
|
| 164 |
+
return FAISS.load_local(cache_dir, self.embeddings, allow_dangerous_deserialization=True)
|
| 165 |
+
|
| 166 |
+
print(f"Creating vector store from {len(self.documents)} chunks...")
|
| 167 |
+
vector_store = FAISS.from_documents(self.documents, self.embeddings)
|
| 168 |
+
|
| 169 |
+
# Save for future use
|
| 170 |
+
os.makedirs(cache_dir, exist_ok=True)
|
| 171 |
+
vector_store.save_local(cache_dir)
|
| 172 |
+
|
| 173 |
+
return vector_store
|
| 174 |
+
|
| 175 |
+
def _setup_llm(self):
|
| 176 |
+
"""Setup Gemini with large context window"""
|
| 177 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 178 |
+
|
| 179 |
+
if api_key:
|
| 180 |
+
genai.configure(api_key=api_key)
|
| 181 |
+
# Use Gemini 2.0 Flash for even larger context window
|
| 182 |
+
return genai.GenerativeModel('gemini-2.0-flash-exp')
|
| 183 |
+
else:
|
| 184 |
+
print("Warning: No GOOGLE_API_KEY found. Using limited mode.")
|
| 185 |
+
return None
|
| 186 |
+
|
| 187 |
+
def _get_relevant_papers(self, query: str) -> List[Dict[str, Any]]:
|
| 188 |
+
"""Determine which full papers are most relevant to the query"""
|
| 189 |
+
# First use vector search to identify relevant papers
|
| 190 |
+
relevant_chunks = self.vector_store.similarity_search(query, k=10)
|
| 191 |
+
|
| 192 |
+
# Identify unique papers from chunks
|
| 193 |
+
relevant_paper_keys = set()
|
| 194 |
+
for chunk in relevant_chunks:
|
| 195 |
+
source = chunk.metadata.get('source', '')
|
| 196 |
+
# Map source file to paper key
|
| 197 |
+
for key, paper_info in self.full_papers.items():
|
| 198 |
+
if paper_info['file'] == source:
|
| 199 |
+
relevant_paper_keys.add(key)
|
| 200 |
+
break
|
| 201 |
+
|
| 202 |
+
# Also check for specific paper mentions in query
|
| 203 |
+
query_lower = query.lower()
|
| 204 |
+
keyword_map = {
|
| 205 |
+
'r3d': ['r3d', 'regression discontinuity', 'distribution', 'job market'],
|
| 206 |
+
'fdr': ['fdr', 'free discontinuity', 'internet shutdown'],
|
| 207 |
+
'disco': ['disco', 'stochastic restriction', 'optimal transport'],
|
| 208 |
+
'rto': ['rto', 'return to office', 'tenure'],
|
| 209 |
+
'prodf': ['productivity', 'production function', 'revenue'],
|
| 210 |
+
'unmasking': ['mask', 'partisan', 'polarization', 'covid'],
|
| 211 |
+
'ukraine': ['ukraine', 'alert', 'invasion'],
|
| 212 |
+
'staying_open': ['staying open', 'lockdown', 'voluntary'],
|
| 213 |
+
'belief_science': ['belief', 'science', 'compliance'],
|
| 214 |
+
'portfolio_flows': ['portfolio', 'flow', 'international'],
|
| 215 |
+
'cv': ['cv', 'curriculum', 'job market', 'econometrician', 'david']
|
| 216 |
+
}
|
| 217 |
+
|
| 218 |
+
for key, keywords in keyword_map.items():
|
| 219 |
+
if any(keyword in query_lower for keyword in keywords):
|
| 220 |
+
relevant_paper_keys.add(key)
|
| 221 |
+
|
| 222 |
+
# Return full paper info for relevant papers
|
| 223 |
+
relevant_papers = []
|
| 224 |
+
for key in relevant_paper_keys:
|
| 225 |
+
if key in self.full_papers:
|
| 226 |
+
paper_info = self.full_papers[key].copy()
|
| 227 |
+
paper_info['key'] = key
|
| 228 |
+
relevant_papers.append(paper_info)
|
| 229 |
+
|
| 230 |
+
return relevant_papers
|
| 231 |
+
|
| 232 |
+
def answer_question(self, query: str) -> str:
|
| 233 |
+
"""Answer questions using full paper context"""
|
| 234 |
+
if not query.strip():
|
| 235 |
+
return "Please ask a question about David Van Dijcke's research."
|
| 236 |
+
|
| 237 |
+
# Get relevant full papers
|
| 238 |
+
relevant_papers = self._get_relevant_papers(query)
|
| 239 |
+
|
| 240 |
+
if not relevant_papers and self.llm is None:
|
| 241 |
+
return self._get_fallback_response(query)
|
| 242 |
+
|
| 243 |
+
# Construct context with full papers
|
| 244 |
+
context_parts = []
|
| 245 |
+
total_chars = 0
|
| 246 |
+
max_chars = 1000000 # Gemini 2.0 Flash supports up to 1M tokens
|
| 247 |
+
|
| 248 |
+
# Always include CV first if available
|
| 249 |
+
if 'cv' in self.full_papers and total_chars < max_chars:
|
| 250 |
+
cv_text = self.full_papers['cv']['text'][:50000] # First 50k chars of CV
|
| 251 |
+
context_parts.append(f"=== CURRICULUM VITAE ===\n{cv_text}\n")
|
| 252 |
+
total_chars += len(cv_text)
|
| 253 |
+
|
| 254 |
+
# Add relevant papers
|
| 255 |
+
for paper in relevant_papers:
|
| 256 |
+
if total_chars >= max_chars:
|
| 257 |
+
break
|
| 258 |
+
|
| 259 |
+
paper_text = paper['text']
|
| 260 |
+
if total_chars + len(paper_text) > max_chars:
|
| 261 |
+
# Truncate if needed
|
| 262 |
+
paper_text = paper_text[:max_chars - total_chars]
|
| 263 |
+
|
| 264 |
+
context_parts.append(f"=== {paper['title'].upper()} ===\n{paper_text}\n")
|
| 265 |
+
total_chars += len(paper_text)
|
| 266 |
+
|
| 267 |
+
full_context = "\n\n".join(context_parts)
|
| 268 |
+
|
| 269 |
+
# Create prompt
|
| 270 |
+
prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market
|
| 271 |
+
who develops novel methods for analyzing functional and high-dimensional data.
|
| 272 |
+
|
| 273 |
+
You have access to David's FULL papers and CV. Use this comprehensive information to provide detailed, accurate answers.
|
| 274 |
+
|
| 275 |
+
Context (Full Papers):
|
| 276 |
+
{full_context}
|
| 277 |
+
|
| 278 |
+
Question: {query}
|
| 279 |
+
|
| 280 |
+
Instructions:
|
| 281 |
+
1. Provide specific, detailed information from the papers
|
| 282 |
+
2. Quote exact passages when relevant
|
| 283 |
+
3. Explain technical concepts clearly
|
| 284 |
+
4. Make connections across different papers when applicable
|
| 285 |
+
5. Be precise about David's contributions and methods
|
| 286 |
+
|
| 287 |
+
Answer:"""
|
| 288 |
+
|
| 289 |
+
if self.llm:
|
| 290 |
+
try:
|
| 291 |
+
response = self.llm.generate_content(prompt)
|
| 292 |
+
return response.text
|
| 293 |
+
except Exception as e:
|
| 294 |
+
print(f"Error with Gemini API: {e}")
|
| 295 |
+
return self._get_fallback_response(query)
|
| 296 |
+
else:
|
| 297 |
+
return self._get_fallback_response(query)
|
| 298 |
+
|
| 299 |
+
def _get_fallback_response(self, query: str) -> str:
|
| 300 |
+
"""Fallback response when API is not available"""
|
| 301 |
+
query_lower = query.lower()
|
| 302 |
+
|
| 303 |
+
responses = {
|
| 304 |
+
"r3d": """R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David's job market paper.
|
| 305 |
+
|
| 306 |
+
Key features:
|
| 307 |
+
- Extends RDD to distribution-valued outcomes
|
| 308 |
+
- Uses optimal transport and Wasserstein distances
|
| 309 |
+
- Allows testing effects on entire outcome distributions
|
| 310 |
+
- Applications to income distributions, test scores, etc.""",
|
| 311 |
+
|
| 312 |
+
"david": """David Van Dijcke is an econometrician on the 2025-26 job market. He specializes in:
|
| 313 |
+
- Functional data analysis
|
| 314 |
+
- High-dimensional econometrics
|
| 315 |
+
- Optimal transport methods
|
| 316 |
+
- Applications to big data
|
| 317 |
+
|
| 318 |
+
Currently at University of Michigan, completing his PhD.""",
|
| 319 |
+
|
| 320 |
+
"methods": """David develops econometric methods for:
|
| 321 |
+
1. Distribution-valued outcomes (R3D)
|
| 322 |
+
2. Free discontinuity problems (FDR)
|
| 323 |
+
3. Stochastic restrictions (DISCO)
|
| 324 |
+
4. High-dimensional productivity analysis
|
| 325 |
+
|
| 326 |
+
His work bridges mathematical theory and practical applications."""
|
| 327 |
+
}
|
| 328 |
+
|
| 329 |
+
# Check for keywords
|
| 330 |
+
for key, response in responses.items():
|
| 331 |
+
if key in query_lower:
|
| 332 |
+
return response
|
| 333 |
+
|
| 334 |
+
return """I'm David Van Dijcke's research assistant. I can help with questions about:
|
| 335 |
+
- His job market paper (R3D)
|
| 336 |
+
- Econometric methods he's developed
|
| 337 |
+
- His research papers and applications
|
| 338 |
+
- His background and expertise
|
| 339 |
+
|
| 340 |
+
For best results, please add a Google API key."""
|
| 341 |
+
|
| 342 |
+
# Create Gradio interface
|
| 343 |
+
def create_interface():
|
| 344 |
+
"""Create the Gradio interface"""
|
| 345 |
+
assistant = FullContextResearchAssistant()
|
| 346 |
+
|
| 347 |
+
# Header with API key info
|
| 348 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant") as interface:
|
| 349 |
+
gr.Markdown("""
|
| 350 |
+
# David Van Dijcke - Econometric Research Assistant (Full Context Version)
|
| 351 |
+
|
| 352 |
+
This enhanced version loads COMPLETE papers to provide comprehensive, detailed answers about David's research.
|
| 353 |
+
|
| 354 |
+
**Features:**
|
| 355 |
+
- Full paper context (not just excerpts)
|
| 356 |
+
- Detailed technical explanations
|
| 357 |
+
- Comprehensive method descriptions
|
| 358 |
+
- Cross-paper connections
|
| 359 |
+
|
| 360 |
+
For best performance, add your Google API key in the Space settings.
|
| 361 |
+
""")
|
| 362 |
+
|
| 363 |
+
# Check API status
|
| 364 |
+
api_status = "✅ Google API configured - Full context mode active" if os.getenv("GOOGLE_API_KEY") else "⚠️ No API key - Limited mode"
|
| 365 |
+
gr.Markdown(f"**Status:** {api_status}")
|
| 366 |
+
|
| 367 |
+
# Chat interface
|
| 368 |
+
chatbot = gr.Chatbot(height=500)
|
| 369 |
+
msg = gr.Textbox(
|
| 370 |
+
label="Ask about David's research",
|
| 371 |
+
placeholder="What is David's job market paper about? What methods does he develop?",
|
| 372 |
+
lines=2
|
| 373 |
+
)
|
| 374 |
+
clear = gr.Button("Clear")
|
| 375 |
+
|
| 376 |
+
# Examples
|
| 377 |
+
gr.Examples(
|
| 378 |
+
examples=[
|
| 379 |
+
"What is David's job market paper R3D about? Explain the technical details.",
|
| 380 |
+
"How does David use optimal transport in his research?",
|
| 381 |
+
"What are the main contributions of the FDR paper?",
|
| 382 |
+
"Explain David's work on productivity and distributional outcomes.",
|
| 383 |
+
"What policy applications does David's research have?",
|
| 384 |
+
"Tell me about David's background and why he's suited for an econometrics position."
|
| 385 |
+
],
|
| 386 |
+
inputs=msg
|
| 387 |
+
)
|
| 388 |
+
|
| 389 |
+
def respond(message, chat_history):
|
| 390 |
+
bot_message = assistant.answer_question(message)
|
| 391 |
+
chat_history.append((message, bot_message))
|
| 392 |
+
return "", chat_history
|
| 393 |
+
|
| 394 |
+
msg.submit(respond, [msg, chatbot], [msg, chatbot])
|
| 395 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
| 396 |
+
|
| 397 |
+
return interface
|
| 398 |
+
|
| 399 |
+
if __name__ == "__main__":
|
| 400 |
+
interface = create_interface()
|
| 401 |
+
interface.launch()
|
app_natural.py
ADDED
|
@@ -0,0 +1,355 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Natural Research Assistant
|
| 4 |
+
Focuses on clear, accessible, and accurate responses
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from typing import List, Dict, Optional
|
| 9 |
+
import gradio as gr
|
| 10 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 11 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 12 |
+
from langchain_community.vectorstores import FAISS
|
| 13 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 14 |
+
from langchain.schema import Document
|
| 15 |
+
from dotenv import load_dotenv
|
| 16 |
+
import google.generativeai as genai
|
| 17 |
+
|
| 18 |
+
# Load environment variables
|
| 19 |
+
load_dotenv()
|
| 20 |
+
|
| 21 |
+
class NaturalResearchAssistant:
|
| 22 |
+
"""Assistant focused on natural, accessible communication"""
|
| 23 |
+
|
| 24 |
+
def __init__(self):
|
| 25 |
+
"""Initialize with focus on clarity"""
|
| 26 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 27 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
# Load papers
|
| 31 |
+
self.papers = self._load_papers_simple()
|
| 32 |
+
|
| 33 |
+
# Simple vector store
|
| 34 |
+
self.vector_store = self._create_simple_vector_store()
|
| 35 |
+
|
| 36 |
+
# Setup LLM
|
| 37 |
+
self.llm = self._setup_llm()
|
| 38 |
+
|
| 39 |
+
# Pre-written clear explanations
|
| 40 |
+
self.clear_explanations = self._create_clear_explanations()
|
| 41 |
+
|
| 42 |
+
def _load_papers_simple(self) -> Dict[str, Dict]:
|
| 43 |
+
"""Load papers with focus on key information"""
|
| 44 |
+
papers = {}
|
| 45 |
+
pdf_dir = "documents"
|
| 46 |
+
|
| 47 |
+
# Essential paper info
|
| 48 |
+
paper_info = {
|
| 49 |
+
"r3d": {
|
| 50 |
+
"file": "r3d_arxiv_4apr2025.pdf",
|
| 51 |
+
"title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
|
| 52 |
+
"simple_explanation": "This paper extends a popular causal inference method (RDD) to study not just average effects but entire distributions - like how a policy affects income inequality, not just average income.",
|
| 53 |
+
"main_contribution": "Allows researchers to see how policies affect different parts of the population differently",
|
| 54 |
+
"real_world_use": "Can show if a minimum wage increase helps low earners more than high earners, or if a school policy helps struggling students catch up"
|
| 55 |
+
},
|
| 56 |
+
"fdr": {
|
| 57 |
+
"file": "fdr.pdf",
|
| 58 |
+
"title": "Free Discontinuity Regression",
|
| 59 |
+
"simple_explanation": "Develops a method to find sudden changes in data when you don't know where they occur - like finding where internet shutdowns hurt the economy most.",
|
| 60 |
+
"main_contribution": "Automatically detects breakpoints in data without pre-specifying them",
|
| 61 |
+
"real_world_use": "Measures economic damage from internet shutdowns, finds structural breaks in markets"
|
| 62 |
+
},
|
| 63 |
+
"rto": {
|
| 64 |
+
"file": "rto.pdf",
|
| 65 |
+
"title": "Return to Office and the Tenure Distribution",
|
| 66 |
+
"simple_explanation": "Studies how return-to-office mandates affect employee retention, finding that senior employees are more likely to leave.",
|
| 67 |
+
"main_contribution": "Shows RTO policies can backfire by driving away experienced talent",
|
| 68 |
+
"real_world_use": "Helps companies understand the hidden costs of ending remote work"
|
| 69 |
+
}
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
for key, info in paper_info.items():
|
| 73 |
+
pdf_path = os.path.join(pdf_dir, info["file"])
|
| 74 |
+
if os.path.exists(pdf_path):
|
| 75 |
+
try:
|
| 76 |
+
loader = PyPDFLoader(pdf_path)
|
| 77 |
+
pages = loader.load()
|
| 78 |
+
full_text = "\n\n".join([p.page_content for p in pages])
|
| 79 |
+
|
| 80 |
+
papers[key] = {
|
| 81 |
+
"text": full_text,
|
| 82 |
+
"pages": len(pages),
|
| 83 |
+
**info
|
| 84 |
+
}
|
| 85 |
+
except Exception as e:
|
| 86 |
+
print(f"Error loading {info['file']}: {e}")
|
| 87 |
+
|
| 88 |
+
return papers
|
| 89 |
+
|
| 90 |
+
def _create_simple_vector_store(self) -> Optional[FAISS]:
|
| 91 |
+
"""Create simple vector store"""
|
| 92 |
+
try:
|
| 93 |
+
documents = []
|
| 94 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
| 95 |
+
chunk_size=1000,
|
| 96 |
+
chunk_overlap=100
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
for key, paper in self.papers.items():
|
| 100 |
+
# Add the simple explanations as documents
|
| 101 |
+
if paper.get("simple_explanation"):
|
| 102 |
+
doc = Document(
|
| 103 |
+
page_content=f"{paper['title']}\n\n{paper['simple_explanation']}\n\nMain contribution: {paper['main_contribution']}\n\nReal-world use: {paper['real_world_use']}",
|
| 104 |
+
metadata={"source": key, "type": "explanation"}
|
| 105 |
+
)
|
| 106 |
+
documents.append(doc)
|
| 107 |
+
|
| 108 |
+
# Add some text chunks
|
| 109 |
+
chunks = text_splitter.split_text(paper["text"])[:10]
|
| 110 |
+
for i, chunk in enumerate(chunks):
|
| 111 |
+
doc = Document(
|
| 112 |
+
page_content=chunk,
|
| 113 |
+
metadata={"source": key, "type": "text", "chunk": i}
|
| 114 |
+
)
|
| 115 |
+
documents.append(doc)
|
| 116 |
+
|
| 117 |
+
return FAISS.from_documents(documents, self.embeddings) if documents else None
|
| 118 |
+
|
| 119 |
+
except Exception as e:
|
| 120 |
+
print(f"Error creating vector store: {e}")
|
| 121 |
+
return None
|
| 122 |
+
|
| 123 |
+
def _setup_llm(self):
|
| 124 |
+
"""Setup Gemini LLM"""
|
| 125 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 126 |
+
|
| 127 |
+
if api_key:
|
| 128 |
+
try:
|
| 129 |
+
genai.configure(api_key=api_key)
|
| 130 |
+
return genai.GenerativeModel('gemini-1.5-flash')
|
| 131 |
+
except Exception as e:
|
| 132 |
+
print(f"Error setting up Gemini: {e}")
|
| 133 |
+
|
| 134 |
+
return None
|
| 135 |
+
|
| 136 |
+
def _create_clear_explanations(self) -> Dict[str, str]:
|
| 137 |
+
"""Pre-written clear explanations for common questions"""
|
| 138 |
+
return {
|
| 139 |
+
"greeting": """Hi! I'm here to help explain David Van Dijcke's research in clear, accessible terms.
|
| 140 |
+
|
| 141 |
+
David is an econometrician on the job market who develops new statistical methods to answer important policy questions. His work helps us understand how policies affect different people differently - not just averages.
|
| 142 |
+
|
| 143 |
+
Feel free to ask about his job market paper (R3D), his other research, or what makes his work unique!""",
|
| 144 |
+
|
| 145 |
+
"job_market": """David's job market paper, R3D, solves an important problem in economics.
|
| 146 |
+
|
| 147 |
+
Traditional methods can tell us if a policy works "on average" - like whether a job training program increases average wages. But averages hide important details. Maybe the program helps low earners a lot but doesn't help high earners at all.
|
| 148 |
+
|
| 149 |
+
R3D lets researchers see the full picture - how a policy affects the entire distribution of outcomes. This means we can answer questions like:
|
| 150 |
+
- Does this education policy help struggling students catch up?
|
| 151 |
+
- Does this labor policy reduce inequality?
|
| 152 |
+
- Do subsidies benefit small firms more than large ones?
|
| 153 |
+
|
| 154 |
+
The technical innovation uses "optimal transport theory" - basically finding the most efficient way to compare whole distributions before and after a policy change.""",
|
| 155 |
+
|
| 156 |
+
"use_cases": """The R3D method has several important applications:
|
| 157 |
+
|
| 158 |
+
**Education Policy**: Instead of just asking "does this program raise test scores?", we can ask "does it help struggling students more than advanced students?"
|
| 159 |
+
|
| 160 |
+
**Labor Economics**: When studying minimum wage effects, we can see if it compresses the wage distribution (reduces inequality) beyond just raising the average.
|
| 161 |
+
|
| 162 |
+
**Development Economics**: For anti-poverty programs, we can see if they help the poorest households escape poverty or just slightly improve everyone's situation.
|
| 163 |
+
|
| 164 |
+
**Finance**: In studying financial regulations, we can see if they reduce extreme risks, not just average risk.
|
| 165 |
+
|
| 166 |
+
The key insight is that the same average effect can hide very different distributional stories - and those differences matter for policy.""",
|
| 167 |
+
|
| 168 |
+
"what_makes_unique": """What makes David's research unique:
|
| 169 |
+
|
| 170 |
+
1. **Practical Focus**: While the methods are sophisticated, they're designed to answer real policy questions that matter to people's lives.
|
| 171 |
+
|
| 172 |
+
2. **Distribution Thinking**: Most economics focuses on averages. David's work recognizes that how effects are distributed across people often matters more than the average.
|
| 173 |
+
|
| 174 |
+
3. **Technical Innovation**: He brings tools from other fields (like optimal transport from mathematics) to solve economic problems in new ways.
|
| 175 |
+
|
| 176 |
+
4. **Policy Relevance**: His papers directly address current issues - internet shutdowns, return-to-office policies, COVID responses - not just theoretical questions.
|
| 177 |
+
|
| 178 |
+
5. **Clear Applications**: Each method comes with real examples showing how it helps answer important questions."""
|
| 179 |
+
}
|
| 180 |
+
|
| 181 |
+
def answer_question(self, query: str, chat_history: List = None) -> str:
|
| 182 |
+
"""Answer with focus on clarity and accuracy"""
|
| 183 |
+
if not query.strip():
|
| 184 |
+
return "What would you like to know about David's research?"
|
| 185 |
+
|
| 186 |
+
query_lower = query.lower()
|
| 187 |
+
|
| 188 |
+
# Check for pre-written explanations
|
| 189 |
+
if any(greeting in query_lower for greeting in ["hi", "hello", "hey", "what's up"]):
|
| 190 |
+
return self.clear_explanations["greeting"]
|
| 191 |
+
|
| 192 |
+
if any(term in query_lower for term in ["job market", "jmp", "r3d"]) and "paper" in query_lower:
|
| 193 |
+
return self.clear_explanations["job_market"]
|
| 194 |
+
|
| 195 |
+
if any(term in query_lower for term in ["use", "application", "why", "purpose"]):
|
| 196 |
+
return self.clear_explanations["use_cases"]
|
| 197 |
+
|
| 198 |
+
if any(term in query_lower for term in ["unique", "special", "different"]):
|
| 199 |
+
return self.clear_explanations["what_makes_unique"]
|
| 200 |
+
|
| 201 |
+
# For other questions, use LLM with better prompting
|
| 202 |
+
if self.llm:
|
| 203 |
+
context = self._get_relevant_context(query)
|
| 204 |
+
return self._generate_natural_response(query, context)
|
| 205 |
+
else:
|
| 206 |
+
return self._get_simple_fallback(query)
|
| 207 |
+
|
| 208 |
+
def _get_relevant_context(self, query: str) -> str:
|
| 209 |
+
"""Get relevant context focusing on explanations"""
|
| 210 |
+
contexts = []
|
| 211 |
+
|
| 212 |
+
# First, try to match specific papers
|
| 213 |
+
query_lower = query.lower()
|
| 214 |
+
|
| 215 |
+
for key, paper in self.papers.items():
|
| 216 |
+
paper_mentioned = False
|
| 217 |
+
|
| 218 |
+
# Check if paper is mentioned
|
| 219 |
+
if key in query_lower or any(word in query_lower for word in paper['title'].lower().split()):
|
| 220 |
+
paper_mentioned = True
|
| 221 |
+
|
| 222 |
+
if paper_mentioned:
|
| 223 |
+
context = f"Paper: {paper['title']}\n"
|
| 224 |
+
context += f"Simple explanation: {paper.get('simple_explanation', '')}\n"
|
| 225 |
+
context += f"Main contribution: {paper.get('main_contribution', '')}\n"
|
| 226 |
+
context += f"Real-world use: {paper.get('real_world_use', '')}"
|
| 227 |
+
contexts.append(context)
|
| 228 |
+
|
| 229 |
+
# If no specific paper matched, use vector search
|
| 230 |
+
if not contexts and self.vector_store:
|
| 231 |
+
try:
|
| 232 |
+
docs = self.vector_store.similarity_search(query, k=3)
|
| 233 |
+
for doc in docs:
|
| 234 |
+
contexts.append(doc.page_content)
|
| 235 |
+
except:
|
| 236 |
+
pass
|
| 237 |
+
|
| 238 |
+
return "\n\n---\n\n".join(contexts)
|
| 239 |
+
|
| 240 |
+
def _generate_natural_response(self, query: str, context: str) -> str:
|
| 241 |
+
"""Generate natural, accessible response"""
|
| 242 |
+
prompt = f"""You are explaining David Van Dijcke's econometric research to someone who may not have a technical background.
|
| 243 |
+
|
| 244 |
+
David is on the 2025-26 economics job market. His job market paper is R3D.
|
| 245 |
+
|
| 246 |
+
Context about his work:
|
| 247 |
+
{context}
|
| 248 |
+
|
| 249 |
+
Question: {query}
|
| 250 |
+
|
| 251 |
+
Instructions:
|
| 252 |
+
1. Give a clear, conversational answer in 2-3 paragraphs maximum
|
| 253 |
+
2. Avoid technical jargon - explain concepts simply
|
| 254 |
+
3. Use concrete examples when possible
|
| 255 |
+
4. Focus on why this research matters, not just what it does
|
| 256 |
+
5. Be friendly and approachable
|
| 257 |
+
6. If discussing methods, explain the intuition, not the math
|
| 258 |
+
|
| 259 |
+
Answer in a natural, conversational tone:"""
|
| 260 |
+
|
| 261 |
+
try:
|
| 262 |
+
response = self.llm.generate_content(prompt)
|
| 263 |
+
return response.text
|
| 264 |
+
except Exception as e:
|
| 265 |
+
return self._get_simple_fallback(query)
|
| 266 |
+
|
| 267 |
+
def _get_simple_fallback(self, query: str) -> str:
|
| 268 |
+
"""Simple fallback responses"""
|
| 269 |
+
query_lower = query.lower()
|
| 270 |
+
|
| 271 |
+
if "who" in query_lower or "david" in query_lower:
|
| 272 |
+
return """David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan.
|
| 273 |
+
|
| 274 |
+
He develops new statistical methods that help us understand how policies affect different people differently - going beyond simple averages to see the full picture. His job market paper (R3D) is about measuring distributional effects in policy evaluation."""
|
| 275 |
+
|
| 276 |
+
if "r3d" in query_lower or "job market" in query_lower:
|
| 277 |
+
return """R3D is David's job market paper. It extends regression discontinuity design to study entire distributions.
|
| 278 |
+
|
| 279 |
+
In simple terms: Traditional methods tell us if a policy works "on average." R3D shows us WHO it works for - whether it helps the poor more than the rich, struggling students more than advanced ones, etc. This matters because the same "average" effect can hide very different realities."""
|
| 280 |
+
|
| 281 |
+
return """I can help explain David Van Dijcke's research! He's an econometrician who develops methods to understand how policies affect different people differently.
|
| 282 |
+
|
| 283 |
+
Try asking about:
|
| 284 |
+
- His job market paper (R3D)
|
| 285 |
+
- What makes his research unique
|
| 286 |
+
- How his methods are used in practice"""
|
| 287 |
+
|
| 288 |
+
# Create interface
|
| 289 |
+
def create_interface():
|
| 290 |
+
"""Create user-friendly interface"""
|
| 291 |
+
assistant = NaturalResearchAssistant()
|
| 292 |
+
|
| 293 |
+
def chat(message, history):
|
| 294 |
+
if history is None:
|
| 295 |
+
history = []
|
| 296 |
+
response = assistant.answer_question(message, history)
|
| 297 |
+
history.append([message, response])
|
| 298 |
+
return "", history
|
| 299 |
+
|
| 300 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant", css="""
|
| 301 |
+
.gradio-container {font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;}
|
| 302 |
+
""") as demo:
|
| 303 |
+
|
| 304 |
+
gr.Markdown("""
|
| 305 |
+
# Chat with David Van Dijcke's Research Assistant
|
| 306 |
+
|
| 307 |
+
**David Van Dijcke** | Econometrician | 2025-26 Job Market Candidate | University of Michigan
|
| 308 |
+
""")
|
| 309 |
+
|
| 310 |
+
chatbot = gr.Chatbot(
|
| 311 |
+
height=450,
|
| 312 |
+
show_label=False,
|
| 313 |
+
avatar_images=None
|
| 314 |
+
)
|
| 315 |
+
|
| 316 |
+
msg = gr.Textbox(
|
| 317 |
+
label="Your question",
|
| 318 |
+
placeholder="Ask me about David's research in plain English...",
|
| 319 |
+
lines=2
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
with gr.Row():
|
| 323 |
+
submit = gr.Button("Send", variant="primary")
|
| 324 |
+
clear = gr.Button("Clear Chat")
|
| 325 |
+
|
| 326 |
+
# Suggested questions
|
| 327 |
+
gr.Markdown("### Try asking:")
|
| 328 |
+
examples = gr.Examples(
|
| 329 |
+
examples=[
|
| 330 |
+
"What is David's job market paper about?",
|
| 331 |
+
"Why does R3D matter for policy?",
|
| 332 |
+
"What real-world problems does David's research solve?",
|
| 333 |
+
"How is David's work different from typical economics research?",
|
| 334 |
+
"Can you explain R3D without the technical details?",
|
| 335 |
+
"What are some applications of the R3D method?"
|
| 336 |
+
],
|
| 337 |
+
inputs=msg,
|
| 338 |
+
label=""
|
| 339 |
+
)
|
| 340 |
+
|
| 341 |
+
# Event handlers
|
| 342 |
+
msg.submit(chat, [msg, chatbot], [msg, chatbot])
|
| 343 |
+
submit.click(chat, [msg, chatbot], [msg, chatbot])
|
| 344 |
+
clear.click(lambda: [], None, chatbot)
|
| 345 |
+
|
| 346 |
+
return demo
|
| 347 |
+
|
| 348 |
+
if __name__ == "__main__":
|
| 349 |
+
interface = create_interface()
|
| 350 |
+
interface.launch(
|
| 351 |
+
server_name="127.0.0.1",
|
| 352 |
+
server_port=7860,
|
| 353 |
+
share=False,
|
| 354 |
+
quiet=True
|
| 355 |
+
)
|
app_optimized.py
ADDED
|
@@ -0,0 +1,554 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Optimized Research Assistant
|
| 4 |
+
Combines full paper loading with smart retrieval and caching
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
from typing import List, Dict, Any, Optional, Tuple
|
| 11 |
+
import gradio as gr
|
| 12 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 13 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 14 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 15 |
+
from langchain_community.vectorstores import FAISS
|
| 16 |
+
from langchain.schema import Document
|
| 17 |
+
from dotenv import load_dotenv
|
| 18 |
+
import google.generativeai as genai
|
| 19 |
+
|
| 20 |
+
# Load environment variables
|
| 21 |
+
load_dotenv()
|
| 22 |
+
|
| 23 |
+
class OptimizedResearchAssistant:
|
| 24 |
+
"""Optimized assistant with full papers and smart retrieval"""
|
| 25 |
+
|
| 26 |
+
def __init__(self):
|
| 27 |
+
"""Initialize with optimized loading and caching"""
|
| 28 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 29 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
| 30 |
+
)
|
| 31 |
+
|
| 32 |
+
# Load papers with smart caching
|
| 33 |
+
self.papers_metadata = self._load_papers_metadata()
|
| 34 |
+
self.full_papers = self._load_full_papers_cached()
|
| 35 |
+
|
| 36 |
+
# Create hierarchical vector stores
|
| 37 |
+
self.vector_store_chunks = self._create_vector_store("chunks")
|
| 38 |
+
self.vector_store_sections = self._create_vector_store("sections")
|
| 39 |
+
|
| 40 |
+
self.llm = self._setup_llm()
|
| 41 |
+
|
| 42 |
+
# Cache for responses
|
| 43 |
+
self.response_cache = {}
|
| 44 |
+
|
| 45 |
+
def _load_papers_metadata(self) -> Dict[str, Dict]:
|
| 46 |
+
"""Load metadata about papers"""
|
| 47 |
+
return {
|
| 48 |
+
"r3d": {
|
| 49 |
+
"file": "r3d_arxiv_4apr2025.pdf",
|
| 50 |
+
"title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
|
| 51 |
+
"type": "Job Market Paper",
|
| 52 |
+
"year": 2025,
|
| 53 |
+
"keywords": ["regression discontinuity", "distribution", "optimal transport", "wasserstein", "job market"],
|
| 54 |
+
"sections": ["introduction", "theory", "identification", "estimation", "applications", "conclusion"]
|
| 55 |
+
},
|
| 56 |
+
"fdr": {
|
| 57 |
+
"file": "fdr.pdf",
|
| 58 |
+
"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
|
| 59 |
+
"type": "Working Paper",
|
| 60 |
+
"year": 2024,
|
| 61 |
+
"keywords": ["free discontinuity", "internet shutdowns", "geometric measure theory"],
|
| 62 |
+
"sections": ["introduction", "methodology", "application", "results"]
|
| 63 |
+
},
|
| 64 |
+
"disco": {
|
| 65 |
+
"file": "disco.pdf",
|
| 66 |
+
"title": "disco: Distributional Synthetic Controls",
|
| 67 |
+
"type": "Working Paper",
|
| 68 |
+
"year": 2025,
|
| 69 |
+
"keywords": ["distributional", "synthetic controls", "stata", "package"],
|
| 70 |
+
"sections": ["introduction", "methodology", "implementation", "application", "conclusion"]
|
| 71 |
+
},
|
| 72 |
+
"rto": {
|
| 73 |
+
"file": "rto.pdf",
|
| 74 |
+
"title": "Return to Office and the Tenure Distribution",
|
| 75 |
+
"type": "Working Paper",
|
| 76 |
+
"year": 2025,
|
| 77 |
+
"keywords": ["return to office", "tenure", "distribution", "covid", "remote work"],
|
| 78 |
+
"sections": ["introduction", "data", "methodology", "results", "conclusion"]
|
| 79 |
+
},
|
| 80 |
+
"prodf": {
|
| 81 |
+
"file": "prodf.pdf",
|
| 82 |
+
"title": "On the Non-Identification of Revenue Production Functions",
|
| 83 |
+
"type": "Working Paper",
|
| 84 |
+
"year": 2023,
|
| 85 |
+
"keywords": ["production functions", "revenue", "identification", "productivity"],
|
| 86 |
+
"sections": ["introduction", "theory", "identification", "conclusion"]
|
| 87 |
+
},
|
| 88 |
+
"unmasking": {
|
| 89 |
+
"file": "unmasking_partisanship.pdf",
|
| 90 |
+
"title": "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk",
|
| 91 |
+
"type": "Published Paper",
|
| 92 |
+
"year": 2021,
|
| 93 |
+
"keywords": ["masks", "partisanship", "polarization", "covid", "public health"],
|
| 94 |
+
"sections": ["introduction", "data", "methodology", "results", "conclusion"]
|
| 95 |
+
},
|
| 96 |
+
"ukraine": {
|
| 97 |
+
"file": "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf",
|
| 98 |
+
"title": "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine",
|
| 99 |
+
"type": "Published Paper",
|
| 100 |
+
"year": 2023,
|
| 101 |
+
"keywords": ["ukraine", "alerts", "invasion", "public response", "war"],
|
| 102 |
+
"sections": ["introduction", "data", "methodology", "results", "conclusion"]
|
| 103 |
+
},
|
| 104 |
+
"staying_open": {
|
| 105 |
+
"file": "BrzezinskiKechtDeianaVanDijcke_18042020_CEPR_2.pdf",
|
| 106 |
+
"title": "The Cost of Staying Open: Voluntary Social Distancing and Lockdowns in the US",
|
| 107 |
+
"type": "Published Paper",
|
| 108 |
+
"year": 2020,
|
| 109 |
+
"keywords": ["covid", "lockdown", "staying open", "voluntary", "social distancing"],
|
| 110 |
+
"sections": ["introduction", "data", "methodology", "results", "conclusion"]
|
| 111 |
+
},
|
| 112 |
+
"belief_science": {
|
| 113 |
+
"file": "ssrn-3776854.pdf",
|
| 114 |
+
"title": "Belief in Science Influences Physical Distancing in Response to COVID-19 Lockdown Policies",
|
| 115 |
+
"type": "Working Paper",
|
| 116 |
+
"year": 2021,
|
| 117 |
+
"keywords": ["belief", "science", "covid", "compliance", "physical distancing"],
|
| 118 |
+
"sections": ["introduction", "data", "methodology", "results", "conclusion"]
|
| 119 |
+
},
|
| 120 |
+
"portfolio_flows": {
|
| 121 |
+
"file": "BOE_revision_8dec2022.pdf",
|
| 122 |
+
"title": "What Drives International Portfolio Flows?",
|
| 123 |
+
"type": "Working Paper",
|
| 124 |
+
"year": 2022,
|
| 125 |
+
"keywords": ["portfolio", "flows", "international", "finance", "investment"],
|
| 126 |
+
"sections": ["introduction", "theory", "data", "results", "conclusion"]
|
| 127 |
+
},
|
| 128 |
+
"cv": {
|
| 129 |
+
"file": "CV_DavidVanDijcke.pdf",
|
| 130 |
+
"title": "Curriculum Vitae",
|
| 131 |
+
"type": "CV",
|
| 132 |
+
"year": 2025,
|
| 133 |
+
"keywords": ["cv", "resume", "background", "econometrician", "david"],
|
| 134 |
+
"sections": ["education", "research", "teaching", "awards"]
|
| 135 |
+
}
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
def _load_full_papers_cached(self) -> Dict[str, Dict]:
|
| 139 |
+
"""Load full papers with caching"""
|
| 140 |
+
cache_file = "papers_cache.json"
|
| 141 |
+
|
| 142 |
+
# Try to load from cache
|
| 143 |
+
if os.path.exists(cache_file):
|
| 144 |
+
try:
|
| 145 |
+
with open(cache_file, 'r') as f:
|
| 146 |
+
return json.load(f)
|
| 147 |
+
except:
|
| 148 |
+
pass
|
| 149 |
+
|
| 150 |
+
# Load papers
|
| 151 |
+
papers = {}
|
| 152 |
+
pdf_dir = "documents"
|
| 153 |
+
|
| 154 |
+
for key, metadata in self.papers_metadata.items():
|
| 155 |
+
pdf_path = os.path.join(pdf_dir, metadata["file"])
|
| 156 |
+
if os.path.exists(pdf_path):
|
| 157 |
+
try:
|
| 158 |
+
loader = PyPDFLoader(pdf_path)
|
| 159 |
+
pages = loader.load()
|
| 160 |
+
|
| 161 |
+
# Extract sections intelligently
|
| 162 |
+
sections = self._extract_sections(pages, metadata["sections"])
|
| 163 |
+
|
| 164 |
+
papers[key] = {
|
| 165 |
+
"full_text": "\n\n".join([p.page_content for p in pages]),
|
| 166 |
+
"sections": sections,
|
| 167 |
+
"num_pages": len(pages),
|
| 168 |
+
"metadata": metadata
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
print(f"Loaded: {metadata['title']} ({len(pages)} pages)")
|
| 172 |
+
|
| 173 |
+
except Exception as e:
|
| 174 |
+
print(f"Error loading {metadata['file']}: {e}")
|
| 175 |
+
|
| 176 |
+
# Cache for next time
|
| 177 |
+
try:
|
| 178 |
+
# Create a serializable version
|
| 179 |
+
cache_data = {}
|
| 180 |
+
for key, paper in papers.items():
|
| 181 |
+
cache_data[key] = {
|
| 182 |
+
"full_text": paper["full_text"],
|
| 183 |
+
"sections": paper["sections"],
|
| 184 |
+
"num_pages": paper["num_pages"],
|
| 185 |
+
"metadata": paper["metadata"]
|
| 186 |
+
}
|
| 187 |
+
|
| 188 |
+
with open(cache_file, 'w') as f:
|
| 189 |
+
json.dump(cache_data, f)
|
| 190 |
+
except:
|
| 191 |
+
pass
|
| 192 |
+
|
| 193 |
+
return papers
|
| 194 |
+
|
| 195 |
+
def _extract_sections(self, pages: List[Document], expected_sections: List[str]) -> Dict[str, str]:
|
| 196 |
+
"""Extract paper sections intelligently"""
|
| 197 |
+
full_text = "\n\n".join([p.page_content for p in pages])
|
| 198 |
+
sections = {}
|
| 199 |
+
|
| 200 |
+
# Common section patterns
|
| 201 |
+
section_patterns = {
|
| 202 |
+
"introduction": ["introduction", "1 introduction", "1. introduction"],
|
| 203 |
+
"theory": ["theory", "theoretical", "model", "2 theory", "2. theory"],
|
| 204 |
+
"methodology": ["methodology", "method", "empirical strategy", "3 method"],
|
| 205 |
+
"data": ["data", "dataset", "4 data"],
|
| 206 |
+
"results": ["results", "findings", "5 results"],
|
| 207 |
+
"conclusion": ["conclusion", "concluding", "6 conclusion"]
|
| 208 |
+
}
|
| 209 |
+
|
| 210 |
+
# Extract sections
|
| 211 |
+
for section_key in expected_sections:
|
| 212 |
+
patterns = section_patterns.get(section_key, [section_key])
|
| 213 |
+
|
| 214 |
+
for pattern in patterns:
|
| 215 |
+
# Find section start
|
| 216 |
+
import re
|
| 217 |
+
regex = re.compile(rf"\n+\s*({re.escape(pattern)})\s*\n", re.IGNORECASE)
|
| 218 |
+
match = regex.search(full_text)
|
| 219 |
+
|
| 220 |
+
if match:
|
| 221 |
+
start = match.end()
|
| 222 |
+
# Find next section or end
|
| 223 |
+
next_match = None
|
| 224 |
+
for next_key in expected_sections:
|
| 225 |
+
if next_key != section_key:
|
| 226 |
+
next_patterns = section_patterns.get(next_key, [next_key])
|
| 227 |
+
for next_pattern in next_patterns:
|
| 228 |
+
next_regex = re.compile(rf"\n+\s*({re.escape(next_pattern)})\s*\n", re.IGNORECASE)
|
| 229 |
+
next_match = next_regex.search(full_text[start:])
|
| 230 |
+
if next_match:
|
| 231 |
+
break
|
| 232 |
+
if next_match:
|
| 233 |
+
break
|
| 234 |
+
|
| 235 |
+
end = start + next_match.start() if next_match else len(full_text)
|
| 236 |
+
sections[section_key] = full_text[start:end].strip()
|
| 237 |
+
break
|
| 238 |
+
|
| 239 |
+
return sections
|
| 240 |
+
|
| 241 |
+
def _create_vector_store(self, store_type: str) -> FAISS:
|
| 242 |
+
"""Create or load vector stores"""
|
| 243 |
+
cache_dir = f"vector_store_cache_{store_type}"
|
| 244 |
+
|
| 245 |
+
if os.path.exists(cache_dir):
|
| 246 |
+
try:
|
| 247 |
+
# Try with newer langchain version parameter
|
| 248 |
+
return FAISS.load_local(cache_dir, self.embeddings, allow_dangerous_deserialization=True)
|
| 249 |
+
except TypeError:
|
| 250 |
+
# Fall back to older version without the parameter
|
| 251 |
+
return FAISS.load_local(cache_dir, self.embeddings)
|
| 252 |
+
|
| 253 |
+
documents = []
|
| 254 |
+
|
| 255 |
+
if store_type == "chunks":
|
| 256 |
+
# Smaller chunks for detailed retrieval
|
| 257 |
+
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
|
| 258 |
+
else:
|
| 259 |
+
# Larger chunks for section-level retrieval
|
| 260 |
+
splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=300)
|
| 261 |
+
|
| 262 |
+
for key, paper in self.full_papers.items():
|
| 263 |
+
# Create documents from sections
|
| 264 |
+
for section_name, section_text in paper["sections"].items():
|
| 265 |
+
if section_text:
|
| 266 |
+
doc = Document(
|
| 267 |
+
page_content=section_text,
|
| 268 |
+
metadata={
|
| 269 |
+
"paper_key": key,
|
| 270 |
+
"section": section_name,
|
| 271 |
+
"title": paper["metadata"]["title"],
|
| 272 |
+
"type": paper["metadata"]["type"]
|
| 273 |
+
}
|
| 274 |
+
)
|
| 275 |
+
|
| 276 |
+
# Split if needed
|
| 277 |
+
if store_type == "chunks":
|
| 278 |
+
chunks = splitter.split_documents([doc])
|
| 279 |
+
documents.extend(chunks)
|
| 280 |
+
else:
|
| 281 |
+
documents.append(doc)
|
| 282 |
+
|
| 283 |
+
vector_store = FAISS.from_documents(documents, self.embeddings)
|
| 284 |
+
os.makedirs(cache_dir, exist_ok=True)
|
| 285 |
+
vector_store.save_local(cache_dir)
|
| 286 |
+
|
| 287 |
+
return vector_store
|
| 288 |
+
|
| 289 |
+
def _setup_llm(self):
|
| 290 |
+
"""Setup Gemini model"""
|
| 291 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 292 |
+
|
| 293 |
+
if api_key:
|
| 294 |
+
genai.configure(api_key=api_key)
|
| 295 |
+
# Use latest Gemini model
|
| 296 |
+
return genai.GenerativeModel('gemini-2.0-flash-exp')
|
| 297 |
+
|
| 298 |
+
return None
|
| 299 |
+
|
| 300 |
+
def _get_query_type(self, query: str) -> str:
|
| 301 |
+
"""Determine query type for optimal retrieval"""
|
| 302 |
+
query_lower = query.lower()
|
| 303 |
+
|
| 304 |
+
if any(term in query_lower for term in ["technical", "method", "econometric", "detail"]):
|
| 305 |
+
return "technical"
|
| 306 |
+
elif any(term in query_lower for term in ["overview", "summary", "about", "who is"]):
|
| 307 |
+
return "overview"
|
| 308 |
+
elif any(term in query_lower for term in ["application", "policy", "empirical"]):
|
| 309 |
+
return "application"
|
| 310 |
+
elif any(term in query_lower for term in ["job market", "cv", "background"]):
|
| 311 |
+
return "biographical"
|
| 312 |
+
else:
|
| 313 |
+
return "general"
|
| 314 |
+
|
| 315 |
+
def _smart_retrieval(self, query: str) -> Tuple[str, List[str]]:
|
| 316 |
+
"""Smart retrieval based on query type"""
|
| 317 |
+
query_type = self._get_query_type(query)
|
| 318 |
+
|
| 319 |
+
# Determine which papers are most relevant
|
| 320 |
+
relevant_papers = self._identify_relevant_papers(query)
|
| 321 |
+
|
| 322 |
+
context_parts = []
|
| 323 |
+
paper_list = []
|
| 324 |
+
|
| 325 |
+
# Always include CV summary for biographical queries
|
| 326 |
+
if query_type == "biographical" and "cv" in self.full_papers:
|
| 327 |
+
cv_sections = self.full_papers["cv"]["sections"]
|
| 328 |
+
context_parts.append(f"=== CV HIGHLIGHTS ===\n{cv_sections.get('education', '')}\n{cv_sections.get('research', '')}")
|
| 329 |
+
paper_list.append("CV")
|
| 330 |
+
|
| 331 |
+
# Add relevant papers based on query type
|
| 332 |
+
if query_type == "technical":
|
| 333 |
+
# For technical queries, include theory and methodology sections
|
| 334 |
+
for paper_key in relevant_papers[:3]: # Top 3 papers
|
| 335 |
+
if paper_key in self.full_papers:
|
| 336 |
+
paper = self.full_papers[paper_key]
|
| 337 |
+
sections = paper["sections"]
|
| 338 |
+
|
| 339 |
+
title = paper["metadata"]["title"]
|
| 340 |
+
theory = sections.get("theory", sections.get("methodology", ""))
|
| 341 |
+
|
| 342 |
+
if theory:
|
| 343 |
+
context_parts.append(f"=== {title} - TECHNICAL DETAILS ===\n{theory[:20000]}")
|
| 344 |
+
paper_list.append(title)
|
| 345 |
+
|
| 346 |
+
elif query_type == "overview":
|
| 347 |
+
# For overview queries, include introductions and conclusions
|
| 348 |
+
for paper_key in relevant_papers[:4]: # Top 4 papers
|
| 349 |
+
if paper_key in self.full_papers:
|
| 350 |
+
paper = self.full_papers[paper_key]
|
| 351 |
+
sections = paper["sections"]
|
| 352 |
+
|
| 353 |
+
title = paper["metadata"]["title"]
|
| 354 |
+
intro = sections.get("introduction", "")[:5000]
|
| 355 |
+
conclusion = sections.get("conclusion", "")[:3000]
|
| 356 |
+
|
| 357 |
+
context_parts.append(f"=== {title} ===\nIntroduction:\n{intro}\n\nConclusion:\n{conclusion}")
|
| 358 |
+
paper_list.append(title)
|
| 359 |
+
|
| 360 |
+
else:
|
| 361 |
+
# For general queries, use hybrid approach
|
| 362 |
+
# Get relevant chunks
|
| 363 |
+
chunks = self.vector_store_chunks.similarity_search(query, k=6)
|
| 364 |
+
|
| 365 |
+
# Group by paper
|
| 366 |
+
paper_chunks = {}
|
| 367 |
+
for chunk in chunks:
|
| 368 |
+
paper_key = chunk.metadata.get("paper_key")
|
| 369 |
+
if paper_key:
|
| 370 |
+
if paper_key not in paper_chunks:
|
| 371 |
+
paper_chunks[paper_key] = []
|
| 372 |
+
paper_chunks[paper_key].append(chunk.page_content)
|
| 373 |
+
|
| 374 |
+
# Add grouped chunks
|
| 375 |
+
for paper_key, chunks in paper_chunks.items():
|
| 376 |
+
if paper_key in self.full_papers:
|
| 377 |
+
title = self.full_papers[paper_key]["metadata"]["title"]
|
| 378 |
+
combined_chunks = "\n\n".join(chunks)
|
| 379 |
+
context_parts.append(f"=== {title} - RELEVANT EXCERPTS ===\n{combined_chunks}")
|
| 380 |
+
paper_list.append(title)
|
| 381 |
+
|
| 382 |
+
return "\n\n".join(context_parts), paper_list
|
| 383 |
+
|
| 384 |
+
def _identify_relevant_papers(self, query: str) -> List[str]:
|
| 385 |
+
"""Identify most relevant papers for a query"""
|
| 386 |
+
query_lower = query.lower()
|
| 387 |
+
scores = {}
|
| 388 |
+
|
| 389 |
+
for key, metadata in self.papers_metadata.items():
|
| 390 |
+
score = 0
|
| 391 |
+
|
| 392 |
+
# Check keywords
|
| 393 |
+
for keyword in metadata["keywords"]:
|
| 394 |
+
if keyword in query_lower:
|
| 395 |
+
score += 2
|
| 396 |
+
|
| 397 |
+
# Check title
|
| 398 |
+
if any(word in query_lower for word in metadata["title"].lower().split()):
|
| 399 |
+
score += 1
|
| 400 |
+
|
| 401 |
+
# Special cases
|
| 402 |
+
if key == "r3d" and any(term in query_lower for term in ["job market", "jmp", "main paper"]):
|
| 403 |
+
score += 5
|
| 404 |
+
|
| 405 |
+
if score > 0:
|
| 406 |
+
scores[key] = score
|
| 407 |
+
|
| 408 |
+
# Sort by score
|
| 409 |
+
sorted_papers = sorted(scores.items(), key=lambda x: x[1], reverse=True)
|
| 410 |
+
|
| 411 |
+
return [paper[0] for paper in sorted_papers]
|
| 412 |
+
|
| 413 |
+
def answer_question(self, query: str) -> str:
|
| 414 |
+
"""Answer questions with optimized retrieval"""
|
| 415 |
+
if not query.strip():
|
| 416 |
+
return "Please ask a question about David Van Dijcke's research."
|
| 417 |
+
|
| 418 |
+
# Check cache
|
| 419 |
+
cache_key = query.lower().strip()
|
| 420 |
+
if cache_key in self.response_cache:
|
| 421 |
+
return self.response_cache[cache_key]
|
| 422 |
+
|
| 423 |
+
# Get relevant context
|
| 424 |
+
context, papers_used = self._smart_retrieval(query)
|
| 425 |
+
|
| 426 |
+
if not self.llm:
|
| 427 |
+
return self._get_fallback_response(query)
|
| 428 |
+
|
| 429 |
+
# Create optimized prompt
|
| 430 |
+
prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 academic job market.
|
| 431 |
+
|
| 432 |
+
Context from papers: {', '.join(papers_used)}
|
| 433 |
+
|
| 434 |
+
{context}
|
| 435 |
+
|
| 436 |
+
Question: {query}
|
| 437 |
+
|
| 438 |
+
Instructions:
|
| 439 |
+
- Provide accurate, detailed answers based on the context
|
| 440 |
+
- Use specific examples and technical details when relevant
|
| 441 |
+
- Be clear and precise about David's contributions
|
| 442 |
+
- If discussing methods, explain both the intuition and technical aspects
|
| 443 |
+
|
| 444 |
+
Answer:"""
|
| 445 |
+
|
| 446 |
+
try:
|
| 447 |
+
response = self.llm.generate_content(prompt)
|
| 448 |
+
answer = response.text
|
| 449 |
+
|
| 450 |
+
# Cache response
|
| 451 |
+
self.response_cache[cache_key] = answer
|
| 452 |
+
|
| 453 |
+
return answer
|
| 454 |
+
|
| 455 |
+
except Exception as e:
|
| 456 |
+
print(f"Error: {e}")
|
| 457 |
+
return self._get_fallback_response(query)
|
| 458 |
+
|
| 459 |
+
def _get_fallback_response(self, query: str) -> str:
|
| 460 |
+
"""Enhanced fallback responses"""
|
| 461 |
+
query_lower = query.lower()
|
| 462 |
+
|
| 463 |
+
# Check for specific paper mentions
|
| 464 |
+
if "r3d" in query_lower or "job market" in query_lower:
|
| 465 |
+
return """R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David Van Dijcke's job market paper.
|
| 466 |
+
|
| 467 |
+
Key innovations:
|
| 468 |
+
• Extends RDD to analyze entire outcome distributions, not just means
|
| 469 |
+
• Uses optimal transport theory and Wasserstein distances
|
| 470 |
+
• Develops new estimation and inference procedures
|
| 471 |
+
• Applications to income distributions, test score distributions
|
| 472 |
+
|
| 473 |
+
The paper addresses a fundamental limitation of traditional RDD that only looks at average effects, enabling researchers to study distributional impacts of policies."""
|
| 474 |
+
|
| 475 |
+
elif "fdr" in query_lower or "free discontinuity" in query_lower:
|
| 476 |
+
return """Free Discontinuity Regression (FDR) is David's paper on estimating regression functions with unknown discontinuities.
|
| 477 |
+
|
| 478 |
+
Key contributions:
|
| 479 |
+
• Develops methods for when discontinuity locations are unknown
|
| 480 |
+
• Uses geometric measure theory and free discontinuity problems
|
| 481 |
+
• Application to internet shutdowns' economic effects
|
| 482 |
+
• Shows traditional methods can be severely biased when discontinuities are misspecified"""
|
| 483 |
+
|
| 484 |
+
elif "david" in query_lower or "who" in query_lower:
|
| 485 |
+
return """David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan.
|
| 486 |
+
|
| 487 |
+
Specializations:
|
| 488 |
+
• Functional data analysis and high-dimensional econometrics
|
| 489 |
+
• Optimal transport methods in economics
|
| 490 |
+
• Distribution-valued outcomes and treatment effects
|
| 491 |
+
• Novel applications of geometric measure theory
|
| 492 |
+
|
| 493 |
+
His research develops cutting-edge econometric methods for modern data challenges, with applications to labor, development, and public policy."""
|
| 494 |
+
|
| 495 |
+
return "I'm David Van Dijcke's research assistant. Please ask about his econometric methods, papers, or background. For best results, configure a Google API key."
|
| 496 |
+
|
| 497 |
+
# Create interface
|
| 498 |
+
def create_interface():
|
| 499 |
+
"""Create Gradio interface"""
|
| 500 |
+
assistant = OptimizedResearchAssistant()
|
| 501 |
+
|
| 502 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant") as interface:
|
| 503 |
+
gr.Markdown("""
|
| 504 |
+
# David Van Dijcke - Optimized Research Assistant
|
| 505 |
+
|
| 506 |
+
**Advanced Features:**
|
| 507 |
+
- Full paper loading with intelligent section extraction
|
| 508 |
+
- Smart retrieval based on query type
|
| 509 |
+
- Response caching for instant repeated queries
|
| 510 |
+
- Hierarchical search (sections + chunks)
|
| 511 |
+
|
| 512 |
+
Ask about David's econometric methods, research papers, or academic background.
|
| 513 |
+
""")
|
| 514 |
+
|
| 515 |
+
# API status
|
| 516 |
+
api_status = "✅ Full functionality enabled" if os.getenv("GOOGLE_API_KEY") else "⚠️ Limited mode"
|
| 517 |
+
gr.Markdown(f"**Status:** {api_status}")
|
| 518 |
+
|
| 519 |
+
chatbot = gr.Chatbot(height=500)
|
| 520 |
+
msg = gr.Textbox(
|
| 521 |
+
label="Your question",
|
| 522 |
+
placeholder="Example: Explain the technical innovations in David's job market paper",
|
| 523 |
+
lines=2
|
| 524 |
+
)
|
| 525 |
+
clear = gr.Button("Clear")
|
| 526 |
+
|
| 527 |
+
# Advanced examples
|
| 528 |
+
gr.Examples(
|
| 529 |
+
examples=[
|
| 530 |
+
"What are the key technical innovations in R3D? Explain the methodology in detail.",
|
| 531 |
+
"How does David apply optimal transport theory across his different papers?",
|
| 532 |
+
"Compare the identification strategies used in R3D versus FDR.",
|
| 533 |
+
"What makes David's approach to functional data analysis unique?",
|
| 534 |
+
"Explain how David's work on productivity relates to distributional outcomes.",
|
| 535 |
+
"What are David's main contributions to econometric theory and methods?",
|
| 536 |
+
"How do David's papers address policy-relevant questions?",
|
| 537 |
+
"What is David's research agenda and how do his papers fit together?"
|
| 538 |
+
],
|
| 539 |
+
inputs=msg
|
| 540 |
+
)
|
| 541 |
+
|
| 542 |
+
def respond(message, chat_history):
|
| 543 |
+
bot_message = assistant.answer_question(message)
|
| 544 |
+
chat_history.append((message, bot_message))
|
| 545 |
+
return "", chat_history
|
| 546 |
+
|
| 547 |
+
msg.submit(respond, [msg, chatbot], [msg, chatbot])
|
| 548 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
| 549 |
+
|
| 550 |
+
return interface
|
| 551 |
+
|
| 552 |
+
if __name__ == "__main__":
|
| 553 |
+
interface = create_interface()
|
| 554 |
+
interface.launch()
|
app_professional.py
ADDED
|
@@ -0,0 +1,233 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Professional Research Assistant
|
| 4 |
+
Clean chat interface with expert responses
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from typing import List, Tuple
|
| 9 |
+
import gradio as gr
|
| 10 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 11 |
+
from dotenv import load_dotenv
|
| 12 |
+
import google.generativeai as genai
|
| 13 |
+
|
| 14 |
+
# Load environment variables
|
| 15 |
+
load_dotenv()
|
| 16 |
+
|
| 17 |
+
class ProfessionalAssistant:
|
| 18 |
+
"""Professional assistant that speaks as an expert about David's work"""
|
| 19 |
+
|
| 20 |
+
def __init__(self):
|
| 21 |
+
# Setup Gemini
|
| 22 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 23 |
+
if api_key:
|
| 24 |
+
genai.configure(api_key=api_key)
|
| 25 |
+
try:
|
| 26 |
+
self.model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
|
| 27 |
+
print("Using Gemini 2.5 Flash Preview")
|
| 28 |
+
except:
|
| 29 |
+
self.model = genai.GenerativeModel('gemini-1.5-flash')
|
| 30 |
+
print("Using Gemini 1.5 Flash")
|
| 31 |
+
else:
|
| 32 |
+
self.model = None
|
| 33 |
+
|
| 34 |
+
# Load all papers
|
| 35 |
+
self.papers = self._load_all_papers()
|
| 36 |
+
|
| 37 |
+
# Pre-load context
|
| 38 |
+
self.context = self._create_context()
|
| 39 |
+
|
| 40 |
+
def _load_all_papers(self) -> dict:
|
| 41 |
+
"""Load all papers completely"""
|
| 42 |
+
papers = {}
|
| 43 |
+
pdf_dir = "documents"
|
| 44 |
+
|
| 45 |
+
paper_files = {
|
| 46 |
+
"r3d": ("r3d_arxiv_4apr2025.pdf", "R3D (Job Market Paper)"),
|
| 47 |
+
"cv": ("CV_DavidVanDijcke.pdf", "CV"),
|
| 48 |
+
"fdr": ("fdr.pdf", "Free Discontinuity Regression"),
|
| 49 |
+
"disco": ("disco.pdf", "Distributional Synthetic Controls"),
|
| 50 |
+
"rto": ("rto.pdf", "Return to Office"),
|
| 51 |
+
"prodf": ("prodf.pdf", "Revenue Production Functions"),
|
| 52 |
+
"unmasking": ("unmasking_partisanship.pdf", "Unmasking Partisanship"),
|
| 53 |
+
"ukraine": ("van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf", "Ukraine Alerts")
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
for key, (filename, title) in paper_files.items():
|
| 57 |
+
pdf_path = os.path.join(pdf_dir, filename)
|
| 58 |
+
if os.path.exists(pdf_path):
|
| 59 |
+
try:
|
| 60 |
+
loader = PyPDFLoader(pdf_path)
|
| 61 |
+
pages = loader.load()
|
| 62 |
+
text = "\n\n".join([p.page_content for p in pages])
|
| 63 |
+
papers[key] = {
|
| 64 |
+
"text": text,
|
| 65 |
+
"title": title,
|
| 66 |
+
"pages": len(pages)
|
| 67 |
+
}
|
| 68 |
+
print(f"Loaded {title}: {len(pages)} pages")
|
| 69 |
+
except Exception as e:
|
| 70 |
+
print(f"Error loading {filename}: {e}")
|
| 71 |
+
|
| 72 |
+
return papers
|
| 73 |
+
|
| 74 |
+
def _create_context(self) -> str:
|
| 75 |
+
"""Create comprehensive context from all papers"""
|
| 76 |
+
context_parts = []
|
| 77 |
+
|
| 78 |
+
# Add papers in priority order
|
| 79 |
+
priority_order = ["r3d", "cv", "fdr", "disco", "rto", "prodf"]
|
| 80 |
+
|
| 81 |
+
for key in priority_order:
|
| 82 |
+
if key in self.papers:
|
| 83 |
+
paper = self.papers[key]
|
| 84 |
+
# Add substantial excerpts
|
| 85 |
+
excerpt_length = 30000 if key == "r3d" else 15000
|
| 86 |
+
context_parts.append(f"\n[{paper['title']}]\n{paper['text'][:excerpt_length]}")
|
| 87 |
+
|
| 88 |
+
return "\n\n".join(context_parts)
|
| 89 |
+
|
| 90 |
+
def chat(self, message: str, history: List[Tuple[str, str]]) -> Tuple[str, List[Tuple[str, str]]]:
|
| 91 |
+
"""Chat with proper history handling"""
|
| 92 |
+
if not message.strip():
|
| 93 |
+
return "", history
|
| 94 |
+
|
| 95 |
+
if not self.model:
|
| 96 |
+
response = "I need a Google API key to provide detailed answers about David's research."
|
| 97 |
+
history.append((message, response))
|
| 98 |
+
return "", history
|
| 99 |
+
|
| 100 |
+
# Build conversation context
|
| 101 |
+
conversation = "Previous conversation:\n"
|
| 102 |
+
for human, assistant in history[-3:]: # Last 3 exchanges
|
| 103 |
+
conversation += f"User: {human}\nAssistant: {assistant}\n\n"
|
| 104 |
+
|
| 105 |
+
# Determine which papers to emphasize based on query
|
| 106 |
+
message_lower = message.lower()
|
| 107 |
+
specific_context = ""
|
| 108 |
+
|
| 109 |
+
if "job market" in message_lower or "r3d" in message_lower:
|
| 110 |
+
if "r3d" in self.papers:
|
| 111 |
+
specific_context = f"\n[R3D - Job Market Paper]\n{self.papers['r3d']['text'][:50000]}\n"
|
| 112 |
+
elif "fdr" in message_lower or "discontinuity" in message_lower:
|
| 113 |
+
if "fdr" in self.papers:
|
| 114 |
+
specific_context = f"\n[FDR Paper]\n{self.papers['fdr']['text'][:30000]}\n"
|
| 115 |
+
|
| 116 |
+
# Create prompt
|
| 117 |
+
prompt = f"""You are an expert assistant helping visitors learn about David Van Dijcke's research.
|
| 118 |
+
|
| 119 |
+
CRITICAL INSTRUCTIONS:
|
| 120 |
+
- You are NOT David - you are an expert explaining his work to website visitors
|
| 121 |
+
- Speak in third person about David (use "David" or "Van Dijcke", not "I" or "my")
|
| 122 |
+
- Be conversational but professional
|
| 123 |
+
- Give concise, informative answers (2-3 paragraphs max unless asked for details)
|
| 124 |
+
- Don't say "based on the provided papers" - just state facts confidently
|
| 125 |
+
- Focus on what makes his work innovative and important
|
| 126 |
+
|
| 127 |
+
Key facts:
|
| 128 |
+
- David is an econometrician on the 2025-26 job market from University of Michigan
|
| 129 |
+
- His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
|
| 130 |
+
- He specializes in functional data analysis and optimal transport methods
|
| 131 |
+
|
| 132 |
+
{conversation}
|
| 133 |
+
|
| 134 |
+
Full research context:
|
| 135 |
+
{self.context}
|
| 136 |
+
|
| 137 |
+
{specific_context}
|
| 138 |
+
|
| 139 |
+
Current question: {message}
|
| 140 |
+
|
| 141 |
+
Provide a concise, expert response:"""
|
| 142 |
+
|
| 143 |
+
try:
|
| 144 |
+
response = self.model.generate_content(prompt)
|
| 145 |
+
answer = response.text
|
| 146 |
+
history.append((message, answer))
|
| 147 |
+
return "", history
|
| 148 |
+
except Exception as e:
|
| 149 |
+
error_response = f"I encountered an error. Please try rephrasing your question."
|
| 150 |
+
history.append((message, error_response))
|
| 151 |
+
return "", history
|
| 152 |
+
|
| 153 |
+
# Create interface
|
| 154 |
+
def create_interface():
|
| 155 |
+
assistant = ProfessionalAssistant()
|
| 156 |
+
|
| 157 |
+
# Custom CSS for a clean look
|
| 158 |
+
custom_css = """
|
| 159 |
+
.gradio-container {
|
| 160 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', sans-serif;
|
| 161 |
+
max-width: 900px;
|
| 162 |
+
margin: auto;
|
| 163 |
+
}
|
| 164 |
+
.chatbot {
|
| 165 |
+
height: 500px !important;
|
| 166 |
+
}
|
| 167 |
+
.message {
|
| 168 |
+
font-size: 15px !important;
|
| 169 |
+
line-height: 1.6 !important;
|
| 170 |
+
}
|
| 171 |
+
"""
|
| 172 |
+
|
| 173 |
+
with gr.Blocks(title="David Van Dijcke | Research Assistant", css=custom_css) as demo:
|
| 174 |
+
gr.Markdown("""
|
| 175 |
+
## David Van Dijcke - Research Assistant
|
| 176 |
+
|
| 177 |
+
Welcome! I can help you learn about David Van Dijcke's econometric research. David is on the 2025-26 academic job market.
|
| 178 |
+
|
| 179 |
+
**Job Market Paper:** R3D - Regression Discontinuity Design with Distribution-Valued Outcomes
|
| 180 |
+
""")
|
| 181 |
+
|
| 182 |
+
chatbot = gr.Chatbot(
|
| 183 |
+
value=[],
|
| 184 |
+
elem_classes=["chatbot"],
|
| 185 |
+
bubble_full_width=False,
|
| 186 |
+
avatar_images=(None, None),
|
| 187 |
+
show_label=False
|
| 188 |
+
)
|
| 189 |
+
|
| 190 |
+
with gr.Row():
|
| 191 |
+
msg = gr.Textbox(
|
| 192 |
+
show_label=False,
|
| 193 |
+
placeholder="Ask about David's research, methods, or papers...",
|
| 194 |
+
elem_classes=["message-input"],
|
| 195 |
+
scale=4
|
| 196 |
+
)
|
| 197 |
+
submit = gr.Button("Send", scale=1, variant="primary")
|
| 198 |
+
|
| 199 |
+
# Clear button
|
| 200 |
+
clear = gr.Button("Clear conversation", size="sm")
|
| 201 |
+
|
| 202 |
+
# Examples in a nice layout
|
| 203 |
+
gr.Examples(
|
| 204 |
+
examples=[
|
| 205 |
+
"What is David's job market paper about?",
|
| 206 |
+
"What makes R3D innovative?",
|
| 207 |
+
"What are the practical applications of R3D?",
|
| 208 |
+
"Tell me about David's other research besides R3D",
|
| 209 |
+
"What makes David a strong candidate for an econometrics position?"
|
| 210 |
+
],
|
| 211 |
+
inputs=msg,
|
| 212 |
+
label="Example questions:"
|
| 213 |
+
)
|
| 214 |
+
|
| 215 |
+
# Event handlers
|
| 216 |
+
msg.submit(assistant.chat, [msg, chatbot], [msg, chatbot])
|
| 217 |
+
submit.click(assistant.chat, [msg, chatbot], [msg, chatbot])
|
| 218 |
+
clear.click(lambda: [], None, chatbot, queue=False)
|
| 219 |
+
|
| 220 |
+
gr.Markdown("""
|
| 221 |
+
---
|
| 222 |
+
*This assistant has access to David's complete research portfolio including published papers, working papers, and CV.*
|
| 223 |
+
""")
|
| 224 |
+
|
| 225 |
+
return demo
|
| 226 |
+
|
| 227 |
+
if __name__ == "__main__":
|
| 228 |
+
interface = create_interface()
|
| 229 |
+
interface.launch(
|
| 230 |
+
server_name="127.0.0.1",
|
| 231 |
+
server_port=7860,
|
| 232 |
+
show_error=True
|
| 233 |
+
)
|
app_simple_chat.py
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Simple Chat Assistant
|
| 4 |
+
Minimal implementation without Gradio complications
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from typing import List, Dict
|
| 9 |
+
import gradio as gr
|
| 10 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 11 |
+
from dotenv import load_dotenv
|
| 12 |
+
import google.generativeai as genai
|
| 13 |
+
|
| 14 |
+
# Load environment variables
|
| 15 |
+
load_dotenv()
|
| 16 |
+
|
| 17 |
+
class SimpleChatAssistant:
|
| 18 |
+
"""Simple assistant without complex features"""
|
| 19 |
+
|
| 20 |
+
def __init__(self):
|
| 21 |
+
# Setup Gemini
|
| 22 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 23 |
+
if api_key:
|
| 24 |
+
genai.configure(api_key=api_key)
|
| 25 |
+
try:
|
| 26 |
+
self.model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
|
| 27 |
+
print("Using Gemini 2.5 Flash Preview")
|
| 28 |
+
except:
|
| 29 |
+
self.model = genai.GenerativeModel('gemini-1.5-flash')
|
| 30 |
+
print("Using Gemini 1.5 Flash")
|
| 31 |
+
else:
|
| 32 |
+
self.model = None
|
| 33 |
+
print("No API key found")
|
| 34 |
+
|
| 35 |
+
# Load papers
|
| 36 |
+
self.papers = self._load_papers()
|
| 37 |
+
|
| 38 |
+
def _load_papers(self) -> Dict[str, str]:
|
| 39 |
+
"""Load key papers"""
|
| 40 |
+
papers = {}
|
| 41 |
+
pdf_dir = "documents"
|
| 42 |
+
|
| 43 |
+
key_files = [
|
| 44 |
+
("r3d", "r3d_arxiv_4apr2025.pdf"),
|
| 45 |
+
("cv", "CV_DavidVanDijcke.pdf"),
|
| 46 |
+
("fdr", "fdr.pdf")
|
| 47 |
+
]
|
| 48 |
+
|
| 49 |
+
for key, filename in key_files:
|
| 50 |
+
pdf_path = os.path.join(pdf_dir, filename)
|
| 51 |
+
if os.path.exists(pdf_path):
|
| 52 |
+
try:
|
| 53 |
+
loader = PyPDFLoader(pdf_path)
|
| 54 |
+
pages = loader.load()
|
| 55 |
+
text = "\n\n".join([p.page_content for p in pages])
|
| 56 |
+
papers[key] = text
|
| 57 |
+
print(f"Loaded {filename}: {len(pages)} pages")
|
| 58 |
+
except Exception as e:
|
| 59 |
+
print(f"Error loading {filename}: {e}")
|
| 60 |
+
|
| 61 |
+
return papers
|
| 62 |
+
|
| 63 |
+
def chat(self, message: str) -> str:
|
| 64 |
+
"""Simple chat function"""
|
| 65 |
+
if not message.strip():
|
| 66 |
+
return "What would you like to know about David's research?"
|
| 67 |
+
|
| 68 |
+
if not self.model:
|
| 69 |
+
return "Please set up your Google API key to use the assistant."
|
| 70 |
+
|
| 71 |
+
# Build context
|
| 72 |
+
context = ""
|
| 73 |
+
|
| 74 |
+
# Add relevant paper based on query
|
| 75 |
+
message_lower = message.lower()
|
| 76 |
+
|
| 77 |
+
if "job market" in message_lower or "jmp" in message_lower:
|
| 78 |
+
if "r3d" in self.papers:
|
| 79 |
+
context = f"[JOB MARKET PAPER - R3D]\n\n{self.papers['r3d'][:50000]}"
|
| 80 |
+
elif "cv" in message_lower or "background" in message_lower:
|
| 81 |
+
if "cv" in self.papers:
|
| 82 |
+
context = f"[CV]\n\n{self.papers['cv'][:20000]}"
|
| 83 |
+
else:
|
| 84 |
+
# Add some context from each paper
|
| 85 |
+
for key, text in self.papers.items():
|
| 86 |
+
context += f"\n[{key.upper()}]\n{text[:10000]}\n"
|
| 87 |
+
|
| 88 |
+
# Create prompt
|
| 89 |
+
prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market.
|
| 90 |
+
|
| 91 |
+
His JOB MARKET PAPER is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
|
| 92 |
+
|
| 93 |
+
Context from papers:
|
| 94 |
+
{context}
|
| 95 |
+
|
| 96 |
+
Question: {message}
|
| 97 |
+
|
| 98 |
+
Provide a helpful, conversational response:"""
|
| 99 |
+
|
| 100 |
+
try:
|
| 101 |
+
response = self.model.generate_content(prompt)
|
| 102 |
+
return response.text
|
| 103 |
+
except Exception as e:
|
| 104 |
+
return f"Error: {str(e)}"
|
| 105 |
+
|
| 106 |
+
# Create simple interface
|
| 107 |
+
assistant = SimpleChatAssistant()
|
| 108 |
+
|
| 109 |
+
# Create the most basic Gradio interface possible
|
| 110 |
+
iface = gr.Interface(
|
| 111 |
+
fn=assistant.chat,
|
| 112 |
+
inputs=gr.Textbox(lines=2, placeholder="Ask about David's research..."),
|
| 113 |
+
outputs="text",
|
| 114 |
+
title="David Van Dijcke - Research Assistant",
|
| 115 |
+
description="Ask about David's job market paper (R3D) and research",
|
| 116 |
+
examples=[
|
| 117 |
+
"What is David's job market paper about?",
|
| 118 |
+
"What makes R3D innovative?",
|
| 119 |
+
"What is the use of R3D?"
|
| 120 |
+
]
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
if __name__ == "__main__":
|
| 124 |
+
iface.launch(server_name="127.0.0.1", server_port=7860)
|
app_sota.py
ADDED
|
@@ -0,0 +1,341 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - State-of-the-Art Research Assistant
|
| 4 |
+
Uses modern LLM capabilities: full document context, native PDF handling, and advanced prompting
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import base64
|
| 9 |
+
from typing import List, Dict, Optional, Tuple
|
| 10 |
+
import gradio as gr
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from pypdf import PdfReader
|
| 13 |
+
from dotenv import load_dotenv
|
| 14 |
+
import google.generativeai as genai
|
| 15 |
+
|
| 16 |
+
# Load environment variables
|
| 17 |
+
load_dotenv()
|
| 18 |
+
|
| 19 |
+
class StateOfTheArtAssistant:
|
| 20 |
+
"""Uses Gemini's full capabilities - large context window and native understanding"""
|
| 21 |
+
|
| 22 |
+
def __init__(self):
|
| 23 |
+
"""Initialize with modern approach"""
|
| 24 |
+
# Setup Gemini with best model
|
| 25 |
+
self.llm = self._setup_advanced_llm()
|
| 26 |
+
|
| 27 |
+
# Load all papers into memory at once
|
| 28 |
+
self.papers_full_text = self._load_all_papers_full()
|
| 29 |
+
|
| 30 |
+
# Create a single mega-context with all papers
|
| 31 |
+
self.mega_context = self._create_mega_context()
|
| 32 |
+
|
| 33 |
+
# Pre-load common contexts into Gemini's memory
|
| 34 |
+
self.initialized = False
|
| 35 |
+
self._initialize_assistant()
|
| 36 |
+
|
| 37 |
+
def _setup_advanced_llm(self):
|
| 38 |
+
"""Setup most capable Gemini model"""
|
| 39 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 40 |
+
|
| 41 |
+
if not api_key:
|
| 42 |
+
raise ValueError("Google API key is required for state-of-the-art performance")
|
| 43 |
+
|
| 44 |
+
genai.configure(api_key=api_key)
|
| 45 |
+
|
| 46 |
+
# Try to use the most capable model available
|
| 47 |
+
try:
|
| 48 |
+
# Gemini 2.5 Flash Preview - Latest and most capable
|
| 49 |
+
model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
|
| 50 |
+
print("Using Gemini 2.5 Flash Preview - Latest model with enhanced capabilities")
|
| 51 |
+
return model
|
| 52 |
+
except Exception as e:
|
| 53 |
+
print(f"Could not load Gemini 2.5 Flash Preview: {e}")
|
| 54 |
+
try:
|
| 55 |
+
# Fallback to 1.5 Pro
|
| 56 |
+
model = genai.GenerativeModel('gemini-1.5-pro-002')
|
| 57 |
+
print("Using Gemini 1.5 Pro as fallback")
|
| 58 |
+
return model
|
| 59 |
+
except:
|
| 60 |
+
try:
|
| 61 |
+
# Second fallback
|
| 62 |
+
model = genai.GenerativeModel('gemini-1.5-flash-002')
|
| 63 |
+
print("Using Gemini 1.5 Flash as fallback")
|
| 64 |
+
return model
|
| 65 |
+
except:
|
| 66 |
+
# Last resort
|
| 67 |
+
model = genai.GenerativeModel('gemini-1.5-flash')
|
| 68 |
+
print("Using Gemini 1.5 Flash (base) as last resort")
|
| 69 |
+
return model
|
| 70 |
+
|
| 71 |
+
def _load_all_papers_full(self) -> Dict[str, str]:
|
| 72 |
+
"""Load complete papers without chunking"""
|
| 73 |
+
papers = {}
|
| 74 |
+
pdf_dir = "documents"
|
| 75 |
+
|
| 76 |
+
# Define papers with priority order (job market paper first)
|
| 77 |
+
paper_files = [
|
| 78 |
+
("r3d", "r3d_arxiv_4apr2025.pdf", "JOB MARKET PAPER - R3D: Regression Discontinuity Design with Distribution-Valued Outcomes"),
|
| 79 |
+
("cv", "CV_DavidVanDijcke.pdf", "CURRICULUM VITAE"),
|
| 80 |
+
("fdr", "fdr.pdf", "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns"),
|
| 81 |
+
("disco", "disco.pdf", "disco: Distributional Synthetic Controls"),
|
| 82 |
+
("rto", "rto.pdf", "Return to Office and the Tenure Distribution"),
|
| 83 |
+
("prodf", "prodf.pdf", "On the Non-Identification of Revenue Production Functions"),
|
| 84 |
+
("unmasking", "unmasking_partisanship.pdf", "Unmasking Partisanship: Polarization Undermines Public Response to Collective Risk"),
|
| 85 |
+
("ukraine", "van-dijcke-et-al-public-response-to-government-alerts-saves-lives-during-russian-invasion-of-ukraine.pdf", "Public Response to Government Alerts Saves Lives During Russian Invasion of Ukraine")
|
| 86 |
+
]
|
| 87 |
+
|
| 88 |
+
for key, filename, title in paper_files:
|
| 89 |
+
pdf_path = os.path.join(pdf_dir, filename)
|
| 90 |
+
if os.path.exists(pdf_path):
|
| 91 |
+
try:
|
| 92 |
+
# Read PDF completely
|
| 93 |
+
with open(pdf_path, 'rb') as file:
|
| 94 |
+
pdf_reader = PdfReader(file)
|
| 95 |
+
|
| 96 |
+
# Extract all text at once
|
| 97 |
+
full_text = f"\n\n{'='*80}\n{title}\n{'='*80}\n\n"
|
| 98 |
+
|
| 99 |
+
for page_num, page in enumerate(pdf_reader.pages, 1):
|
| 100 |
+
text = page.extract_text()
|
| 101 |
+
if text.strip():
|
| 102 |
+
full_text += f"\n[Page {page_num}]\n{text}\n"
|
| 103 |
+
|
| 104 |
+
papers[key] = full_text
|
| 105 |
+
print(f"Loaded {title}: {len(full_text):,} characters")
|
| 106 |
+
|
| 107 |
+
except Exception as e:
|
| 108 |
+
print(f"Error loading {filename}: {e}")
|
| 109 |
+
|
| 110 |
+
return papers
|
| 111 |
+
|
| 112 |
+
def _create_mega_context(self) -> str:
|
| 113 |
+
"""Create single context with all papers for Gemini to process"""
|
| 114 |
+
mega_context = """COMPLETE RESEARCH PORTFOLIO OF DAVID VAN DIJCKE
|
| 115 |
+
Econometrician on the 2025-26 Job Market
|
| 116 |
+
University of Michigan
|
| 117 |
+
|
| 118 |
+
IMPORTANT: David's JOB MARKET PAPER is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
|
| 119 |
+
|
| 120 |
+
Below are ALL of David's papers in full text:
|
| 121 |
+
|
| 122 |
+
"""
|
| 123 |
+
|
| 124 |
+
# Add papers in priority order
|
| 125 |
+
for key, full_text in self.papers_full_text.items():
|
| 126 |
+
mega_context += full_text + "\n\n"
|
| 127 |
+
|
| 128 |
+
print(f"Total context size: {len(mega_context):,} characters (~{len(mega_context)//4:,} tokens)")
|
| 129 |
+
|
| 130 |
+
return mega_context
|
| 131 |
+
|
| 132 |
+
def _initialize_assistant(self):
|
| 133 |
+
"""Pre-load context into Gemini for faster responses"""
|
| 134 |
+
if self.initialized or not self.llm:
|
| 135 |
+
return
|
| 136 |
+
|
| 137 |
+
try:
|
| 138 |
+
# Create a chat session with the full context pre-loaded
|
| 139 |
+
self.chat = self.llm.start_chat(history=[
|
| 140 |
+
{
|
| 141 |
+
"role": "user",
|
| 142 |
+
"parts": [f"""You are David Van Dijcke's research assistant. I'm providing you with his COMPLETE research portfolio to answer questions about.
|
| 143 |
+
|
| 144 |
+
{self.mega_context}
|
| 145 |
+
|
| 146 |
+
REMEMBER:
|
| 147 |
+
1. David is on the 2025-26 economics job market
|
| 148 |
+
2. His JOB MARKET PAPER is R3D
|
| 149 |
+
3. He's from University of Michigan
|
| 150 |
+
4. He specializes in econometric methods for functional and distributional data
|
| 151 |
+
|
| 152 |
+
Please confirm you've loaded all the papers."""]
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
"role": "model",
|
| 156 |
+
"parts": ["""I've successfully loaded David Van Dijcke's complete research portfolio. I have access to:
|
| 157 |
+
|
| 158 |
+
1. **JOB MARKET PAPER**: R3D - Regression Discontinuity Design with Distribution-Valued Outcomes
|
| 159 |
+
2. His CV
|
| 160 |
+
3. Free Discontinuity Regression (FDR)
|
| 161 |
+
4. disco: Distributional Synthetic Controls
|
| 162 |
+
5. Return to Office and the Tenure Distribution
|
| 163 |
+
6. Revenue Production Functions paper
|
| 164 |
+
7. Published work on COVID/masks and Ukraine
|
| 165 |
+
|
| 166 |
+
I'm ready to answer any questions about David's research, methods, contributions, or background with full context from all his papers."""]
|
| 167 |
+
}
|
| 168 |
+
])
|
| 169 |
+
|
| 170 |
+
self.initialized = True
|
| 171 |
+
print("Assistant initialized with full paper context")
|
| 172 |
+
|
| 173 |
+
except Exception as e:
|
| 174 |
+
print(f"Could not pre-initialize: {e}")
|
| 175 |
+
self.chat = None
|
| 176 |
+
|
| 177 |
+
def answer_question(self, query: str, chat_history: List = None) -> str:
|
| 178 |
+
"""Answer using full context already loaded in Gemini"""
|
| 179 |
+
if not query.strip():
|
| 180 |
+
return "What would you like to know about David's research?"
|
| 181 |
+
|
| 182 |
+
try:
|
| 183 |
+
if self.chat:
|
| 184 |
+
# Use existing chat with pre-loaded context
|
| 185 |
+
response = self.chat.send_message(f"""Based on the complete papers I have loaded, please answer this question:
|
| 186 |
+
|
| 187 |
+
{query}
|
| 188 |
+
|
| 189 |
+
Important guidelines:
|
| 190 |
+
- Be conversational and accessible
|
| 191 |
+
- For technical questions, explain both intuition AND technical details
|
| 192 |
+
- Always specify which paper you're referencing
|
| 193 |
+
- For job market paper questions, focus on R3D
|
| 194 |
+
- Highlight what makes David's work unique and impactful
|
| 195 |
+
- Use specific examples from the papers""")
|
| 196 |
+
|
| 197 |
+
return response.text
|
| 198 |
+
|
| 199 |
+
else:
|
| 200 |
+
# Fallback: Send everything in one request
|
| 201 |
+
prompt = f"""You are David Van Dijcke's research assistant. Based on his complete research portfolio below, answer the question.
|
| 202 |
+
|
| 203 |
+
{self.mega_context}
|
| 204 |
+
|
| 205 |
+
Question: {query}
|
| 206 |
+
|
| 207 |
+
Guidelines:
|
| 208 |
+
- Be conversational and accessible
|
| 209 |
+
- For technical questions, explain both intuition AND technical details
|
| 210 |
+
- Always specify which paper you're referencing
|
| 211 |
+
- For job market paper questions, focus on R3D
|
| 212 |
+
- Highlight what makes David's work unique and impactful
|
| 213 |
+
|
| 214 |
+
Answer:"""
|
| 215 |
+
|
| 216 |
+
response = self.llm.generate_content(prompt)
|
| 217 |
+
return response.text
|
| 218 |
+
|
| 219 |
+
except Exception as e:
|
| 220 |
+
print(f"Error: {e}")
|
| 221 |
+
|
| 222 |
+
# Try with truncated context if we hit limits
|
| 223 |
+
try:
|
| 224 |
+
# Focus on job market paper and CV
|
| 225 |
+
limited_context = self.papers_full_text.get("r3d", "")[:50000] + "\n\n" + self.papers_full_text.get("cv", "")[:20000]
|
| 226 |
+
|
| 227 |
+
prompt = f"""Answer based on David Van Dijcke's job market paper (R3D) and CV:
|
| 228 |
+
|
| 229 |
+
{limited_context}
|
| 230 |
+
|
| 231 |
+
Question: {query}
|
| 232 |
+
|
| 233 |
+
Answer conversationally:"""
|
| 234 |
+
|
| 235 |
+
response = self.llm.generate_content(prompt)
|
| 236 |
+
return response.text
|
| 237 |
+
|
| 238 |
+
except:
|
| 239 |
+
return "I'm having trouble processing that request. Please try rephrasing or asking about a specific paper."
|
| 240 |
+
|
| 241 |
+
# Create modern interface
|
| 242 |
+
def create_interface():
|
| 243 |
+
"""Create state-of-the-art interface"""
|
| 244 |
+
|
| 245 |
+
# Initialize assistant
|
| 246 |
+
try:
|
| 247 |
+
assistant = StateOfTheArtAssistant()
|
| 248 |
+
status_message = "✅ Assistant loaded with full paper context"
|
| 249 |
+
except Exception as e:
|
| 250 |
+
print(f"Initialization error: {e}")
|
| 251 |
+
assistant = None
|
| 252 |
+
status_message = "❌ Error: Please check your Google API key"
|
| 253 |
+
|
| 254 |
+
def chat(message, history):
|
| 255 |
+
if not assistant:
|
| 256 |
+
return "", history + [[message, "Please set up your Google API key to use the assistant."]]
|
| 257 |
+
|
| 258 |
+
if history is None:
|
| 259 |
+
history = []
|
| 260 |
+
|
| 261 |
+
response = assistant.answer_question(message, history)
|
| 262 |
+
history.append([message, response])
|
| 263 |
+
return "", history
|
| 264 |
+
|
| 265 |
+
# Custom CSS for modern look
|
| 266 |
+
custom_css = """
|
| 267 |
+
.gradio-container {
|
| 268 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
| 269 |
+
}
|
| 270 |
+
.user-message, .bot-message {
|
| 271 |
+
padding: 12px 16px !important;
|
| 272 |
+
border-radius: 8px !important;
|
| 273 |
+
}
|
| 274 |
+
"""
|
| 275 |
+
|
| 276 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant", css=custom_css) as demo:
|
| 277 |
+
|
| 278 |
+
gr.Markdown(f"""
|
| 279 |
+
# David Van Dijcke - AI Research Assistant
|
| 280 |
+
|
| 281 |
+
**Econometrician** | **2025-26 Job Market** | **University of Michigan**
|
| 282 |
+
|
| 283 |
+
{status_message}
|
| 284 |
+
""")
|
| 285 |
+
|
| 286 |
+
chatbot = gr.Chatbot(
|
| 287 |
+
height=500,
|
| 288 |
+
show_label=False,
|
| 289 |
+
elem_classes=["chatbot"]
|
| 290 |
+
)
|
| 291 |
+
|
| 292 |
+
with gr.Row():
|
| 293 |
+
msg = gr.Textbox(
|
| 294 |
+
label="Ask anything about David's research",
|
| 295 |
+
placeholder="What makes R3D innovative? How does David use optimal transport? What are his main contributions?",
|
| 296 |
+
lines=2,
|
| 297 |
+
scale=4
|
| 298 |
+
)
|
| 299 |
+
submit = gr.Button("Send", variant="primary", scale=1)
|
| 300 |
+
|
| 301 |
+
clear = gr.Button("Clear Conversation")
|
| 302 |
+
|
| 303 |
+
# Example queries organized by category
|
| 304 |
+
with gr.Accordion("Example Questions", open=True):
|
| 305 |
+
gr.Examples(
|
| 306 |
+
examples=[
|
| 307 |
+
"What is David's job market paper about and why is it important?",
|
| 308 |
+
"Explain R3D's methodology - both the intuition and technical details",
|
| 309 |
+
"How does David's work on optimal transport connect across his papers?",
|
| 310 |
+
"What real-world policy questions can R3D help answer?",
|
| 311 |
+
"Compare David's approach in R3D versus traditional RDD",
|
| 312 |
+
"What makes David uniquely qualified for an econometrics position?",
|
| 313 |
+
"How does the FDR paper relate to the job market paper?",
|
| 314 |
+
"What are the key identification strategies across David's papers?",
|
| 315 |
+
"Explain the practical applications of distributional synthetic controls",
|
| 316 |
+
"What broader research agenda do David's papers represent?"
|
| 317 |
+
],
|
| 318 |
+
inputs=msg,
|
| 319 |
+
label="Click any example to try it"
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
# Event handlers
|
| 323 |
+
msg.submit(chat, [msg, chatbot], [msg, chatbot])
|
| 324 |
+
submit.click(chat, [msg, chatbot], [msg, chatbot])
|
| 325 |
+
clear.click(lambda: [], None, chatbot)
|
| 326 |
+
|
| 327 |
+
gr.Markdown("""
|
| 328 |
+
---
|
| 329 |
+
💡 **Tip**: This assistant has David's complete papers loaded. Ask technical questions, request comparisons across papers, or explore specific methodological details.
|
| 330 |
+
""")
|
| 331 |
+
|
| 332 |
+
return demo
|
| 333 |
+
|
| 334 |
+
if __name__ == "__main__":
|
| 335 |
+
interface = create_interface()
|
| 336 |
+
interface.launch(
|
| 337 |
+
server_name="0.0.0.0",
|
| 338 |
+
server_port=7860,
|
| 339 |
+
share=False,
|
| 340 |
+
quiet=True
|
| 341 |
+
)
|
app_stable.py
ADDED
|
@@ -0,0 +1,355 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Stable Research Assistant
|
| 4 |
+
A simplified, stable version that avoids dependency conflicts
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from typing import List, Dict, Optional
|
| 9 |
+
import gradio as gr
|
| 10 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 11 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 12 |
+
from langchain_community.vectorstores import FAISS
|
| 13 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 14 |
+
from dotenv import load_dotenv
|
| 15 |
+
import google.generativeai as genai
|
| 16 |
+
|
| 17 |
+
# Load environment variables
|
| 18 |
+
load_dotenv()
|
| 19 |
+
|
| 20 |
+
class StableResearchAssistant:
|
| 21 |
+
"""Stable assistant with minimal dependencies"""
|
| 22 |
+
|
| 23 |
+
def __init__(self):
|
| 24 |
+
"""Initialize with stable configuration"""
|
| 25 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 26 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
# Load all papers into memory
|
| 30 |
+
self.papers = self._load_papers()
|
| 31 |
+
|
| 32 |
+
# Create simple vector store
|
| 33 |
+
self.vector_store = self._create_vector_store()
|
| 34 |
+
|
| 35 |
+
# Setup LLM
|
| 36 |
+
self.llm = self._setup_llm()
|
| 37 |
+
|
| 38 |
+
def _load_papers(self) -> Dict[str, Dict]:
|
| 39 |
+
"""Load all papers into memory"""
|
| 40 |
+
papers = {}
|
| 41 |
+
pdf_dir = "documents"
|
| 42 |
+
|
| 43 |
+
paper_metadata = {
|
| 44 |
+
"r3d": {
|
| 45 |
+
"file": "r3d_arxiv_4apr2025.pdf",
|
| 46 |
+
"title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
|
| 47 |
+
"type": "JOB MARKET PAPER",
|
| 48 |
+
"description": "Extends RDD to analyze entire outcome distributions using optimal transport"
|
| 49 |
+
},
|
| 50 |
+
"fdr": {
|
| 51 |
+
"file": "fdr.pdf",
|
| 52 |
+
"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
|
| 53 |
+
"type": "Working Paper",
|
| 54 |
+
"description": "Estimates regression functions with unknown discontinuity locations"
|
| 55 |
+
},
|
| 56 |
+
"disco": {
|
| 57 |
+
"file": "disco.pdf",
|
| 58 |
+
"title": "disco: Distributional Synthetic Controls",
|
| 59 |
+
"type": "Working Paper",
|
| 60 |
+
"description": "Stata package for distributional synthetic control methods"
|
| 61 |
+
},
|
| 62 |
+
"rto": {
|
| 63 |
+
"file": "rto.pdf",
|
| 64 |
+
"title": "Return to Office and the Tenure Distribution",
|
| 65 |
+
"type": "Working Paper",
|
| 66 |
+
"description": "Analyzes impact of return-to-office mandates on employee tenure"
|
| 67 |
+
},
|
| 68 |
+
"prodf": {
|
| 69 |
+
"file": "prodf.pdf",
|
| 70 |
+
"title": "On the Non-Identification of Revenue Production Functions",
|
| 71 |
+
"type": "Working Paper",
|
| 72 |
+
"description": "Shows non-identification of production functions with revenue data"
|
| 73 |
+
},
|
| 74 |
+
"cv": {
|
| 75 |
+
"file": "CV_DavidVanDijcke.pdf",
|
| 76 |
+
"title": "Curriculum Vitae",
|
| 77 |
+
"type": "CV",
|
| 78 |
+
"description": "David Van Dijcke's academic CV"
|
| 79 |
+
}
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
for key, metadata in paper_metadata.items():
|
| 83 |
+
pdf_path = os.path.join(pdf_dir, metadata["file"])
|
| 84 |
+
if os.path.exists(pdf_path):
|
| 85 |
+
try:
|
| 86 |
+
loader = PyPDFLoader(pdf_path)
|
| 87 |
+
pages = loader.load()
|
| 88 |
+
|
| 89 |
+
# Store full text with metadata
|
| 90 |
+
full_text = "\n\n".join([p.page_content for p in pages])
|
| 91 |
+
papers[key] = {
|
| 92 |
+
"text": full_text,
|
| 93 |
+
"pages": len(pages),
|
| 94 |
+
"filename": metadata["file"],
|
| 95 |
+
"title": metadata["title"],
|
| 96 |
+
"type": metadata["type"],
|
| 97 |
+
"description": metadata["description"]
|
| 98 |
+
}
|
| 99 |
+
print(f"Loaded {metadata['title']}: {len(pages)} pages")
|
| 100 |
+
|
| 101 |
+
except Exception as e:
|
| 102 |
+
print(f"Error loading {metadata['file']}: {e}")
|
| 103 |
+
|
| 104 |
+
return papers
|
| 105 |
+
|
| 106 |
+
def _create_vector_store(self) -> Optional[FAISS]:
|
| 107 |
+
"""Create vector store from papers"""
|
| 108 |
+
try:
|
| 109 |
+
# Create documents with larger chunks
|
| 110 |
+
documents = []
|
| 111 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
| 112 |
+
chunk_size=1500,
|
| 113 |
+
chunk_overlap=150
|
| 114 |
+
)
|
| 115 |
+
|
| 116 |
+
for key, paper in self.papers.items():
|
| 117 |
+
# Split text
|
| 118 |
+
chunks = text_splitter.split_text(paper["text"])
|
| 119 |
+
|
| 120 |
+
# Create documents
|
| 121 |
+
for i, chunk in enumerate(chunks):
|
| 122 |
+
from langchain.schema import Document
|
| 123 |
+
doc = Document(
|
| 124 |
+
page_content=chunk,
|
| 125 |
+
metadata={"source": key, "chunk": i}
|
| 126 |
+
)
|
| 127 |
+
documents.append(doc)
|
| 128 |
+
|
| 129 |
+
# Create vector store
|
| 130 |
+
if documents:
|
| 131 |
+
return FAISS.from_documents(documents, self.embeddings)
|
| 132 |
+
|
| 133 |
+
except Exception as e:
|
| 134 |
+
print(f"Error creating vector store: {e}")
|
| 135 |
+
|
| 136 |
+
return None
|
| 137 |
+
|
| 138 |
+
def _setup_llm(self):
|
| 139 |
+
"""Setup Gemini LLM"""
|
| 140 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 141 |
+
|
| 142 |
+
if api_key:
|
| 143 |
+
try:
|
| 144 |
+
genai.configure(api_key=api_key)
|
| 145 |
+
return genai.GenerativeModel('gemini-1.5-flash')
|
| 146 |
+
except Exception as e:
|
| 147 |
+
print(f"Error setting up Gemini: {e}")
|
| 148 |
+
|
| 149 |
+
return None
|
| 150 |
+
|
| 151 |
+
def answer_question(self, query: str, chat_history: List = None) -> str:
|
| 152 |
+
"""Answer questions about David's research"""
|
| 153 |
+
if not query.strip():
|
| 154 |
+
return "Please ask a question about David Van Dijcke's research."
|
| 155 |
+
|
| 156 |
+
# Get relevant context
|
| 157 |
+
context = self._get_context(query)
|
| 158 |
+
|
| 159 |
+
# Generate response
|
| 160 |
+
if self.llm:
|
| 161 |
+
prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 academic job market.
|
| 162 |
+
|
| 163 |
+
IMPORTANT: The context below contains labeled sections from David's actual papers. Pay attention to the labels like [JOB MARKET PAPER], [CURRICULUM VITAE], etc.
|
| 164 |
+
|
| 165 |
+
Context:
|
| 166 |
+
{context}
|
| 167 |
+
|
| 168 |
+
Question: {query}
|
| 169 |
+
|
| 170 |
+
Instructions:
|
| 171 |
+
- Answer based ONLY on the provided context
|
| 172 |
+
- If the context mentions "JOB MARKET PAPER", that refers to R3D
|
| 173 |
+
- Be specific and cite the paper titles when relevant
|
| 174 |
+
- For job market paper questions, focus on the R3D paper
|
| 175 |
+
|
| 176 |
+
Answer:"""
|
| 177 |
+
|
| 178 |
+
try:
|
| 179 |
+
response = self.llm.generate_content(prompt)
|
| 180 |
+
return response.text
|
| 181 |
+
except Exception as e:
|
| 182 |
+
print(f"Error generating response: {e}")
|
| 183 |
+
return self._fallback_response(query)
|
| 184 |
+
else:
|
| 185 |
+
return self._fallback_response(query)
|
| 186 |
+
|
| 187 |
+
def _get_context(self, query: str) -> str:
|
| 188 |
+
"""Get relevant context for query"""
|
| 189 |
+
query_lower = query.lower()
|
| 190 |
+
contexts = []
|
| 191 |
+
|
| 192 |
+
# CRITICAL: Check for job market paper queries FIRST
|
| 193 |
+
if any(phrase in query_lower for phrase in ["job market", "jmp", "job market paper", "what is his job market"]):
|
| 194 |
+
# Add R3D paper info with clear labeling
|
| 195 |
+
if "r3d" in self.papers:
|
| 196 |
+
paper = self.papers["r3d"]
|
| 197 |
+
context = f"[JOB MARKET PAPER - R3D: Regression Discontinuity Design with Distribution-Valued Outcomes]\n\n"
|
| 198 |
+
context += f"This is David Van Dijcke's JOB MARKET PAPER.\n\n"
|
| 199 |
+
context += paper["text"][:20000] # Get first ~20k chars
|
| 200 |
+
contexts.append(context)
|
| 201 |
+
# Return immediately for job market paper queries
|
| 202 |
+
return context
|
| 203 |
+
|
| 204 |
+
# Check for general "what's up" or greeting
|
| 205 |
+
if any(phrase in query_lower for phrase in ["what's up", "whats up", "hello", "hi"]):
|
| 206 |
+
intro = """David Van Dijcke is an econometrician on the 2025-26 academic job market from the University of Michigan.
|
| 207 |
+
|
| 208 |
+
His JOB MARKET PAPER is "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes" which extends regression discontinuity design to analyze entire outcome distributions using optimal transport theory.
|
| 209 |
+
|
| 210 |
+
He has developed several innovative econometric methods including:
|
| 211 |
+
- R3D (Job Market Paper): Distribution-valued RDD
|
| 212 |
+
- Free Discontinuity Regression (FDR): Estimating regressions with unknown discontinuities
|
| 213 |
+
- Distributional Synthetic Controls (disco): A Stata package
|
| 214 |
+
- Work on non-identification of revenue production functions
|
| 215 |
+
|
| 216 |
+
Ask me about any of his papers or methods!"""
|
| 217 |
+
contexts.append(intro)
|
| 218 |
+
|
| 219 |
+
# Check for David/CV queries
|
| 220 |
+
if any(word in query_lower for word in ["david", "who", "background", "cv", "about"]):
|
| 221 |
+
if "cv" in self.papers:
|
| 222 |
+
cv_context = f"[CURRICULUM VITAE]\n\n{self.papers['cv']['text'][:8000]}"
|
| 223 |
+
contexts.append(cv_context)
|
| 224 |
+
|
| 225 |
+
# Check for specific paper mentions
|
| 226 |
+
paper_keywords = {
|
| 227 |
+
"r3d": ["r3d", "regression discontinuity", "distribution", "optimal transport", "wasserstein"],
|
| 228 |
+
"fdr": ["fdr", "free discontinuity", "internet shutdown"],
|
| 229 |
+
"rto": ["return to office", "tenure", "rto"],
|
| 230 |
+
"disco": ["disco", "synthetic control", "distributional"],
|
| 231 |
+
"prodf": ["production function", "revenue", "identification"]
|
| 232 |
+
}
|
| 233 |
+
|
| 234 |
+
for key, keywords in paper_keywords.items():
|
| 235 |
+
if any(kw in query_lower for kw in keywords):
|
| 236 |
+
if key in self.papers:
|
| 237 |
+
paper = self.papers[key]
|
| 238 |
+
paper_context = f"[{paper['type']}: {paper['title']}]\n\n"
|
| 239 |
+
paper_context += paper["text"][:15000]
|
| 240 |
+
contexts.append(paper_context)
|
| 241 |
+
|
| 242 |
+
# If no specific match, try vector search
|
| 243 |
+
if not contexts and self.vector_store:
|
| 244 |
+
try:
|
| 245 |
+
docs = self.vector_store.similarity_search(query, k=4)
|
| 246 |
+
for doc in docs:
|
| 247 |
+
source = doc.metadata.get("source", "")
|
| 248 |
+
if source in self.papers:
|
| 249 |
+
paper = self.papers[source]
|
| 250 |
+
chunk_context = f"[From {paper['title']}]\n{doc.page_content}"
|
| 251 |
+
contexts.append(chunk_context)
|
| 252 |
+
except:
|
| 253 |
+
pass
|
| 254 |
+
|
| 255 |
+
# Always include paper list if no context found
|
| 256 |
+
if not contexts:
|
| 257 |
+
paper_list = "David Van Dijcke's papers:\n"
|
| 258 |
+
for key, paper in self.papers.items():
|
| 259 |
+
if key != "cv":
|
| 260 |
+
paper_list += f"- {paper['type']}: {paper['title']}\n"
|
| 261 |
+
contexts.append(paper_list)
|
| 262 |
+
|
| 263 |
+
return "\n\n---\n\n".join(contexts[:3])
|
| 264 |
+
|
| 265 |
+
def _fallback_response(self, query: str) -> str:
|
| 266 |
+
"""Fallback response without LLM"""
|
| 267 |
+
query_lower = query.lower()
|
| 268 |
+
|
| 269 |
+
# Job market paper query
|
| 270 |
+
if any(phrase in query_lower for phrase in ["job market", "jmp"]):
|
| 271 |
+
return """David Van Dijcke's JOB MARKET PAPER is:
|
| 272 |
+
|
| 273 |
+
"R3D: Regression Discontinuity Design with Distribution-Valued Outcomes"
|
| 274 |
+
|
| 275 |
+
This paper extends regression discontinuity design (RDD) to analyze entire outcome distributions rather than just means. Key innovations:
|
| 276 |
+
- Uses optimal transport theory and Wasserstein distances
|
| 277 |
+
- Allows testing of distributional effects of policies
|
| 278 |
+
- Applications to income distributions, test score distributions
|
| 279 |
+
- Provides new identification and estimation procedures
|
| 280 |
+
|
| 281 |
+
This addresses a fundamental limitation of traditional RDD that only examines average treatment effects."""
|
| 282 |
+
|
| 283 |
+
# General greeting
|
| 284 |
+
if any(phrase in query_lower for phrase in ["what's up", "hello", "hi"]):
|
| 285 |
+
return """Hello! I'm David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market.
|
| 286 |
+
|
| 287 |
+
His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
|
| 288 |
+
|
| 289 |
+
I can tell you about:
|
| 290 |
+
- His job market paper (R3D)
|
| 291 |
+
- His other papers (FDR, disco, RTO, etc.)
|
| 292 |
+
- His econometric methods
|
| 293 |
+
- His background and CV
|
| 294 |
+
|
| 295 |
+
What would you like to know?"""
|
| 296 |
+
|
| 297 |
+
# Specific paper queries
|
| 298 |
+
if "r3d" in query_lower:
|
| 299 |
+
return "R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David's JOB MARKET PAPER. It extends RDD to analyze entire outcome distributions using optimal transport theory and Wasserstein distances."
|
| 300 |
+
|
| 301 |
+
if "fdr" in query_lower:
|
| 302 |
+
return "Free Discontinuity Regression (FDR) is David's paper on estimating regression functions with unknown discontinuity locations. It uses geometric measure theory with applications to measuring economic impacts of internet shutdowns."
|
| 303 |
+
|
| 304 |
+
if "david" in query_lower or "who" in query_lower:
|
| 305 |
+
return "David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan. He specializes in functional data analysis, optimal transport methods, and develops novel econometric techniques for modern data challenges."
|
| 306 |
+
|
| 307 |
+
return "I can help with questions about David Van Dijcke's research. Try asking about his job market paper (R3D), his methods, or his background. For best results, please add a Google API key."
|
| 308 |
+
|
| 309 |
+
# Create Gradio interface
|
| 310 |
+
def create_interface():
|
| 311 |
+
"""Create simple Gradio interface"""
|
| 312 |
+
assistant = StableResearchAssistant()
|
| 313 |
+
|
| 314 |
+
def chat(message, history):
|
| 315 |
+
response = assistant.answer_question(message, history)
|
| 316 |
+
history.append([message, response])
|
| 317 |
+
return "", history
|
| 318 |
+
|
| 319 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant") as demo:
|
| 320 |
+
gr.Markdown("""
|
| 321 |
+
# David Van Dijcke - Research Assistant (Stable Version)
|
| 322 |
+
|
| 323 |
+
Ask questions about David's econometric research and papers.
|
| 324 |
+
""")
|
| 325 |
+
|
| 326 |
+
chatbot = gr.Chatbot(height=400)
|
| 327 |
+
msg = gr.Textbox(label="Your question", placeholder="What is David's job market paper about?")
|
| 328 |
+
clear = gr.Button("Clear")
|
| 329 |
+
|
| 330 |
+
# Examples
|
| 331 |
+
gr.Examples(
|
| 332 |
+
examples=[
|
| 333 |
+
"What is David's job market paper R3D about?",
|
| 334 |
+
"What econometric methods has David developed?",
|
| 335 |
+
"Tell me about David's background",
|
| 336 |
+
"How does David use optimal transport in his research?",
|
| 337 |
+
"What is the FDR paper about?"
|
| 338 |
+
],
|
| 339 |
+
inputs=msg
|
| 340 |
+
)
|
| 341 |
+
|
| 342 |
+
msg.submit(chat, [msg, chatbot], [msg, chatbot])
|
| 343 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
| 344 |
+
|
| 345 |
+
return demo
|
| 346 |
+
|
| 347 |
+
if __name__ == "__main__":
|
| 348 |
+
# Simple launch without API endpoint issues
|
| 349 |
+
interface = create_interface()
|
| 350 |
+
interface.launch(
|
| 351 |
+
server_name="127.0.0.1",
|
| 352 |
+
server_port=7860,
|
| 353 |
+
share=False,
|
| 354 |
+
quiet=True
|
| 355 |
+
)
|
app_working.py
ADDED
|
@@ -0,0 +1,368 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
David Van Dijcke - Stable Research Assistant
|
| 4 |
+
A simplified, stable version that avoids dependency conflicts
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from typing import List, Dict, Optional
|
| 9 |
+
import gradio as gr
|
| 10 |
+
from langchain_community.document_loaders import PyPDFLoader
|
| 11 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 12 |
+
from langchain_community.vectorstores import FAISS
|
| 13 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 14 |
+
from dotenv import load_dotenv
|
| 15 |
+
import google.generativeai as genai
|
| 16 |
+
|
| 17 |
+
# Load environment variables
|
| 18 |
+
load_dotenv()
|
| 19 |
+
|
| 20 |
+
class StableResearchAssistant:
|
| 21 |
+
"""Stable assistant with minimal dependencies"""
|
| 22 |
+
|
| 23 |
+
def __init__(self):
|
| 24 |
+
"""Initialize with stable configuration"""
|
| 25 |
+
self.embeddings = HuggingFaceEmbeddings(
|
| 26 |
+
model_name="sentence-transformers/all-MiniLM-L6-v2"
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
# Load all papers into memory
|
| 30 |
+
self.papers = self._load_papers()
|
| 31 |
+
|
| 32 |
+
# Create simple vector store
|
| 33 |
+
self.vector_store = self._create_vector_store()
|
| 34 |
+
|
| 35 |
+
# Setup LLM
|
| 36 |
+
self.llm = self._setup_llm()
|
| 37 |
+
|
| 38 |
+
def _load_papers(self) -> Dict[str, Dict]:
|
| 39 |
+
"""Load all papers into memory"""
|
| 40 |
+
papers = {}
|
| 41 |
+
pdf_dir = "documents"
|
| 42 |
+
|
| 43 |
+
paper_metadata = {
|
| 44 |
+
"r3d": {
|
| 45 |
+
"file": "r3d_arxiv_4apr2025.pdf",
|
| 46 |
+
"title": "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes",
|
| 47 |
+
"type": "JOB MARKET PAPER",
|
| 48 |
+
"description": "Extends RDD to analyze entire outcome distributions using optimal transport"
|
| 49 |
+
},
|
| 50 |
+
"fdr": {
|
| 51 |
+
"file": "fdr.pdf",
|
| 52 |
+
"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns",
|
| 53 |
+
"type": "Working Paper",
|
| 54 |
+
"description": "Estimates regression functions with unknown discontinuity locations"
|
| 55 |
+
},
|
| 56 |
+
"disco": {
|
| 57 |
+
"file": "disco.pdf",
|
| 58 |
+
"title": "disco: Distributional Synthetic Controls",
|
| 59 |
+
"type": "Working Paper",
|
| 60 |
+
"description": "Stata package for distributional synthetic control methods"
|
| 61 |
+
},
|
| 62 |
+
"rto": {
|
| 63 |
+
"file": "rto.pdf",
|
| 64 |
+
"title": "Return to Office and the Tenure Distribution",
|
| 65 |
+
"type": "Working Paper",
|
| 66 |
+
"description": "Analyzes impact of return-to-office mandates on employee tenure"
|
| 67 |
+
},
|
| 68 |
+
"prodf": {
|
| 69 |
+
"file": "prodf.pdf",
|
| 70 |
+
"title": "On the Non-Identification of Revenue Production Functions",
|
| 71 |
+
"type": "Working Paper",
|
| 72 |
+
"description": "Shows non-identification of production functions with revenue data"
|
| 73 |
+
},
|
| 74 |
+
"cv": {
|
| 75 |
+
"file": "CV_DavidVanDijcke.pdf",
|
| 76 |
+
"title": "Curriculum Vitae",
|
| 77 |
+
"type": "CV",
|
| 78 |
+
"description": "David Van Dijcke's academic CV"
|
| 79 |
+
}
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
for key, metadata in paper_metadata.items():
|
| 83 |
+
pdf_path = os.path.join(pdf_dir, metadata["file"])
|
| 84 |
+
if os.path.exists(pdf_path):
|
| 85 |
+
try:
|
| 86 |
+
loader = PyPDFLoader(pdf_path)
|
| 87 |
+
pages = loader.load() # Load ALL pages
|
| 88 |
+
|
| 89 |
+
# Store full text with metadata
|
| 90 |
+
full_text = "\n\n".join([p.page_content for p in pages])
|
| 91 |
+
papers[key] = {
|
| 92 |
+
"text": full_text,
|
| 93 |
+
"pages": len(pages),
|
| 94 |
+
"filename": metadata["file"],
|
| 95 |
+
"title": metadata["title"],
|
| 96 |
+
"type": metadata["type"],
|
| 97 |
+
"description": metadata["description"]
|
| 98 |
+
}
|
| 99 |
+
print(f"Loaded {metadata['title']}: {len(pages)} pages, {len(full_text):,} characters")
|
| 100 |
+
|
| 101 |
+
except Exception as e:
|
| 102 |
+
print(f"Error loading {metadata['file']}: {e}")
|
| 103 |
+
|
| 104 |
+
return papers
|
| 105 |
+
|
| 106 |
+
def _create_vector_store(self) -> Optional[FAISS]:
|
| 107 |
+
"""Create vector store from papers"""
|
| 108 |
+
try:
|
| 109 |
+
# Create documents with larger chunks
|
| 110 |
+
documents = []
|
| 111 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
| 112 |
+
chunk_size=1500,
|
| 113 |
+
chunk_overlap=150
|
| 114 |
+
)
|
| 115 |
+
|
| 116 |
+
for key, paper in self.papers.items():
|
| 117 |
+
# Split text
|
| 118 |
+
chunks = text_splitter.split_text(paper["text"])
|
| 119 |
+
|
| 120 |
+
# Create documents
|
| 121 |
+
for i, chunk in enumerate(chunks):
|
| 122 |
+
from langchain.schema import Document
|
| 123 |
+
doc = Document(
|
| 124 |
+
page_content=chunk,
|
| 125 |
+
metadata={"source": key, "chunk": i}
|
| 126 |
+
)
|
| 127 |
+
documents.append(doc)
|
| 128 |
+
|
| 129 |
+
# Create vector store
|
| 130 |
+
if documents:
|
| 131 |
+
return FAISS.from_documents(documents, self.embeddings)
|
| 132 |
+
|
| 133 |
+
except Exception as e:
|
| 134 |
+
print(f"Error creating vector store: {e}")
|
| 135 |
+
|
| 136 |
+
return None
|
| 137 |
+
|
| 138 |
+
def _setup_llm(self):
|
| 139 |
+
"""Setup Gemini LLM"""
|
| 140 |
+
api_key = os.getenv("GOOGLE_API_KEY")
|
| 141 |
+
|
| 142 |
+
if api_key:
|
| 143 |
+
try:
|
| 144 |
+
genai.configure(api_key=api_key)
|
| 145 |
+
# Use Gemini 2.5 Flash Preview
|
| 146 |
+
try:
|
| 147 |
+
model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')
|
| 148 |
+
print("Using Gemini 2.5 Flash Preview")
|
| 149 |
+
return model
|
| 150 |
+
except:
|
| 151 |
+
model = genai.GenerativeModel('gemini-1.5-flash')
|
| 152 |
+
print("Using Gemini 1.5 Flash")
|
| 153 |
+
return model
|
| 154 |
+
except Exception as e:
|
| 155 |
+
print(f"Error setting up Gemini: {e}")
|
| 156 |
+
|
| 157 |
+
return None
|
| 158 |
+
|
| 159 |
+
def answer_question(self, query: str, chat_history: List = None) -> str:
|
| 160 |
+
"""Answer questions about David's research"""
|
| 161 |
+
if not query.strip():
|
| 162 |
+
return "Please ask a question about David Van Dijcke's research."
|
| 163 |
+
|
| 164 |
+
# Get relevant context
|
| 165 |
+
context = self._get_context(query)
|
| 166 |
+
|
| 167 |
+
# Generate response
|
| 168 |
+
if self.llm:
|
| 169 |
+
prompt = f"""You are David Van Dijcke's research assistant. David is an econometrician on the 2025-26 academic job market from the University of Michigan.
|
| 170 |
+
|
| 171 |
+
Key facts about David:
|
| 172 |
+
- His JOB MARKET PAPER is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
|
| 173 |
+
- He specializes in functional data analysis, optimal transport, and econometric theory
|
| 174 |
+
- He develops methods for analyzing distributional effects, not just averages
|
| 175 |
+
|
| 176 |
+
Context from his papers:
|
| 177 |
+
{context}
|
| 178 |
+
|
| 179 |
+
Question: {query}
|
| 180 |
+
|
| 181 |
+
Instructions:
|
| 182 |
+
- Provide a conversational yet informative response
|
| 183 |
+
- Be specific and accurate based on the papers
|
| 184 |
+
- For technical questions, explain both the intuition AND the technical details
|
| 185 |
+
- Highlight what makes David's work unique and important
|
| 186 |
+
- For "what is the use" questions, focus on real-world applications and policy relevance
|
| 187 |
+
|
| 188 |
+
Answer:"""
|
| 189 |
+
|
| 190 |
+
try:
|
| 191 |
+
response = self.llm.generate_content(prompt)
|
| 192 |
+
return response.text
|
| 193 |
+
except Exception as e:
|
| 194 |
+
print(f"Error generating response: {e}")
|
| 195 |
+
return self._fallback_response(query)
|
| 196 |
+
else:
|
| 197 |
+
return self._fallback_response(query)
|
| 198 |
+
|
| 199 |
+
def _get_context(self, query: str) -> str:
|
| 200 |
+
"""Get relevant context for query"""
|
| 201 |
+
query_lower = query.lower()
|
| 202 |
+
contexts = []
|
| 203 |
+
|
| 204 |
+
# CRITICAL: Check for job market paper queries FIRST
|
| 205 |
+
if any(phrase in query_lower for phrase in ["job market", "jmp", "job market paper", "what is his job market"]):
|
| 206 |
+
# Add R3D paper info with clear labeling
|
| 207 |
+
if "r3d" in self.papers:
|
| 208 |
+
paper = self.papers["r3d"]
|
| 209 |
+
context = f"[JOB MARKET PAPER - R3D: Regression Discontinuity Design with Distribution-Valued Outcomes]\n\n"
|
| 210 |
+
context += f"This is David Van Dijcke's JOB MARKET PAPER.\n\n"
|
| 211 |
+
# Provide more context for Gemini 2.5's larger window
|
| 212 |
+
context += paper["text"][:100000] # Get first ~100k chars (about 25k tokens)
|
| 213 |
+
contexts.append(context)
|
| 214 |
+
# Return immediately for job market paper queries
|
| 215 |
+
return context
|
| 216 |
+
|
| 217 |
+
# Check for general "what's up" or greeting
|
| 218 |
+
if any(phrase in query_lower for phrase in ["what's up", "whats up", "hello", "hi"]):
|
| 219 |
+
intro = """David Van Dijcke is an econometrician on the 2025-26 academic job market from the University of Michigan.
|
| 220 |
+
|
| 221 |
+
His JOB MARKET PAPER is "R3D: Regression Discontinuity Design with Distribution-Valued Outcomes" which extends regression discontinuity design to analyze entire outcome distributions using optimal transport theory.
|
| 222 |
+
|
| 223 |
+
He has developed several innovative econometric methods including:
|
| 224 |
+
- R3D (Job Market Paper): Distribution-valued RDD
|
| 225 |
+
- Free Discontinuity Regression (FDR): Estimating regressions with unknown discontinuities
|
| 226 |
+
- Distributional Synthetic Controls (disco): A Stata package
|
| 227 |
+
- Work on non-identification of revenue production functions
|
| 228 |
+
|
| 229 |
+
Ask me about any of his papers or methods!"""
|
| 230 |
+
contexts.append(intro)
|
| 231 |
+
|
| 232 |
+
# Check for David/CV queries
|
| 233 |
+
if any(word in query_lower for word in ["david", "who", "background", "cv", "about"]):
|
| 234 |
+
if "cv" in self.papers:
|
| 235 |
+
cv_context = f"[CURRICULUM VITAE]\n\n{self.papers['cv']['text'][:8000]}"
|
| 236 |
+
contexts.append(cv_context)
|
| 237 |
+
|
| 238 |
+
# Check for specific paper mentions
|
| 239 |
+
paper_keywords = {
|
| 240 |
+
"r3d": ["r3d", "regression discontinuity", "distribution", "optimal transport", "wasserstein"],
|
| 241 |
+
"fdr": ["fdr", "free discontinuity", "internet shutdown"],
|
| 242 |
+
"rto": ["return to office", "tenure", "rto"],
|
| 243 |
+
"disco": ["disco", "synthetic control", "distributional"],
|
| 244 |
+
"prodf": ["production function", "revenue", "identification"]
|
| 245 |
+
}
|
| 246 |
+
|
| 247 |
+
for key, keywords in paper_keywords.items():
|
| 248 |
+
if any(kw in query_lower for kw in keywords):
|
| 249 |
+
if key in self.papers:
|
| 250 |
+
paper = self.papers[key]
|
| 251 |
+
paper_context = f"[{paper['type']}: {paper['title']}]\n\n"
|
| 252 |
+
paper_context += paper["text"][:15000]
|
| 253 |
+
contexts.append(paper_context)
|
| 254 |
+
|
| 255 |
+
# If no specific match, try vector search
|
| 256 |
+
if not contexts and self.vector_store:
|
| 257 |
+
try:
|
| 258 |
+
docs = self.vector_store.similarity_search(query, k=4)
|
| 259 |
+
for doc in docs:
|
| 260 |
+
source = doc.metadata.get("source", "")
|
| 261 |
+
if source in self.papers:
|
| 262 |
+
paper = self.papers[source]
|
| 263 |
+
chunk_context = f"[From {paper['title']}]\n{doc.page_content}"
|
| 264 |
+
contexts.append(chunk_context)
|
| 265 |
+
except:
|
| 266 |
+
pass
|
| 267 |
+
|
| 268 |
+
# Always include paper list if no context found
|
| 269 |
+
if not contexts:
|
| 270 |
+
paper_list = "David Van Dijcke's papers:\n"
|
| 271 |
+
for key, paper in self.papers.items():
|
| 272 |
+
if key != "cv":
|
| 273 |
+
paper_list += f"- {paper['type']}: {paper['title']}\n"
|
| 274 |
+
contexts.append(paper_list)
|
| 275 |
+
|
| 276 |
+
return "\n\n---\n\n".join(contexts[:3])
|
| 277 |
+
|
| 278 |
+
def _fallback_response(self, query: str) -> str:
|
| 279 |
+
"""Fallback response without LLM"""
|
| 280 |
+
query_lower = query.lower()
|
| 281 |
+
|
| 282 |
+
# Job market paper query
|
| 283 |
+
if any(phrase in query_lower for phrase in ["job market", "jmp"]):
|
| 284 |
+
return """David Van Dijcke's JOB MARKET PAPER is:
|
| 285 |
+
|
| 286 |
+
"R3D: Regression Discontinuity Design with Distribution-Valued Outcomes"
|
| 287 |
+
|
| 288 |
+
This paper extends regression discontinuity design (RDD) to analyze entire outcome distributions rather than just means. Key innovations:
|
| 289 |
+
- Uses optimal transport theory and Wasserstein distances
|
| 290 |
+
- Allows testing of distributional effects of policies
|
| 291 |
+
- Applications to income distributions, test score distributions
|
| 292 |
+
- Provides new identification and estimation procedures
|
| 293 |
+
|
| 294 |
+
This addresses a fundamental limitation of traditional RDD that only examines average treatment effects."""
|
| 295 |
+
|
| 296 |
+
# General greeting
|
| 297 |
+
if any(phrase in query_lower for phrase in ["what's up", "hello", "hi"]):
|
| 298 |
+
return """Hello! I'm David Van Dijcke's research assistant. David is an econometrician on the 2025-26 job market.
|
| 299 |
+
|
| 300 |
+
His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes).
|
| 301 |
+
|
| 302 |
+
I can tell you about:
|
| 303 |
+
- His job market paper (R3D)
|
| 304 |
+
- His other papers (FDR, disco, RTO, etc.)
|
| 305 |
+
- His econometric methods
|
| 306 |
+
- His background and CV
|
| 307 |
+
|
| 308 |
+
What would you like to know?"""
|
| 309 |
+
|
| 310 |
+
# Specific paper queries
|
| 311 |
+
if "r3d" in query_lower:
|
| 312 |
+
return "R3D (Regression Discontinuity Design with Distribution-Valued Outcomes) is David's JOB MARKET PAPER. It extends RDD to analyze entire outcome distributions using optimal transport theory and Wasserstein distances."
|
| 313 |
+
|
| 314 |
+
if "fdr" in query_lower:
|
| 315 |
+
return "Free Discontinuity Regression (FDR) is David's paper on estimating regression functions with unknown discontinuity locations. It uses geometric measure theory with applications to measuring economic impacts of internet shutdowns."
|
| 316 |
+
|
| 317 |
+
if "david" in query_lower or "who" in query_lower:
|
| 318 |
+
return "David Van Dijcke is an econometrician on the 2025-26 job market from the University of Michigan. He specializes in functional data analysis, optimal transport methods, and develops novel econometric techniques for modern data challenges."
|
| 319 |
+
|
| 320 |
+
return "I can help with questions about David Van Dijcke's research. Try asking about his job market paper (R3D), his methods, or his background. For best results, please add a Google API key."
|
| 321 |
+
|
| 322 |
+
# Create Gradio interface
|
| 323 |
+
def create_interface():
|
| 324 |
+
"""Create simple Gradio interface"""
|
| 325 |
+
assistant = StableResearchAssistant()
|
| 326 |
+
|
| 327 |
+
def chat(message, history):
|
| 328 |
+
response = assistant.answer_question(message, history)
|
| 329 |
+
history.append([message, response])
|
| 330 |
+
return "", history
|
| 331 |
+
|
| 332 |
+
with gr.Blocks(title="David Van Dijcke - Research Assistant") as demo:
|
| 333 |
+
gr.Markdown("""
|
| 334 |
+
# David Van Dijcke - Research Assistant (Stable Version)
|
| 335 |
+
|
| 336 |
+
Ask questions about David's econometric research and papers.
|
| 337 |
+
""")
|
| 338 |
+
|
| 339 |
+
chatbot = gr.Chatbot(height=400)
|
| 340 |
+
msg = gr.Textbox(label="Your question", placeholder="What is David's job market paper about?")
|
| 341 |
+
clear = gr.Button("Clear")
|
| 342 |
+
|
| 343 |
+
# Examples
|
| 344 |
+
gr.Examples(
|
| 345 |
+
examples=[
|
| 346 |
+
"What is David's job market paper R3D about?",
|
| 347 |
+
"What econometric methods has David developed?",
|
| 348 |
+
"Tell me about David's background",
|
| 349 |
+
"How does David use optimal transport in his research?",
|
| 350 |
+
"What is the FDR paper about?"
|
| 351 |
+
],
|
| 352 |
+
inputs=msg
|
| 353 |
+
)
|
| 354 |
+
|
| 355 |
+
msg.submit(chat, [msg, chatbot], [msg, chatbot])
|
| 356 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
| 357 |
+
|
| 358 |
+
return demo
|
| 359 |
+
|
| 360 |
+
if __name__ == "__main__":
|
| 361 |
+
# Simple launch without API endpoint issues
|
| 362 |
+
interface = create_interface()
|
| 363 |
+
interface.launch(
|
| 364 |
+
server_name="127.0.0.1",
|
| 365 |
+
server_port=7860,
|
| 366 |
+
share=False,
|
| 367 |
+
quiet=True
|
| 368 |
+
)
|
pyproject.toml
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "david-research-assistant"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "AI Research Assistant for David Van Dijcke's academic website"
|
| 5 |
+
requires-python = ">=3.9"
|
| 6 |
+
dependencies = [
|
| 7 |
+
"gradio>=4.44.0",
|
| 8 |
+
"langchain>=0.1.9",
|
| 9 |
+
"langchain-community>=0.0.24",
|
| 10 |
+
"sentence-transformers==2.5.1",
|
| 11 |
+
"faiss-cpu==1.7.4",
|
| 12 |
+
"pypdf==4.0.2",
|
| 13 |
+
"google-generativeai>=0.8.3",
|
| 14 |
+
"python-dotenv==1.0.1",
|
| 15 |
+
"pyyaml==6.0.1",
|
| 16 |
+
"pydantic>=2.0,<3.0",
|
| 17 |
+
"fastapi>=0.100.0",
|
| 18 |
+
]
|
| 19 |
+
|
| 20 |
+
[project.optional-dependencies]
|
| 21 |
+
improved = [
|
| 22 |
+
"gradio>=4.44.0",
|
| 23 |
+
"langchain==0.1.9",
|
| 24 |
+
"langchain-community==0.0.24",
|
| 25 |
+
"sentence-transformers==2.5.1",
|
| 26 |
+
"faiss-cpu==1.7.4",
|
| 27 |
+
"pypdf==4.0.2",
|
| 28 |
+
"huggingface-hub==0.20.3",
|
| 29 |
+
"python-dotenv==1.0.1",
|
| 30 |
+
"pydantic>=2.0,<3.0",
|
| 31 |
+
"fastapi>=0.100.0",
|
| 32 |
+
]
|
| 33 |
+
full-context = [
|
| 34 |
+
"gradio>=4.44.0",
|
| 35 |
+
"langchain==0.1.9",
|
| 36 |
+
"langchain-community==0.0.24",
|
| 37 |
+
"sentence-transformers==2.5.1",
|
| 38 |
+
"faiss-cpu==1.7.4",
|
| 39 |
+
"pypdf==4.0.2",
|
| 40 |
+
"google-generativeai>=0.8.3",
|
| 41 |
+
"python-dotenv==1.0.1",
|
| 42 |
+
"pyyaml==6.0.1",
|
| 43 |
+
"pydantic>=2.0,<3.0",
|
| 44 |
+
"fastapi>=0.100.0",
|
| 45 |
+
]
|
| 46 |
+
test = [
|
| 47 |
+
"pytest>=7.0",
|
| 48 |
+
"pytest-asyncio",
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
[build-system]
|
| 52 |
+
requires = ["hatchling"]
|
| 53 |
+
build-backend = "hatchling.build"
|
| 54 |
+
|
| 55 |
+
[tool.hatch.build.targets.wheel]
|
| 56 |
+
packages = ["."]
|
| 57 |
+
include = ["*.py", "documents/", "requirements*.txt", "*.md"]
|
| 58 |
+
|
| 59 |
+
[tool.uv]
|
| 60 |
+
dev-dependencies = [
|
| 61 |
+
"ipython>=8.0",
|
| 62 |
+
"black>=23.0",
|
| 63 |
+
"ruff>=0.1.0",
|
| 64 |
+
]
|
pyproject_stable.toml
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "david-research-assistant"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "AI Research Assistant for David Van Dijcke's academic website"
|
| 5 |
+
requires-python = ">=3.9,<3.11"
|
| 6 |
+
dependencies = [
|
| 7 |
+
"gradio==4.19.2",
|
| 8 |
+
"langchain==0.1.9",
|
| 9 |
+
"langchain-community==0.0.24",
|
| 10 |
+
"sentence-transformers==2.5.1",
|
| 11 |
+
"faiss-cpu==1.7.4",
|
| 12 |
+
"pypdf==4.0.2",
|
| 13 |
+
"google-generativeai==0.3.2",
|
| 14 |
+
"python-dotenv==1.0.1",
|
| 15 |
+
"pydantic==2.5.3",
|
| 16 |
+
"pydantic-core==2.14.6",
|
| 17 |
+
"fastapi==0.109.0",
|
| 18 |
+
"httpx==0.26.0",
|
| 19 |
+
"typing-extensions==4.9.0",
|
| 20 |
+
]
|
| 21 |
+
|
| 22 |
+
[build-system]
|
| 23 |
+
requires = ["setuptools>=61.0"]
|
| 24 |
+
build-backend = "setuptools.build_meta"
|
setup_stable.sh
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
echo "Setting up stable David Research Assistant environment..."
|
| 4 |
+
|
| 5 |
+
# Clean up existing environment
|
| 6 |
+
echo "Cleaning up..."
|
| 7 |
+
rm -rf .venv uv.lock __pycache__ *.pyc
|
| 8 |
+
|
| 9 |
+
# Create fresh virtual environment
|
| 10 |
+
echo "Creating virtual environment..."
|
| 11 |
+
uv venv
|
| 12 |
+
|
| 13 |
+
# Activate it
|
| 14 |
+
source .venv/bin/activate
|
| 15 |
+
|
| 16 |
+
# Install specific versions that work together
|
| 17 |
+
echo "Installing dependencies..."
|
| 18 |
+
uv pip install \
|
| 19 |
+
gradio==4.19.2 \
|
| 20 |
+
langchain==0.1.9 \
|
| 21 |
+
langchain-community==0.0.24 \
|
| 22 |
+
sentence-transformers==2.5.1 \
|
| 23 |
+
faiss-cpu==1.7.4 \
|
| 24 |
+
pypdf==4.0.2 \
|
| 25 |
+
google-generativeai==0.3.2 \
|
| 26 |
+
python-dotenv==1.0.1 \
|
| 27 |
+
pydantic==2.5.3 \
|
| 28 |
+
pydantic-core==2.14.6 \
|
| 29 |
+
fastapi==0.109.0 \
|
| 30 |
+
httpx==0.26.0 \
|
| 31 |
+
typing-extensions==4.9.0
|
| 32 |
+
|
| 33 |
+
echo "Setup complete! Run with:"
|
| 34 |
+
echo "source .venv/bin/activate"
|
| 35 |
+
echo "python app_stable.py"
|
test_full_context.py
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script comparing original and full context versions
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import time
|
| 8 |
+
from app import ImprovedResearchAssistant
|
| 9 |
+
from app_full_context import FullContextResearchAssistant
|
| 10 |
+
|
| 11 |
+
def test_both_versions():
|
| 12 |
+
"""Compare responses from both versions"""
|
| 13 |
+
print("Comparing Research Assistant Versions\n")
|
| 14 |
+
print("="*80)
|
| 15 |
+
|
| 16 |
+
# Initialize both assistants
|
| 17 |
+
print("Loading original assistant...")
|
| 18 |
+
original = ImprovedResearchAssistant()
|
| 19 |
+
|
| 20 |
+
print("Loading full context assistant...")
|
| 21 |
+
full_context = FullContextResearchAssistant()
|
| 22 |
+
|
| 23 |
+
# Test queries that benefit from full context
|
| 24 |
+
test_queries = [
|
| 25 |
+
"What specific econometric methods does David develop in R3D? Give technical details.",
|
| 26 |
+
"Explain the theoretical framework of optimal transport in David's research.",
|
| 27 |
+
"What are the main results and contributions across all of David's papers?",
|
| 28 |
+
"How does the FDR paper relate to David's other work on discontinuities?",
|
| 29 |
+
"What makes David uniquely qualified for an econometrics position? Use specific examples from his papers.",
|
| 30 |
+
"Describe the empirical applications in David's job market paper with specific details.",
|
| 31 |
+
"What are the identification strategies used across David's different papers?",
|
| 32 |
+
"How does David's work on productivity relate to distributional outcomes?"
|
| 33 |
+
]
|
| 34 |
+
|
| 35 |
+
for i, query in enumerate(test_queries, 1):
|
| 36 |
+
print(f"\n{'='*80}")
|
| 37 |
+
print(f"Test {i}: {query}")
|
| 38 |
+
print('='*80)
|
| 39 |
+
|
| 40 |
+
# Original version
|
| 41 |
+
print("\n--- ORIGINAL VERSION (Chunked) ---")
|
| 42 |
+
start_time = time.time()
|
| 43 |
+
try:
|
| 44 |
+
original_response = original.answer_question(query)
|
| 45 |
+
original_time = time.time() - start_time
|
| 46 |
+
print(f"Response ({original_time:.2f}s):")
|
| 47 |
+
print(original_response[:500] + "..." if len(original_response) > 500 else original_response)
|
| 48 |
+
except Exception as e:
|
| 49 |
+
print(f"Error: {e}")
|
| 50 |
+
original_response = "Error"
|
| 51 |
+
original_time = 0
|
| 52 |
+
|
| 53 |
+
# Full context version
|
| 54 |
+
print("\n--- FULL CONTEXT VERSION ---")
|
| 55 |
+
start_time = time.time()
|
| 56 |
+
try:
|
| 57 |
+
full_response = full_context.answer_question(query)
|
| 58 |
+
full_time = time.time() - start_time
|
| 59 |
+
print(f"Response ({full_time:.2f}s):")
|
| 60 |
+
print(full_response[:500] + "..." if len(full_response) > 500 else full_response)
|
| 61 |
+
except Exception as e:
|
| 62 |
+
print(f"Error: {e}")
|
| 63 |
+
full_response = "Error"
|
| 64 |
+
full_time = 0
|
| 65 |
+
|
| 66 |
+
# Compare
|
| 67 |
+
print("\n--- COMPARISON ---")
|
| 68 |
+
print(f"Original length: {len(original_response)} chars")
|
| 69 |
+
print(f"Full context length: {len(full_response)} chars")
|
| 70 |
+
print(f"Length improvement: {len(full_response) / max(len(original_response), 1):.1f}x")
|
| 71 |
+
|
| 72 |
+
# Check for specific technical terms
|
| 73 |
+
technical_terms = ["optimal transport", "Wasserstein", "distribution", "discontinuity",
|
| 74 |
+
"identification", "econometric", "functional data", "geometric measure"]
|
| 75 |
+
|
| 76 |
+
original_terms = sum(1 for term in technical_terms if term.lower() in original_response.lower())
|
| 77 |
+
full_terms = sum(1 for term in technical_terms if term.lower() in full_response.lower())
|
| 78 |
+
|
| 79 |
+
print(f"Technical terms - Original: {original_terms}, Full: {full_terms}")
|
| 80 |
+
|
| 81 |
+
def analyze_paper_coverage():
|
| 82 |
+
"""Analyze how much of each paper is loaded"""
|
| 83 |
+
print("\n" + "="*80)
|
| 84 |
+
print("PAPER COVERAGE ANALYSIS")
|
| 85 |
+
print("="*80)
|
| 86 |
+
|
| 87 |
+
assistant = FullContextResearchAssistant()
|
| 88 |
+
|
| 89 |
+
print("\nFull papers loaded:")
|
| 90 |
+
total_chars = 0
|
| 91 |
+
for key, paper_info in assistant.full_papers.items():
|
| 92 |
+
print(f"\n{key}:")
|
| 93 |
+
print(f" Title: {paper_info['title']}")
|
| 94 |
+
print(f" Pages: {paper_info['num_pages']}")
|
| 95 |
+
print(f" Characters: {paper_info['length']:,}")
|
| 96 |
+
total_chars += paper_info['length']
|
| 97 |
+
|
| 98 |
+
print(f"\nTotal characters across all papers: {total_chars:,}")
|
| 99 |
+
print(f"Approximate tokens (chars/4): {total_chars//4:,}")
|
| 100 |
+
print(f"Well within Gemini 2.0 Flash context window (1M+ tokens)")
|
| 101 |
+
|
| 102 |
+
if __name__ == "__main__":
|
| 103 |
+
# Check API key
|
| 104 |
+
if not os.getenv("GOOGLE_API_KEY"):
|
| 105 |
+
print("Warning: No GOOGLE_API_KEY found. Results will be limited.\n")
|
| 106 |
+
|
| 107 |
+
# Run tests
|
| 108 |
+
test_both_versions()
|
| 109 |
+
analyze_paper_coverage()
|
| 110 |
+
|
| 111 |
+
print("\n" + "="*80)
|
| 112 |
+
print("Testing complete!")
|
| 113 |
+
print("\nKey improvements in full context version:")
|
| 114 |
+
print("- Loads complete papers instead of just first few pages")
|
| 115 |
+
print("- Larger chunk sizes (2000 vs 500 chars)")
|
| 116 |
+
print("- Better context preservation")
|
| 117 |
+
print("- More comprehensive responses")
|
| 118 |
+
print("- Ability to make cross-paper connections")
|