Davidvandijcke Claude commited on
Commit
1d552a0
·
1 Parent(s): 402874c

Major upgrade: Transform assistant into specialized econometric research showcase

Browse files

This commit comprehensively improves the AI assistant to properly represent David Van Dijcke as an econometrician on the 2025-26 job market, emphasizing his methodological contributions to functional data analysis and optimal transport.

Key improvements:
- Enhanced econometric focus with detailed paper summaries (R3D, FDR, DISCO, RTO)
- Professional prompts emphasizing methodological contributions
- Improved greetings that immediately identify David as an econometrician
- Better document loading with more content for job market paper
- Comprehensive deployment documentation and testing framework
- Security improvements (proper .env handling, .gitignore)

Technical enhancements:
- Optimized Gemini 2.0/1.5 Flash integration for accurate responses
- Enhanced context about functional data analysis and optimal transport
- Distribution-valued treatment effects and geometric measure theory focus
- Policy applications using big data emphasized alongside theoretical work

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

.env.example CHANGED
@@ -1,4 +1,4 @@
1
- # Google AI API Key (optional)
2
  # Get your API key from https://aistudio.google.com/app/apikey
3
  # If not provided, the app will use a limited mode with lower quality
4
- GOOGLE_API_KEY=your_api_key_here
 
1
+ # Google AI API Key (optional but recommended)
2
  # Get your API key from https://aistudio.google.com/app/apikey
3
  # If not provided, the app will use a limited mode with lower quality
4
+ # GOOGLE_API_KEY=your_api_key_here
.gitignore ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment variables
2
+ .env
3
+ .env.local
4
+ .env.*.local
5
+
6
+ # Cache
7
+ vector_store_cache/
8
+ __pycache__/
9
+ *.pyc
10
+ *.pyo
11
+ *.pyd
12
+ .Python
13
+
14
+ # IDE
15
+ .vscode/
16
+ .idea/
17
+ *.swp
18
+ *.swo
19
+ *~
20
+
21
+ # OS
22
+ .DS_Store
23
+ Thumbs.db
24
+
25
+ # Logs
26
+ *.log
27
+
28
+ # Testing
29
+ .pytest_cache/
30
+ .coverage
31
+ htmlcov/
32
+
33
+ # Gradio
34
+ gradio_cached_examples/
35
+ flagged/
DEPLOYMENT_GUIDE.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Guide for David Van Dijcke's Econometric Research Assistant
2
+
3
+ ## Overview
4
+
5
+ This assistant specializes in David Van Dijcke's econometric research, emphasizing his contributions to functional data analysis, optimal transport, and causal inference methods. The assistant is optimized for the 2025-26 economics job market.
6
+
7
+ ## Key Features
8
+
9
+ - **Econometric Focus**: Emphasizes David's methodological contributions
10
+ - **Job Market Ready**: Highlights R3D paper and econometric innovations
11
+ - **Technical Accuracy**: Detailed information about functional data analysis and optimal transport
12
+ - **Policy Applications**: Shows how methods apply to real-world big data problems
13
+
14
+ ## Deployment Options
15
+
16
+ ### Option 1: Hugging Face Spaces (Recommended)
17
+
18
+ 1. **Create a new Space**:
19
+ - Go to https://huggingface.co/new-space
20
+ - Choose Gradio SDK
21
+ - Set to Public
22
+
23
+ 2. **Upload files**:
24
+ - `app.py` (the main application)
25
+ - `requirements.txt`
26
+ - `documents/` folder with PDFs
27
+
28
+ 3. **Add Google API Key** (for best performance):
29
+ - Go to Space Settings > Repository secrets
30
+ - Add secret: `GOOGLE_API_KEY`
31
+ - Get key from: https://aistudio.google.com/app/apikey
32
+
33
+ 4. **The Space will auto-deploy**
34
+
35
+ ### Option 2: Local Development
36
+
37
+ ```bash
38
+ # Clone the repository
39
+ git clone https://huggingface.co/spaces/dvdijcke/david-research-assistant
40
+
41
+ # Install dependencies
42
+ pip install -r requirements.txt
43
+
44
+ # Set up environment
45
+ echo "GOOGLE_API_KEY=your_key_here" > .env
46
+
47
+ # Run the app
48
+ python app.py
49
+ ```
50
+
51
+ ## Performance Optimization
52
+
53
+ ### Using Google Gemini (Recommended)
54
+ - **Model**: Gemini 2.0 Flash (falls back to 1.5 Flash)
55
+ - **Cost**: ~$0.001-0.005 per conversation
56
+ - **Quality**: High accuracy, understands technical econometric concepts
57
+ - **Setup**: Just add GOOGLE_API_KEY to environment
58
+
59
+ ### Without API Key
60
+ - Falls back to limited mode
61
+ - Lower quality responses
62
+ - Still functional but less accurate
63
+
64
+ ## Content Updates
65
+
66
+ ### To update research information:
67
+
68
+ 1. **Edit `app.py`** and modify the `research_info` section:
69
+ - Update paper titles and descriptions
70
+ - Add new methodological contributions
71
+ - Update job market status
72
+
73
+ 2. **Update paper summaries** in the `paper_summaries` section:
74
+ - Add new papers
75
+ - Update findings
76
+ - Emphasize econometric innovations
77
+
78
+ 3. **Add new PDFs** to `documents/` folder:
79
+ - Job market paper should be prominently featured
80
+ - Include recent working papers
81
+ - CV should be up to date
82
+
83
+ ## Testing
84
+
85
+ Run the test script to verify functionality:
86
+
87
+ ```bash
88
+ python test_assistant.py
89
+ ```
90
+
91
+ Key things to verify:
92
+ - Correctly identifies David as an econometrician
93
+ - Accurately describes R3D and other papers
94
+ - Emphasizes methodological contributions
95
+ - Links theory to applications
96
+
97
+ ## Common Issues
98
+
99
+ 1. **"No module named 'langchain'"**
100
+ - Solution: `pip install -r requirements.txt`
101
+
102
+ 2. **Slow responses**
103
+ - Add Google API key for faster Gemini responses
104
+ - Check if vector store cache exists
105
+
106
+ 3. **Incorrect information**
107
+ - Update the context in `app.py`
108
+ - Ensure PDFs are loading correctly
109
+ - Check paper summaries are accurate
110
+
111
+ ## Customization
112
+
113
+ ### Adjusting the tone:
114
+ Edit the prompt in `generate_response()` to adjust formality and focus.
115
+
116
+ ### Adding new examples:
117
+ Update the `examples` list in `create_gradio_interface()`.
118
+
119
+ ### Changing the model:
120
+ Modify the `genai.GenerativeModel()` initialization to use different models.
121
+
122
+ ## Monitoring
123
+
124
+ - Check Space logs for errors
125
+ - Monitor API usage in Google AI Studio
126
+ - Test with various econometric questions regularly
127
+
128
+ ## Support
129
+
130
+ For issues or updates:
131
+ - Check Hugging Face Space logs
132
+ - Verify API key is correctly set
133
+ - Ensure all PDFs are in the documents folder
DEPLOYMENT_IMPROVED.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deploying the Improved Research Assistant
2
+
3
+ ## Quick Start
4
+
5
+ 1. **Set up environment variables**:
6
+ Create a `.env` file with your Hugging Face token:
7
+ ```
8
+ HUGGINGFACE_TOKEN=your_token_here
9
+ ```
10
+
11
+ 2. **Install dependencies**:
12
+ ```bash
13
+ pip install -r requirements_improved.txt
14
+ ```
15
+
16
+ 3. **Run the improved app**:
17
+ ```bash
18
+ python app_improved.py
19
+ ```
20
+
21
+ ## Key Improvements
22
+
23
+ ### 1. **Performance Enhancements**
24
+ - Removed heavy BART classifier that was slowing down responses
25
+ - Added vector store caching to avoid reloading documents
26
+ - Using Hugging Face Inference API for faster text generation
27
+ - Reduced PDF processing to only essential pages
28
+
29
+ ### 2. **Better Conversation Flow**
30
+ - Now responds warmly to greetings like "hello"
31
+ - More conversational and friendly tone
32
+ - Doesn't restrict topics unnecessarily
33
+ - Provides helpful suggestions for what users can ask
34
+
35
+ ### 3. **Technical Optimizations**
36
+ - Smaller chunk sizes (500 chars) for faster retrieval
37
+ - Caching mechanism for vector store
38
+ - Streaming responses for better user experience
39
+ - Removed unnecessary dependencies (torch, transformers)
40
+
41
+ ## Deployment on Hugging Face Spaces
42
+
43
+ 1. Update your `app.py` file with the contents of `app_improved.py`
44
+ 2. Update your `requirements.txt` with the contents of `requirements_improved.txt`
45
+ 3. Add the `HUGGINGFACE_TOKEN` secret in your Space settings
46
+ 4. The app will automatically rebuild and deploy
47
+
48
+ ## Testing Locally
49
+
50
+ Try these test messages to see the improvements:
51
+ - "Hello!" - Should get a warm, helpful response
52
+ - "Tell me about David" - Should provide a comprehensive overview
53
+ - "What's functional difference-in-differences?" - Should give technical details
54
+
55
+ The assistant should now be much faster and more conversational!
README.md CHANGED
@@ -10,16 +10,17 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- # David Van Dijcke - Research Assistant
14
 
15
- An AI-powered assistant that answers questions about David Van Dijcke's academic research, publications, and career in economics.
16
 
17
  ## Features
18
 
19
- - Answers questions about research interests and publications
20
- - Provides information about econometric methods and causal inference work
21
- - Friendly conversational interface that welcomes casual greetings
22
- - Built with LangChain and Gradio for an interactive experience
 
23
 
24
  ## Getting the Best Performance
25
 
 
10
  license: mit
11
  ---
12
 
13
+ # David Van Dijcke - Econometric Research Assistant
14
 
15
+ An AI-powered assistant specializing in David Van Dijcke's econometric research. David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
16
 
17
  ## Features
18
 
19
+ - **Econometric Methods Focus**: Detailed information about David's methodological contributions
20
+ - **Job Market Paper (R3D)**: Regression Discontinuity Design with Distribution-Valued Outcomes
21
+ - **Technical Expertise**: Functional data analysis, optimal transport, and geometric measure theory
22
+ - **Policy Applications**: How David applies econometric tools to answer questions with big data
23
+ - **Research Portfolio**: Information on FDR, DISCO, RTO, and other papers
24
 
25
  ## Getting the Best Performance
26
 
app.py CHANGED
@@ -81,21 +81,49 @@ class ImprovedResearchAssistant:
81
 
82
  # Enhanced research information
83
  research_info = """
84
- David Van Dijcke is a PhD student in Economics at the University of Michigan, Ann Arbor.
85
- He is on the job market for the 2025-26 academic year.
86
 
87
- RESEARCH FOCUS:
88
- David's research develops novel econometric methods combining causal inference with functional
89
- data analysis and optimal transport to study settings where the outcomes and/or covariates
90
- are functional and high-dimensional.
91
 
92
- KEY RESEARCH AREAS:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  - Econometric methods and theory
94
- - Causal inference with high-dimensional data
95
- - Functional data analysis
96
- - Optimal transport applications in economics
97
- - Labor market policies and their effects
98
- - Mobility patterns in response to crises (COVID-19, conflicts)
99
 
100
  CURRENT POSITIONS:
101
  - Rackham Graduate School Predoctoral Fellow at the University of Michigan (2024-25)
@@ -104,19 +132,12 @@ class ImprovedResearchAssistant:
104
  EDUCATION:
105
  - PhD in Economics, University of Michigan (expected 2026)
106
  - MA in Economics, University of Michigan
107
- - Previous education includes a BA in Theatre, showcasing his interdisciplinary background
108
-
109
- RESEARCH PAPERS:
110
- 1. "Revenue and Production Functions" - Work on firm-level analysis
111
- 2. "Return to Office" - Research on workplace policies post-COVID
112
- 3. "Unmasking Partisanship" - Analysis of political behavior during COVID-19
113
- 4. Work on public response to government alerts during the Russian invasion of Ukraine
114
- 5. Research on econometric methods combining causal inference with functional data analysis
115
 
116
  PERSONALITY:
117
- David is approachable and values clear communication. His background in theatre gives him
118
- unique presentation skills. He enjoys discussing both technical econometric details and
119
- broader policy implications of his work.
120
 
121
  CONTACT:
122
  Email: dvdijcke@umich.edu
@@ -130,25 +151,43 @@ class ImprovedResearchAssistant:
130
 
131
  # Add information about his background
132
  background_info = """
133
- UNIQUE BACKGROUND:
134
- David has an unusual path to economics - he holds a BA in Theatre, which gives him strong
135
- communication and presentation skills. This interdisciplinary background helps him explain
136
- complex econometric concepts in accessible ways.
 
 
137
 
138
- TEACHING:
139
- David has experience teaching various economics courses at the University of Michigan.
140
- He is known for making complex statistical concepts accessible to students.
 
 
 
 
 
 
 
 
 
141
 
142
  TECHNICAL SKILLS:
143
- - Advanced econometric theory
144
- - Programming in R, Python, Stata
145
- - Machine learning applications in economics
146
- - Functional data analysis
147
- - Optimal transport theory
148
-
149
- COLLABORATIONS:
150
- David frequently collaborates with other researchers and is open to new research partnerships.
151
- His work often involves interdisciplinary approaches combining economics with data science.
 
 
 
 
 
 
 
152
  """
153
 
154
  documents.append(Document(
@@ -156,6 +195,57 @@ class ImprovedResearchAssistant:
156
  metadata={"source": "background_info", "type": "personal"}
157
  ))
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  # Load PDFs efficiently - only key documents
160
  key_pdfs = [
161
  "CV_DavidVanDijcke.pdf",
@@ -181,8 +271,11 @@ class ImprovedResearchAssistant:
181
  try:
182
  loader = PyPDFLoader(filepath)
183
  pdf_docs = loader.load()
184
- # Add first few pages only for faster loading
185
- documents.extend(pdf_docs[:3])
 
 
 
186
  logger.info(f"Loaded {filename}")
187
  except Exception as e:
188
  logger.error(f"Error loading {filename}: {e}")
@@ -220,22 +313,30 @@ class ImprovedResearchAssistant:
220
 
221
  if self.use_gemini:
222
  # Create prompt for Gemini
223
- prompt = f"""You are a helpful AI assistant for David Van Dijcke's academic website.
224
- You help visitors learn about David's research, publications, and academic career.
 
 
 
 
 
 
 
225
 
226
- Key instructions:
227
- - Be accurate and only use information provided in the context
228
- - If you don't have specific information, say so clearly
229
- - Be conversational and friendly, but precise
230
- - Don't make up papers, publications, or details not in the context
231
- - If asked about papers not mentioned in the context, say you don't have information about that specific paper
 
232
 
233
  Context about David Van Dijcke:
234
  {context}
235
 
236
  User's question: {question}
237
 
238
- Please provide an accurate response based only on the context provided. If the context doesn't contain the information needed to answer the question, please say so clearly."""
239
 
240
  try:
241
  # Configure generation parameters for accuracy
@@ -263,9 +364,9 @@ Please provide an accurate response based only on the context provided. If the c
263
  # Handle greetings and casual conversation
264
  if self.is_greeting_or_casual(message):
265
  greeting_responses = [
266
- "Hello! I'm here to help you learn about David Van Dijcke's research and academic work. What would you like to know?",
267
- "Hi there! Welcome to David's research assistant. I can tell you about his econometric methods, publications, or academic journey. What interests you?",
268
- "Hello! Great to meet you. I'd be happy to share information about David's work in economics, his research papers, or his background. What would you like to explore?",
269
  ]
270
 
271
  # Use message hash to select consistent greeting
@@ -283,9 +384,9 @@ Please provide an accurate response based only on the context provided. If the c
283
  response = self.generate_response(message, context)
284
 
285
  # Add source information if specific papers were referenced
286
- paper_keywords = ["functional difference", "revenue production", "return to office", "unmasking", "ukraine"]
287
  if any(keyword in message.lower() for keyword in paper_keywords):
288
- response += "\n\n*For more details, you can find David's papers on his website.*"
289
 
290
  return response
291
 
@@ -320,19 +421,21 @@ def create_gradio_interface():
320
  # Create the interface with better examples
321
  demo = gr.ChatInterface(
322
  fn=chat_function,
323
- title="Chat with David Van Dijcke's Research Assistant",
324
  description=(
325
- "Hi! I'm here to help you learn about David's research in economics. "
326
- "Feel free to ask about his work, papers, or just say hello! 👋"
 
 
327
  ),
328
  examples=[
329
- "Hello! Who is David?",
330
- "What are David's main research interests?",
331
- "Tell me about functional difference-in-differences",
332
- "What's David's background?",
333
- "Which papers has David published?",
334
  "Is David on the job market?",
335
- "What econometric methods has he developed?"
336
  ],
337
  theme=gr.themes.Soft(
338
  primary_hue="blue",
 
81
 
82
  # Enhanced research information
83
  research_info = """
84
+ David Van Dijcke is a PhD candidate in Economics at the University of Michigan, Ann Arbor.
85
+ He is on the job market for the 2025-26 academic year as an ECONOMETRICIAN.
86
 
87
+ RESEARCH PROFILE:
88
+ David develops cutting-edge econometric methods for functional and high-dimensional data,
89
+ combining tools from functional data analysis, optimal transport, and geometric measure theory.
90
+ He applies these methods to answer important policy questions using big data.
91
 
92
+ CORE ECONOMETRIC CONTRIBUTIONS:
93
+ 1. **R3D: Regression Discontinuity Design with Distribution-Valued Outcomes** (JOB MARKET PAPER)
94
+ - Extends RDD to settings where outcomes are entire distributions
95
+ - Introduces local average quantile treatment effects
96
+ - Applies to income distribution effects of gubernatorial elections
97
+
98
+ 2. **Free Discontinuity Regression (FDR)**
99
+ - Non-parametric method to detect and estimate multivariate discontinuities
100
+ - Based on convex relaxation of the Mumford-Shah functional
101
+ - Applied to estimate economic costs of internet shutdowns in India
102
+
103
+ 3. **Distributional Synthetic Controls (DISCO)**
104
+ - Software implementation for studying distributional policy effects
105
+ - Uses optimal transport to match entire distributions
106
+ - Provides both quantile and CDF-based approaches
107
+
108
+ 4. **Return to Office and the Tenure Distribution**
109
+ - Applies distributional synthetic controls to 260 million resumes
110
+ - Studies effects of RTO mandates on employee tenure distributions
111
+ - Develops bootstrapped uniform confidence intervals
112
+
113
+ KEY TECHNICAL INNOVATIONS:
114
+ - Functional data analysis: Working with distribution-valued outcomes
115
+ - Optimal transport theory: Matching and comparing distributions
116
+ - Geometric measure theory: Detecting discontinuities in multivariate settings
117
+ - Asymptotic theory: Establishing inference for novel estimators
118
+ - Big data applications: Scalable methods for massive datasets
119
+
120
+ RESEARCH AREAS:
121
  - Econometric methods and theory
122
+ - Causal inference with functional data
123
+ - Distribution-valued treatment effects
124
+ - Spatial and geographic discontinuities
125
+ - Labor market dynamics and firm policies
126
+ - Economic impacts of digital infrastructure
127
 
128
  CURRENT POSITIONS:
129
  - Rackham Graduate School Predoctoral Fellow at the University of Michigan (2024-25)
 
132
  EDUCATION:
133
  - PhD in Economics, University of Michigan (expected 2026)
134
  - MA in Economics, University of Michigan
135
+ - BA in Theatre (demonstrating communication skills and creativity)
 
 
 
 
 
 
 
136
 
137
  PERSONALITY:
138
+ David combines rigorous technical expertise with strong communication skills.
139
+ His theatre background helps him present complex econometric concepts clearly.
140
+ He values both theoretical rigor and practical policy relevance.
141
 
142
  CONTACT:
143
  Email: dvdijcke@umich.edu
 
151
 
152
  # Add information about his background
153
  background_info = """
154
+ ECONOMETRIC EXPERTISE:
155
+ David specializes in developing econometric methods at the intersection of:
156
+ - Functional data analysis (working with curve and distribution-valued data)
157
+ - Optimal transport theory (comparing and matching distributions)
158
+ - Geometric measure theory (detecting discontinuities and boundaries)
159
+ - Causal inference (identifying treatment effects)
160
 
161
+ METHODOLOGICAL CONTRIBUTIONS:
162
+ 1. **Distribution-valued treatment effects**: Extending causal inference beyond scalar outcomes
163
+ 2. **Discontinuity detection**: Finding unknown boundaries in multivariate settings
164
+ 3. **Functional regression**: Adapting RDD and synthetic controls to functional data
165
+ 4. **Big data econometrics**: Scalable methods for massive datasets
166
+
167
+ APPLIED WORK:
168
+ David applies his econometric tools to important policy questions:
169
+ - Labor market dynamics (return-to-office policies, tenure distributions)
170
+ - Digital infrastructure (economic costs of internet shutdowns)
171
+ - Political economy (distributional effects of elections)
172
+ - Crisis responses (COVID-19, Ukraine conflict)
173
 
174
  TECHNICAL SKILLS:
175
+ - Advanced econometric theory and asymptotics
176
+ - Programming: R, Python, Stata, Julia
177
+ - Functional data analysis packages
178
+ - Optimal transport algorithms
179
+ - High-performance computing for big data
180
+
181
+ TEACHING & COMMUNICATION:
182
+ - Makes complex econometric concepts accessible
183
+ - Theatre background enhances presentation skills
184
+ - Experience teaching econometrics and statistics
185
+ - Clear technical writing for top journals
186
+
187
+ RESEARCH PHILOSOPHY:
188
+ David believes in developing rigorous econometric theory that solves real-world problems.
189
+ He combines mathematical sophistication with practical relevance, ensuring his methods
190
+ are both theoretically sound and empirically useful.
191
  """
192
 
193
  documents.append(Document(
 
195
  metadata={"source": "background_info", "type": "personal"}
196
  ))
197
 
198
+ # Add detailed paper summaries
199
+ paper_summaries = """
200
+ DETAILED PAPER SUMMARIES:
201
+
202
+ 1. **R3D: Regression Discontinuity Design with Distribution-Valued Outcomes** (JOB MARKET PAPER)
203
+ - Problem: Standard RDD only estimates effects on mean outcomes, missing distributional impacts
204
+ - Innovation: Extends RDD to estimate effects on entire outcome distributions
205
+ - Method: Local polynomial regression on random quantiles with uniform confidence bands
206
+ - Theory: Establishes local average quantile treatment effects (LAQTE)
207
+ - Application: Studies how gubernatorial party affects state income distributions
208
+ - Finding: Democratic governors reduce income inequality, especially at lower quantiles
209
+
210
+ 2. **Free Discontinuity Regression (FDR)**
211
+ - Problem: Unknown location of multivariate discontinuities (e.g., geographic borders)
212
+ - Innovation: Detects and estimates discontinuities without prior knowledge of location
213
+ - Method: Convex relaxation of Mumford-Shah functional from image processing
214
+ - Theory: Proves identification and convergence of the segmented regression surface
215
+ - Application: Internet shutdowns in India using 48 billion mobile transactions
216
+ - Finding: Shutdowns reduce economic activity by 25-35% in affected regions
217
+
218
+ 3. **Distributional Synthetic Controls (DISCO)**
219
+ - Problem: Standard synthetic controls only match on means, not distributions
220
+ - Innovation: Constructs synthetic distributions using optimal transport
221
+ - Method: Matches entire CDFs or quantile functions across units
222
+ - Software: R package with quantile and CDF approaches, bootstrap inference
223
+ - Features: Multiple aggregation schemes, permutation tests, visualization tools
224
+
225
+ 4. **Return to Office and the Tenure Distribution**
226
+ - Problem: How do RTO mandates affect employee tenure beyond just averages?
227
+ - Innovation: First application of distributional synthetic controls to labor markets
228
+ - Method: Analyzes 260 million resumes to construct tenure distributions
229
+ - Theory: Develops bootstrapped uniform confidence intervals for DiSCo
230
+ - Finding: RTO mandates significantly alter tenure distributions at tech firms
231
+
232
+ 5. **Revenue and Production Functions**
233
+ - Focus: Functional data analysis in firm-level production economics
234
+ - Innovation: Treats production processes as functional objects
235
+ - Method: Applies functional regression to production function estimation
236
+
237
+ COMMON THEMES:
238
+ - Moving beyond scalar outcomes to functional/distributional outcomes
239
+ - Rigorous asymptotic theory for novel estimators
240
+ - Large-scale empirical applications with big data
241
+ - Bridging pure econometric theory with policy relevance
242
+ """
243
+
244
+ documents.append(Document(
245
+ page_content=paper_summaries,
246
+ metadata={"source": "paper_summaries", "type": "research"}
247
+ ))
248
+
249
  # Load PDFs efficiently - only key documents
250
  key_pdfs = [
251
  "CV_DavidVanDijcke.pdf",
 
271
  try:
272
  loader = PyPDFLoader(filepath)
273
  pdf_docs = loader.load()
274
+ # For job market paper, load more pages
275
+ if "r3d" in filename.lower():
276
+ documents.extend(pdf_docs[:10]) # Abstract, intro, and key sections
277
+ else:
278
+ documents.extend(pdf_docs[:5]) # First 5 pages for other papers
279
  logger.info(f"Loaded {filename}")
280
  except Exception as e:
281
  logger.error(f"Error loading {filename}: {e}")
 
313
 
314
  if self.use_gemini:
315
  # Create prompt for Gemini
316
+ prompt = f"""You are an expert AI assistant for David Van Dijcke's academic website, specializing in his ECONOMETRIC research.
317
+ David is an econometrician on the 2025-26 job market who develops novel methods for functional and high-dimensional data.
318
+
319
+ Key points to emphasize:
320
+ - David is an ECONOMETRICIAN who develops new statistical methods
321
+ - His job market paper is R3D (Regression Discontinuity Design with Distribution-Valued Outcomes)
322
+ - He combines functional data analysis, optimal transport, and geometric measure theory
323
+ - He applies these methods to answer policy questions with big data
324
+ - His work extends beyond scalar outcomes to distribution-valued outcomes
325
 
326
+ Instructions:
327
+ - Emphasize his econometric contributions and methodological innovations
328
+ - Highlight how his methods combine theory with policy applications
329
+ - Be precise about technical details when discussing his papers
330
+ - Make clear he develops econometric TOOLS, not just applications
331
+ - If asked about specific papers, provide technical details from the context
332
+ - Be friendly but professional, as befits an academic website
333
 
334
  Context about David Van Dijcke:
335
  {context}
336
 
337
  User's question: {question}
338
 
339
+ Provide an accurate, professional response that emphasizes David's econometric expertise and contributions to the field."""
340
 
341
  try:
342
  # Configure generation parameters for accuracy
 
364
  # Handle greetings and casual conversation
365
  if self.is_greeting_or_casual(message):
366
  greeting_responses = [
367
+ "Hello! I'm here to help you learn about David Van Dijcke, an econometrician on the 2025-26 job market. He develops cutting-edge methods for functional and high-dimensional data. What would you like to know about his research?",
368
+ "Hi! Welcome to David Van Dijcke's research assistant. David is an econometrician who combines functional data analysis, optimal transport, and geometric measure theory to develop new causal inference methods. How can I help you learn about his work?",
369
+ "Hello! I can tell you about David Van Dijcke's econometric research, including his job market paper on distribution-valued treatment effects (R3D) and his other methodological contributions. What aspect of his work interests you?",
370
  ]
371
 
372
  # Use message hash to select consistent greeting
 
384
  response = self.generate_response(message, context)
385
 
386
  # Add source information if specific papers were referenced
387
+ paper_keywords = ["r3d", "regression discontinuity", "free discontinuity", "fdr", "disco", "distributional synthetic", "return to office", "rto", "revenue", "production function", "unmasking", "ukraine"]
388
  if any(keyword in message.lower() for keyword in paper_keywords):
389
+ response += "\n\n*For more details, you can find David's papers on his website at https://dvandijcke.github.io*"
390
 
391
  return response
392
 
 
421
  # Create the interface with better examples
422
  demo = gr.ChatInterface(
423
  fn=chat_function,
424
+ title="David Van Dijcke - Econometrician | Job Market 2025-26",
425
  description=(
426
+ "Welcome! I'm an AI assistant specializing in David Van Dijcke's econometric research. "
427
+ "David develops novel methods for functional and high-dimensional data, combining functional data analysis, "
428
+ "optimal transport, and geometric measure theory. Ask me about his job market paper (R3D), "
429
+ "his econometric innovations, or how he applies these methods to policy questions with big data."
430
  ),
431
  examples=[
432
+ "Hello! Who is David Van Dijcke?",
433
+ "What econometric methods has David developed?",
434
+ "Tell me about R3D (his job market paper)",
435
+ "How does David use optimal transport in econometrics?",
436
+ "What is functional data analysis in David's work?",
437
  "Is David on the job market?",
438
+ "What are distribution-valued treatment effects?"
439
  ],
440
  theme=gr.themes.Soft(
441
  primary_hue="blue",
app_improved.py ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ from typing import List, Tuple
4
+ import json
5
+ from datetime import datetime
6
+ import hashlib
7
+
8
+ # Import only what we need for better performance
9
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
10
+ from langchain.document_loaders import PyPDFLoader
11
+ from langchain_community.embeddings import HuggingFaceEmbeddings
12
+ from langchain_community.vectorstores import FAISS
13
+ from langchain.schema import Document
14
+ from huggingface_hub import InferenceClient
15
+ import logging
16
+
17
+ # Set up logging
18
+ logging.basicConfig(level=logging.INFO)
19
+ logger = logging.getLogger(__name__)
20
+
21
+ class ImprovedResearchAssistant:
22
+ def __init__(self):
23
+ # Use a lightweight embedding model
24
+ self.embeddings = HuggingFaceEmbeddings(
25
+ model_name="sentence-transformers/all-MiniLM-L6-v2",
26
+ model_kwargs={'device': 'cpu'},
27
+ encode_kwargs={'normalize_embeddings': True}
28
+ )
29
+
30
+ # Initialize InferenceClient for faster responses
31
+ self.client = InferenceClient(
32
+ "mistralai/Mixtral-8x7B-Instruct-v0.1",
33
+ token=os.getenv("HUGGINGFACE_TOKEN")
34
+ )
35
+
36
+ self.vector_store = None
37
+ self.conversation_history = []
38
+
39
+ # Check if we have a cached vector store
40
+ self.cache_path = "vector_store_cache"
41
+ if os.path.exists(self.cache_path):
42
+ logger.info("Loading cached vector store...")
43
+ self.vector_store = FAISS.load_local(self.cache_path, self.embeddings)
44
+ else:
45
+ logger.info("Building vector store from documents...")
46
+ self.load_documents()
47
+
48
+ def load_documents(self):
49
+ """Load all documents about the researcher with caching"""
50
+ documents = []
51
+
52
+ # Enhanced research information
53
+ research_info = """
54
+ David Van Dijcke is a PhD student in Economics at the University of Michigan, Ann Arbor.
55
+ He is on the job market for the 2025-26 academic year.
56
+
57
+ RESEARCH FOCUS:
58
+ David's research develops novel econometric methods combining causal inference with functional
59
+ data analysis and optimal transport to study settings where the outcomes and/or covariates
60
+ are functional and high-dimensional.
61
+
62
+ KEY RESEARCH AREAS:
63
+ - Econometric methods and theory
64
+ - Causal inference with high-dimensional data
65
+ - Functional data analysis
66
+ - Optimal transport applications in economics
67
+ - Labor market policies and their effects
68
+ - Mobility patterns in response to crises (COVID-19, conflicts)
69
+
70
+ CURRENT POSITIONS:
71
+ - Rackham Graduate School Predoctoral Fellow at the University of Michigan (2024-25)
72
+ - Academic Visitor at the Bank of England
73
+
74
+ EDUCATION:
75
+ - PhD in Economics, University of Michigan (expected 2026)
76
+ - MA in Economics, University of Michigan
77
+ - Previous education includes a BA in Theatre, showcasing his interdisciplinary background
78
+
79
+ RESEARCH PAPERS:
80
+ 1. "Functional Difference-in-Differences" - His job market paper developing new econometric methods
81
+ 2. "Revenue and Production Functions" - Work on firm-level analysis
82
+ 3. "Return to Office" - Research on workplace policies post-COVID
83
+ 4. "Unmasking Partisanship" - Analysis of political behavior during COVID-19
84
+ 5. Work on public response to government alerts during the Russian invasion of Ukraine
85
+
86
+ PERSONALITY:
87
+ David is approachable and values clear communication. His background in theatre gives him
88
+ unique presentation skills. He enjoys discussing both technical econometric details and
89
+ broader policy implications of his work.
90
+
91
+ CONTACT:
92
+ Email: dvdijcke@umich.edu
93
+ Website: https://dvandijcke.github.io
94
+ """
95
+
96
+ documents.append(Document(
97
+ page_content=research_info,
98
+ metadata={"source": "website_overview", "type": "general_info"}
99
+ ))
100
+
101
+ # Add information about his background
102
+ background_info = """
103
+ UNIQUE BACKGROUND:
104
+ David has an unusual path to economics - he holds a BA in Theatre, which gives him strong
105
+ communication and presentation skills. This interdisciplinary background helps him explain
106
+ complex econometric concepts in accessible ways.
107
+
108
+ TEACHING:
109
+ David has experience teaching various economics courses at the University of Michigan.
110
+ He is known for making complex statistical concepts accessible to students.
111
+
112
+ TECHNICAL SKILLS:
113
+ - Advanced econometric theory
114
+ - Programming in R, Python, Stata
115
+ - Machine learning applications in economics
116
+ - Functional data analysis
117
+ - Optimal transport theory
118
+
119
+ COLLABORATIONS:
120
+ David frequently collaborates with other researchers and is open to new research partnerships.
121
+ His work often involves interdisciplinary approaches combining economics with data science.
122
+ """
123
+
124
+ documents.append(Document(
125
+ page_content=background_info,
126
+ metadata={"source": "background_info", "type": "personal"}
127
+ ))
128
+
129
+ # Load PDFs efficiently - only key documents
130
+ key_pdfs = [
131
+ "CV_DavidVanDijcke.pdf",
132
+ "disco.pdf",
133
+ "fdr.pdf",
134
+ "r3d_arxiv_4apr2025.pdf",
135
+ "rto.pdf",
136
+ "unmasking_partisanship.pdf"
137
+ ]
138
+
139
+ documents_dir = "documents"
140
+ if os.path.exists(documents_dir):
141
+ for filename in key_pdfs:
142
+ filepath = os.path.join(documents_dir, filename)
143
+ if os.path.exists(filepath):
144
+ try:
145
+ loader = PyPDFLoader(filepath)
146
+ pdf_docs = loader.load()
147
+ # Add first few pages only for faster loading
148
+ documents.extend(pdf_docs[:3])
149
+ logger.info(f"Loaded {filename}")
150
+ except Exception as e:
151
+ logger.error(f"Error loading {filename}: {e}")
152
+
153
+ # Split documents with optimized chunk size
154
+ text_splitter = RecursiveCharacterTextSplitter(
155
+ chunk_size=500, # Smaller chunks for faster retrieval
156
+ chunk_overlap=50,
157
+ length_function=len
158
+ )
159
+ splits = text_splitter.split_documents(documents)
160
+
161
+ # Create and cache vector store
162
+ self.vector_store = FAISS.from_documents(splits, self.embeddings)
163
+
164
+ # Save to cache
165
+ try:
166
+ self.vector_store.save_local(self.cache_path)
167
+ logger.info("Vector store cached successfully")
168
+ except Exception as e:
169
+ logger.error(f"Failed to cache vector store: {e}")
170
+
171
+ def is_greeting_or_casual(self, message: str) -> bool:
172
+ """Check if the message is a greeting or casual conversation starter"""
173
+ greetings = [
174
+ "hello", "hi", "hey", "good morning", "good afternoon", "good evening",
175
+ "how are you", "what's up", "greetings", "howdy", "hola", "bonjour"
176
+ ]
177
+
178
+ message_lower = message.lower().strip()
179
+ return any(greeting in message_lower for greeting in greetings) or len(message_lower.split()) <= 3
180
+
181
+ def generate_response(self, question: str, context: str) -> str:
182
+ """Generate response using the Inference API for faster results"""
183
+
184
+ # Create a conversational prompt
185
+ prompt = f"""You are a friendly AI assistant for David Van Dijcke's academic website.
186
+ You help visitors learn about David's research, publications, and academic career in a warm,
187
+ conversational manner. Be helpful and engaging, not overly formal.
188
+
189
+ Context about David:
190
+ {context}
191
+
192
+ User's question: {question}
193
+
194
+ Instructions:
195
+ - If it's a greeting, respond warmly and offer to help
196
+ - For research questions, provide detailed, accurate information
197
+ - Be conversational and friendly, not stiff or robotic
198
+ - If you don't have specific information, acknowledge it politely
199
+ - Feel free to suggest related topics the user might be interested in
200
+
201
+ Response:"""
202
+
203
+ try:
204
+ # Use streaming for faster perceived response
205
+ response = self.client.text_generation(
206
+ prompt,
207
+ max_new_tokens=300,
208
+ temperature=0.7,
209
+ top_p=0.95,
210
+ repetition_penalty=1.1,
211
+ do_sample=True
212
+ )
213
+ return response
214
+ except Exception as e:
215
+ logger.error(f"Error generating response: {e}")
216
+ return "I apologize, but I'm having trouble generating a response right now. Could you please try again?"
217
+
218
+ def answer_question(self, message: str, history: List[Tuple[str, str]] = None) -> str:
219
+ """Answer a question about the researcher"""
220
+
221
+ # Handle greetings and casual conversation
222
+ if self.is_greeting_or_casual(message):
223
+ greeting_responses = [
224
+ "Hello! I'm here to help you learn about David Van Dijcke's research and academic work. What would you like to know?",
225
+ "Hi there! Welcome to David's research assistant. I can tell you about his econometric methods, publications, or academic journey. What interests you?",
226
+ "Hello! Great to meet you. I'd be happy to share information about David's work in economics, his research papers, or his background. What would you like to explore?",
227
+ ]
228
+
229
+ # Use message hash to select consistent greeting
230
+ response_index = int(hashlib.md5(message.encode()).hexdigest(), 16) % len(greeting_responses)
231
+ return greeting_responses[response_index]
232
+
233
+ try:
234
+ # Retrieve relevant documents
235
+ docs = self.vector_store.similarity_search(message, k=3)
236
+
237
+ # Combine context from retrieved documents
238
+ context = "\n".join([doc.page_content for doc in docs])
239
+
240
+ # Generate response
241
+ response = self.generate_response(message, context)
242
+
243
+ # Add source information if specific papers were referenced
244
+ paper_keywords = ["functional difference", "revenue production", "return to office", "unmasking", "ukraine"]
245
+ if any(keyword in message.lower() for keyword in paper_keywords):
246
+ response += "\n\n*For more details, you can find David's papers on his website.*"
247
+
248
+ return response
249
+
250
+ except Exception as e:
251
+ logger.error(f"Error in answer_question: {e}")
252
+ return "I apologize, but I'm having trouble accessing the information right now. Please try rephrasing your question or ask about David's research areas, publications, or academic background."
253
+
254
+ # Create optimized Gradio interface
255
+ def create_gradio_interface():
256
+ assistant = ImprovedResearchAssistant()
257
+
258
+ def chat_function(message, history):
259
+ return assistant.answer_question(message, history)
260
+
261
+ # Modern, clean CSS
262
+ custom_css = """
263
+ #chatbot {
264
+ height: 600px;
265
+ }
266
+ .gradio-container {
267
+ font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
268
+ max-width: 900px;
269
+ margin: auto;
270
+ }
271
+ .user-message, .bot-message {
272
+ padding: 15px;
273
+ border-radius: 10px;
274
+ margin: 10px 0;
275
+ }
276
+ """
277
+
278
+ # Create the interface with better examples
279
+ demo = gr.ChatInterface(
280
+ fn=chat_function,
281
+ title="Chat with David Van Dijcke's Research Assistant",
282
+ description=(
283
+ "Hi! I'm here to help you learn about David's research in economics. "
284
+ "Feel free to ask about his work, papers, or just say hello! 👋"
285
+ ),
286
+ examples=[
287
+ "Hello! Who is David?",
288
+ "What are David's main research interests?",
289
+ "Tell me about functional difference-in-differences",
290
+ "What's David's background?",
291
+ "Which papers has David published?",
292
+ "Is David on the job market?",
293
+ "What econometric methods has he developed?"
294
+ ],
295
+ theme=gr.themes.Soft(
296
+ primary_hue="blue",
297
+ secondary_hue="gray",
298
+ neutral_hue="gray",
299
+ font=gr.themes.GoogleFont("Inter")
300
+ ),
301
+ css=custom_css,
302
+ retry_btn="Retry",
303
+ undo_btn="Undo",
304
+ clear_btn="Clear Chat",
305
+ submit_btn="Send",
306
+ autofocus=True
307
+ )
308
+
309
+ return demo
310
+
311
+ if __name__ == "__main__":
312
+ # Set cache directory
313
+ os.makedirs("vector_store_cache", exist_ok=True)
314
+
315
+ demo = create_gradio_interface()
316
+ demo.launch(
317
+ share=False,
318
+ server_name="0.0.0.0",
319
+ server_port=7860,
320
+ show_error=True
321
+ )
requirements_improved.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio==4.19.2
2
+ langchain==0.1.9
3
+ langchain-community==0.0.24
4
+ sentence-transformers==2.5.1
5
+ faiss-cpu==1.7.4
6
+ pypdf==4.0.2
7
+ huggingface-hub==0.20.3
8
+ python-dotenv==1.0.1
requirements_simple.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio==4.19.2
2
+ langchain==0.1.9
3
+ langchain-community==0.0.24
4
+ sentence-transformers==2.5.1
5
+ faiss-cpu==1.7.4
6
+ pypdf==4.0.2
7
+ google-generativeai==0.8.3
8
+ python-dotenv==1.0.1
test_assistant.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for David Van Dijcke's Research Assistant
4
+ Tests key functionality and econometric focus
5
+ """
6
+
7
+ import os
8
+ from app import ImprovedResearchAssistant
9
+
10
+ def test_assistant():
11
+ """Test the assistant with various queries"""
12
+ print("Testing David Van Dijcke's Research Assistant...\n")
13
+
14
+ # Initialize assistant
15
+ assistant = ImprovedResearchAssistant()
16
+
17
+ # Test queries
18
+ test_queries = [
19
+ "Hello!",
20
+ "Who is David Van Dijcke?",
21
+ "What is David's job market paper about?",
22
+ "Tell me about R3D",
23
+ "What econometric methods has David developed?",
24
+ "How does David use optimal transport in his research?",
25
+ "What is functional data analysis?",
26
+ "Tell me about the Free Discontinuity Regression paper",
27
+ "What policy applications does David's research have?",
28
+ "Is David on the job market?"
29
+ ]
30
+
31
+ for i, query in enumerate(test_queries, 1):
32
+ print(f"\n{'='*60}")
33
+ print(f"Test {i}: {query}")
34
+ print('='*60)
35
+
36
+ try:
37
+ response = assistant.answer_question(query)
38
+ print(f"Response: {response}")
39
+
40
+ # Check for key terms in responses
41
+ if i == 2: # "Who is David" query
42
+ assert "econometrician" in response.lower() or "econometric" in response.lower()
43
+ print("✓ Correctly identifies David as an econometrician")
44
+
45
+ if i == 4: # R3D query
46
+ assert "distribution" in response.lower() or "r3d" in response.lower()
47
+ print("✓ Mentions distribution-valued outcomes")
48
+
49
+ except Exception as e:
50
+ print(f"❌ Error: {e}")
51
+
52
+ print("\n" + "="*60)
53
+ print("Testing complete!")
54
+
55
+ if __name__ == "__main__":
56
+ # Set up environment
57
+ if not os.getenv("GOOGLE_API_KEY"):
58
+ print("Warning: No GOOGLE_API_KEY found. Using limited mode.")
59
+ print("For best results, add your API key to .env file\n")
60
+
61
+ test_assistant()