keefereuther commited on
Commit
0911b7f
·
1 Parent(s): 53f4246

Update to Responses API with GPT-5.1 support, web search, and improved streaming

Browse files
Files changed (4) hide show
  1. README.md +79 -24
  2. app.py +167 -8
  3. config.py +19 -12
  4. terms.csv +17 -126
README.md CHANGED
@@ -13,56 +13,111 @@ short_description: AI-enhanced study app for biology students
13
 
14
  # Schema Study: An AI-Enhanced Study App for Biology Students
15
 
16
- Schema Study is a modern, interactive study app designed to help biology students master core course concepts through AI-powered conversations. The app leverages OpenAI's GPT models to provide instant feedback, Socratic questioning, and personalized study support.
17
 
18
  ## Features
19
- - **Password Protection:** Secure access for your class or group.
 
20
  - **Customizable Terms:** Use your own CSV file of terms and definitions.
21
  - **Prompt Templates:** Engage with the material using creative, research-based prompts.
22
- - **AI-Enhanced Feedback:** Get instant, formative feedback and guidance.
 
 
23
  - **Professional, Accessible UI:** Clean, modern design with a color palette for clarity and focus.
24
 
25
  ## How to Use (Students)
26
- 1. **Access the App:** Go to your Hugging Face Space URL. Enter the password provided by your instructor.
27
- 2. **Select a Term:** Use the dropdown to pick a course term.
28
- 3. **Start Studying:** Respond to the prompt or use a template button to begin your session.
29
- 4. **Chat with the AI:** Ask questions, answer prompts, and explore the term in depth.
 
 
30
 
31
  ## How to Use (Instructors)
 
 
 
32
  1. **Clone or Fork the Space:**
 
33
  ```bash
34
  git clone https://huggingface.co/spaces/<your-username>/<your-space-name>
35
  cd <your-space-name>
36
  ```
 
37
  2. **Edit Configuration:**
 
38
  - Update `config.py` for your course (title, instructions, prompt templates, etc).
39
- - Place your terms CSV (e.g., `all_terms.csv`) in the root directory. Format: first column = term, second column = context/definition.
40
- 3. **Set Secrets:**
41
- - In your Space, go to **Settings > Repository secrets** and add:
42
- - `OPENAI_API_KEY` (your OpenAI API key)
43
- - `username` (for app login)
44
- - `password` (for app login)
 
 
 
 
 
 
 
 
45
  4. **Push Changes:**
 
46
  ```bash
47
  git add .
48
  git commit -m "Update configuration and terms"
49
  git push
50
  ```
51
 
 
 
 
 
 
52
  ## Configuration
53
- - All settings are in `config.py` (title, instructions, prompt templates, resources, AI model parameters, etc).
54
- - Theming is managed via `.streamlit/config.toml` and custom CSS in `app.py`.
55
- - Dependencies are listed in `requirements.txt`.
56
-
57
- ## File Structure
58
- - `app.py` Main Streamlit app
59
- - `config.py` All app settings and customization
60
- - `.streamlit/config.toml` Theme colors
61
- - `requirements.txt` — Python dependencies
62
- - `all_terms.csv` — Your course terms and definitions
63
- - `BILD_5_Syllabus_Reuther_SP25.pdf` — Example resource
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ## License
 
66
  This project is licensed under the GNU GPL-3 License. See the [LICENSE](LICENSE) file for details.
67
 
68
  ## Acknowledgments
 
13
 
14
  # Schema Study: An AI-Enhanced Study App for Biology Students
15
 
16
+ Schema Study is a modern, interactive study app designed to help biology students master core course concepts through AI-powered conversations. The app leverages OpenAI's latest GPT models via the Responses API to provide instant feedback, Socratic questioning, and personalized study support.
17
 
18
  ## Features
19
+
20
+ - **API Key Authentication:** Enter your OpenAI API key in the sidebar to enable chat functionality.
21
  - **Customizable Terms:** Use your own CSV file of terms and definitions.
22
  - **Prompt Templates:** Engage with the material using creative, research-based prompts.
23
+ - **AI-Enhanced Feedback:** Get instant, formative feedback and guidance using GPT-5.1 (default) or GPT-4.1.
24
+ - **Web Search Support:** Optional web search functionality for current information and citations (configurable in `config.py`).
25
+ - **Real-Time Streaming:** Live token-by-token response streaming with visual typing indicator.
26
  - **Professional, Accessible UI:** Clean, modern design with a color palette for clarity and focus.
27
 
28
  ## How to Use (Students)
29
+
30
+ 1. **Access the App:** Go to your Hugging Face Space URL.
31
+ 2. **Enter API Key:** Provide your OpenAI API key in the sidebar configuration section.
32
+ 3. **Select a Term:** Use the dropdown to pick a course term.
33
+ 4. **Start Studying:** Respond to the prompt or use a template button to begin your session.
34
+ 5. **Chat with the AI:** Ask questions, answer prompts, and explore the term in depth.
35
 
36
  ## How to Use (Instructors)
37
+
38
+ ### Setup
39
+
40
  1. **Clone or Fork the Space:**
41
+
42
  ```bash
43
  git clone https://huggingface.co/spaces/<your-username>/<your-space-name>
44
  cd <your-space-name>
45
  ```
46
+
47
  2. **Edit Configuration:**
48
+
49
  - Update `config.py` for your course (title, instructions, prompt templates, etc).
50
+ - Configure AI model settings:
51
+ - `ai_model`: Choose "gpt-5.1" (default) or "gpt-4.1"
52
+ - `reasoning_effort`: For GPT-5.1, set to "none" (fastest), "minimal", "low", or "medium"
53
+ - `enable_web_search`: Set to `True` or `False` (default: True)
54
+ - Place your terms CSV (e.g., `terms.csv`) in the root directory. Format: first column = term, second column = context/definition.
55
+
56
+ 3. **Set Secrets (Optional):**
57
+
58
+ - If you want to use Streamlit secrets instead of sidebar API key input, create `.streamlit/secrets.toml` file locally or use Hugging Face Space secrets:
59
+ ```toml
60
+ OPENAI_API_KEY = "your_openai_api_key"
61
+ ```
62
+ - For Hugging Face Spaces, go to **Settings > Repository secrets** and add the API key.
63
+
64
  4. **Push Changes:**
65
+
66
  ```bash
67
  git add .
68
  git commit -m "Update configuration and terms"
69
  git push
70
  ```
71
 
72
+ ### Model Selection Guide
73
+
74
+ - **GPT-5.1** (default): Best for most use cases, fastest with reasoning="none", supports web search
75
+ - **GPT-4.1**: Use if you need temperature control or prefer non-reasoning model, excellent web search support
76
+
77
  ## Configuration
78
+
79
+ ### AI Model Settings (`config.py`)
80
+
81
+ - **Default Model:** GPT-5.1 with reasoning="none" for faster responses
82
+ - **Alternative Model:** GPT-4.1 with temperature control
83
+ - **Web Search:** Configurable via `enable_web_search` (default: True)
84
+ - **Reasoning Effort:** Configurable for GPT-5.1 (options: "none", "minimal", "low", "medium")
85
+ - **Temperature:** Configurable for GPT-4.1 (0.0-2.0)
86
+
87
+ ### Other Settings
88
+
89
+ All settings are in `config.py` (title, instructions, prompt templates, resources, AI model parameters, etc).
90
+
91
+ Theming is managed via `.streamlit/config.toml` and custom CSS in `app.py`.
92
+
93
+ Dependencies are listed in `requirements.txt`.
94
+
95
+ ## Technical Details
96
+
97
+ ### API & Models
98
+
99
+ - **API Framework:** OpenAI Responses API (streaming-enabled)
100
+ - **Supported Models:** GPT-5.1 (default), GPT-4.1
101
+ - **Streaming:** Real-time token-by-token response streaming with event-based handling
102
+ - **Inactivity Guard:** Streaming stops after 60s of no server deltas
103
+ - **Error Handling:** Comprehensive error handling for streaming events and recovery
104
+
105
+ ### File Structure
106
+
107
+ `app.py` — Main Streamlit app with Responses API integration
108
+
109
+ `config.py` — All app settings and customization (model selection, web search, prompt templates)
110
+
111
+ `.streamlit/secrets.toml` — Optional API key storage (not tracked in git)
112
+
113
+ `requirements.txt` — Python dependencies
114
+
115
+ `terms.csv` — Your course terms and definitions (CSV format: term, context)
116
+
117
+ `BILD_5_Syllabus_Reuther_F25.pdf` — Example resource
118
 
119
  ## License
120
+
121
  This project is licensed under the GNU GPL-3 License. See the [LICENSE](LICENSE) file for details.
122
 
123
  ## Acknowledgments
app.py CHANGED
@@ -9,8 +9,10 @@ import streamlit as st # Web app framework
9
  import pandas as pd # Data handling
10
  import os # File operations
11
  import logging # Logging functionality
 
12
  import config # Local configuration module
13
  from openai import OpenAI # OpenAI API client
 
14
 
15
  # Set up logging to track app activity
16
  logging.basicConfig(filename='app.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
@@ -160,6 +162,79 @@ st.markdown("""
160
  </style>
161
  """, unsafe_allow_html=True)
162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  with st.sidebar:
164
  st.header("Configuration")
165
 
@@ -357,21 +432,105 @@ with right_col:
357
  for m in st.session_state["display_messages"]
358
  ]
359
 
360
- # Create streaming completion
361
- stream = client.chat.completions.create(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
362
  model=st.session_state["openai_model"],
363
  messages=messages,
364
- stream=True,
365
- temperature=config.temperature,
366
  max_tokens=config.max_tokens,
367
- frequency_penalty=config.frequency_penalty,
368
- presence_penalty=config.presence_penalty,
369
  )
370
 
371
- # Display streaming response
372
  with st.chat_message("assistant"):
373
- response = st.write_stream(stream)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
374
  # Save response to chat history
 
375
  st.session_state["display_messages"].append({"role": "assistant", "content": response})
376
  # Log the exchange
377
  logging.info(f"User prompt: {st.session_state['display_messages'][-2]['content']}")
 
9
  import pandas as pd # Data handling
10
  import os # File operations
11
  import logging # Logging functionality
12
+ import time # Time operations for retry logic
13
  import config # Local configuration module
14
  from openai import OpenAI # OpenAI API client
15
+ from typing import Dict, List, Any, Optional # Type hints
16
 
17
  # Set up logging to track app activity
18
  logging.basicConfig(filename='app.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
 
162
  </style>
163
  """, unsafe_allow_html=True)
164
 
165
+ ############################################################################################################
166
+ # Model Configuration - Responses API model capabilities
167
+ ############################################################################################################
168
+
169
+ # Model configurations with capability flags
170
+ MODEL_CONFIGS = {
171
+ "gpt-5.1": {
172
+ "api_type": "responses",
173
+ "supports_reasoning": True,
174
+ "supports_verbosity": False,
175
+ "supports_temperature": False, # do NOT send temperature
176
+ "supports_max_tokens": True, # maps to max_output_tokens
177
+ "supports_web_search": True, # Supported
178
+ },
179
+ "gpt-4.1": {
180
+ "api_type": "responses",
181
+ "supports_reasoning": False,
182
+ "supports_verbosity": False,
183
+ "supports_temperature": True,
184
+ "supports_max_tokens": True, # maps to max_output_tokens
185
+ "supports_web_search": True, # Confirmed working
186
+ },
187
+ }
188
+
189
+ def get_model_config(model: str) -> dict:
190
+ """Get configuration for a specific model"""
191
+ return MODEL_CONFIGS.get(
192
+ model,
193
+ {
194
+ "api_type": "responses",
195
+ "supports_reasoning": False,
196
+ "supports_verbosity": False,
197
+ "supports_temperature": True,
198
+ "supports_max_tokens": True,
199
+ "supports_web_search": False,
200
+ },
201
+ )
202
+
203
+ def build_request_data(
204
+ model: str,
205
+ messages: List[Dict[str, str]],
206
+ reasoning_effort: Optional[str] = None,
207
+ temperature: Optional[float] = None,
208
+ max_tokens: Optional[int] = None,
209
+ enable_web_search: bool = False,
210
+ ) -> dict:
211
+ """Build a Responses API request body using capability flags."""
212
+ model_config = get_model_config(model)
213
+
214
+ # Base payload
215
+ request_data = {"model": model, "input": messages}
216
+
217
+ # Web search tool (if enabled and supported by model)
218
+ if enable_web_search and model_config.get("supports_web_search", False):
219
+ request_data["tools"] = [{"type": "web_search", "search_context_size": "low"}]
220
+ request_data["tool_choice"] = "auto" # Let model decide when to search
221
+ # Web search is incompatible with reasoning effort, so disable it
222
+ reasoning_effort = None
223
+
224
+ # GPT-5.1: reasoning (top-level) - only if web search is not enabled
225
+ if model_config["supports_reasoning"] and reasoning_effort and not enable_web_search:
226
+ request_data["reasoning"] = {"effort": reasoning_effort}
227
+
228
+ # Temperature (only if supported)
229
+ if model_config["supports_temperature"] and temperature is not None:
230
+ request_data["temperature"] = temperature
231
+
232
+ # Map max_tokens -> Responses max_output_tokens
233
+ if model_config["supports_max_tokens"] and max_tokens is not None:
234
+ request_data["max_output_tokens"] = max_tokens
235
+
236
+ return request_data
237
+
238
  with st.sidebar:
239
  st.header("Configuration")
240
 
 
432
  for m in st.session_state["display_messages"]
433
  ]
434
 
435
+ # Get model configuration
436
+ model_config = get_model_config(st.session_state["openai_model"])
437
+
438
+ # Prepare parameters based on model support
439
+ reasoning_effort_param = None
440
+ temperature_param = None
441
+ enable_web_search_param = False
442
+
443
+ # Check if web search is enabled and supported
444
+ if config.enable_web_search and model_config.get("supports_web_search", False):
445
+ enable_web_search_param = True
446
+ # Web search disables reasoning automatically
447
+ reasoning_effort_param = None
448
+ elif model_config["supports_reasoning"]:
449
+ reasoning_effort_param = config.reasoning_effort
450
+
451
+ if model_config["supports_temperature"]:
452
+ temperature_param = config.temperature
453
+
454
+ # Build request data for Responses API
455
+ request_data = build_request_data(
456
  model=st.session_state["openai_model"],
457
  messages=messages,
458
+ reasoning_effort=reasoning_effort_param,
459
+ temperature=temperature_param,
460
  max_tokens=config.max_tokens,
461
+ enable_web_search=enable_web_search_param,
 
462
  )
463
 
464
+ # Create streaming response using Responses API
465
  with st.chat_message("assistant"):
466
+ message_placeholder = st.empty()
467
+ buf: List[str] = [] # collect deltas safely
468
+ last_delta_ts = time.time()
469
+ inactivity_limit_s = 60 # stop if no deltas for 60s
470
+
471
+ try:
472
+ # Stream events with timeout handling
473
+ with client.responses.stream(**request_data) as stream:
474
+ completed = False
475
+
476
+ for event in stream:
477
+ et = getattr(event, "type", None)
478
+
479
+ if et == "response.output_text.delta":
480
+ # Append the new chunk, update the UI
481
+ buf.append(event.delta)
482
+ full_response = "".join(buf)
483
+ message_placeholder.markdown(full_response + "▌")
484
+ last_delta_ts = time.time()
485
+ elif et == "response.error":
486
+ # Show the error inline, then stop
487
+ error_msg = getattr(event, "error", "Unknown streaming error")
488
+ message_placeholder.error(f"⚠️ Error: {error_msg}")
489
+ buf.clear()
490
+ buf.append(f"Error while streaming: {error_msg}")
491
+ break
492
+ elif et == "response.completed":
493
+ # Response completed successfully
494
+ completed = True
495
+ break
496
+
497
+ # Inactivity guard: if no deltas for too long, stop
498
+ if time.time() - last_delta_ts > inactivity_limit_s:
499
+ message_placeholder.warning("⚠️ Streaming paused due to inactivity from the server. Partial content shown above.")
500
+ break
501
+
502
+ # Remove cursor and show final message
503
+ if buf:
504
+ message_placeholder.markdown("".join(buf))
505
+
506
+ # Try to get final response, but don't fail if it's not available
507
+ try:
508
+ final = stream.get_final_response()
509
+ except Exception as final_error:
510
+ final = None
511
+ if not completed:
512
+ st.warning(f"⚠️ Note: Could not retrieve response metadata: {final_error}")
513
+
514
+ except Exception as e:
515
+ # Handle streaming exceptions gracefully
516
+ error_msg = str(e)
517
+ if "response.completed" in error_msg:
518
+ # This is expected - the response completed without the event
519
+ if buf:
520
+ message_placeholder.markdown("".join(buf))
521
+ st.info("ℹ️ Response completed successfully (streaming ended)")
522
+ else:
523
+ message_placeholder.error("❌ No response content received")
524
+ buf.clear()
525
+ buf.append("No response content received")
526
+ else:
527
+ # Other streaming errors
528
+ message_placeholder.error(f"❌ Error while streaming: {error_msg}")
529
+ buf.clear()
530
+ buf.append(f"Error while streaming: {error_msg}")
531
+
532
  # Save response to chat history
533
+ response = "".join(buf)
534
  st.session_state["display_messages"].append({"role": "assistant", "content": response})
535
  # Log the exchange
536
  logging.info(f"User prompt: {st.session_state['display_messages'][-2]['content']}")
config.py CHANGED
@@ -39,25 +39,32 @@ warning_message = "**ChatGPT can make errors and does not replace verified and r
39
  # ==============================================
40
 
41
  # The OpenAI model used by the app
42
- # Current recommended options: "gpt-4.1" or "gpt-4o"
43
- ai_model = "gpt-4.1"
44
-
45
- # Controls randomness/creativity (0-1)
 
 
 
 
 
 
 
 
46
  # Lower values (0.1-0.3) = more focused, precise responses
47
  # Higher values (0.7-1.0) = more creative, varied responses
48
  temperature = 0.1
49
 
50
  # Maximum length of AI responses (measured in tokens)
51
  # Higher values allow for longer responses (1000-2000 is usually sufficient)
 
52
  max_tokens = 1000
53
 
54
- # Controls repetition in responses (0-2)
55
- # Higher values reduce repetition of phrases
56
- frequency_penalty = 0.9
57
-
58
- # Controls topic diversity (0-2)
59
- # Higher values encourage the AI to explore new topics
60
- presence_penalty = 0.7
61
 
62
 
63
  # ==============================================
@@ -134,7 +141,7 @@ app_repo_license_message = "It can be found at [https://huggingface.co/spaces/ke
134
  resources = [
135
  {
136
  "title": "Course Syllabus",
137
- "file_path": "BILD_5_Syllabus_Reuther_SP25.pdf",
138
  "description": "Download the course syllabus. **Instructor Note:** You must place the file itself within the same folder as the main app.py file in your GitHub repository."
139
  },
140
  {
 
39
  # ==============================================
40
 
41
  # The OpenAI model used by the app
42
+ # Options: "gpt-5.1" (default, reasoning model) or "gpt-4.1" (non-reasoning model)
43
+ # - gpt-5.1: Latest reasoning model with reasoning="none" default for faster responses
44
+ # - gpt-4.1: Non-reasoning model with temperature control
45
+ ai_model = "gpt-5.1"
46
+
47
+ # Reasoning effort for gpt-5.1 (only applies when ai_model = "gpt-5.1")
48
+ # Options: "none" (default, fastest), "minimal", "low", "medium"
49
+ # "none" disables reasoning for faster responses without reasoning overhead
50
+ reasoning_effort = "none"
51
+
52
+ # Temperature for gpt-4.1 (only applies when ai_model = "gpt-4.1")
53
+ # Controls randomness/creativity (0-2)
54
  # Lower values (0.1-0.3) = more focused, precise responses
55
  # Higher values (0.7-1.0) = more creative, varied responses
56
  temperature = 0.1
57
 
58
  # Maximum length of AI responses (measured in tokens)
59
  # Higher values allow for longer responses (1000-2000 is usually sufficient)
60
+ # Maps to max_output_tokens in Responses API
61
  max_tokens = 1000
62
 
63
+ # Enable web search functionality (only applies when ai_model supports web search)
64
+ # When enabled, the AI can search the web for current information and cite sources
65
+ # Note: Web search is incompatible with reasoning effort - reasoning will be disabled automatically
66
+ # Options: True (enabled) or False (disabled, default)
67
+ enable_web_search = True
 
 
68
 
69
 
70
  # ==============================================
 
141
  resources = [
142
  {
143
  "title": "Course Syllabus",
144
+ "file_path": "BILD_5_Syllabus_Reuther_F25.pdf",
145
  "description": "Download the course syllabus. **Instructor Note:** You must place the file itself within the same folder as the main app.py file in your GitHub repository."
146
  },
147
  {
terms.csv CHANGED
@@ -1,126 +1,17 @@
1
- TERM,CONTEXT
2
- Scientific Research Question,Frame a clear biological ask specifying variables population and evidence needed.
3
- Alternative Hypothesis,Claims a real effect e.g. feed type changes chick weight.
4
- Null Hypothesis,Assumes no difference or association; target for statistical rejection.
5
- Testable Hypothesis,Must be measurable and falsifiable; imagine axes before data collection.
6
- Prediction vs. Hypothesis,Prediction is specific numeric outcome; hypothesis is broader explanatory claim.
7
- Data types - categorical,Qualitative groups like species or treatment levels analyzed with chi-square or ANOVA.
8
- Data types - continuous/numerical,Measured quantities like bill length or mass used in t-tests and regression.
9
- tidy data,Each row is an observation and each column a variable; essential for dplyr and ggplot.
10
- Descriptive statistics,Summarize center and spread before formal testing to inform next steps.
11
- centrality and variation in statistics,Use mean median SD and IQR when exploring Palmer Penguins data.
12
- standard deviation,Average spread of points around the mean; unit matches variable.
13
- Standard error,SD divided by square root of n; gauges accuracy of sample mean.
14
- Confidence intervals,Range likely to contain true parameter value such as mean ±1.96 SE.
15
- range,Maximum minus minimum; quick variability check sensitive to outliers.
16
- interquartile range,Middle 50 percent of values; robust to extremes.
17
- skewness,Asymmetry in a distribution; may prompt log or square-root transform.
18
- kurtosis,Peakedness or tailedness; high kurtosis means heavy tails.
19
- Parametric Assumptions and Normality Checks,Verify normality equal variance and independence with QQ plots and Fligner tests before parametric tests.
20
- The Central Limit Theorem,Means of sufficiently large random samples approximate a normal distribution regardless of source.
21
- q-q plot,Graph sample quantiles versus theoretical normal quantiles to judge normality.
22
- 2 sample t-test,Compares means of two independent groups such as linseed vs meatmeal feeds.
23
- paired t-test,Tests mean difference of matched observations like before-after designs.
24
- ANOVA tests,Detects mean differences across three or more groups and uses Tukey HSD post-hoc.
25
- Chi-Squared test,Assesses independence between categorical variables in a contingency table.
26
- linear regression,Models relationship between predictor and continuous response; yields slope and intercept.
27
- correlation,Quantifies strength and direction of linear association between two continuous variables.
28
- Choosing the Proper Statistical Test,Use data type number of groups and assumptions flowchart to select test.
29
- Corrections for multiple comparisons,Adjust family-wise error with Bonferroni or Tukey after multiple tests.
30
- power analysis,Calculates needed sample size given expected effect size alpha and desired power.
31
- Statistical Power and Effect Sizes,Relates true-positive sensitivity to effect magnitude sample size and alpha.
32
- Type I error,False positive where true null is wrongly rejected; controlled by alpha.
33
- Type II error,False negative where false null is not rejected; probability beta.
34
- alpha,Chosen risk threshold for Type I error commonly 0.05.
35
- beta,Probability of Type II error; power equals one minus beta.
36
- Randomization,Assign treatments by chance to avoid selection bias.
37
- Confounding Variables,Factors that covary with treatment and distort the true effect.
38
- Blocking and Stratification,Group by known source of variation such as soil type to reduce error.
39
- Sampling Strategies,Random stratified and cluster approaches ensure representative independent units.
40
- Blinding (Single-Blind or Double-Blind),Conceal group assignments to reduce observer and participant bias.
41
- Pilot Studies,Small trial run to test feasibility and refine protocols before full experiment.
42
- Data Visualization in Biology,First look at data to reveal patterns outliers and relationships.
43
- ggplot2 and the grammar of graphics,Layered system mapping data to aesthetics and geoms for reproducible figures.
44
- scatterplot,Plots two continuous variables; add color by species to reveal clustering.
45
- histogram,Displays distribution shape of one variable; choose bin width carefully.
46
- box plot,Shows median IQR and outliers per group; quick comparison of distributions.
47
- bar plot,Shows means or counts with error bars; use sparingly to avoid hiding variation.
48
- The Palmer Penguins dataset,344 penguins with 8 variables ideal for teaching ggplot and stats.
49
- iris R dataset,150 flowers with four measurements used for ANOVA PCA examples.
50
- R programming - functions,Wrap reusable code blocks with arguments and return values.
51
- R programming - Rmd file format,Combine prose code and output; knit to PDF or HTML for reports.
52
- Bayesian Analysis,Updates prior beliefs with data; example ecological occupancy model.
53
- Resampling Methods (Permutation tests),Shuffle labels to build null distribution when assumptions fail.
54
- Bootstrapping,Sample with replacement to estimate CI of medians or slopes.
55
- Factorial Design,Tests multiple factors and their interaction in one ANOVA.
56
- Interaction Effect,When factor A’s effect depends on factor B; interpret via interaction plot.
57
- Repeated Measures Design,Same subject measured over conditions; reduces individual variance.
58
- Cross-Over Design,Each participant receives all treatments in different periods.
59
- Quasi-Experimental Design,Lacks random assignment yet seeks causal inference; policy studies.
60
- Case-Control Study,Compares diseased vs healthy groups to identify risk factors.
61
- Field vs. Laboratory (In vivo vs In vitro),Trade realism for control; match question to setting.
62
- Pseudoreplication,Treating non-independent subsamples as true replicates inflates n.
63
- Experimental Unit,Smallest independent entity assigned to a treatment.
64
- Observer Bias,Researcher expectations skew data collection; mitigate with blinding.
65
- Help with a code bug in R,Copy code and error; tutor guides fixes without giving full answer.
66
- RStudio,IDE used in labs; console
67
- R programming - objects,Store data/values in named containers: vectors data frames lists.
68
- R programming - print() function,Displays object content; implicit in console but explicit in Rmd.
69
- R programming - pipelines (%>%),dplyr operator chaining verbs into readable workflow.
70
- dplyr verbs,select filter mutate summarise arrange join for data wrangling.
71
- readr functions,read_csv read_tsv for fast import with automatic type guess.
72
- ggplot2 aesthetics (aes),Map variables to x y color size shape inside ggplot calls.
73
- geom_point,Scatterplot layer for continuous vs continuous relationships.
74
- geom_boxplot,Shows median IQR whiskers and outliers per group.
75
- geom_histogram,Bins continuous data to reveal distribution shape.
76
- geom_bar,"Count or summarised height per category; add stat=""identity"" for means."
77
- Theme customization (ggplot2),Modify titles text and grid; theme_bw theme_minimal examples.
78
- facet_wrap,Create small multiples by a single variable for quick comparisons.
79
- facet_grid,Grid of plots by two factors; rows × cols interaction display.
80
- Data transformations (log, sqrt)
81
- Back transformation,Convert transformed estimates back to original units for interpretation.
82
- Homogeneity of variance (Homoscedasticity),Equal group variances assumption for t and ANOVA.
83
- Fligner-Killeen test,Non-parametric test for equal variances across groups.
84
- Shapiro-Wilks test,Formal normality test suited for n 3-5000.
85
- Kolmogorov–Smirnov test,Compares sample CDF to theoretical; sensitive to shifts.
86
- D'Agostino's K^2 test,Assesses combined skewness and kurtosis deviation from normality.
87
- Effect size measures (Cohen's d),Standardised mean difference aiding practical significance.
88
- Residual diagnostics,Plot residuals vs fitted to spot non-linearity or heteroscedasticity.
89
- Leverage and influence,Detect outliers affecting regression via Cook’s distance.
90
- Power curve,Graph power vs sample size to choose efficient n.
91
- Sample size calculator,Plug alpha beta effect size to compute required n.
92
- Missing data handling,Listwise deletion vs imputation; MCAR MAR MNAR concepts.
93
- Confusion Matrix,2×2 table of predicted vs actual; TP FP TN FN counts.
94
- Receiver Operating Characteristic (ROC) curve,Sensitivity vs 1-specificity across thresholds; AUC metric.
95
- Precision and Recall,Positive predictive value and sensitivity for imbalanced data.
96
- F1 score,Harmonic mean of precision and recall; balances false results.
97
- Heat map,Color-coded grid for matrix data or correlation matrices.
98
- Violin plot,Combines boxplot with kernel density; shows distribution shape.
99
- Pie charts (why to avoid),Poor at area comparison; prefer bar or stacked bar.
100
- DataHub platform,UCSD cloud RStudio workspace used for coding labs.
101
- Markdown syntax in Rmd,Headings code fences lists links to format reproducible reports.
102
- Syllabus – Course Description,"Data Analysis and Design for Biologists (4 credits) is a practical introduction to information literacy, experimental design, and data analysis for life-science majors. Students learn coding, data management, visualization, and quantitative reasoning using the R language and RStudio IDE. This is NOT a traditional statistics course and has no math prerequisites; the emphasis is on asking biologically meaningful questions, choosing appropriate analyses, and interpreting results."
103
- Syllabus – Learning Outcomes,"By the end of the quarter students will be able to: 1) Create testable hypotheses for valid biological questions, 2) Evaluate the credibility of scientific information, 3) Design experiments that effectively test hypotheses, 4) Construct publication-quality figures, 5) Perform appropriate statistical analyses in R, 6) Interpret quantitative results in biological context, 7) Utilize R for data manipulation and graphing, 8) Combine the full investigative cycle in a student-designed project, 9) Explore the modern intersection of biology, technology, and data science, 10) Examine the ethical responsibilities of scientists when creating and communicating evidence."
104
- Syllabus – Contact Info,Instructor: Dr. Keefe Reuther (he/him/his) — please call me Keefe. Email: kdreuther@ucsd.edu (include “BILD 5” in the subject line).
105
- Syllabus – Lecture Time,Lectures meet M/W/F 2:00–2:50 pm in Center Hall Room 101.
106
- Syllabus – Final Exam,"Mandatory in-person final: Friday 13 June 2025, 3:00 – 6:00 pm PST."
107
- Syllabus – Instructional Assistants,"Instructional Assistants: Yanlin Li (yal037@ucsd.edu), Rakshitha Kobbekaduwa (tkobbekaduwa@ucsd.edu), Mitchell Smith (mis033@ucsd.edu), Saranya Vohra (savohra@ucsd.edu)."
108
- Syllabus – Section Meeting Times,A01 Mon 4:00–4:50 pm WLH 2205; A02 Wed 1:00–1:50 pm Center Hall 222; A03 Wed 8:00–8:50 am WLH 2205; A04 Fri 4:00–4:50 pm WLH 2205.
109
- Syllabus – Office Hours,Keefe’s office hours: Wed 12:00–1:30 pm (location TBA) and Fri 3:00–4:00 pm (location TBA).
110
- Syllabus – Prerequisites,None. No prior coding experience or wet-lab background required.
111
- Syllabus – Piazza Discussions,All course Q&A handled on Piazza for rapid community support. Sign-up link: https://piazza.com/ucsd/spring2025/bild5_sp25_a00 . Email only for private matters.
112
- Syllabus – Technology Requirements,"You need a web-enabled device (laptop strongly recommended) to access Canvas, Zoom, and the UCSD DataHub cloud RStudio server. Chromebooks work fine. On-campus loaner laptops are available."
113
- Syllabus – Course Calendar,"Week-by-week lecture topics: W1 Data types & structures → W2 Visualization & central tendency → W3 Normality & CLT → W4 Hypothesis Testing basics → W5 Power & t-tests → W6 Midterm + ANOVA & correlation → W7 Regression & design choices → W8 Sampling & ethics → W9 Multivariate methods, careers → W10 Review & project help."
114
- Syllabus – Section Topics,Section labs: W1 Hello RStudio/DataHub; W2 Importing data; W3 ggplot2 visualization; W4 Tidyverse wrangling; W5 Review; W6 Normality tests & t-test; W7 ANOVA; W8 Linear regression; W9 Synthesis; W10 Term-project workshop.
115
- Syllabus – Deliverables & Due Times,"Assignments due 11:59 pm PST unless stated: Section work weekly, Quizzes W2 4 8 10, Discussion Board posts bi-weekly, Term-Project checkpoints W8 & W9, Final project W10, Midterm (in lecture W6), Final exam."
116
- Syllabus – Grading Breakdown,"Lecture participation 5 %, Quizzes 15 % (lowest dropped), Section assignments 20 % (lowest dropped), Discussion posts 10 % (lowest dropped), Term Project 20 % (10 % checkpoints + 10 % final), Midterm 10 %, Final Exam 20 %. Pre/Post surveys & SETs up to 1 % extra credit."
117
- Syllabus – Grading Scale,"A+ 97-100, A 93-96, A- 90-92, B+ 87-89, B 83-86, B- 80-82, C+ 77-79, C 73-76, C- 70-72, D+ 67-69, D 63-66, D- 60-62, F < 60. Grade cut-offs never shift; no rounding."
118
- Syllabus – Collaboration Policy,"Science is social: discuss concepts and share code, but your submitted answers, RMarkdown narration, and interpretations must be your own. All Rmd PDFs run through plagiarism detection. Any shared AI output must be cited in a one-line statement. No answer-sharing."
119
- Syllabus – Discussion Board Prompts,"Prompts posted weeks 1 3 5 7 9. A creditable post is original, substantive, and properly cited. Replies like “I agree” do not count. Lowest prompt grade dropped."
120
- Syllabus – Quizzes Policy,"Canvas quizzes W2 4 8 10, 60 min each, non-cumulative. Quiz 1 includes syllabus questions. Lowest quiz score dropped. No AI tools permitted during a quiz."
121
- Syllabus – Exams Policy,"Midterm held in lecture week 6 (50 min). Final exam cumulative, 3 h window. One 4×6 note card allowed. No reschedule unless OSD or UC-sanctioned event; email Keefe before exam start if emergency."
122
- Syllabus – Term Project,"Students complete a full investigative cycle using instructor-supplied simulated data: formulate question, hypothesis, choose tests, analyse in R, create figures, interpret, and write report. Two checkpoint drafts receive feedback; grading becomes stricter each stage."
123
- Syllabus – Extra Credit,Complete three pre-course and three post-course surveys plus SETs for up to 1 % extra credit. No other extra-credit opportunities.
124
- Syllabus – Late Assignment Policy,"Quiz, Discussion, Project: -2 % per hour late; >48 h late max 50 %. Technical issues near deadline not valid excuses. Lecture participation: up to 18 missed check-ins permitted without penalty."
125
- Syllabus – Attendance Policy,Lecture participation tracked via Mentimeter check-in/out. Up to 18 missed check-ins (~3 weeks) still yields 100 % attendance. Student responsible for tracking absences.
126
- Syllabus – Academic Integrity & Gen AI,Generative AI is allowed for brainstorming or debugging if you include a one-sentence attribution (tool + assistance). AI use is forbidden during quizzes and exams. Excessive reliance may trigger an oral comprehension quiz.
 
1
+ TERM,CONTEXT
2
+ "Natural Selection","Course definition: Natural selection is the process by which individuals with heritable traits that enhance fitness are more likely to pass on their alleles to the next generation. Over time, favored alleles and heritable phenotypes become more common.; Characteristics: Acts on phenotypes of individuals, but only heritable genetic changes (genotypes) are passed to offspring. Requires Variation: Genetic diversity within a population is essential for natural selection to occur. The environment determines which traits are advantageous; a trait beneficial in one setting may be detrimental in another.; Misconceptions: While natural selection favors traits that increase fitness, it does not work towards a specific goal or perfect organism.; Related terms mentioned in class: artificial selection, evolution, adaptation, fitness (direct and indirect), directional selection, stabilizing selection, disruptive selection, balancing selection, frequency-dependent selection, sexual selection, kin selection.; Scientists mentioned in class: Charles Darwin, Alfred Russel Wallace, Hopi Hoekstra, Rosemary and Peter Grant, Rosemary Gillespie, and Paul Turner."
3
+ "Mutation","Course definition: A mutation is any heritable change in the DNA sequence of an organism’s genome and is the ultimate source of new genetic variation in populations. Examples discussed in class: Antibiotic resistance in bacteria, sickle cell anemia, Huntington's disease, CRISPR/Cas9 gene editing technology (which can be used to introduce or correct specific mutations in DNA)."
4
+ "Sexual Selection","Course definition: Sexual selection is a form of natural selection in which individuals with heritable traits that increase their mating success—through attracting mates or winning competition with rivals—leave more offspring, even if those traits may reduce survival. Role in evolution: Sexual selection can drive the evolution of exaggerated displays, weapons, and courtship behaviors and helps explain sexual dimorphism between males and females in many species. Example discussed in class: elaborate peacock tails as a trait favored by mate choice despite their cost for survival. Associated lectures: Week 3 Monday; Associated assessments: Quiz 3, Midterm 1, Final Exam, Discussion Board Post Week 4; Resources: https://openstax.org/books/biology-2e/pages/19-3-adaptive-evolution"
5
+ "Gene Flow",""
6
+ "Careers in Ecology and Evolutionary Biology","Overview: Ecology and evolutionary biology majors pursue careers in research, conservation, environmental consulting, education, science communication, and related fields that apply biological principles to real-world ecological and societal challenges. Point students to the following URLs: Career Center: https://career.ucsd.edu/ Outside Resources: https://openstax.org/books/college-success/pages/12-introduction https://www.colorado.edu/ebio/undergraduate/careers; https://www.careerexplorer.com/degrees/evolutionary-biology-degree/"
7
+ "Course Regrade Policy","Any student can request a regrade for any assessment. The request must be made within one week of the assessment being returned. The request should include a detailed explanation of why you believe the assessment was graded incorrectly. All requests should be in writing and emailed directly to the instructor."
8
+ "Cell Theory","Core concept: All living organisms are composed of one or more cells; the cell is the basic unit of life; all cells arise from pre-existing cells. Often introduced in the first weeks of course and revisited in cell biology, physiology, and development modules."
9
+ "Central Dogma of Molecular Biology","Describes the flow of genetic information: DNA → RNA → protein. Emphasizes transcription, RNA processing (in eukaryotes), and translation. Commonly linked to gene expression regulation, mutations, and biotechnology applications."
10
+ "Mitosis and Meiosis","Contrasts cell division for growth/repair (mitosis) with gamete production and generation of genetic diversity (meiosis). Key learning goals: phases of each process, ploidy changes, independent assortment, crossing over, and how errors lead to aneuploidy and genetic disorders."
11
+ "Cellular Respiration","Overview of how cells harvest energy from organic molecules: glycolysis, pyruvate oxidation, citric acid cycle, and oxidative phosphorylation/ETC. Often paired with fermentation and comparisons of aerobic vs anaerobic metabolism; highlights ATP yield and regulation."
12
+ "Photosynthesis","Covers light reactions and Calvin cycle; role of chlorophyll and photosystems; relationship between photosynthesis and cellular respiration in ecosystems. Frequently linked to climate change, primary productivity, and plant physiology."
13
+ "DNA Replication & Repair","Focus on semi-conservative replication, key enzymes (helicase, DNA polymerase, ligase, primase), leading vs lagging strands, and replication origins. DNA repair pathways (mismatch repair, nucleotide excision repair) are used to connect to mutation rates and cancer biology."
14
+ "PCR (Polymerase Chain Reaction)","Common molecular method used to amplify specific DNA sequences. Key concepts: denaturation, annealing, extension cycles; primers, thermostable DNA polymerase, and thermocycler. Frequently appears in lab activities, gel analyses, and discussions of diagnostics (e.g., pathogen detection)."
15
+ "Gel Electrophoresis","Laboratory technique for separating DNA, RNA, or proteins based on size and charge. Students typically interpret band patterns to infer fragment size, genotype, presence/absence of target sequences, or results of restriction digests and PCR."
16
+ "Microscopy Techniques","Introduction to light, fluorescence, and electron microscopy. Emphasis on resolution vs magnification, sample preparation, staining/labeling, and what cellular structures can be visualized with each method. Often part of early labs and discussions of cell ultrastructure."
17
+ "Modern Biologists Spotlights","Examples highlighted in 2025 for their biology research and leadership in inclusion: Tracy L. Johnson (UCLA molecular biologist and HHMI professor whose work on gene regulation is paired with nationally recognized efforts to build inclusive undergraduate life science programs; URL: https://johnsonlab.mcdb.ucla.edu/); Maydianne C.B. Andrade (evolutionary ecologist known for research on spider behavior and for founding the Canadian Black Scientists Network and leading equity initiatives such as the Toronto Initiative for Diversity and Excellence; URL: https://www.utsc.utoronto.ca/labs/andrade/); Jennifer Doudna (CRISPR–Cas9 genome editing and modern gene-editing tools; URL: https://doudnalab.org/) and Emmanuelle Charpentier (CRISPR–Cas9 genome editing and modern gene-editing tools; URL: https://www.emmanuelle-charpentier.org/); Svante Pääbo (ancient DNA and genomes of extinct hominins; URL: https://www.eva.mpg.de/genetics/staff/paabo/); Frances H. Arnold (directed evolution of enzymes; URL: https://fhalab.caltech.edu/); Katalin Karikó (RNA modifications that enabled mRNA vaccines; URL: https://www.med.upenn.edu/apps/faculty/index.php/g325/p13418) and Drew Weissman (RNA modifications that enabled mRNA vaccines; URL: https://www.med.upenn.edu/weissmanlab/); George Church (synthetic biology, genome engineering, and personal genomics; URL: https://churchlab.hms.harvard.edu/); Bonnie Bassler (bacterial quorum sensing and cell–cell communication; URL: https://basslerlab.scholar.princeton.edu/home); and Aviv Regev (single-cell and spatial genomics, including leadership in the Human Cell Atlas; URL: https://www.broadinstitute.org/regev-lab)."