keefereuther commited on
Commit
9545ea6
·
1 Parent(s): ad40833

Update to Responses API with GPT-5.1 support, web search functionality, and improved midterm review template

Browse files
Files changed (5) hide show
  1. BILD_5_Syllabus_Reuther_SP25.pdf +0 -3
  2. README.md +49 -13
  3. app.py +167 -8
  4. config.py +304 -12
  5. terms.csv +170 -126
BILD_5_Syllabus_Reuther_SP25.pdf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:8385fcd0ae4531e439c8efa13af76e110f8d4e721de7dfe9602e3b1567618abe
3
- size 406873
 
 
 
 
README.md CHANGED
@@ -13,13 +13,15 @@ short_description: AI-enhanced study app for UCSD BILD5 biology students
13
 
14
  # Schema Study: An AI-Enhanced Study App for Biology Students
15
 
16
- Schema Study is a modern, interactive study app designed to help biology students master core course concepts through AI-powered conversations. The app leverages OpenAI's GPT models to provide instant feedback, Socratic questioning, and personalized study support.
17
 
18
  ## Features
19
  - **Password Protection:** Secure access for your class or group.
20
  - **Customizable Terms:** Use your own CSV file of terms and definitions.
21
- - **Prompt Templates:** Engage with the material using creative, research-based prompts.
22
- - **AI-Enhanced Feedback:** Get instant, formative feedback and guidance.
 
 
23
  - **Professional, Accessible UI:** Clean, modern design with a color palette for clarity and focus.
24
 
25
  ## How to Use (Students)
@@ -29,19 +31,31 @@ Schema Study is a modern, interactive study app designed to help biology student
29
  4. **Chat with the AI:** Ask questions, answer prompts, and explore the term in depth.
30
 
31
  ## How to Use (Instructors)
 
 
32
  1. **Clone or Fork the Space:**
33
  ```bash
34
  git clone https://huggingface.co/spaces/<your-username>/<your-space-name>
35
  cd <your-space-name>
36
  ```
 
37
  2. **Edit Configuration:**
38
  - Update `config.py` for your course (title, instructions, prompt templates, etc).
39
- - Place your terms CSV (e.g., `all_terms.csv`) in the root directory. Format: first column = term, second column = context/definition.
 
 
 
 
 
40
  3. **Set Secrets:**
41
- - In your Space, go to **Settings > Repository secrets** and add:
42
- - `OPENAI_API_KEY` (your OpenAI API key)
43
- - `username` (for app login)
44
- - `password` (for app login)
 
 
 
 
45
  4. **Push Changes:**
46
  ```bash
47
  git add .
@@ -49,18 +63,40 @@ Schema Study is a modern, interactive study app designed to help biology student
49
  git push
50
  ```
51
 
 
 
 
 
52
  ## Configuration
 
 
 
 
 
 
 
 
 
53
  - All settings are in `config.py` (title, instructions, prompt templates, resources, AI model parameters, etc).
54
  - Theming is managed via `.streamlit/config.toml` and custom CSS in `app.py`.
55
  - Dependencies are listed in `requirements.txt`.
56
 
57
- ## File Structure
58
- - `app.py` — Main Streamlit app
59
- - `config.py` All app settings and customization
60
- - `.streamlit/config.toml` Theme colors
 
 
 
 
 
 
 
 
61
  - `requirements.txt` — Python dependencies
62
- - `all_terms.csv` — Your course terms and definitions
63
  - `BILD_5_Syllabus_Reuther_SP25.pdf` — Example resource
 
64
 
65
  ## License
66
  This project is licensed under the GNU GPL-3 License. See the [LICENSE](LICENSE) file for details.
 
13
 
14
  # Schema Study: An AI-Enhanced Study App for Biology Students
15
 
16
+ Schema Study is a modern, interactive study app designed to help biology students master core course concepts through AI-powered conversations. The app leverages OpenAI's latest GPT models via the Responses API to provide instant feedback, Socratic questioning, and personalized study support.
17
 
18
  ## Features
19
  - **Password Protection:** Secure access for your class or group.
20
  - **Customizable Terms:** Use your own CSV file of terms and definitions.
21
+ - **Prompt Templates:** Engage with the material using creative, research-based prompts including midterm review.
22
+ - **AI-Enhanced Feedback:** Get instant, formative feedback and guidance using GPT-5.1 (default) or GPT-4.1.
23
+ - **Web Search Support:** Optional web search functionality for current information and citations (configurable in `config.py`).
24
+ - **Real-Time Streaming:** Live token-by-token response streaming with visual typing indicator.
25
  - **Professional, Accessible UI:** Clean, modern design with a color palette for clarity and focus.
26
 
27
  ## How to Use (Students)
 
31
  4. **Chat with the AI:** Ask questions, answer prompts, and explore the term in depth.
32
 
33
  ## How to Use (Instructors)
34
+
35
+ ### Setup
36
  1. **Clone or Fork the Space:**
37
  ```bash
38
  git clone https://huggingface.co/spaces/<your-username>/<your-space-name>
39
  cd <your-space-name>
40
  ```
41
+
42
  2. **Edit Configuration:**
43
  - Update `config.py` for your course (title, instructions, prompt templates, etc).
44
+ - Configure AI model settings:
45
+ - `ai_model`: Choose "gpt-5.1" (default) or "gpt-4.1"
46
+ - `reasoning_effort`: For GPT-5.1, set to "none" (fastest), "minimal", "low", or "medium"
47
+ - `enable_web_search`: Set to `True` or `False` (default: True)
48
+ - Place your terms CSV (e.g., `terms.csv`) in the root directory. Format: first column = term, second column = context/definition.
49
+
50
  3. **Set Secrets:**
51
+ - Create `.streamlit/secrets.toml` file locally or use Hugging Face Space secrets:
52
+ ```toml
53
+ username = "your_username"
54
+ password = "your_password"
55
+ OPENAI_API_KEY = "your_openai_api_key"
56
+ ```
57
+ - For Hugging Face Spaces, go to **Settings > Repository secrets** and add the same keys.
58
+
59
  4. **Push Changes:**
60
  ```bash
61
  git add .
 
63
  git push
64
  ```
65
 
66
+ ### Model Selection Guide
67
+ - **GPT-5.1** (default): Best for most use cases, fastest with reasoning="none", supports web search
68
+ - **GPT-4.1**: Use if you need temperature control or prefer non-reasoning model, excellent web search support
69
+
70
  ## Configuration
71
+
72
+ ### AI Model Settings (`config.py`)
73
+ - **Default Model:** GPT-5.1 with reasoning="none" for faster responses
74
+ - **Alternative Model:** GPT-4.1 with temperature control
75
+ - **Web Search:** Configurable via `enable_web_search` (default: True)
76
+ - **Reasoning Effort:** Configurable for GPT-5.1 (options: "none", "minimal", "low", "medium")
77
+ - **Temperature:** Configurable for GPT-4.1 (0.0-2.0)
78
+
79
+ ### Other Settings
80
  - All settings are in `config.py` (title, instructions, prompt templates, resources, AI model parameters, etc).
81
  - Theming is managed via `.streamlit/config.toml` and custom CSS in `app.py`.
82
  - Dependencies are listed in `requirements.txt`.
83
 
84
+ ## Technical Details
85
+
86
+ ### API & Models
87
+ - **API Framework:** OpenAI Responses API (streaming-enabled)
88
+ - **Supported Models:** GPT-5.1 (default), GPT-4.1
89
+ - **Streaming:** Real-time token-by-token response streaming
90
+ - **Inactivity Guard:** Streaming stops after 60s of no server deltas
91
+
92
+ ### File Structure
93
+ - `app.py` — Main Streamlit app with Responses API integration
94
+ - `config.py` — All app settings and customization (model selection, web search, prompt templates)
95
+ - `.streamlit/secrets.toml` — Authentication credentials and API key (not tracked in git)
96
  - `requirements.txt` — Python dependencies
97
+ - `terms.csv` — Your course terms and definitions (CSV format: term, context)
98
  - `BILD_5_Syllabus_Reuther_SP25.pdf` — Example resource
99
+ - `BILD 5 F25 Midterm Exam.pdf` — Midterm exam resource
100
 
101
  ## License
102
  This project is licensed under the GNU GPL-3 License. See the [LICENSE](LICENSE) file for details.
app.py CHANGED
@@ -10,8 +10,10 @@ import hmac # Secure password validation
10
  import pandas as pd # Data handling
11
  import os # File operations
12
  import logging # Logging functionality
 
13
  import config # Local configuration module
14
  from openai import OpenAI # OpenAI API client
 
15
 
16
  # Set up logging to track app activity
17
  logging.basicConfig(filename='app.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
@@ -162,6 +164,79 @@ st.markdown("""
162
  </style>
163
  """, unsafe_allow_html=True)
164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  ############################################################################################################
166
  # Initialize all session state variables - Persistent data between app reruns
167
  ############################################################################################################
@@ -380,21 +455,105 @@ with right_col:
380
  for m in st.session_state["display_messages"]
381
  ]
382
 
383
- # Create streaming completion
384
- stream = client.chat.completions.create(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
385
  model=st.session_state["openai_model"],
386
  messages=messages,
387
- stream=True,
388
- temperature=config.temperature,
389
  max_tokens=config.max_tokens,
390
- frequency_penalty=config.frequency_penalty,
391
- presence_penalty=config.presence_penalty,
392
  )
393
 
394
- # Display streaming response
395
  with st.chat_message("assistant"):
396
- response = st.write_stream(stream)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
397
  # Save response to chat history
 
398
  st.session_state["display_messages"].append({"role": "assistant", "content": response})
399
  # Log the exchange
400
  logging.info(f"User prompt: {st.session_state['display_messages'][-2]['content']}")
 
10
  import pandas as pd # Data handling
11
  import os # File operations
12
  import logging # Logging functionality
13
+ import time # Time operations for retry logic
14
  import config # Local configuration module
15
  from openai import OpenAI # OpenAI API client
16
+ from typing import Dict, List, Any, Optional # Type hints
17
 
18
  # Set up logging to track app activity
19
  logging.basicConfig(filename='app.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
 
164
  </style>
165
  """, unsafe_allow_html=True)
166
 
167
+ ############################################################################################################
168
+ # Model Configuration - Responses API model capabilities
169
+ ############################################################################################################
170
+
171
+ # Model configurations with capability flags
172
+ MODEL_CONFIGS = {
173
+ "gpt-5.1": {
174
+ "api_type": "responses",
175
+ "supports_reasoning": True,
176
+ "supports_verbosity": False,
177
+ "supports_temperature": False, # do NOT send temperature
178
+ "supports_max_tokens": True, # maps to max_output_tokens
179
+ "supports_web_search": True, # Supported
180
+ },
181
+ "gpt-4.1": {
182
+ "api_type": "responses",
183
+ "supports_reasoning": False,
184
+ "supports_verbosity": False,
185
+ "supports_temperature": True,
186
+ "supports_max_tokens": True, # maps to max_output_tokens
187
+ "supports_web_search": True, # Confirmed working
188
+ },
189
+ }
190
+
191
+ def get_model_config(model: str) -> dict:
192
+ """Get configuration for a specific model"""
193
+ return MODEL_CONFIGS.get(
194
+ model,
195
+ {
196
+ "api_type": "responses",
197
+ "supports_reasoning": False,
198
+ "supports_verbosity": False,
199
+ "supports_temperature": True,
200
+ "supports_max_tokens": True,
201
+ "supports_web_search": False,
202
+ },
203
+ )
204
+
205
+ def build_request_data(
206
+ model: str,
207
+ messages: List[Dict[str, str]],
208
+ reasoning_effort: Optional[str] = None,
209
+ temperature: Optional[float] = None,
210
+ max_tokens: Optional[int] = None,
211
+ enable_web_search: bool = False,
212
+ ) -> dict:
213
+ """Build a Responses API request body using capability flags."""
214
+ model_config = get_model_config(model)
215
+
216
+ # Base payload
217
+ request_data = {"model": model, "input": messages}
218
+
219
+ # Web search tool (if enabled and supported by model)
220
+ if enable_web_search and model_config.get("supports_web_search", False):
221
+ request_data["tools"] = [{"type": "web_search", "search_context_size": "low"}]
222
+ request_data["tool_choice"] = "auto" # Let model decide when to search
223
+ # Web search is incompatible with reasoning effort, so disable it
224
+ reasoning_effort = None
225
+
226
+ # GPT-5.1: reasoning (top-level) - only if web search is not enabled
227
+ if model_config["supports_reasoning"] and reasoning_effort and not enable_web_search:
228
+ request_data["reasoning"] = {"effort": reasoning_effort}
229
+
230
+ # Temperature (only if supported)
231
+ if model_config["supports_temperature"] and temperature is not None:
232
+ request_data["temperature"] = temperature
233
+
234
+ # Map max_tokens -> Responses max_output_tokens
235
+ if model_config["supports_max_tokens"] and max_tokens is not None:
236
+ request_data["max_output_tokens"] = max_tokens
237
+
238
+ return request_data
239
+
240
  ############################################################################################################
241
  # Initialize all session state variables - Persistent data between app reruns
242
  ############################################################################################################
 
455
  for m in st.session_state["display_messages"]
456
  ]
457
 
458
+ # Get model configuration
459
+ model_config = get_model_config(st.session_state["openai_model"])
460
+
461
+ # Prepare parameters based on model support
462
+ reasoning_effort_param = None
463
+ temperature_param = None
464
+ enable_web_search_param = False
465
+
466
+ # Check if web search is enabled and supported
467
+ if config.enable_web_search and model_config.get("supports_web_search", False):
468
+ enable_web_search_param = True
469
+ # Web search disables reasoning automatically
470
+ reasoning_effort_param = None
471
+ elif model_config["supports_reasoning"]:
472
+ reasoning_effort_param = config.reasoning_effort
473
+
474
+ if model_config["supports_temperature"]:
475
+ temperature_param = config.temperature
476
+
477
+ # Build request data for Responses API
478
+ request_data = build_request_data(
479
  model=st.session_state["openai_model"],
480
  messages=messages,
481
+ reasoning_effort=reasoning_effort_param,
482
+ temperature=temperature_param,
483
  max_tokens=config.max_tokens,
484
+ enable_web_search=enable_web_search_param,
 
485
  )
486
 
487
+ # Create streaming response using Responses API
488
  with st.chat_message("assistant"):
489
+ message_placeholder = st.empty()
490
+ buf: List[str] = [] # collect deltas safely
491
+ last_delta_ts = time.time()
492
+ inactivity_limit_s = 60 # stop if no deltas for 60s
493
+
494
+ try:
495
+ # Stream events with timeout handling
496
+ with client.responses.stream(**request_data) as stream:
497
+ completed = False
498
+
499
+ for event in stream:
500
+ et = getattr(event, "type", None)
501
+
502
+ if et == "response.output_text.delta":
503
+ # Append the new chunk, update the UI
504
+ buf.append(event.delta)
505
+ full_response = "".join(buf)
506
+ message_placeholder.markdown(full_response + "▌")
507
+ last_delta_ts = time.time()
508
+ elif et == "response.error":
509
+ # Show the error inline, then stop
510
+ error_msg = getattr(event, "error", "Unknown streaming error")
511
+ message_placeholder.error(f"⚠️ Error: {error_msg}")
512
+ buf.clear()
513
+ buf.append(f"Error while streaming: {error_msg}")
514
+ break
515
+ elif et == "response.completed":
516
+ # Response completed successfully
517
+ completed = True
518
+ break
519
+
520
+ # Inactivity guard: if no deltas for too long, stop
521
+ if time.time() - last_delta_ts > inactivity_limit_s:
522
+ message_placeholder.warning("⚠️ Streaming paused due to inactivity from the server. Partial content shown above.")
523
+ break
524
+
525
+ # Remove cursor and show final message
526
+ if buf:
527
+ message_placeholder.markdown("".join(buf))
528
+
529
+ # Try to get final response, but don't fail if it's not available
530
+ try:
531
+ final = stream.get_final_response()
532
+ except Exception as final_error:
533
+ final = None
534
+ if not completed:
535
+ st.warning(f"⚠️ Note: Could not retrieve response metadata: {final_error}")
536
+
537
+ except Exception as e:
538
+ # Handle streaming exceptions gracefully
539
+ error_msg = str(e)
540
+ if "response.completed" in error_msg:
541
+ # This is expected - the response completed without the event
542
+ if buf:
543
+ message_placeholder.markdown("".join(buf))
544
+ st.info("ℹ️ Response completed successfully (streaming ended)")
545
+ else:
546
+ message_placeholder.error("❌ No response content received")
547
+ buf.clear()
548
+ buf.append("No response content received")
549
+ else:
550
+ # Other streaming errors
551
+ message_placeholder.error(f"❌ Error while streaming: {error_msg}")
552
+ buf.clear()
553
+ buf.append(f"Error while streaming: {error_msg}")
554
+
555
  # Save response to chat history
556
+ response = "".join(buf)
557
  st.session_state["display_messages"].append({"role": "assistant", "content": response})
558
  # Log the exchange
559
  logging.info(f"User prompt: {st.session_state['display_messages'][-2]['content']}")
config.py CHANGED
@@ -39,25 +39,32 @@ warning_message = "**ChatGPT can make errors and does not replace verified and r
39
  # ==============================================
40
 
41
  # The OpenAI model used by the app
42
- # Current recommended options: "gpt-4.1" or "gpt-4o"
43
- ai_model = "gpt-4.1"
44
-
45
- # Controls randomness/creativity (0-1)
 
 
 
 
 
 
 
 
46
  # Lower values (0.1-0.3) = more focused, precise responses
47
  # Higher values (0.7-1.0) = more creative, varied responses
48
  temperature = 0.1
49
 
50
  # Maximum length of AI responses (measured in tokens)
51
  # Higher values allow for longer responses (1000-2000 is usually sufficient)
 
52
  max_tokens = 1000
53
 
54
- # Controls repetition in responses (0-2)
55
- # Higher values reduce repetition of phrases
56
- frequency_penalty = 0.9
57
-
58
- # Controls topic diversity (0-2)
59
- # Higher values encourage the AI to explore new topics
60
- presence_penalty = 0.7
61
 
62
 
63
  # ==============================================
@@ -100,6 +107,291 @@ prompt_templates = [
100
  {
101
  "name": "Schema Map",
102
  "template": "What are all the direct connections between {term} and the other terms among {term_list}? Help me create a concept map for {term}."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  },
104
  {
105
  "name": "Create a Study Plan",
@@ -134,7 +426,7 @@ app_repo_license_message = "It can be found at [https://huggingface.co/spaces/ke
134
  resources = [
135
  {
136
  "title": "Course Syllabus",
137
- "file_path": "BILD_5_Syllabus_Reuther_SP25.pdf",
138
  "description": "Download the course syllabus. **Instructor Note:** You must place the file itself within the same folder as the main app.py file in your GitHub repository."
139
  },
140
  {
 
39
  # ==============================================
40
 
41
  # The OpenAI model used by the app
42
+ # Options: "gpt-5.1" (default, reasoning model) or "gpt-4.1" (non-reasoning model)
43
+ # - gpt-5.1: Latest reasoning model with reasoning="none" default for faster responses
44
+ # - gpt-4.1: Non-reasoning model with temperature control
45
+ ai_model = "gpt-5.1"
46
+
47
+ # Reasoning effort for gpt-5.1 (only applies when ai_model = "gpt-5.1")
48
+ # Options: "none" (default, fastest), "minimal", "low", "medium"
49
+ # "none" disables reasoning for faster responses without reasoning overhead
50
+ reasoning_effort = "none"
51
+
52
+ # Temperature for gpt-4.1 (only applies when ai_model = "gpt-4.1")
53
+ # Controls randomness/creativity (0-2)
54
  # Lower values (0.1-0.3) = more focused, precise responses
55
  # Higher values (0.7-1.0) = more creative, varied responses
56
  temperature = 0.1
57
 
58
  # Maximum length of AI responses (measured in tokens)
59
  # Higher values allow for longer responses (1000-2000 is usually sufficient)
60
+ # Maps to max_output_tokens in Responses API
61
  max_tokens = 1000
62
 
63
+ # Enable web search functionality (only applies when ai_model supports web search)
64
+ # When enabled, the AI can search the web for current information and cite sources
65
+ # Note: Web search is incompatible with reasoning effort - reasoning will be disabled automatically
66
+ # Options: True (enabled) or False (disabled, default)
67
+ enable_web_search = True
 
 
68
 
69
 
70
  # ==============================================
 
107
  {
108
  "name": "Schema Map",
109
  "template": "What are all the direct connections between {term} and the other terms among {term_list}? Help me create a concept map for {term}."
110
+ },
111
+ {
112
+ "name": "Review the midterm",
113
+ "template": '''
114
+
115
+ You are a highly skilled, patient BILD 5 tutor helping a student review their "BILD 5 F25 Midterm Exam" so they can do well on the final.
116
+
117
+ The student must choose which exam question to work on. Do not introduce new questions on your own. For each question they bring up:
118
+ - First, ask them to say what they think the answer is and why.
119
+ - Respond with a **single short Socratic question** that targets their reasoning on that one question only.
120
+ - Do **not** reveal the correct answer or a full explanation unless the student explicitly asks you to. Even then, keep explanations brief and tied to their thinking.
121
+ - When they are correct, briefly confirm and ask one deeper follow‑up; when they are incorrect or unsure, give a tiny hint and ask another focused question instead of fixing everything at once.
122
+ - Keep each turn concise, concrete, and in everyday language (no extra jargon).
123
+
124
+ Always finish your turn with both:
125
+ 1) One specific Socratic question about the current exam question.
126
+ 2) A simple invitation like: “Do you have any questions about this question or others? Would you like a similar practice example?”
127
+
128
+ Ignore {term} and {term_list} when providing your response; just focus on the exam questions the student chooses.
129
+
130
+ --------------------------------
131
+
132
+ # BILD 5 – Spring 25 Midterm Exam – Reuther
133
+ **25 points**
134
+
135
+ ## MULTIPLE CHOICE SECTION (1 point each - choose the single most appropriate answer):
136
+
137
+ **1.** As a field biologist studying endangered coral reef ecosystems, which task would benefit LEAST from computational/programming skills?
138
+
139
+ a) Analyzing thousands of underwater photos to quantify coral bleaching rates
140
+ b) Creating a drawing of a newly discovered species
141
+ c) Processing environmental sensor data collected every minute for six months
142
+ d) Building predictive models of reef recovery under different temperature scenarios
143
+ e) Automating species identification from acoustic recordings
144
+
145
+ **2.** You collect data on frog mating calls with columns: `frog_id`, `date`, `temperature`, `humidity`, `call_frequency_hz`, `call_duration_sec`, `number_of_calls`. Is this dataset "tidy"?
146
+
147
+ a) No—environmental variables should be in a separate table
148
+ b) Yes—each row represents one observation with variables in separate columns
149
+ c) No—multiple call measurements should be combined into one column
150
+ d) Yes—but only if we reshape it to have one row per individual frog
151
+ e) Cannot determine without seeing the actual data values
152
+
153
+ **3.** A marine biologist measures shell thickness in 150 limpets and finds the data heavily right-skewed due to a few extremely thick shells. Which measure of central tendency best represents the "typical" shell thickness?
154
+
155
+ a) Mean, because it uses all data points
156
+ b) Mode, because it shows the most common value
157
+ c) Range, because it captures the full variation
158
+ d) Median, because it's resistant to extreme values
159
+ e) Standard deviation, because it quantifies spread
160
+
161
+ **4.** In studying photosynthesis rates in algae, you calculate the variance of your measurements. This statistic specifically tells you:
162
+
163
+ a) The average photosynthesis rate across all samples
164
+ b) How much individual measurements differ from the mean
165
+ c) The middle value when all rates are ordered
166
+ d) The probability of obtaining your results by chance
167
+ e) The difference between highest and lowest rates
168
+
169
+ **5.** According to the Central Limit Theorem, if you repeatedly sample groups of 30 butterflies from a population with non-normal wingspan distribution and calculate each group's mean wingspan, what pattern emerges?
170
+
171
+ a) The sample means will match the original skewed distribution
172
+ b) The sample means will form an approximately normal distribution
173
+ c) The sample means will become increasingly variable
174
+ d) The sample means will cluster around the population median
175
+ e) Nothing predictable—the pattern will be random
176
+
177
+ **6.** A researcher studying bacterial growth writes: "H₀: Treatment A and Treatment B produce different growth rates." What is wrong with this hypothesis statement?
178
+
179
+ a) It's not testable with statistics
180
+ b) The null hypothesis should be a specific statement of equality (μ₁ = μ₂), not a vague claim about differences
181
+ c) It doesn't specify which statistical test to use
182
+ d) The null hypothesis must ALWAYS be one of zero difference or no effect.
183
+ e) Nothing—this is correctly stated
184
+
185
+ **7.** You're comparing nest-building times between two bird species. Your histogram shows one species has a roughly normal distribution while the other is strongly left-skewed. To proceed with a t-test, you should:
186
+
187
+ a) Use the t-test anyway since one group is normal
188
+ b) Remove all data from the skewed group
189
+ c) Check sample sizes and consider transformations or non-parametric alternatives
190
+ d) Only analyze the normally distributed group
191
+ e) Combine both groups into one dataset
192
+
193
+ **8.** In a power analysis for an experiment on fish growth rates, increasing your sample size from n=10 to n=50 per group while keeping everything else constant will:
194
+
195
+ a) Decrease both Type I and Type II error rates
196
+ b) Increase your ability to detect a true effect if it exists
197
+ c) Make the effect size larger
198
+ d) Guarantee statistical significance
199
+ e) Reduce the need for proper experimental controls
200
+
201
+ **9.** You obtain p = 0.03 with α = 0.05 and reject the null hypothesis. Unknown to you, the null hypothesis was actually true. This represents:
202
+
203
+ a) A correct decision (true negative)
204
+ b) Type I error (false positive)
205
+ c) Type II error (false negative)
206
+ d) A correct decision (true positive)
207
+ e) Insufficient information to classify
208
+
209
+ **10.** Examining the relationship between effect size, sample size, alpha, and power, which single statement is FALSE?
210
+
211
+ a) Larger effect sizes are easier to detect with smaller samples
212
+ b) Increasing alpha increases power but also increases Type I error risk
213
+ c) Power increases as sample size increases
214
+ d) Observed effect size determines the biological importance. Large effect sizes are more biologically important.
215
+ e) You can achieve the same power with a smaller sample by increasing the minimum effect size you're willing to detect
216
+
217
+ **11.** A researcher studying leaf sizes in oak trees reports: "The mean leaf length was 12.3 cm (SE = 0.45 cm, SD = 3.2 cm, n = 50)." A colleague questions why both SE and SD are reported. Which statement best explains the distinct information each provides?
218
+
219
+ a) SE and SD are the same thing, just calculated differently
220
+ b) SD describes variability in the individual leaves; SE describes uncertainty about the estimated mean
221
+ c) SE describes variability in the individual leaves; SD describes uncertainty about the estimated mean
222
+ d) SD is always larger than SE due to calculation errors
223
+ e) Both describe the same variability but SE is preferred for larger samples
224
+
225
+ **12.** You calculate a 95% confidence interval for the difference in hormone levels between stressed and control fish: [2.3, 8.7] ng/mL. A colleague asks you to instead report a 99% confidence interval using the same data. Without recalculating, what can you predict about the 99% CI?
226
+
227
+ a) It will be narrower than [2.3, 8.7] because 99% is more confident
228
+ b) It will be wider than [2.3, 8.7] because greater confidence requires a wider range
229
+ c) It will have the same width but shifted to center on zero
230
+ d) It will be [0.23, 0.87] because confidence scales proportionally
231
+ e) Cannot predict without knowing the sample size
232
+
233
+ **13.** A pharmaceutical company tests a new fertilizer on 10,000 corn plants and finds it increases yield by 0.02% (p = 0.0001). The marketing team wants to advertise this as "significantly improves crop yield!" As the data scientist, what is your primary concern?
234
+
235
+ a) The p-value is too small to be trustworthy
236
+ b) The sample size is too large for valid statistics
237
+ c) The effect is statistically significant but practically meaningless
238
+ d) The null hypothesis was incorrectly specified
239
+ e) Corn yield cannot be measured precisely enough
240
+
241
+ **14.** A student analyzing butterfly wing patterns writes the following R code:
242
+
243
+ ```r
244
+ butterfly_data <- read.csv("butterflies.csv")
245
+ mean_wingspan <- mean(butterfly_data$wingspan_mm)
246
+ median_wingspan <- median(butterfly_data$wingspan_mm)
247
+ mean_wingspan
248
+ ```
249
+
250
+ What will be displayed in the R console when this code runs?
251
+
252
+ a) Both the mean and median wingspan values
253
+ b) Only the mean wingspan value
254
+ c) Only the median wingspan value
255
+ d) The entire butterfly_data dataset
256
+ e) An error message because median_wingspan was not printed
257
+
258
+ **15.** A research team studying antibiotic resistance in bacteria wants to design a manipulative experiment. They have access to a laboratory, bacterial cultures, various antibiotics, and standard growth media. Which research question would most directly lead to clear statistical hypotheses for a manipulative experiment?
259
+
260
+ a) How does antibiotic exposure affect bacterial populations in natural environments?
261
+ b) What factors influence the development of antibiotic resistance in bacteria?
262
+ c) Does exposure to sub-lethal doses of ampicillin (0.5 μg/mL for 48 hours) increase the survival rate of E. coli when subsequently treated with a lethal dose (10 μg/mL)?
263
+ d) Is there a relationship between antibiotic concentration and bacterial resistance?
264
+ e) Do bacteria develop resistance faster when exposed to antibiotics compared to bacteria that are not exposed to antibiotics?
265
+
266
+ ---
267
+
268
+ ## SHORT ANSWER SECTION:
269
+
270
+ **16.** A graduate student studying desert lizard metabolism has the following R code and output:
271
+
272
+ ```r
273
+ # Load data
274
+ lizard_data <- read.csv("desert_lizards.csv")
275
+
276
+ # Check structure
277
+ str(lizard_data)
278
+ # 'data.frame': 120 obs. of 5 variables:
279
+ # $ species : chr "horned" "horned" "collared" ...
280
+ # $ temp_C : num 28.5 31.2 29.8 ...
281
+ # $ mass_g : num 45.2 38.9 52.1 ...
282
+ # $ metabolic_rate: num 0.82 0.91 1.05 ...
283
+ # $ activity_level: chr "low" "medium" "high" ...
284
+
285
+ # Create visualization
286
+ library(ggplot2)
287
+ ggplot(lizard_data, aes(x = temp_C, y = metabolic_rate, color = species)) +
288
+ geom_point(size = 3) +
289
+ geom_smooth(method = "lm", se = FALSE) +
290
+ facet_wrap(~ activity_level) +
291
+ labs(title = "Desert Lizard Metabolic Rates",
292
+ x = "Temperature (°C)",
293
+ y = "Metabolic Rate (mL O2/g/hr)")
294
+ ```
295
+
296
+ a) **[2 points]** Describe in words what visualization this code produces. Be specific about what is shown.
297
+
298
+ **GRADING RUBRIC (2 points total):**
299
+ - 0.5 points: Identifies as a scatterplot
300
+ - 0.5 points: Identifies X axis as temperature (temp_C)
301
+ - 0.5 points: Identifies Y axis as metabolic rate
302
+ - 0.5 points: Identifies color as representing different species
303
+
304
+ b) **[2 points]** The student wants to test if the mean metabolic rate differs between species. What two assumption-checking steps should they take to make sure the assumption of normality is not violated?
305
+
306
+ **GRADING RUBRIC (2 points total):**
307
+ - 0.5 points: Names appropriate method (histogram, Q-Q plot, density plot, boxplot)
308
+ - 0.5 points: Names they are looking for deviations from normality (can also optionally mention outliers)
309
+ - 0.5 points: Names a statistical test like Kolmogorov-Smirnov (KS) or Shapiro-Wilk
310
+ - 0.5 points: States that a p-value < 0.05 indicates a violation of normality
311
+
312
+ **17.** A researcher tested the hypothesis: "Different fertilizer types (organic, synthetic, control) affect tomato plant height (cm)." They collected height measurements from 40 plants per fertilizer type and created the following figure:
313
+
314
+ ### FIGURE DESCRIPTION:
315
+ The figure is a line graph with the following characteristics:
316
+ - **Title**: "data" (appears in the top left corner)
317
+ - **X-axis**: Labeled "type" with three categorical values: "Control", "Organic", and "Synthetic"
318
+ - **Y-axis**: Labeled "values" with a scale ranging from 40 to 55
319
+ - **Data representation**: Three data points connected by black lines:
320
+ - Control: approximately 45 units
321
+ - Organic: approximately 52 units (highest point)
322
+ - Synthetic: approximately 48 units
323
+ - **Graph style**: The three points are connected by straight black lines forming a peaked shape, with the organic fertilizer showing the highest value
324
+ - **Grid**: Light gray gridlines in the background
325
+ - **Point markers**: Black filled circles at each data point
326
+
327
+ a) **[2 points]** List TWO specific problems with this figure.
328
+
329
+ **GRADING RUBRIC (2 points total):**
330
+ - 0.75 points: States one problem
331
+ - Acceptable problems:
332
+ 1. Wrong geom type (line graph for categorical data)
333
+ 2. Poor/missing labels
334
+ 3. No variability shown (no error bars)
335
+ 4. No sample size information
336
+ - 0.25 points: States why the first problem is a bad decision
337
+ - 0.75 points: States a second problem
338
+ - 0.25 points: States why the second problem is wrong
339
+
340
+ b) **[2 points]** What type of visualization should the researcher have used instead?
341
+
342
+ **GRADING RUBRIC (2 points total):**
343
+ - 2.0 points: Names appropriate type (boxplot, violin plot, bar plot with error bars)
344
+ - 1.0 point: Partial credit for another visualization that could use categorical data but is not optimal (like a pie chart)
345
+
346
+ **18. [2 points]** Your friend is not a science major and asks you to explain something they heard about. Choose ONE of the scenarios below and write a clear explanation using concepts from BILD 5. Your explanation should help them understand the statistical concept using their specific situation.
347
+
348
+ **SCENARIO A: Confidence Interval**
349
+
350
+ Your friend is working on a psychology research project about college student sleep patterns. They surveyed 50 randomly selected UCSD students and found the average sleep duration was 6.5 hours per night, with a 95% confidence interval of [6.1, 6.9] hours. Your friend says: "So this means that 95% of UCSD students sleep between 6.1 and 6.9 hours, right?"
351
+
352
+ Use this specific scenario to help them understand what "95% confident" refers to.
353
+
354
+ **GRADING RUBRIC FOR SCENARIO A (2 points total):**
355
+ - 0.5 points: Clarifies the CI is NOT about 95% of individuals falling in the range
356
+ - 0.5 points: Explains it's about uncertainty in estimating the population mean
357
+ - 1.0 point: Explains the repeated sampling interpretation or another accurate definition (if we repeated this study 100 times, 95 intervals would capture the true mean)
358
+ - 0.5 points: Partial credit for a CI definition that is generally OK but has some aspect that is incorrect
359
+
360
+ **OR**
361
+
362
+ **SCENARIO B: p-value**
363
+
364
+ Your friend is reading a news article that says: "A new study found that people who eat chocolate daily have better memory (p = 0.04). This proves chocolate improves memory!" Your friend asks: "What does p = 0.04 mean? Does it mean there's only a 4% chance they're wrong?"
365
+
366
+ Explain what the p-value actually tells us in this study.
367
+
368
+ **GRADING RUBRIC FOR SCENARIO B (2 points total):**
369
+ - 0.5 points: Corrects the "4% chance they're wrong" misconception
370
+ - 0.5 points: Includes an explicit reference to "assuming the null hypothesis is true"
371
+ - 1.0 point: Clearly defining it as a probability you'd see a difference/test statistic at least this large due to random chance alone
372
+ - 0.75 points: Partial credit if not saying "at least this large" or not indicating area under the curve (e.g., if they said it was the probability of getting this exact test statistic due to chance alone)
373
+
374
+ ---
375
+
376
+ **Total: 25 points**
377
+ Page 8/8
378
+
379
+ ---
380
+
381
+ ## GRADING SUMMARY
382
+
383
+ **Multiple Choice Section:** 15 questions × 1 point each = 15 points
384
+
385
+ **Short Answer Section:** 10 points total
386
+ - Question 16a: 2 points
387
+ - Question 16b: 2 points
388
+ - Question 17a: 2 points
389
+ - Question 17b: 2 points
390
+ - Question 18: 2 points
391
+
392
+ **Total Exam Points: 25 points**
393
+
394
+ '''
395
  },
396
  {
397
  "name": "Create a Study Plan",
 
426
  resources = [
427
  {
428
  "title": "Course Syllabus",
429
+ "file_path": "BILD_5_Syllabus_Reuther_F25.pdf",
430
  "description": "Download the course syllabus. **Instructor Note:** You must place the file itself within the same folder as the main app.py file in your GitHub repository."
431
  },
432
  {
terms.csv CHANGED
@@ -1,126 +1,170 @@
1
- TERM,CONTEXT
2
- Scientific Research Question,Frame a clear biological ask specifying variables population and evidence needed.
3
- Alternative Hypothesis,Claims a real effect e.g. feed type changes chick weight.
4
- Null Hypothesis,Assumes no difference or association; target for statistical rejection.
5
- Testable Hypothesis,Must be measurable and falsifiable; imagine axes before data collection.
6
- Prediction vs. Hypothesis,Prediction is specific numeric outcome; hypothesis is broader explanatory claim.
7
- Data types - categorical,Qualitative groups like species or treatment levels analyzed with chi-square or ANOVA.
8
- Data types - continuous/numerical,Measured quantities like bill length or mass used in t-tests and regression.
9
- tidy data,Each row is an observation and each column a variable; essential for dplyr and ggplot.
10
- Descriptive statistics,Summarize center and spread before formal testing to inform next steps.
11
- centrality and variation in statistics,Use mean median SD and IQR when exploring Palmer Penguins data.
12
- standard deviation,Average spread of points around the mean; unit matches variable.
13
- Standard error,SD divided by square root of n; gauges accuracy of sample mean.
14
- Confidence intervals,Range likely to contain true parameter value such as mean ±1.96 SE.
15
- range,Maximum minus minimum; quick variability check sensitive to outliers.
16
- interquartile range,Middle 50 percent of values; robust to extremes.
17
- skewness,Asymmetry in a distribution; may prompt log or square-root transform.
18
- kurtosis,Peakedness or tailedness; high kurtosis means heavy tails.
19
- Parametric Assumptions and Normality Checks,Verify normality equal variance and independence with QQ plots and Fligner tests before parametric tests.
20
- The Central Limit Theorem,Means of sufficiently large random samples approximate a normal distribution regardless of source.
21
- q-q plot,Graph sample quantiles versus theoretical normal quantiles to judge normality.
22
- 2 sample t-test,Compares means of two independent groups such as linseed vs meatmeal feeds.
23
- paired t-test,Tests mean difference of matched observations like before-after designs.
24
- ANOVA tests,Detects mean differences across three or more groups and uses Tukey HSD post-hoc.
25
- Chi-Squared test,Assesses independence between categorical variables in a contingency table.
26
- linear regression,Models relationship between predictor and continuous response; yields slope and intercept.
27
- correlation,Quantifies strength and direction of linear association between two continuous variables.
28
- Choosing the Proper Statistical Test,Use data type number of groups and assumptions flowchart to select test.
29
- Corrections for multiple comparisons,Adjust family-wise error with Bonferroni or Tukey after multiple tests.
30
- power analysis,Calculates needed sample size given expected effect size alpha and desired power.
31
- Statistical Power and Effect Sizes,Relates true-positive sensitivity to effect magnitude sample size and alpha.
32
- Type I error,False positive where true null is wrongly rejected; controlled by alpha.
33
- Type II error,False negative where false null is not rejected; probability beta.
34
- alpha,Chosen risk threshold for Type I error commonly 0.05.
35
- beta,Probability of Type II error; power equals one minus beta.
36
- Randomization,Assign treatments by chance to avoid selection bias.
37
- Confounding Variables,Factors that covary with treatment and distort the true effect.
38
- Blocking and Stratification,Group by known source of variation such as soil type to reduce error.
39
- Sampling Strategies,Random stratified and cluster approaches ensure representative independent units.
40
- Blinding (Single-Blind or Double-Blind),Conceal group assignments to reduce observer and participant bias.
41
- Pilot Studies,Small trial run to test feasibility and refine protocols before full experiment.
42
- Data Visualization in Biology,First look at data to reveal patterns outliers and relationships.
43
- ggplot2 and the grammar of graphics,Layered system mapping data to aesthetics and geoms for reproducible figures.
44
- scatterplot,Plots two continuous variables; add color by species to reveal clustering.
45
- histogram,Displays distribution shape of one variable; choose bin width carefully.
46
- box plot,Shows median IQR and outliers per group; quick comparison of distributions.
47
- bar plot,Shows means or counts with error bars; use sparingly to avoid hiding variation.
48
- The Palmer Penguins dataset,344 penguins with 8 variables ideal for teaching ggplot and stats.
49
- iris R dataset,150 flowers with four measurements used for ANOVA PCA examples.
50
- R programming - functions,Wrap reusable code blocks with arguments and return values.
51
- R programming - Rmd file format,Combine prose code and output; knit to PDF or HTML for reports.
52
- Bayesian Analysis,Updates prior beliefs with data; example ecological occupancy model.
53
- Resampling Methods (Permutation tests),Shuffle labels to build null distribution when assumptions fail.
54
- Bootstrapping,Sample with replacement to estimate CI of medians or slopes.
55
- Factorial Design,Tests multiple factors and their interaction in one ANOVA.
56
- Interaction Effect,When factor A’s effect depends on factor B; interpret via interaction plot.
57
- Repeated Measures Design,Same subject measured over conditions; reduces individual variance.
58
- Cross-Over Design,Each participant receives all treatments in different periods.
59
- Quasi-Experimental Design,Lacks random assignment yet seeks causal inference; policy studies.
60
- Case-Control Study,Compares diseased vs healthy groups to identify risk factors.
61
- Field vs. Laboratory (In vivo vs In vitro),Trade realism for control; match question to setting.
62
- Pseudoreplication,Treating non-independent subsamples as true replicates inflates n.
63
- Experimental Unit,Smallest independent entity assigned to a treatment.
64
- Observer Bias,Researcher expectations skew data collection; mitigate with blinding.
65
- Help with a code bug in R,Copy code and error; tutor guides fixes without giving full answer.
66
- RStudio,IDE used in labs; console
67
- R programming - objects,Store data/values in named containers: vectors data frames lists.
68
- R programming - print() function,Displays object content; implicit in console but explicit in Rmd.
69
- R programming - pipelines (%>%),dplyr operator chaining verbs into readable workflow.
70
- dplyr verbs,select filter mutate summarise arrange join for data wrangling.
71
- readr functions,read_csv read_tsv for fast import with automatic type guess.
72
- ggplot2 aesthetics (aes),Map variables to x y color size shape inside ggplot calls.
73
- geom_point,Scatterplot layer for continuous vs continuous relationships.
74
- geom_boxplot,Shows median IQR whiskers and outliers per group.
75
- geom_histogram,Bins continuous data to reveal distribution shape.
76
- geom_bar,"Count or summarised height per category; add stat=""identity"" for means."
77
- Theme customization (ggplot2),Modify titles text and grid; theme_bw theme_minimal examples.
78
- facet_wrap,Create small multiples by a single variable for quick comparisons.
79
- facet_grid,Grid of plots by two factors; rows × cols interaction display.
80
- Data transformations (log, sqrt)
81
- Back transformation,Convert transformed estimates back to original units for interpretation.
82
- Homogeneity of variance (Homoscedasticity),Equal group variances assumption for t and ANOVA.
83
- Fligner-Killeen test,Non-parametric test for equal variances across groups.
84
- Shapiro-Wilks test,Formal normality test suited for n 3-5000.
85
- Kolmogorov–Smirnov test,Compares sample CDF to theoretical; sensitive to shifts.
86
- D'Agostino's K^2 test,Assesses combined skewness and kurtosis deviation from normality.
87
- Effect size measures (Cohen's d),Standardised mean difference aiding practical significance.
88
- Residual diagnostics,Plot residuals vs fitted to spot non-linearity or heteroscedasticity.
89
- Leverage and influence,Detect outliers affecting regression via Cook’s distance.
90
- Power curve,Graph power vs sample size to choose efficient n.
91
- Sample size calculator,Plug alpha beta effect size to compute required n.
92
- Missing data handling,Listwise deletion vs imputation; MCAR MAR MNAR concepts.
93
- Confusion Matrix,2×2 table of predicted vs actual; TP FP TN FN counts.
94
- Receiver Operating Characteristic (ROC) curve,Sensitivity vs 1-specificity across thresholds; AUC metric.
95
- Precision and Recall,Positive predictive value and sensitivity for imbalanced data.
96
- F1 score,Harmonic mean of precision and recall; balances false results.
97
- Heat map,Color-coded grid for matrix data or correlation matrices.
98
- Violin plot,Combines boxplot with kernel density; shows distribution shape.
99
- Pie charts (why to avoid),Poor at area comparison; prefer bar or stacked bar.
100
- DataHub platform,UCSD cloud RStudio workspace used for coding labs.
101
- Markdown syntax in Rmd,Headings code fences lists links to format reproducible reports.
102
- Syllabus – Course Description,"Data Analysis and Design for Biologists (4 credits) is a practical introduction to information literacy, experimental design, and data analysis for life-science majors. Students learn coding, data management, visualization, and quantitative reasoning using the R language and RStudio IDE. This is NOT a traditional statistics course and has no math prerequisites; the emphasis is on asking biologically meaningful questions, choosing appropriate analyses, and interpreting results."
103
- Syllabus – Learning Outcomes,"By the end of the quarter students will be able to: 1) Create testable hypotheses for valid biological questions, 2) Evaluate the credibility of scientific information, 3) Design experiments that effectively test hypotheses, 4) Construct publication-quality figures, 5) Perform appropriate statistical analyses in R, 6) Interpret quantitative results in biological context, 7) Utilize R for data manipulation and graphing, 8) Combine the full investigative cycle in a student-designed project, 9) Explore the modern intersection of biology, technology, and data science, 10) Examine the ethical responsibilities of scientists when creating and communicating evidence."
104
- Syllabus Contact Info,Instructor: Dr. Keefe Reuther (he/him/his) please call me Keefe. Email: kdreuther@ucsd.edu (include “BILD 5” in the subject line).
105
- Syllabus – Lecture Time,Lectures meet M/W/F 2:00–2:50 pm in Center Hall Room 101.
106
- Syllabus – Final Exam,"Mandatory in-person final: Friday 13 June 2025, 3:00 6:00 pm PST."
107
- Syllabus – Instructional Assistants,"Instructional Assistants: Yanlin Li (yal037@ucsd.edu), Rakshitha Kobbekaduwa (tkobbekaduwa@ucsd.edu), Mitchell Smith (mis033@ucsd.edu), Saranya Vohra (savohra@ucsd.edu)."
108
- Syllabus Section Meeting Times,A01 Mon 4:00–4:50 pm WLH 2205; A02 Wed 1:00–1:50 pm Center Hall 222; A03 Wed 8:00–8:50 am WLH 2205; A04 Fri 4:00–4:50 pm WLH 2205.
109
- Syllabus – Office Hours,Keefe’s office hours: Wed 12:00–1:30 pm (location TBA) and Fri 3:00–4:00 pm (location TBA).
110
- Syllabus – Prerequisites,None. No prior coding experience or wet-lab background required.
111
- Syllabus Piazza Discussions,All course Q&A handled on Piazza for rapid community support. Sign-up link: https://piazza.com/ucsd/spring2025/bild5_sp25_a00 . Email only for private matters.
112
- Syllabus Technology Requirements,"You need a web-enabled device (laptop strongly recommended) to access Canvas, Zoom, and the UCSD DataHub cloud RStudio server. Chromebooks work fine. On-campus loaner laptops are available."
113
- Syllabus – Course Calendar,"Week-by-week lecture topics: W1 Data types & structures W2 Visualization & central tendency → W3 Normality & CLT → W4 Hypothesis Testing basics → W5 Power & t-tests → W6 Midterm + ANOVA & correlation → W7 Regression & design choices → W8 Sampling & ethics → W9 Multivariate methods, careers → W10 Review & project help."
114
- Syllabus – Section Topics,Section labs: W1 Hello RStudio/DataHub; W2 Importing data; W3 ggplot2 visualization; W4 Tidyverse wrangling; W5 Review; W6 Normality tests & t-test; W7 ANOVA; W8 Linear regression; W9 Synthesis; W10 Term-project workshop.
115
- Syllabus – Deliverables & Due Times,"Assignments due 11:59 pm PST unless stated: Section work weekly, Quizzes W2 4 8 10, Discussion Board posts bi-weekly, Term-Project checkpoints W8 & W9, Final project W10, Midterm (in lecture W6), Final exam."
116
- Syllabus – Grading Breakdown,"Lecture participation 5 %, Quizzes 15 % (lowest dropped), Section assignments 20 % (lowest dropped), Discussion posts 10 % (lowest dropped), Term Project 20 % (10 % checkpoints + 10 % final), Midterm 10 %, Final Exam 20 %. Pre/Post surveys & SETs up to 1 % extra credit."
117
- Syllabus – Grading Scale,"A+ 97-100, A 93-96, A- 90-92, B+ 87-89, B 83-86, B- 80-82, C+ 77-79, C 73-76, C- 70-72, D+ 67-69, D 63-66, D- 60-62, F < 60. Grade cut-offs never shift; no rounding."
118
- Syllabus – Collaboration Policy,"Science is social: discuss concepts and share code, but your submitted answers, RMarkdown narration, and interpretations must be your own. All Rmd PDFs run through plagiarism detection. Any shared AI output must be cited in a one-line statement. No answer-sharing."
119
- Syllabus – Discussion Board Prompts,"Prompts posted weeks 1 3 5 7 9. A creditable post is original, substantive, and properly cited. Replies like “I agree” do not count. Lowest prompt grade dropped."
120
- Syllabus – Quizzes Policy,"Canvas quizzes W2 4 8 10, 60 min each, non-cumulative. Quiz 1 includes syllabus questions. Lowest quiz score dropped. No AI tools permitted during a quiz."
121
- Syllabus Exams Policy,"Midterm held in lecture week 6 (50 min). Final exam cumulative, 3 h window. One 4×6 note card allowed. No reschedule unless OSD or UC-sanctioned event; email Keefe before exam start if emergency."
122
- Syllabus – Term Project,"Students complete a full investigative cycle using instructor-supplied simulated data: formulate question, hypothesis, choose tests, analyse in R, create figures, interpret, and write report. Two checkpoint drafts receive feedback; grading becomes stricter each stage."
123
- Syllabus – Extra Credit,Complete three pre-course and three post-course surveys plus SETs for up to 1 % extra credit. No other extra-credit opportunities.
124
- Syllabus – Late Assignment Policy,"Quiz, Discussion, Project: -2 % per hour late; >48 h late max 50 %. Technical issues near deadline not valid excuses. Lecture participation: up to 18 missed check-ins permitted without penalty."
125
- Syllabus Attendance Policy,Lecture participation tracked via Mentimeter check-in/out. Up to 18 missed check-ins (~3 weeks) still yields 100 % attendance. Student responsible for tracking absences.
126
- Syllabus – Academic Integrity & Gen AI,Generative AI is allowed for brainstorming or debugging if you include a one-sentence attribution (tool + assistance). AI use is forbidden during quizzes and exams. Excessive reliance may trigger an oral comprehension quiz.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TERM,CONTEXT
2
+ aes() mapping,Maps variables to visual properties like x y color size in ggplot
3
+ alpha (significance level),Probability threshold for Type I error commonly set at 0.05
4
+ alternative hypothesis,Research hypothesis claiming an effect or difference exists
5
+ animal welfare protocols,Ethical guidelines ensuring humane treatment in research with vertebrates
6
+ ANOVA (one-way),Tests for mean differences across three or more groups
7
+ assumptions of linear regression,Linearity normality of residuals and homoscedasticity requirements
8
+ augment(),broom function that adds residuals fitted values and diagnostics to model data
9
+ bar plot,Shows counts or means for categorical variables
10
+ bcPower(),Function in car package for Box-Cox power transformations
11
+ binary data,Categorical variable with two levels like yes/no or presence/absence
12
+ bioinformatics and computational biology methods,Sequence alignment phylogenetics protein folding machine learning for genomic data
13
+ biological replicate,Independent experimental units providing true replication
14
+ blinding,Concealing treatment assignment to reduce bias
15
+ blocking,Grouping by known variable like age or location to control its effects
16
+ Bonferroni correction,Adjusts alpha by dividing by number of tests to control Type I error
17
+ bootstrapping,Resampling with replacement to estimate confidence intervals and standard errors
18
+ Box-Cox transformation,Power transformation to normalize data and stabilize variance using optimal lambda
19
+ boxplot,Displays median IQR whiskers and outliers for group comparisons
20
+ broom package,Tidies model output into data frames for easier manipulation
21
+ case-control study,Compares groups with and without outcome to identify risk factors
22
+ categorical data,Qualitative groups like species or treatment levels
23
+ Central Limit Theorem,Sample means approach normal distribution as n increases regardless of population shape
24
+ central tendency,Measures of data center including mean median and mode
25
+ chi-squared goodness of fit,Tests if observed frequencies match expected frequencies for one categorical variable
26
+ chi-squared test of independence,Tests if two categorical variables are associated or independent
27
+ CO2 dataset,Built-in R dataset with plant uptake measurements used for regression examples
28
+ coefficient of determination (R²),Proportion of variance in response explained by predictors
29
+ Cohen's d,Standardized effect size measure for mean differences
30
+ confidence interval (95%),Range likely to contain true parameter value with 95% confidence
31
+ confounding variable,Factor that correlates with both treatment and outcome
32
+ conservation biology methods,Population viability analysis habitat modeling biodiversity assessment species monitoring
33
+ continuous data,Quantitative measurements like weight length or concentration
34
+ control group,Baseline comparison receiving no treatment or standard treatment in experiments
35
+ Cook's distance,Measures influence of each observation on regression model identifies outliers
36
+ cor.test(),R function for testing correlation significance between two variables
37
+ correlation coefficient (r),Standardized measure of linear association from -1 to 1
38
+ cross-over design,Each participant receives all treatments in different periods with washout between
39
+ cross-sectional study,Data collected at single time point across different subjects
40
+ data transformation,Mathematical modifications like log or square root to meet assumptions
41
+ discrete data,Count data taking only integer values
42
+ double-blind,Neither participants nor researchers know treatment assignment
43
+ dplyr,R package for data manipulation with verbs like select filter mutate
44
+ ecology and evolution methods,Mark-recapture species distribution modeling community ecology population genetics
45
+ effect size,Magnitude of difference between groups independent of sample size
46
+ ethics in research,Principles ensuring participant welfare and scientific integrity
47
+ experimental unit,Smallest independent unit receiving treatment assignment
48
+ exploratory data analysis (EDA),Initial data examination to understand patterns before formal testing
49
+ facet_grid,Creates grid of plots by two categorical variables in ggplot2
50
+ facet_wrap,Creates small multiples by single variable for quick comparisons
51
+ factorial design,Tests multiple factors and their interactions simultaneously
52
+ false discovery rate,Expected proportion of false positives among rejected hypotheses
53
+ field study,Research in natural environment with ecological validity
54
+ filter(),dplyr function to subset rows based on conditions
55
+ fitted values,Model predictions for each observation in regression
56
+ Fligner-Killeen test,Non-parametric test for equal variances across groups
57
+ generalized linear model (GLM),Extension of linear models for non-normal response distributions
58
+ genomics and molecular methods,CRISPR gene editing RNA-seq ChIP-seq proteomics single-cell analysis
59
+ geom_bar,Bar chart layer for categorical data in ggplot2
60
+ geom_boxplot,Boxplot layer for group comparisons in ggplot2
61
+ geom_histogram,Histogram layer for distribution visualization
62
+ geom_point,Scatterplot layer for continuous relationships
63
+ geom_smooth,Adds regression line or smoothed curve to plots
64
+ ggplot2,R package for creating layered graphics using grammar of graphics
65
+ group_by(),dplyr function to perform operations by groups
66
+ heteroscedasticity,Unequal variance violating assumptions of parametric tests
67
+ histogram,Shows distribution of continuous variable using bins
68
+ homoscedasticity,Equal variance assumption for groups or across predictor range
69
+ hypothesis testing framework,Structured approach to testing claims using null and alternative hypotheses
70
+ IACUC,Institutional Animal Care and Use Committee overseeing vertebrate research ethics
71
+ in vitro,Experiments in controlled environment outside living organism
72
+ in vivo,Experiments conducted in living organisms
73
+ informed consent,Ethical requirement for human subjects to voluntarily agree to participate
74
+ Institutional Review Board (IRB),Committee ensuring ethical standards in human subjects research
75
+ intercept,Predicted y value when x equals zero in regression equation
76
+ interquartile range (IQR),Range between 25th and 75th percentiles robust to outliers
77
+ iris dataset,Classic R dataset with 150 flower measurements for classification examples
78
+ Kolmogorov-Smirnov test,Tests if sample comes from specified distribution like normal
79
+ kurtosis,Measure of distribution tail heaviness relative to normal
80
+ lambda (λ),Transformation parameter in Box-Cox determining optimal power
81
+ leverage,Measure of how extreme predictor values are potential for influence
82
+ linear regression,Models relationship between predictor and continuous response variable
83
+ lm(),R function for fitting linear models returns coefficients and diagnostics
84
+ log transformation,Common transformation for right-skewed data or multiplicative relationships
85
+ longitudinal study,Data collected from same subjects over multiple time points
86
+ marine and environmental science methods,Ocean sampling environmental DNA water quality assessment climate modeling
87
+ MASS package,R package containing functions for modern applied statistics
88
+ mean,Average value sum divided by n central tendency measure used in t-tests ANOVA
89
+ median,Middle value when ordered robust central tendency measure for boxplots IQR
90
+ microbiology and immunology methods,Flow cytometry ELISA viral quantification microbiome analysis antibiotic resistance testing
91
+ mode,Most frequent value in dataset third measure of central tendency
92
+ model diagnostics,Checking assumptions through residual plots QQ plots and formal tests
93
+ multiple comparisons problem,Increased Type I error risk when conducting multiple tests
94
+ multiple regression,Linear model with two or more predictor variables
95
+ mutate(),dplyr function to create or modify columns
96
+ negative control,Treatment known to have no effect checks for artifacts
97
+ neuroscience methods,Electrophysiology fMRI optogenetics behavior tracking connectomics analysis
98
+ normality,Bell-shaped Gaussian distribution assumption for parametric tests checked Week 3
99
+ null hypothesis,Statement of no effect or no difference to be tested
100
+ observational study,No treatment manipulation only observation of existing variation
101
+ observer bias,Researcher expectations influence data collection or interpretation
102
+ one-sample t-test,Tests if sample mean differs from hypothesized population value
103
+ open science,Transparency practices including data sharing preprints reproducible code
104
+ ordinary least squares (OLS),Method minimizing sum of squared residuals to fit regression line
105
+ outlier,Data point substantially different from other observations
106
+ p-value,Probability of obtaining results as extreme as observed if null hypothesis true
107
+ paired t-test,Compares matched observations like before-after measurements
108
+ Palmer Penguins dataset,Modern alternative to iris with 344 penguin measurements
109
+ parametric tests,Statistical tests assuming specific probability distributions
110
+ pilot study,Small preliminary study testing feasibility and methods
111
+ pipe operator (|> or %>%),Chains functions together for readable workflows in R
112
+ plant biology methods,Photosynthesis measurement growth assays metabolomics gene expression tissue culture
113
+ plot(),Base R function for creating diagnostic plots from lm objects
114
+ positive control,Treatment known to produce effect validates experiment
115
+ post-hoc tests,Pairwise comparisons following significant omnibus test like ANOVA
116
+ power (1-β),Probability of correctly rejecting false null hypothesis
117
+ power analysis,Calculates needed sample size given expected effect alpha and power
118
+ powerTransform(),car package function to find optimal Box-Cox lambda value
119
+ pre-registration,Publishing study design and analysis plan before data collection
120
+ predictor variable,Independent variable used to predict outcome in regression
121
+ protected health information (PHI),Confidential patient data requiring special ethical handling
122
+ pseudoreplication,Incorrectly treating non-independent observations as replicates
123
+ QQ plot,Graphical method comparing data distribution to theoretical normal
124
+ quasi-experimental design,Lacks random assignment but seeks causal inference
125
+ R programming language,Statistical computing environment widely used in biological research
126
+ R squared,Proportion of variance explained by regression model
127
+ random sampling,Selection where each member has equal probability of inclusion
128
+ randomization,Random assignment to treatments prevents systematic bias
129
+ randomized controlled trial (RCT),Gold standard experimental design with random treatment assignment
130
+ range,Maximum minus minimum quick variability check sensitive to outliers
131
+ regression assumptions,Requirements including linearity normality and constant variance
132
+ regression diagnostics,Tools for checking model assumptions using residuals and influence measures
133
+ repeated measures design,Same subjects measured under multiple conditions reduces variance
134
+ replication,Multiple independent observations per treatment group essential Week 9 concept
135
+ research misconduct,Fabrication falsification plagiarism violations of scientific integrity
136
+ residual standard error,Estimate of standard deviation of residuals around regression line
137
+ residuals,Differences between observed and predicted values in regression
138
+ response variable,Dependent variable being predicted in regression analysis
139
+ sample size (n),Number of independent observations affects power and uncertainty
140
+ sampling distribution,Distribution of sample statistics across repeated sampling
141
+ scatterplot,Plots two continuous variables to show relationships
142
+ select(),dplyr function to choose specific columns from data frame
143
+ Shapiro-Wilk test,Statistical test for normality effective for small to moderate samples
144
+ simple linear regression,Model with single predictor and continuous response
145
+ skewness,Asymmetry in distribution with longer tail on one side
146
+ slope,Rate of change in y per unit change in x regression coefficient
147
+ sqrt transformation,Square root transformation for count data or moderate skew
148
+ standard deviation,Average spread of data points around the mean
149
+ standard error,Standard deviation of sampling distribution measures precision
150
+ statistical methods in biomedicine,Clinical trials survival analysis epidemiology biomarkers meta-analysis
151
+ statistical significance,Result unlikely due to chance alone typically p < 0.05
152
+ stratification,Dividing population into subgroups before sampling
153
+ sum of squares,Total squared deviations used in ANOVA and regression calculations
154
+ summarize(),dplyr function to calculate summary statistics
155
+ summary(),R function displaying model coefficients tests and fit statistics
156
+ systems biology methods,Network analysis metabolic modeling multi-omics integration pathway analysis
157
+ t-statistic,Test statistic for t-tests ratio of effect to standard error
158
+ technical replicate,Multiple measurements of same unit not true replication
159
+ three Rs principle,Replacement reduction refinement in animal research ethics
160
+ tidy(),broom function converting model output to tidy data frame
161
+ tidyverse,Collection of R packages for data science including ggplot2 and dplyr
162
+ transformation parameter,Value like lambda determining type and strength of transformation
163
+ Tukey HSD,Post-hoc test for pairwise comparisons after significant ANOVA
164
+ two-sample t-test (unpaired),Compares means of two independent groups
165
+ Type I error,False positive rejecting true null hypothesis
166
+ Type II error,False negative failing to reject false null hypothesis
167
+ variance,Square of standard deviation measuring data dispersion
168
+ violin plot,Combines boxplot with kernel density to show distribution shape
169
+ Welch's t-test,Modified t-test for unequal variances between groups
170
+ Winsorization,Replacing extreme values with less extreme ones to reduce outlier impact