Schema_Study_BILD5_V1 / all_terms.csv
keefereuther's picture
Upload all_terms.csv
2c981c4 verified
TERM,CONTEXT
Scientific Research Question,Frame a clear biological ask specifying variables population and evidence needed.
Alternative Hypothesis,Claims a real effect e.g. feed type changes chick weight.
Null Hypothesis,Assumes no difference or association; target for statistical rejection.
Testable Hypothesis,Must be measurable and falsifiable; imagine axes before data collection.
Prediction vs. Hypothesis,Prediction is specific numeric outcome; hypothesis is broader explanatory claim.
Data types - categorical,Qualitative groups like species or treatment levels analyzed with chi-square or ANOVA.
Data types - continuous/numerical,Measured quantities like bill length or mass used in t-tests and regression.
tidy data,Each row is an observation and each column a variable; essential for dplyr and ggplot.
Descriptive statistics,Summarize center and spread before formal testing to inform next steps.
centrality and variation in statistics,Use mean median SD and IQR when exploring Palmer Penguins data.
standard deviation,Average spread of points around the mean; unit matches variable.
Standard error,SD divided by square root of n; gauges accuracy of sample mean.
Confidence intervals,Range likely to contain true parameter value such as mean Β±1.96 SE.
range,Maximum minus minimum; quick variability check sensitive to outliers.
interquartile range,Middle 50 percent of values; robust to extremes.
skewness,Asymmetry in a distribution; may prompt log or square-root transform.
kurtosis,Peakedness or tailedness; high kurtosis means heavy tails.
Parametric Assumptions and Normality Checks,Verify normality equal variance and independence with QQ plots and Fligner tests before parametric tests.
The Central Limit Theorem,Means of sufficiently large random samples approximate a normal distribution regardless of source.
q-q plot,Graph sample quantiles versus theoretical normal quantiles to judge normality.
2 sample t-test,Compares means of two independent groups such as linseed vs meatmeal feeds.
paired t-test,Tests mean difference of matched observations like before-after designs.
ANOVA tests,Detects mean differences across three or more groups and uses Tukey HSD post-hoc.
Chi-Squared test,Assesses independence between categorical variables in a contingency table.
linear regression,Models relationship between predictor and continuous response; yields slope and intercept.
correlation,Quantifies strength and direction of linear association between two continuous variables.
Choosing the Proper Statistical Test,Use data type number of groups and assumptions flowchart to select test.
Corrections for multiple comparisons,Adjust family-wise error with Bonferroni or Tukey after multiple tests.
power analysis,Calculates needed sample size given expected effect size alpha and desired power.
Statistical Power and Effect Sizes,Relates true-positive sensitivity to effect magnitude sample size and alpha.
Type I error,False positive where true null is wrongly rejected; controlled by alpha.
Type II error,False negative where false null is not rejected; probability beta.
alpha,Chosen risk threshold for Type I error commonly 0.05.
beta,Probability of Type II error; power equals one minus beta.
Randomization,Assign treatments by chance to avoid selection bias.
Confounding Variables,Factors that covary with treatment and distort the true effect.
Blocking and Stratification,Group by known source of variation such as soil type to reduce error.
Sampling Strategies,Random stratified and cluster approaches ensure representative independent units.
Blinding (Single-Blind or Double-Blind),Conceal group assignments to reduce observer and participant bias.
Pilot Studies,Small trial run to test feasibility and refine protocols before full experiment.
Data Visualization in Biology,First look at data to reveal patterns outliers and relationships.
ggplot2 and the grammar of graphics,Layered system mapping data to aesthetics and geoms for reproducible figures.
scatterplot,Plots two continuous variables; add color by species to reveal clustering.
histogram,Displays distribution shape of one variable; choose bin width carefully.
box plot,Shows median IQR and outliers per group; quick comparison of distributions.
bar plot,Shows means or counts with error bars; use sparingly to avoid hiding variation.
The Palmer Penguins dataset,344 penguins with 8 variables ideal for teaching ggplot and stats.
iris R dataset,150 flowers with four measurements used for ANOVA PCA examples.
R programming - functions,Wrap reusable code blocks with arguments and return values.
R programming - Rmd file format,Combine prose code and output; knit to PDF or HTML for reports.
Bayesian Analysis,Updates prior beliefs with data; example ecological occupancy model.
Resampling Methods (Permutation tests),Shuffle labels to build null distribution when assumptions fail.
Bootstrapping,Sample with replacement to estimate CI of medians or slopes.
Factorial Design,Tests multiple factors and their interaction in one ANOVA.
Interaction Effect,When factor A’s effect depends on factor B; interpret via interaction plot.
Repeated Measures Design,Same subject measured over conditions; reduces individual variance.
Cross-Over Design,Each participant receives all treatments in different periods.
Quasi-Experimental Design,Lacks random assignment yet seeks causal inference; policy studies.
Case-Control Study,Compares diseased vs healthy groups to identify risk factors.
Field vs. Laboratory (In vivo vs In vitro),Trade realism for control; match question to setting.
Pseudoreplication,Treating non-independent subsamples as true replicates inflates n.
Experimental Unit,Smallest independent entity assigned to a treatment.
Observer Bias,Researcher expectations skew data collection; mitigate with blinding.
Help with a code bug in R,Copy code and error; tutor guides fixes without giving full answer.
RStudio,IDE used in labs; console
R programming - objects,Store data/values in named containers: vectors data frames lists.
R programming - print() function,Displays object content; implicit in console but explicit in Rmd.
R programming - pipelines (%>%),dplyr operator chaining verbs into readable workflow.
dplyr verbs,select filter mutate summarise arrange join for data wrangling.
readr functions,read_csv read_tsv for fast import with automatic type guess.
ggplot2 aesthetics (aes),Map variables to x y color size shape inside ggplot calls.
geom_point,Scatterplot layer for continuous vs continuous relationships.
geom_boxplot,Shows median IQR whiskers and outliers per group.
geom_histogram,Bins continuous data to reveal distribution shape.
geom_bar,"Count or summarised height per category; add stat=""identity"" for means."
Theme customization (ggplot2),Modify titles text and grid; theme_bw theme_minimal examples.
facet_wrap,Create small multiples by a single variable for quick comparisons.
facet_grid,Grid of plots by two factors; rows Γ— cols interaction display.
Data transformations (log, sqrt)
Back transformation,Convert transformed estimates back to original units for interpretation.
Homogeneity of variance (Homoscedasticity),Equal group variances assumption for t and ANOVA.
Fligner-Killeen test,Non-parametric test for equal variances across groups.
Shapiro-Wilks test,Formal normality test suited for n 3-5000.
Kolmogorov–Smirnov test,Compares sample CDF to theoretical; sensitive to shifts.
D'Agostino's K^2 test,Assesses combined skewness and kurtosis deviation from normality.
Effect size measures (Cohen's d),Standardised mean difference aiding practical significance.
Residual diagnostics,Plot residuals vs fitted to spot non-linearity or heteroscedasticity.
Leverage and influence,Detect outliers affecting regression via Cook’s distance.
Power curve,Graph power vs sample size to choose efficient n.
Sample size calculator,Plug alpha beta effect size to compute required n.
Missing data handling,Listwise deletion vs imputation; MCAR MAR MNAR concepts.
Confusion Matrix,2Γ—2 table of predicted vs actual; TP FP TN FN counts.
Receiver Operating Characteristic (ROC) curve,Sensitivity vs 1-specificity across thresholds; AUC metric.
Precision and Recall,Positive predictive value and sensitivity for imbalanced data.
F1 score,Harmonic mean of precision and recall; balances false results.
Heat map,Color-coded grid for matrix data or correlation matrices.
Violin plot,Combines boxplot with kernel density; shows distribution shape.
Pie charts (why to avoid),Poor at area comparison; prefer bar or stacked bar.
DataHub platform,UCSD cloud RStudio workspace used for coding labs.
Markdown syntax in Rmd,Headings code fences lists links to format reproducible reports.
Syllabus – Course Description,"Data Analysis and Design for Biologists (4 credits) is a practical introduction to information literacy, experimental design, and data analysis for life-science majors. Students learn coding, data management, visualization, and quantitative reasoning using the R language and RStudio IDE. This is NOT a traditional statistics course and has no math prerequisites; the emphasis is on asking biologically meaningful questions, choosing appropriate analyses, and interpreting results."
Syllabus – Learning Outcomes,"By the end of the quarter students will be able to: 1) Create testable hypotheses for valid biological questions, 2) Evaluate the credibility of scientific information, 3) Design experiments that effectively test hypotheses, 4) Construct publication-quality figures, 5) Perform appropriate statistical analyses in R, 6) Interpret quantitative results in biological context, 7) Utilize R for data manipulation and graphing, 8) Combine the full investigative cycle in a student-designed project, 9) Explore the modern intersection of biology, technology, and data science, 10) Examine the ethical responsibilities of scientists when creating and communicating evidence."
Syllabus – Contact Info,Instructor: Dr. Keefe Reuther (he/him/his) β€” please call me Keefe. Email: kdreuther@ucsd.edu (include β€œBILD 5” in the subject line).
Syllabus – Lecture Time,Lectures meet M/W/F 2:00–2:50 pm in Center Hall Room 101.
Syllabus – Final Exam,"Mandatory in-person final: Friday 13 June 2025, 3:00 – 6:00 pm PST."
Syllabus – Instructional Assistants,"Instructional Assistants: Yanlin Li (yal037@ucsd.edu), Rakshitha Kobbekaduwa (tkobbekaduwa@ucsd.edu), Mitchell Smith (mis033@ucsd.edu), Saranya Vohra (savohra@ucsd.edu)."
Syllabus – Section Meeting Times,A01 Mon 4:00–4:50 pm WLH 2205; A02 Wed 1:00–1:50 pm Center Hall 222; A03 Wed 8:00–8:50 am WLH 2205; A04 Fri 4:00–4:50 pm WLH 2205.
Syllabus – Office Hours,Keefe’s office hours: Wed 12:00–1:30 pm (location TBA) and Fri 3:00–4:00 pm (location TBA).
Syllabus – Prerequisites,None. No prior coding experience or wet-lab background required.
Syllabus – Piazza Discussions,All course Q&A handled on Piazza for rapid community support. Sign-up link: https://piazza.com/ucsd/spring2025/bild5_sp25_a00 . Email only for private matters.
Syllabus – Technology Requirements,"You need a web-enabled device (laptop strongly recommended) to access Canvas, Zoom, and the UCSD DataHub cloud RStudio server. Chromebooks work fine. On-campus loaner laptops are available."
Syllabus – Course Calendar,"Week-by-week lecture topics: W1 Data types & structures β†’ W2 Visualization & central tendency β†’ W3 Normality & CLT β†’ W4 Hypothesis Testing basics β†’ W5 Power & t-tests β†’ W6 Midterm + ANOVA & correlation β†’ W7 Regression & design choices β†’ W8 Sampling & ethics β†’ W9 Multivariate methods, careers β†’ W10 Review & project help."
Syllabus – Section Topics,Section labs: W1 Hello RStudio/DataHub; W2 Importing data; W3 ggplot2 visualization; W4 Tidyverse wrangling; W5 Review; W6 Normality tests & t-test; W7 ANOVA; W8 Linear regression; W9 Synthesis; W10 Term-project workshop.
Syllabus – Deliverables & Due Times,"Assignments due 11:59 pm PST unless stated: Section work weekly, Quizzes W2 4 8 10, Discussion Board posts bi-weekly, Term-Project checkpoints W8 & W9, Final project W10, Midterm (in lecture W6), Final exam."
Syllabus – Grading Breakdown,"Lecture participation 5 %, Quizzes 15 % (lowest dropped), Section assignments 20 % (lowest dropped), Discussion posts 10 % (lowest dropped), Term Project 20 % (10 % checkpoints + 10 % final), Midterm 10 %, Final Exam 20 %. Pre/Post surveys & SETs up to 1 % extra credit."
Syllabus – Grading Scale,"A+ 97-100, A 93-96, A- 90-92, B+ 87-89, B 83-86, B- 80-82, C+ 77-79, C 73-76, C- 70-72, D+ 67-69, D 63-66, D- 60-62, F < 60. Grade cut-offs never shift; no rounding."
Syllabus – Collaboration Policy,"Science is social: discuss concepts and share code, but your submitted answers, RMarkdown narration, and interpretations must be your own. All Rmd PDFs run through plagiarism detection. Any shared AI output must be cited in a one-line statement. No answer-sharing."
Syllabus – Discussion Board Prompts,"Prompts posted weeks 1 3 5 7 9. A creditable post is original, substantive, and properly cited. Replies like β€œI agree” do not count. Lowest prompt grade dropped."
Syllabus – Quizzes Policy,"Canvas quizzes W2 4 8 10, 60 min each, non-cumulative. Quiz 1 includes syllabus questions. Lowest quiz score dropped. No AI tools permitted during a quiz."
Syllabus – Exams Policy,"Midterm held in lecture week 6 (50 min). Final exam cumulative, 3 h window. One 4Γ—6 note card allowed. No reschedule unless OSD or UC-sanctioned event; email Keefe before exam start if emergency."
Syllabus – Term Project,"Students complete a full investigative cycle using instructor-supplied simulated data: formulate question, hypothesis, choose tests, analyse in R, create figures, interpret, and write report. Two checkpoint drafts receive feedback; grading becomes stricter each stage."
Syllabus – Extra Credit,Complete three pre-course and three post-course surveys plus SETs for up to 1 % extra credit. No other extra-credit opportunities.
Syllabus – Late Assignment Policy,"Quiz, Discussion, Project: -2 % per hour late; >48 h late max 50 %. Technical issues near deadline not valid excuses. Lecture participation: up to 18 missed check-ins permitted without penalty."
Syllabus – Attendance Policy,Lecture participation tracked via Mentimeter check-in/out. Up to 18 missed check-ins (~3 weeks) still yields 100 % attendance. Student responsible for tracking absences.
Syllabus – Academic Integrity & Gen AI,Generative AI is allowed for brainstorming or debugging if you include a one-sentence attribution (tool + assistance). AI use is forbidden during quizzes and exams. Excessive reliance may trigger an oral comprehension quiz.