Training Next-Generation Health Data Scientists

Course Description

The MDSH is a two-year, 48 unit program, consisting of public health foundation (4 units), MDSH core courses (24 units), MDSH electives (16 units), and a data science capstone course (4 units).

Public Health Foundation (4 units)

  • PUBHLT C201 Fundamentals of Public Health Exploration of foundations of public health by examining public health challenges at local, national, and global levels, and current strategies for advancing population health. Analysis of current public health issues and modern public health policies and practices.

MDSH Core Courses (24 units)

  • BIOSTAT 203A,B,C Introduction to Data Science The BIOSTAT 203 3-course sequence introduces practical data science (data ingestion, data cleaning, data wrangling, data visualization and reporting, databases) and big data computing (parallel, distributed, cluster and cloud computing) skills using computer languages R, Python, SAS, and SQL. Other topics include data ethics.

  • BIOSTAT 201A Introduction to Biostatistics Principles of biostatistics.

  • BIOSTAT 212A,B Statistical Learning The BIOSTAT 212 2-course sequence lays a rigorous foundation to commonly used data analytic tools for prediction, classification, and artificial intelligence (AI), with emphasis on applications to big and complex health data.

MDSH Elective Courses (16 units)

MDSH students take at least 4 elective courses from the following list.

  • BIOSTAT 218 Observational Health Data Science and Informatics An introduction to observational research in the health data sciences. Topics include disease cohort characterization, patient-level prediction and population-level estimation using administrative claims and electronic health records. Lectures will cover an introduction of observational health databases, a common data model for representing patient trajectories through healthcare systems, tools to manipulate data while preserving patient privacy theory of patient-level prediction and casual inference from observational data, and best practices for generating reproducible and reliable observational studies. Introductory theory will demonstrate how linear and generalized linear modeling is used in observational studies. Weekly practical laboratories will demonstrate the methods discussed in lecture. Laboratories will use SQL and R software, and regular homework assignments will re-enforce theoretical work with practical application using large-scale synthetic and real-world example databases. Students will design and complete a data analysis project that reflects the best practices covered in this course and translate their results into an oral presentation and written report.
  • BIOSTAT M215 Survival Analysis Data science methods for survival and life time data.

  • BIOSTAT 231 Statistical Power and Sample Size Methods for Health Research Sample size and power analysis methods for common study designs, including comparisons of means and proportions, ANOVA, time-to-event data, group sequential trials, linear regression, cluster randomized trials and multilevel data, with emphasis on designing randomized trials. Discussion also of multiple endpoints.

  • BIOSTAT M234 Applied Bayesian Inference Bayesian approach to statistical inference, with emphasis on biomedical applications and concepts rather than mathematical theory. Topics include large sample Bayes inference from likelihoods, noninformative and conjugate priors, empirical Bayes, Bayesian approaches to linear and nonlinear regression, model selection, Bayesian hypothesis testing, and numerical methods.

  • BIOSTAT M236 Longitudinal Data Analysis of continuous responses for which multivariate normal model may be assumed. Students learn how to think about longitudinal data, plot data, and how to specify mean and variance of longitudinal response. Advanced topics include introductions to clustered, multivariate, and discrete longitudinal data.

  • BIOSTAT 285 Advanced Topics. Machine Learning: Healthcare, Economics, and LLM This course provides an introduction to modern methods in health data science, focusing on the intersection of machine learning, game theory, and economic principles, particularly in the context of decision-making and interactions between multiple self-interested participants. Topics include advanced patient-level prediction, population-level estimation, and the application of large language models (LLMs) to healthcare data. Homework assignments will focus on applying theoretical concepts to practical scenarios. By the end of the course, students will design and complete a data analysis project that incorporates best practices in machine learning and healthcare, translating their results into both an oral presentation and a written report.

  • BIOSTAT 285 Advanced Topics. Deep learning: a statistical perspective The goal of this course is to study deep learning methodologies and identify related statistical issues. The content includes selected topics from the following: pre-deep-learning methods such as feature extraction and discrimination; components of well-established machine learning tools (support vector machines, reproducing kernel Hilbert spaces, model complexity, sparse models); history of neural networks; multi-layer-perceptron; backpropagation; convolutional neural networks; transformer networks; variational inference; generative adversarial networks; optimization and regularization; visualization; Python and deep learning frameworks.

MDSH Capstone (4 units)

  • BIOSTAT 401 Data Science Capstone A capstone project that consists of an original written analysis and an oral presentation that addresses an applied health-related data science topic and advances existing skills and techniques in healthcare or public health. Communication skills for professionals. Data ethics training.