Statistics and Probability: Data, Distributions, and Inference

Statistics and probability form the mathematical backbone of how knowledge is extracted from noisy, incomplete, and uncertain data. This page covers the core definitions, structural mechanics, classification boundaries, and common pitfalls of statistical reasoning — from basic distributions to inferential logic. The material spans descriptive and inferential statistics, classical and Bayesian frameworks, and the tensions that arise when real-world data collides with textbook assumptions.


Definition and scope

A poll predicts an election winner with a 3-point margin of error. A pharmaceutical trial reports a p-value of 0.04. A quality-control engineer flags a batch as defective after sampling 200 units. All three situations rest on the same foundational machinery: using partial information to draw defensible conclusions about something larger.

Probability is the mathematical study of uncertainty — it assigns numerical likelihoods to outcomes, ranging from 0 (impossible) to 1 (certain). Statistics is the discipline that uses those probability models to analyze data: summarizing what was observed (descriptive statistics) and making inferences about what wasn't (inferential statistics). The two fields are formally linked but have different orientations. Probability works forward from a model to predict data. Statistics works backward from data to evaluate or build models.

The American Statistical Association (ASA) defines statistics as "the science of learning from data, and of measuring, controlling, and communicating uncertainty." That definition quietly contains three distinct jobs: design (how data is collected), analysis (what patterns exist), and communication (what those patterns mean given the uncertainty involved). Most confusion in applied statistics comes from collapsing these three into one.

The scope of this subject intersects applied mathematics, experimental science, social science, and machine learning. The statistics and probability curriculum in U.S. high schools and universities is formally organized around this connection — from the Common Core math standards, which introduce data literacy as early as grade 6, through Advanced Placement Statistics, which covers inference for proportions and means for roughly 500,000 students annually (College Board AP Program).


Core mechanics or structure

The structural skeleton of statistics runs through four layers: data, distribution, estimation, and inference.

Data is the raw material — observed measurements or counts. Data types determine which methods apply. Quantitative data (heights, temperatures, reaction times) supports arithmetic operations. Categorical data (blood type, zip code, yes/no responses) does not.

Distributions describe how values are spread. The normal distribution — the symmetric bell curve described by Carl Friedrich Gauss in the early 19th century — is the most widely applied, partly because of the Central Limit Theorem (CLT): the sampling distribution of a mean approaches normality as sample size grows, regardless of the population's original shape. The CLT is why standard errors and z-tests work in so many settings. Other workhorses include the binomial distribution (counts of successes in fixed trials), the Poisson distribution (events per unit time or space), and the exponential distribution (waiting times).

Estimation translates sample data into statements about population parameters. A point estimate (e.g., a sample mean of 47.3) gives a single best guess. A confidence interval wraps that guess in a range that would capture the true parameter in a specified proportion of repeated samples — 95% intervals being the most common convention in published research (NIST/SEMATECH e-Handbook of Statistical Methods).

Inference uses probability to test hypotheses. In the Neyman-Pearson framework, a null hypothesis (H₀) is tested against an alternative (H₁). The test statistic measures how far the observed data sits from what H₀ predicts, and the p-value quantifies how often a result at least that extreme would appear if H₀ were true. A p-value below a pre-set threshold α (commonly 0.05) leads to rejection of H₀.


Causal relationships or drivers

Statistical associations and causal relationships are structurally different, and conflating them is one of the most consequential errors in applied work. Two variables can be correlated because one causes the other, because a third variable causes both (confounding), or purely by chance (spurious correlation).

The field of causal inference — formalized substantially through the work of Judea Pearl (Pearl, Causality, 2nd ed., Cambridge University Press, 2009) and the potential outcomes framework developed by Donald Rubin — provides mathematical tools for reasoning about causation from observational data. The key instrument is the directed acyclic graph (DAG), which encodes assumed causal structures and allows analysts to identify which variables to control for and which to leave alone.

Randomized controlled trials (RCTs) remain the design that most directly supports causal inference, because random assignment balances confounders across groups in expectation. When randomization is impossible — in economics, epidemiology, or education research — analysts use quasi-experimental designs: difference-in-differences, regression discontinuity, or instrumental variables.

Sample size is another driver. Underpowered studies (those with fewer observations than the minimum needed to detect a real effect) produce noisy, unreliable estimates and inflate false-positive rates. The National Institutes of Health (NIH) requires sample size justification and power calculations in grant applications for exactly this reason.


Classification boundaries

Statistics divides along several clean axes, each with consequences for method selection.

Descriptive vs. inferential: Descriptive statistics summarize the data in hand — mean, median, standard deviation, interquartile range, histograms. Inferential statistics project conclusions beyond the observed sample to a broader population using probability models. Mixing these up produces claims that look inferential but are only descriptive.

Parametric vs. nonparametric: Parametric methods assume the data follows a distribution with a fixed functional form (e.g., normal, Poisson). Nonparametric methods — Mann-Whitney U, Kruskal-Wallis, Spearman's rank correlation — make weaker distributional assumptions and are more robust when those assumptions fail, at the cost of some statistical power.

Frequentist vs. Bayesian: Frequentist inference treats parameters as fixed but unknown and defines probability through long-run frequency. Bayesian inference treats parameters as random variables, combining a prior distribution (encoding prior belief) with the likelihood of observed data to produce a posterior distribution. The posterior summarizes updated beliefs after seeing evidence. Both frameworks are mathematically valid; the choice involves philosophical commitments and practical tradeoffs.

Univariate vs. multivariate: Univariate analysis examines one variable at a time. Multivariate analysis — multiple regression, principal component analysis (PCA), cluster analysis — handles relationships among 2 or more variables simultaneously. Modern datasets with hundreds of variables require dimensionality-reduction techniques that are part of the mathematics and artificial intelligence toolkit.


Tradeoffs and tensions

The p-value threshold of 0.05 has generated sustained controversy. The ASA issued a formal statement in 2016 warning against treating p < 0.05 as a binary pass/fail criterion (Wasserstein & Lazar, The American Statistician, 2016). The statement identifies six principles, including the explicit caution that "p-values do not measure the probability that the studied hypothesis is true." In 2019, the ASA followed with a special issue of The American Statistician titled "Statistical Inference in the 21st Century: A World Beyond p < 0.05," which drew contributions from more than 40 researchers.

Frequentist confidence intervals and Bayesian credible intervals sound similar but mean different things. A 95% frequentist confidence interval means that 95% of such intervals constructed from repeated samples would contain the true parameter. A 95% Bayesian credible interval means there is a 95% posterior probability that the parameter lies in that range given the data and prior. These are not the same statement, and treating them as equivalent is a routine misreading.

The bias-variance tradeoff governs model building across statistics and machine learning. A model with too few parameters (high bias) systematically misses structure in the data. A model with too many parameters (high variance) fits training data well but generalizes poorly to new data. Cross-validation, regularization (ridge, lasso), and information criteria like AIC and BIC exist to navigate this tradeoff — all topics covered in depth in mathematical modeling contexts.


Common misconceptions

Misconception: A larger sample always makes a study reliable. Sample size controls random error, not systematic bias. A study with 50,000 respondents drawn from a biased sampling frame will produce precise estimates of the wrong quantity. The 1936 Literary Digest poll predicted Alf Landon would defeat Franklin Roosevelt — drawing on 2.4 million responses — because its sample was systematically skewed toward wealthier, Republican-leaning households.

Misconception: Correlation coefficient r = 0 means no relationship. A correlation of 0 means no linear relationship. Two variables can be tightly and predictably related in a nonlinear way (e.g., quadratic) while producing r = 0. Always visualize data before computing summaries.

Misconception: Failing to reject H₀ proves H₀ is true. Absence of evidence is not evidence of absence. A non-significant result may reflect a true null effect, an underpowered study, high measurement noise, or all three. Statistical non-significance cannot confirm that an effect does not exist.

Misconception: The p-value is the probability the null hypothesis is true. This is the single most persistent misreading in applied research. The p-value is computed assuming H₀ is true and tells nothing directly about the probability of that assumption — a point the ASA 2016 statement addresses explicitly.


Checklist or steps (non-advisory)

The following sequence reflects the structural phases of a standard statistical analysis, as described in frameworks including the NIST/SEMATECH e-Handbook of Statistical Methods:

  1. Define the research question — state what population parameter or relationship is of interest.
  2. Identify the data type — continuous, discrete, ordinal, nominal; this determines the method space.
  3. Specify the study design — experimental, quasi-experimental, observational, survey.
  4. Determine required sample size — based on effect size, desired power (typically 0.80), and significance level α.
  5. Collect and clean data — document missingness, outliers, and data-entry inconsistencies.
  6. Compute descriptive statistics and visualize — histograms, boxplots, scatterplots, frequency tables.
  7. Check distributional assumptions — normality tests (Shapiro-Wilk), homogeneity of variance (Levene's test).
  8. Select and apply the appropriate test or model — matched to data type and study design.
  9. Compute effect sizes — Cohen's d, η², r, or odds ratios alongside p-values.
  10. Interpret within context — statistical significance, practical significance, and confidence interval width together.

Reference table or matrix

The mathematics resource index includes pathways to related quantitative topics. The table below maps common inferential situations to their standard methods.

Situation Data Type Standard Method Key Assumption
Compare 1 mean to known value Continuous One-sample t-test Approximately normal sampling distribution
Compare 2 independent means Continuous Independent-samples t-test Equal or unequal variance (Welch variant)
Compare 2 paired means Continuous Paired t-test Differences approximately normal
Compare ≥3 group means Continuous One-way ANOVA Normality, homogeneity of variance
Test association between 2 categorical variables Categorical Chi-square test of independence Expected cell counts ≥5
Model relationship between variables Continuous/Mixed Linear or logistic regression Linearity, independence, homoscedasticity
Compare medians (non-normal data) Ordinal/Non-normal Mann-Whitney U or Kruskal-Wallis Ordinal measurement, independent samples
Estimate population proportion Categorical Z-test for proportion / exact binomial Sufficient sample size for normal approximation
Bayesian parameter estimation Any Posterior via Bayes' theorem Prior specification required

Effect size conventions drawn from Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Erlbaum, 1988: small d = 0.2, medium d = 0.5, large d = 0.8.


References