Sample Size Calculation: Power, Errors & Clinical Trials

Executive Summary
Sample size determination is a foundational element of clinical trial design. It integrates statistical, clinical, ethical, and practical considerations to ensure that studies are capable of answering key research questions without subjecting patients to unnecessary risk or expense. This report examines why and how statisticians compute sample sizes in clinical trials, tracing historical developments, statistical principles, design-specific formulas, ethical and regulatory frameworks, and emerging practices. We detail the roles of Type I error (α), Type II error (β) and power, effect size, outcome variability, allocation ratios, and other factors in sample size formulas. We compare approaches for different endpoints (e.g. means, proportions, time-to-event) and trial designs (superiority, non-inferiority, equivalence, cluster, cross-over, adaptive, Bayesian). Several case studies illustrate real-world applications, including the large COVID-19 vaccine trials and non-inferiority device trials, with concrete sample size figures and assumptions. We also address common misconceptions (such as naive dropout adjustments and optional stopping), the ethical imperative to avoid underpowered or overpowered trials, and the regulatory requirement to justify sample size (e.g. ICH E9). The report concludes with a discussion of future directions — from adaptive re-estimation and Bayesian designs to pragmatic and precision-medicine trials — emphasizing that sample size planning remains an evolving science that balances statistical rigor with patient welfare and resource constraints.
Introduction and Background
Determining how many patients are needed in a clinical trial is a critical step that directly impacts the validity, feasibility, cost, and ethics of research. A correctly powered study can reliably detect clinically meaningful treatment effects; an underpowered study may miss true effects, rendering the trial inconclusive and unethical; an overpowered study may expose excessive numbers of patients to risk without necessary benefit. Thus, sample size must be pre-specified in trial protocols and justified by statistical and practical considerations ([1]) ([2]). International guidelines (e.g. ICH E9) mandate explicit sample size calculations and documentation for every confirmatory trial ([2]). Regulatory agencies (FDA, EMA and others) scrutinize these justifications during protocol review, ensuring trials are feasibly powered and scientifically credible ([3]) ([2]).
Historically, the formal framework for sample size arose from the development of hypothesis testing. In the early 20th century, Neyman and Pearson introduced Type I and Type II errors and the concept of power (1−β) to the statistical literature. They showed that, for a given hypothesis test (e.g. two-sample t-test or z-test), one can compute the probability of incorrectly rejecting a true null (false positive, α) or failing to reject a false null (false negative, β). These error rates and the true effect size jointly determine how large the sample must be to achieve a desired power ([1]) ([4]). Ronald Fisher emphasized efficient use of data but was less prescriptive about sample size rules; it was Neyman–Pearson theory that formalized the power analysis approach still used today.
In modern clinical research, sample size calculation most often follows a frequentist, power-based paradigm: one specifies a null hypothesis (e.g. no difference between treatments) and an alternative (e.g. minimum clinically important difference), sets acceptable Type I error (α, commonly 0.05) and desired power (typically 80% or 90%), and computes the required N. The calculations also depend on parameters like outcome variance or baseline event rate, often estimated from pilot data or prior literature. The Biostatistical principles handbook (ICH E9, 1998) explicitly requires trial protocols to define all such hypotheses and calculations ([2]). If assumptions change (e.g. variance larger than expected), adaptive methods like sample size re-estimation can adjust N mid-trial, though such adaptations must be planned carefully to avoid bias.
Precision and ethics go hand-in-hand in sample size planning. As Röhrig et al. emphasize, studies must avoid being “either too small or too large,” as both are ethically and economically unacceptable ([1]). Too-small trials risk failing to detect true effects, wasting patients’ contributions and potentially leading to useless results. Excessively large trials expose more patients than necessary to experimental risks and inflate costs. Statisticians, clinicians, and patient representatives collaborate to set a clinically meaningful effect size (the minimum difference worth detecting) so that the resulting sample is scientifically valid yet ethically justified ([5]) . For example, interviewing clinicians or patient advocates can help elicit what magnitude of benefit they consider important, guiding the effect size choice ([5]).
The following sections delve into the key components of sample size calculation. We first review core statistical principles – hypotheses, errors, and power – and outline how these translate into formulas for different trial types. We then examine practical factors (variance, allocation ratio, dropout, clustering, etc.) and design variations (adaptive, non-inferiority, Bayesian). Case studies illustrate these concepts in real trials. Finally, we discuss regulatory and ethical frameworks, computational tools, and future trends.
Statistical Principles Underpinning Sample Size
The fundamental goal of sample size calculation is to ensure the trial can detect a true treatment effect of a specified magnitude with high probability (power), while controlling false positives. This involves four primary parameters: the chosen Type I error rate (α), the desired power (1−β), the expected effect size (Δ), and the variability of the outcome. Alpha (α) is the probability of a false positive (mistaking chance variation for a real effect) and is typically set at 5% (two-sided). Beta (β) is the probability of a false negative (missing a true effect); 1−β is the power of the trial – the chance of correctly detecting the effect. Conventionally, trials aim for 80% or 90% power, corresponding to β=0.20 or 0.10 ([6]). As Das et al. note, “most clinical trials use a power of 80%”, implying one would accept a false negative 1 in 5 times ([6]). Power can be raised (β lowered) at the cost of larger sample; lower α (stricter significance) similarly increases required N ([4]) ([6]).
The effect size (Δ) is the minimum difference between groups that the trial is designed to detect. It should be the smallest clinically important difference. For example, if a new antihypertensive is expected to lower systolic blood pressure by 20 mmHg more than standard therapy, Δ would be 20 mmHg. Effect size may be absolute (delta in means or proportions) or standardized (delta divided by a pooled standard deviation, known as Cohen’s d). We illustrate: a trial comparing an antihypertensive yielding a 20 mmHg BP drop versus 10 mmHg drop has Δ=10 mmHg ([5]). A multicenter diabetes trial may specify Δ=1% HbA1c. Effect sizes in binary outcomes are differences in event rates or risks. Investigators can estimate Δ from pilot studies, previous trials, or clinical judgment ([5]).
Outcome variability also critically affects N.For continuous outcomes, the outcome’s standard deviation (σ) enters the formula: more variability → larger N. Social science and medical research often assume normal-like distributions, so standard formulas (e.g. t-test) use σ. In binary outcomes, the baseline event rate (p in the control group) and the anticipated change under treatment determine variability (p(1−p)). Rarer events (low p) require larger samples to accrue sufficient events for comparison. As an example from COVID-19 vaccine trials, Pfizer assumed a ~0.65% infection rate in the placebo arm; to detect a 30-percentage-point vaccine efficacy difference, tens of thousands of subjects were needed ([7]).
A concise way to present these principles is a summary of power analysis parameters:
- Hypotheses: Define H₀ (e.g. no difference) and H₁ (difference = Δ).
- α (Type I error): Significance level (commonly 0.05), affects critical value (Z or t).
- β (Type II error) & Power (1−β): Desired probability of detecting effect, commonly 0.80–0.90 ([6]).
- Effect Size (Δ): Clinically meaningful difference to detect; smaller Δ requires larger N.
- Variance/Spread: Standard deviation (continuous) or event rate (binary); larger spread (σ, p(1−p)) requires larger N.
- Allocation and Design Features: e.g. equal or unequal randomization, cluster design, chi-square vs t-test.
These inputs feed into analytic formulas or simulation to solve for N. In most cases where closed-form solutions exist, one equates the chosen critical value and power requirement to the sampling distribution of the test statistic under H₁. For example, for two independent groups using a z- or t-test, we require
$$Z_{1-\alpha/2}\sigma_{\rm pooled} + Z_{1-\beta}\sigma_{\rm pooled} \le \Delta,, $$
solving for the sample size in each arm. (Here $\sigma_{\rm pooled}$ scales with $\sigma/\sqrt{n}$.) If no simple formula exists (e.g. for complex models), simulation is used ([8]). We discuss specific formulas in the next section.
Key Statistical Defaults: Unless otherwise specified, trials usually assume a two-sided α=0.05, power=80–90% (β=0.2–0.1) ([6]). Deviating from these defaults (e.g. α=0.01 in genomics) materially affects N. Non-inferiority or equivalence trials often use a one-sided α and require smaller margins (thus larger N) than simple superiority trials (discussed below). The tables and case studies below illustrate these quantitative relationships.
Sample Size Calculations for Common Designs
This section examines how to calculate sample size for various trial objectives and endpoints, under the frequentist paradigm. We cover comparisons of means, comparisons of proportions, and time-to-event outcomes, as well as specialized designs (non-inferiority/equivalence). In each case, the basic approach is similar: translate the research question into a null and alternative hypothesis, determine the test statistic and distribution, specify α, β, effect size and variance assumptions, and solve for N. We give formulas where straightforward but emphasize conceptual understanding and dependencies. Generic formulas are well-covered in statistical texts ([8]), so here we highlight key points and references.
Comparing Two Means or Continuous Endpoints
For a two-arm trial comparing a continuous outcome (e.g. blood pressure, biomarker), a common test is an independent-samples t-test (or z-test if variance known). Suppose the outcome in group 1 has (unknown) mean $\mu_1$ and standard deviation $\sigma$, and group 2 has mean $\mu_2$ (same $\sigma$). The null hypothesis is $H_0: \mu_1 = \mu_2$ (difference 0), and $H_1: \mu_1 - \mu_2 = \Delta$ (the anticipated difference to detect). For equal group sizes $n$, a standard approximation yields the required $n$ per group:
\ [ n ;\approx; \frac{2\sigma^2,(Z_{1-\alpha/2} + Z_{1-\beta})^2}{(\mu_1 - \mu_2)^2},. ]
Here $Z_{1-\alpha/2}$ and $Z_{1-\beta}$ are the normal critical values for the two-sided α and for the power $1-\beta$. (If using a one-sided test, replace $Z_{1-\alpha/2}$ by $Z_{1-\alpha}$.) If $\sigma$ must be estimated, a slightly larger $n$ is needed to account for t-distribution. The key dependence is that $n$ grows with $\sigma^2$ and shrinks with the square of the true mean difference. In other words, halving the detectable difference (making it more subtle) requires four times as many patients (all else equal). The same phenomenon holds in equivalent formulations: for example, Cohen’s d (standardized effect) is $(\mu_1 - \mu_2)/\sigma$. With fixed power and α, required $n \propto 1/d^2$.
In practice, one often uses software to compute $n$, but the above formula guides understanding. As Das et al. illustrate with an example of antihypertensive drugs reducing blood pressure, they define “effect size (ES)” as the mean difference one wishes to detect ([5]). If Drug A lowers by 10 mmHg and Drug B by 20 mmHg on average, $\Delta=10$ mmHg and $n$ can be found from the variance of BP. In their example, assuming $\sigma=15$ mmHg, α=0.05 and 90% power yields $n\approx 12$ per group . If practical constraints or higher variability occur, this $n$ scales accordingly. For instance, if $\sigma$ were 20 instead of 15, that same 10 mmHg difference would require roughly $(20/15)^2\approx1.78$ times more subjects per arm.
Key Points: To plan a two-sample mean comparison, the investigator needs (a) an estimate of the population standard deviation (often from previous studies or a pilot),and (b) a clinically relevant mean difference $\Delta$. The choice of $\Delta$ should reflect the minimal effect worth detecting, not the maximum expected benefit ([5]). The more conservative (smaller) the $\Delta$, the larger $n$ will be. Statistical texts (e.g. Chow et al. or Rosner) provide the detailed derivations and exact formulas ([8]).
Comparing Two Proportions (Binary Outcomes)
Many trials have a binary endpoint – e.g. success/failure, event/no-event. Commonly one compares two independent proportions $p_1$ and $p_2$. The null $H_0$ is typically $p_1 = p_2$ (no difference), and the alternative is $p_1 - p_2 = \Delta_p$, the absolute risk difference one aims to detect. For example, $p_1$=proportion of patients improving with Drug A, $p_2$ with Drug B.
A standard approximate formula for the required number in each arm is:
\ [ n ;\approx; \frac{\bigl(Z_{1-\alpha/2}\sqrt{2\bar p(1-\bar p)} + Z_{1-\beta}\sqrt{p_1(1-p_1) + p_2(1-p_2)}\bigr)^2}{(p_1 - p_2)^2}, ]
where $\bar p = (p_1+p_2)/2$ is the average proportion under the null. This formula assumes sufficiently large samples to use normal approximations. If $p_1=p_2=p$ under $H_0$, it simplifies with a pooled variance $\bar p(1-\bar p)$. It shows that rare events (small $p$) inflate required $n$. For example, if the control event rate is 0.05 and the treatment is expected to reduce it to 0.01 (a big relative effect, $\Delta_p=0.04$), $n$ will be much larger than if both rates were ~0.2 (easier to detect difference on a larger baseline).
This effect is clear in the real-world trials of COVID-19 vaccines. In Senn’s analysis of five phase III trials ([7]), all used a binary endpoint (COVID infection). Although vaccine efficacy differences (60% vs 30% assumed) were large, the low event rate necessitated huge N. For Pfizer/BioNTech’s trial, the assumed placebo event rate was only 0.65% over 6 months, so even obtaining ~164 cases (for power calculations) required enrolling ~21,999 subjects per arm ([7]). In contrast, a disease with a 20% incidence would require far fewer patients for the same absolute effect. The principle is that under binary outcomes, one must specify both $p_2$ (control rate) and the anticipated $p_1$, then plug into the formula or software.
As with means, one must consider one- vs two-sided tests. If the trial is explicitly one-sided (e.g. superiority with a placebo, direction assumed), $Z_{1-\alpha}$ replaces $Z_{1-\alpha/2}$. Since $Z_{0.975}\approx1.96$ and $Z_{0.95}\approx1.645$, a one-sided design saves some sample (but regulators often want two-sided). For equivalence or non-inferiority studies with binary endpoints, formulas are similar but use non-inferiority margins (see below).
Key Points: Planning for binary outcomes requires an estimate of baseline event risks. If events are rare, trials must be larger. Many published trials report sample sizes along with assumed event rates. For instance, Moderna’s COVID-19 trial assumed 0.7% event rate and planned 15,000 per arm to achieve ~90% power at 1-in-40 one-sided α ([7]) (their H1 efficacy was 60%). Investigators also sometimes convert odds or relative risks to absolute differences for planning. In all cases, sample size calculators for proportions (free online tools or software like PASS, nQuery) are widely used.
Time-to-Event and Survival Outcomes
When the primary endpoint is time-to-event (with possibly censored data), as in many oncology or cardiovascular trials, sample size planning revolves around the number of events rather than just number of patients. A common approach (Schoenfeld’s formula) relates the required number of events $E$ to the logarithm of the hazard ratio (HR) under $H_1$. For example, to detect a hazard ratio $\theta$ vs the null ($\theta=1$) at two-sided α and power $1-\beta$, the approximate number of events needed in total is
\ [ E ;\approx; \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2}{(\ln \theta)^2},. ]
The total sample size then depends on the trial’s duration and expected event accrual. For constant hazards and uniform entry, one can back-calculate how many patients and follow-up time are needed to yield $E$ events. In practice, designers use software or simulation for complex accrual/follow-up patterns.
Key factors for time-to-event designs include: expected control survival curve (to estimate control event rate over study period), the target HR (or corresponding $\Delta$ in median survival), and drop-out or competing risks. The group sequential (interim analysis) plans also influence design: interim looks on event counts require adjustment of $E$ while controlling overall α (via spending functions or O’Brien-Fleming boundaries). However, the central idea remains that more stringent α/power or smaller effect (HR closer to 1) increases required events and thus sample.
A fully detailed derivation is beyond this summary, but references such as Chow et al. and Julious’s Sample Sizes for Clinical Trials cover it. In essence, survival trials need large samples if the anticipated treatment effect on hazards is modest. For example, a 6-month median survival improvement versus 5-month (HR≈0.85) would need many more patients than, say, a 12 vs 6-month improvement (HR≈0.5).
Superiority vs Non-Inferiority/Equivalence
Most trials test for superiority: whether a new treatment outperforms control. The above formulas apply directly to superiority. By contrast, non-inferiority and equivalence trials flip hypotheses and often need larger samples. In a non-inferiority trial, the goal is to show the new treatment is not worse than control by more than a prespecified margin $\Delta_{NI}$. Typically:
- H₀ (non-inferiority null): new is worse by ≥$\Delta_{NI}$.
- H₁: new is not worse (true difference <Δ_{NI}).
The margin $\Delta_{NI}$ is often chosen as a fraction of a known active control effect (e.g. retain 50% of effect). A smaller (stricter) margin demands more power (i.e. more patients).
Practically, one often uses a one-sided α for non-inferiority (since only performance worse than control is disapproved). However, the essential calculation can still mirror a difference test: replace $\Delta$ by $(\mu_2 - \mu_1 - \Delta_{NI})$ or in proportions/effects terms. The core is that $n$ grows as $\Delta_{NI}$ shrinks. For example, Sayers et al. discuss a device benchmark where a 1% absolute non-inferiority margin was compared to a 5% failure rate benchmark; achieving adequate power (>60%) required on the order of 3,200+ subjects ([9]), far more than small pilot scales. In fact, they note that current benchmarking standards often have “limited power to detect non-inferiority, and substantially larger sample sizes, in excess of 3200… are required to achieve a power >60%” ([9]). Thus, non-inferiority designs usually need larger samples than superiority trials for the same level of confidence. Equivalence trials (two one-sided tests) are similar: they test if the difference lies between ±Δ, again usually needing large $n$ if Δ is small.
Adjusting for Analysis Plans
Beyond the basic formulas, additional design features alter sample size:
-
One-sided vs two-sided tests: A two-sided test (standard) requires roughly twice the area in the tails (α/2 each) compared to one-sided (α only in one tail), which modestly increases $Z_{1-\alpha/2}$ over $Z_{1-\alpha}$. In practice, two-sided N is only slightly larger than one-sided N for common α values. Some trials (especially non-inferiority) use one-sided planning.
-
Unequal randomization: Trials sometimes randomize 2:1 or 3:1 (treatment:control) for reasons like exposing more patients to a new therapy or saving costs. Unequal allocation increases total N to attain the same power, because the “effective” sample per comparison is less balanced. For example, if total N is fixed, power is maximized at a 1:1 ratio. Conversely, to maintain power with a 2:1 design, total N must be higher. In Senn’s vaccine example ([7]), the AZ/Oxford trial used 2:1 (20k vaccinated vs 10k placebo), which gave more safety data on the vaccine but required 30,000 total – about 36% more total subjects than a balanced 1:1 design with ~15k/15k would have needed.
-
Multiple endpoints or multiplicity: If a trial has co-primary endpoints or multiple hypotheses, α must be split (e.g. Bonferroni-type correction), inflating the Z-value and thus N. Each outcome’s sample size is effectively calculated at a more stringent α. These complications are handled on a case-by-case basis.
-
Covariate adjustment: Incorporating a strong prognostic covariate (e.g. baseline risk) into analysis (via ANCOVA or stratification) can improve efficiency and reduce required N, though this is more about analysis than design. Conversely, ignoring large covariate variability can underestimate needed N. Sample size formulas typically assume no covariates (or worst case); advanced planning might include an adjustment factor if covariate correlation (R²) is known ([8]).
-
Interim analyses and group-sequential design: Trials with planned interim looks for efficacy or futility (e.g. DSMB checks) use spending functions (like O’Brien–Fleming) which allow early stopping. These designs often set a maximum (final) N slightly above the fixed-sample N to account for alpha-spending at interim. In practice, group-sequential designs can actually allow stopping before full N if treatment is convincingly effective, but require meticulous statistical planning. Overall, a group-sequential trial may pre-specify a slightly larger $N_{\max}$, but typically uses the same or slightly smaller expected number of patients. (We do not detail these complex formulas here.)
-
Sample size re-estimation (SSR): In adaptive designs, one may perform an interim check of nuisance parameters (e.g. true variance or event rate) without unblinding effect, then adjust N. This is increasingly used in confirmatory trials. We discuss this in the next section.
In all these cases, the biostatistician carefully derives or simulates the required sample considering the exact analysis plan. Software (e.g. nQuery, PASS, SAS PROC POWER, R packages) is commonly used to implement complex calculations.
Sample Size Re-estimation and Adaptive Methods
Traditional sample size planning fixes N before the trial based on assumed parameters. However, in practice assumptions (e.g. variance, event rate, control response) may turn out inaccurate. Adaptive designs allow some flexibility: one approach is sample size re-estimation (SSR) or re-calculation using interim data.
SSR methods come in two main flavors: blinded SSR (BSSR), which uses pooled data (e.g. overall variance estimate) without breaking treatment codes, and unblinded SSR (UBSSR), which briefly looks at observed treatment difference. These methods can correct underestimates of variability or event rates mid-trial. For example, if an interim analysis shows the outcome variance is 20% higher than planned, the trial can increase N to maintain target power. If done properly, alpha levels can still be controlled.
Surveys and reviews show SSR is expanding in use. A recent systematic review of published trials found SSR was used not only in confirmatory Phase III trials, but also in Phase I/II trials where assumptions are uncertain ([10]). Survey respondents expect SSR will continue in early-phase work where prior information is lacking ([10]). The same review noted modest sample increases: when SSR was applied, sample sizes at trial end were generally only slightly above original targets ([11]). A scatter plot of SSR deployments showed final $N$ typically a little higher than planned как (since some inflation is built in to avoid major underpower) ([11]). Importantly, investigators also cap the maximum N when using SSR to avoid runaway sample growth.
In sum, SSR provides a data-driven safety net if initial guesses were off. Regulatory guidance (FDA’s 2017 draft guidance on adaptive designs, for example) permits SSR under strict control. It highlights that while SSR can enhance efficiency and ethics (by avoiding wasted effort), the process must be pre-specified and documented.
Bayesian Alternatives
A less common but important perspective is Bayesian sample size planning. Instead of fixing α and β, Bayesian designs may specify a desired posterior probability of efficacy or a credible interval width. In truth, “sample size calculation” in the Bayesian world is often done by simulation under prior distributions. One might say: recruit until the posterior probability of the treatment effect exceeding a threshold reaches X%. The infamous Pfizer/BioNTech COVID-19 vaccine trial nominally used a Bayesian framework to define success ([12]), though for simplicity their final reporting emphasized traditional confidence intervals. Conceptually, a Bayesian approach can allow incorporation of prior data (e.g. from earlier studies) to reduce needed N. However, it typically still requires a justification of the trial’s assurance (the integrated probability of success under the prior), which is analogous to power, and thus similar sample determinations go into planning. ([12]) ([8])
Bayesian planning methods are beyond the scope of standard clinical protocols but are gaining traction, especially in adaptive and early-phase contexts. We note them for completeness; they follow the same principle that smaller margins, tighter credible requirements, or less informative priors lead to larger needed samples.
Practical Considerations and Adjustments
Beyond core statistical parameters, practical factors significantly affect required sample sizes:
-
Dropout and noncompliance: Real trials rarely retain 100% of enrolled patients through final analysis. Dropouts are common due to loss to follow-up, withdrawal, side effects, etc. To ensure the analyzable sample meets the target, one inflates $n$ at the design stage. The naive method (“add 10% for a 10% expected dropout”) is incorrect ([13]). Instead, if $N_{\text{final}}$ completes are needed after attrition rate $d$ (e.g. 10%), then initial $N_{\text{initial}} = N_{\text{final}}/(1-d)$. For example, needing 500 completers with 10% dropout means $500/0.9 = 556$ starters, not 550 ([13]). Og-Brandon et al. argue that simply adding a percentage leads to underpowered trials; indeed, In et al. demonstrate that calculating $500 + 50$ is a miscalculation, and one must solve $0.9N = 500$ instead ([13]). Summarizing, always adjust $N$ by dividing by the expected retention rate, and be transparent about the assumed dropout in the protocol. The Korean J. Anesthesiol. guidelines explicitly warn against the “common error” of naive dropout adjustment ([13]).
-
Multiple endpoints and multiplicity: If a trial has several co-primary outcomes or key secondary endpoints requiring α control, the smallest sample size that satisfies all may be adopted. Often one calculates $N$ for each endpoint (possibly at a lower α due to Bonferroni or gatekeeping), and then uses the largest. If the primary endpoint requires a larger $N$ than secondaries, the study is powered for the hardest (most precise) test.
-
Sequential and group-sequential designs: Planning interim analyses adds complexity. Standard fixed-sample formulas yield the nominal $N_{\max}$, but an effective $N$ may be smaller if stopping early. Planning such designs is beyond basic formulas; typically one uses specialized software (East, rpact, or custom R code) that applies spending functions. The key takeaway is that group-sequential planning must specify the maximum $N$ and boundaries a priori, and often results in a modest increase in $N_{\max}$ to "buy" the ability to stop early.
-
Covariate adjustment: Incorporating baseline covariates (e.g. stratification factors) can improve precision. If a covariate explains a fraction $R^2$ of outcome variance, an ANCOVA-adjusted sample size could be $n \times (1-R^2)$ the size needed without covariates (subject to certain assumptions). This means strong covariate control can reduce required N. In practice, however, most sample size programs assume simple analyses without covariates; trialists may manually scale down N if justified.
-
Multi-arm and factorial designs: Trials with >2 arms (e.g. three drugs, or 2×2 factorial) can have complex sample requirements. One approach is to treat comparisons pairwise, often requiring adjustment for multiple pairwise tests if they are co-primary. Modulationally, adding arms splits the randomization ratio, so each comparison effectively has fewer subjects per group, thus increasing total N. Specific formulas exist (or one may simulate) for these cases.
-
Cluster randomization: In cluster-randomized trials, whole groups (e.g. schools, clinics) are randomized. Here an intraclass correlation coefficient (ICC, ρ) captures within-cluster similarity. The design effect (DE) inflates required N: for equal cluster size $m$, $DE=1+(m-1)\rho$ ([14]). The effective sample is $N/DE$ relative to an individual randomization. For example, if ICC=0.05 and clusters have m=20, then $DE=1+19*0.05=1.95$, so roughly double the individual-randomization sample is needed. Rutterford et al. present formulas for continuous and binary outcomes ([14]). In stepped-wedge designs or variable cluster sizes, more advanced adjustments apply, but the principle remains that non-zero ρ substantially increases sample size. (We return to clustering in advanced designs.)
-
Randomization ratio: As noted, unequal randomization raises total N. For a 2:1 randomization, the variance term becomes $(σ^2/n_1 + σ^2/(2n_1))$ in the denominator (assuming group1 has twice subjects of group2), which is larger than the balanced $(2σ^2/n_1)/2$. Hence, for equal power, $n_1$ (treatment) and $n_2$ must both increase. In essence, allocation ratio $r = n_2:n_1$ requires $N \propto (1+r)^2/(4r)$ for fixed power. (The minimal $N$ occurs at $r=1$.) In practice, planners use a formula or optimizer to solve for both $n_1$ and $n_2$ given $r$.
-
Event-driven vs time-driven: Some trials (especially large outcomes trials) are designed for a fixed number of events (e.g. “target accrual of 500 events”), with enrollment continuing until those events occur. This sometimes decouples the notion of “sample size” from calendar length: the biostatistician computes needed events and then models accrual. Here, the sample size is typically larger than if one fixed study duration.
Each of these practical factors must be built into the sample size computation. Modern trial protocols usually include a paragraph or two explaining these decisions: anticipated dropout rate and resulting inflation, primary endpoint definition, α/power chosen, and final per-arm/sample total recommended. A well-justified plan, with all assumptions and formulas stated, is essential for both scientific integrity and regulatory review ([2]) .
Data Tools and Software
Numerous software tools assist sample size calculation. Common approaches include:
-
Statistical software: Packages like PROC POWER in SAS, or functions in R (e.g.
power.prop.test,power.t.test, or specialized packages such aspwr,Hmisc,gsDesignfor group sequential,ICC.Sample.Sizefor clusters, etc.), can compute $n$ given parameters. These require the user to input the assumed values (control rate, effect size, variance, α, power) and output the needed N or power. -
Specialized software: Dedicated programs (PASS, nQuery) offer graphical interfaces and handle complex designs (non-inferiority margins, cluster randomization, longitudinal outcomes) with multiple options. They often generate power curves or error tear sheets.
-
Online calculators: Many free web calculators exist for simple cases (two means, two proportions, etc.). For example, MD Anderson’s Biostatistics sample size calculators or university tools. Caution is needed as not all cover advanced designs or are validated.
-
Simulation: For complex designs (non-linear models, adaptive, composite endpoints), simulation is used. One might write code to generate many hypothetical trials under $H_1$ and estimate power by counting rejections ([8]). This Monte Carlo approach is the same idea used in Bayesian planning.
All tools should be supplemented by expert review. The biostatistician must ensure input assumptions are realistic and results make sense. For example, tools may not warn if the required N is astronomically large given tiny effect; the statistician should counsel that such a trial may be infeasible.
Case Studies and Real-World Examples
To ground the discussion, we present several illustrative examples of sample size determination in actual clinical research. These highlight how the principles and formulas play out in practice.
1. COVID-19 Vaccine Trials
The Phase III trials of the first COVID-19 vaccines (Pfizer/BioNTech, Moderna, AstraZeneca, etc.) were unprecedented in speed and scale. They exemplify sample size planning under time pressure and uncertain parameters. These trials had a common primary endpoint: virologically-confirmed COVID-19. Setting up the trial required many assumptions about attack rates, vaccine efficacy, α/power, and logistics. Senn (2022) summarized five such trials ([12]). Key figures from these designs are given in Table 1.
| Sponsor | H₀ VE (%) | H₁ VE (%) | Power (%) | Event Targets | Planned N (vaccine / placebo) | Remarks |
|---|---|---|---|---|---|---|
| Pfizer/BioNTech ([7]) | 30 | 60 | 90 | 164 Events | 21,999 / 21,999 | One-sided 2.5% (one-sided efficacy test); 2 doses; Bayesian-inspired design. |
| AstraZeneca/Oxford ([7]) | 30 | 60 | 90 | 150 Events | 20,000 / 10,000 | 2:1 allocation; lower event rate assumption. |
| Moderna ([7]) | 30 | 60 | 90 | 151 Events | 15,000 / 15,000 | Symmetric randomization. |
| Novavax ([7]) | 30 | 70 | 95 | 100 Events | 7,500 / 7,500 | Higher assumed effect (70%). |
| J&J Janssen ([7]) | 30 | 60 | 90 | 154 Events | 30,000 / 30,000 | Single-dose vaccine (truncated SPRT design). |
Table 1: Design parameters for five large COVID-19 vaccine trials (primary efficacy). The columns give the null hypothesis vaccine efficacy (H₀ VE, the minimal clinically acceptable efficacy), the alternative (H₁ VE, the assumed true efficacy), desired power, number of events targeted for decision, and planned sample sizes. Source: Senn (2022) ([7]).
These trials assumed a rather low baseline infection rate (e.g. ~0.6–0.8% in 6 months under placebo ([7])). To have enough COVID cases to evaluate efficacy, each trial needed tens of thousands of volunteers. Even though the assumed vaccine effects (H₁) were large (60–70% efficacy), the low event frequency meant enormous N. For Pfizer, 21,999 per arm were planned – a total of ~44,000 subjects ([7]). Moderna planned 30,000 total, AstraZeneca 30,000 (with a 2:1 randomization), and Novavax about 15,000.
This illustrates how small effect sizes or rare events drive up required sample. If the planned efficacy had been only 50% instead of 60%, or if the actual infection rate turned out lower (say 0.4%), each study would have needed still more subjects. Indeed, all these trials incorporated DSMBs and sequential monitoring, allowing early stopping if efficacy met pre-set boundaries. Pfizer’s design, notably, was Bayesian in spirit and allowed posterior analysis boundaries ([12]) (though for sample size it still targeted a conventional number of events).
The COVID vaccine trials underscore real-world constraints: to reach ~150 events with a 0.7% incidence, Pfizer needed ~44k participants. The actual accrual for the primary analysis reflected these numbers. From a planning viewpoint, this is a case where sample size was effectively event-driven: the key calculation was “how many subjects to get ~164 cases?” which is equivalent to our binary-sample formula linking $n$, event rate, and $\Delta_p$. In the publications and FDA briefings, these calculations were detailed. For example, Pfizer’s protocol notes initially targeting 164 cases to provide 90% power to show true VE≥50% against a null of 30% ([7]).
In summary, the vaccine trials’ large sizes were justified by standard sample size formulas using small p$_0$ and large Δp. Table 1 provides a concise comparison of their assumptions and resulting Ns.
2. Non-Inferiority Device Benchmark Trial
In orthopedic device research, one may need to show a new implant is “not worse than” a benchmark rate of failure. Sayers et al. (2017) examined the case of benchmarking an implant’s failure rate against an external standard ([15]) ([9]). They considered an external benchmark 5-year failure rate of 5%. A non-inferiority margin (Δ) might be, for example, 1% absolute (i.e. allow up to 6% failure rate). They found that demonstrating non-inferiority with high confidence requires very large samples. For instance, to have even 60% power with a 5% baseline and 1% margin, one would need over 3,200 procedures ([9]). This calculation assumed a one-sample z-test framework.
Such results reveal how small margins blow up the needed N. The authors note that because modern implants already have low failure rates, any trial aiming to prove new ones are as good (with a tiny margin) is intrinsically costly and long. They argue that current benchmarking has “limited power to detect non-inferiority” unless trials enroll thousands ([9]). In other words, from Table above, decreasing Δ (margin) or requiring higher power pushes N sharply upward – a general lesson for any NI study.
3. Psychoactive Drug Trial (Hypothetical Example)
Consider a hypothetical Phase II trial testing whether Drug A reduces depression scores more than placebo. Suppose previous literature suggests a placebo-group mean change of −5 units (SD ≈ 8), and Drug A might achieve −8 units (Δ = 3). The investigators choose α=0.05, power=80%. Using a two-sample t-test formula or software (e.g. power.t.test(d=3, sd=8, power=0.8) in R), one obtains $n ≈ 70$ per arm. If we expected 10% dropout, we inflate to 78 [$=70/(1-0.1)$] per group. The final plan might say “80 patients per arm.” If, however, a more stringent design required 90% power, the needed $n$ rises to ~100 per arm (increasing ~$1/(Z_{.95}+Z_{.9})^2$).
This stylized example is typical in anxiolytic or antipsychotic trials. The principle is transparent: a larger SD or smaller Δ would directly enlarge $n$. It also highlights dropout inflation, an often-neglected step in some published protocols.
4. Crossover Trial (Small N)
In some Phase II studies (e.g. early oncology or rare diseases), a crossover design is used because patients receive both treatments (in random order) with a washout in between. This design can greatly reduce required N because each patient serves as their own control (paired analysis). For a crossover comparing means, the sample size formula uses the within-subject standard deviation, which is typically smaller than the between-subject σ used in parallel trials. In fact, if $\rho$ is the within-subject correlation, the variance of the difference is $2\sigma^2(1-\rho)$, lowering $n$.
For example, a crossover trial in Parkinson’s disease might need only 10–20 patients, compared to hundreds in parallel format, due to this efficiency. However, crossovers have restrictions (no carryover, condition must be stable, short half-life interventions). We mention them briefly as special cases where typical sample size rules are modified by the design.
Common Pitfalls and Misinterpretations
Beyond the structured calculations, several pitfalls can derail sample size planning:
-
Assuming “bigger is always better.” Investigators may be tempted to round up vastly or recruit extra “just to be safe.” But excessively oversizing a trial is wasteful and unethical ([1]) . One guideline explicitly warns against “unnecessarily increasing the sample size” to chase statistical significance . As Senn notes, overpowered studies can detect trivial differences that lack clinical relevance. Statisticians advocate aligning sample size with clinically important effects, not with finding any minuscule $p<0.05$.
-
Ignoring Dropouts: Many published trials underpower themselves by forgetting attrition. As seen above ([13]), simply adding a percentage often gives too low a N. Thorough protocols should present the anticipated dropout rate and inflated $N$ clearly. Regulatory reviewers often query this explicitly.
-
Post Hoc Adjustments (“peeking”): Continuously monitoring results and adjusting N on the fly (without pre-specification) invalidates the Type I error rate. While planned interim looks are fine if alpha spending is controlled, optional stopping without correction inflates false positives. This is true even in sample size terms: examining data and increasing $n$ because results look promising (p-hacking) can inadvertently raise the overall α. That practice is strongly discouraged.
-
Data-Driven Effect Sizes: Sometimes investigators use results from a pilot or a small experience to estimate Δ or σ. If that pilot had a big random effect, it can mislead sample size. Best practice is to use conservative (i.e. larger) variance estimates or to incorporate uncertainty (via SSR or Bayesian priors). Blind re-estimation of nuisance parameters (without unblinding the effect) is one hedge.
-
Citing Old Conventions Uncritically: Terms of 10–20% dropouts, α=0.05, power=80% are common defaults, but each trial’s context might require deviation. Some fields (e.g. genetic associations) use stricter α (like 1e-8), in which case required N skyrockets. Conversely, some small exploratory trials might accept lower power. The context matters.
Overall, meticulous attention to assumptions and transparent justification helps avoid these errors. Sample size sections in papers should clearly state all inputs: effect size, variability, α, power, dropouts, and give the final formula or citation. Reviews have found many publications inadequately report or miscalculate these ([16]). As Rolnizek (2010) concludes, planners (biostatisticians) and clinicians must collaborate closely on sample size to ensure realism and reliability ([1]).
Broader Perspectives and Constraints
Designing sample size involves stakeholders beyond statisticians:
-
Ethics and Patient Welfare: Institutional review boards (IRBs) scrutinize that study risks are justified. An underpowered trial exposes patients to risk with negligible chance of benefit or knowledge gain. This is seen as unethical, akin to giving a placebo that cannot possibly show superiority. Conversely, recruiting more patients than necessary wastes potential enrollment from future trials. The maxim “as few as possible, as many as necessary” is often invoked. Ethical frameworks and the Declaration of Helsinki implicitly demand scientifically sound sample sizes. Patients’ perspective is that their contribution should be meaningful; patient advocacy groups increasingly emphasize the importance of properly powered trials for real impact.
-
Regulatory Authorities: As mentioned, guidelines (e.g. FDA’s statistical guidance, EMA’s guideline on “Trial Design”) require clear sample size justification. New drug applications (NDAs) or device submissions usually contain a section vetted by regulators wherein deviations or compromises in sample size must be justified. For pivotal trials, regulators expect high confidence (e.g. 80–90% power) to support approval decisions.
-
Sponsors and Funders: Pharmaceutical companies and grant agencies care about budget and feasibility. Large $N$ trials cost more in recruitment, treatment, and follow-up. There is a trade-off between statistical aims and resource constraints. Often economic models (cost-effectiveness or return-on-investment) are considered alongside sample size to ensure trials are economical. Sometimes adaptive designs (like sample size re-estimation) are appealing to mitigate this, by not “over-shooting” if the effect is larger than expected.
-
Clinicians and Medical Leadership: Investigators want assurance that their trial will answer the question. Small-sample Phase II trials may serve as “learning” studies, but they are labeled exploratory. Confirmatory Phase III trials must have solid $N$ to be taken seriously by the medical community. The historical context (e.g. previous trial sizes in similar settings) often influences planners’ confidence.
-
Academia and Publication: The reproducibility crisis has emphasized that underpowered studies often yield false positives or inflated effect sizes ([1]). Journals increasingly require sample size sections and sometimes supplementary power analyses. The scientific community now expects explicit sample size calculations, backed by data or citations.
In sum, sample size decision-making is an interdisciplinary negotiation: statisticians provide the technical analysis, clinicians define effect sizes and endpoints, ethicists emphasize participant risk/benefit, and funders consider cost. The optimal sample is one that meets statistical aims without undue cost or ethical burden. Good sample size planning therefore reflects a balance of quality of evidence and practical realities ([1]) .
Discussion and Future Directions
The field of sample size calculation continues to evolve. Traditional methods assume fairly simple trial structures; modern trials are more complex (biomarker subgroups, adaptive arms, external controls, etc.), prompting new methods. Key future directions include:
-
Adaptive and Bayesian Designs: As noted, adaptive trials (seamless phase II/III, multi-arm multi-stage, platform trials) may blur the line between fixed N and dynamic stopping rules. Bayesian adaptive trials explicitly incorporate accumulating evidence to adjust enrollment. Regulatory agencies have become more open to these designs, with draft guidances describing allowed flexibility. Power remains a guiding principle, but with more emphasis on Bayesian predictive probabilities of success.
-
Use of Real-World Data (RWD): Increasingly, external data (registries, electronic health records) can inform control arm event rates or effect heterogeneity. Hybrid designs might reduce needed randomization by borrowing external controls. Statisticians are developing methods to adjust sample size when incorporating such priors. This might reduce sample costs but has new assumptions.
-
Precision Medicine: Trials targeting narrow biomarker-defined populations naturally have smaller eligible pools. Sample size planning must account for enrichment strategies, possibly requiring fewer subjects if the effect is larger in a selected subgroup, or employing adaptive enrichment. The increasing stratification (genomics, imaging) adds multiple design layers; sample size formulas are being extended to multi-stratum trials.
-
Multi-Arm and Platform Trials: Trials like I-SPY COVID or cancer platform trials allow comparing multiple treatments concurrently against a shared control, often with adaptive randomization. These efficiently utilize information but complicate sample size. New methods (like closing separate arms when futile) require simulation-based sample planning. Here, “sample size” may mean the maximum patients per arm or overall in the platform. Technical work by Berry, Thall, Simon and others is addressing such computations.
-
Ethical and Societal Value-Based Endpoints: Some suggest going beyond pure statistical significance. There is discussion of precision (confidence interval width) criteria instead of (or in addition to) power. For example, designing a trial so that a 95% CI for the effect excludes clinically irrelevant range. This precision-based approach can sometimes yield different $N$ than a test-power approach, especially in equivalence trials. Regulatory interest in such ideas is still nascent.
-
Software and Automation: Advanced web tools and R packages continue to emerge. Instant simulation via cloud computing could allow investigators to “play with” sample size assumptions interactively, making planning more accessible outside biostatistics experts’ domain. There is a move towards integrating sample size modules into broader CDASH/GCP trial planning systems.
-
Meta-Research on Sample Size: Recent studies (e.g. Barnett & Glasziou, 2021) have cataloged differences between planned vs actual sample sizes across many trials ([17]). They find many trials either fail to meet designs or end earlier for futility. This kind of oversight encourages transparency (e.g. reporting CONSORT flowcharts of enrolled vs planned numbers).
In short, sample size calculation remains a dynamic area. Basic principles (effect size, error rates, variance) persist, but are being applied within ever-more sophisticated trial frameworks. Collaboration across disciplines and continued methodological research will ensure sample size determination stays aligned with both ethical standards and statistical rigor.
Tables of Key Results
| Factor | Effect on Required Sample Size |
|---|---|
| Effect Size (Δ) | Inversely related. Smaller targeted effect → much larger $N$. (Halving Δ roughly quadruples $N$) ([1]) ([5]). |
| Type I Error (α) | Lower α (e.g. 0.01 vs 0.05) → larger $Z_{1-\alpha/2}$ and larger $N$ ([4]). Conversely, relaxing α reduces N, but at cost of false positives. |
| Power (1–β) | Higher power (e.g. 90% vs 80%) → larger $Z_{1-\beta}$ and larger $N$ ([6]) (e.g. going 80→90% typically adds ~25% more N). |
| Outcome Variability | Greater variance (σ² for means, or p(1–p) for proportions) → larger $N$. (Heterogeneous outcomes need more patients for same precision.) |
| Control Event Rate (binary) | Lower baseline event rate → more $N$ needed to observe enough events. (As seen in vaccine trials: 0.7% incidence → ~20k patients per arm ([7])). |
| Non-Inferiority Margin | Smaller margin → much larger $N$. (For instance, a 1% margin required thousands of patients ([15]) ([9]).) |
| Cluster Design (ICC) | Required $N$ multiplied by design effect $=1+(m-1)\rho$. Nonzero ICC (ρ) inflates needed sample approximately by this factor ([14]). |
| Allocation Ratio | Deviating from 1:1 (e.g. 2:1) → larger total $N$ for same power. (Unequal arms dilute information; e.g. AZ’s 2:1 design needed 30k vs ~26k if 1:1). |
| Dropout Rate | Requires inflating $N$ by dividing by (1–dropout%). Example: To end with 500 completers and 10% dropout, start with ≈556 (not 550) ([13]). |
| Covariate Adjustment | Strong covariates (high $R^2$) can reduce needed $N$ via ANCOVA (not usually in simple formulas, but can be approximated as $(1-R^2)$ factor). |
| Interim Analyses | Group-sequential designs may slightly increase maximum $N$ (to allow early looks) but can stop early if effect is strong. Requires spending-function calculations. |
| Multiplicity (multiple endpoints) | Using Bonferroni or multiple α splits effectively raises critical values, increasing $N$ for each test to maintain overall α. |
Table 2: Factors affecting sample size. Each row shows how a design or analysis factor influences the required sample. Citations illustrate key points (e.g. effect size law, α/power trade-offs, dropout correction, design effect).
Conclusion
Sample size calculation is a multifaceted process that requires close collaboration between statisticians, clinicians, ethicists and regulators. The goal is to ensure the trial yields reliable, valid answers with minimum participant burden. In summary:
-
Sample size is driven by the statistical requirements (error rates, power, effect size, variability) and practical constraints (cost, time, ethics, recruitment potential). Smaller α, higher power, smaller effect sizes, greater variance, low event rates, and stricter non-inferiority margins all inflate required $N$.
-
Proper calculation and justification is ethically mandatory. Both underpowered and overpowered studies risk wasting resources or harming participants. As noted by Saksens et al. ([1]) and others, scientific and ethical validity hinge on “a justifiable, rational” sample size ([1]).
-
Regulatory guidelines (e.g. ICH E9) explicitly demand transparent sample size planning ([2]). Investigators must document all assumptions (hypotheses, effect sizes, variances, dropout rates, etc.) and include them in protocols. Major funders and journals similarly require this rigor.
-
Real-world examples (COVID-19 vaccine trials, non-inferiority device study) reveal that sample sizes often reach tens of thousands when detecting modest effects or dealing with rare outcomes. They underscore that planning must realistically account for event rates and dropout.
-
Recent trends (adaptive designs, Bayesian methods, big data) offer new tools but also require thoughtful adaptation of traditional power thinking. The fundamentals remain: any alternative approach still needs to ensure that true effects of interest will be detectable at the stated confidence level.
-
Throughout, evidence-based judgement is key. Default parameters (α=0.05, 80% power) are typical but not sacrosanct. Clinical input should guide effect size, and empirical data (or robust priors) should refine variance estimates. When in doubt, simulations or SSR can mitigate uncertainty.
In closing, deciding how many patients are needed for a clinical trial is a science informed by mathematics and data, but also an art guided by clinical relevance and practicality. No universal rule fits all trials; instead, each calculation is custom-tailored. By adhering to rigorous methods and clear justification, trialists can ensure their studies are both scientifically sound and ethically responsible ([1]) .
References: Cited sources in this report (peer-reviewed articles, guidelines, and textbooks) provide detailed derivations, case studies, and authoritative guidance on all aspects of sample size planning ([1]) ([12]) ([7]) ([15]) ([9]) ([2]) ([10]) ([11]) ([8]) ([4]) ([6]) ([5]) ([13]) ([18]). Each claim in this report is backed by one or more of these sources.
External Sources
DISCLAIMER
The information contained in this document is provided for educational and informational purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability of the information contained herein. Any reliance you place on such information is strictly at your own risk. In no event will IntuitionLabs.ai or its representatives be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from the use of information presented in this document. This document may contain content generated with the assistance of artificial intelligence technologies. AI-generated content may contain errors, omissions, or inaccuracies. Readers are advised to independently verify any critical information before acting upon it. All product names, logos, brands, trademarks, and registered trademarks mentioned in this document are the property of their respective owners. All company, product, and service names used in this document are for identification purposes only. Use of these names, logos, trademarks, and brands does not imply endorsement by the respective trademark holders. IntuitionLabs.ai is an AI software development company specializing in helping life-science companies implement and leverage artificial intelligence solutions. Founded in 2023 by Adrien Laurent and based in San Jose, California. This document does not constitute professional or legal advice. For specific guidance related to your business needs, please consult with appropriate qualified professionals.
Related Articles

Patient-Centric RTSM: Enhancing Clinical Trial Experience Through Innovation
Exploring how modern RTSM solutions are evolving to improve patient experience in clinical trials, featuring innovative approaches to recruitment, engagement, and trial management.

Bid Defense Meetings: A Guide to Winning CRO Contracts
Explore the Bid Defense Meeting (BDM), a critical step for CROs to win clinical trial contracts. Learn key preparation strategies, sponsor criteria, and best pr

CRO Consolidation: How Mergers Impact Clinical Trials
Explore the rise of CRO consolidation in the clinical trials industry. This analysis covers M&A trends, key drivers, and the impact on pharma sponsors and quali