EPIDEMIOLOGY

1: BASIC TERMINOLOGY

Odds: case/(total-case)  less intuitive

ex: 0.25

 

Prevalence/Proportion: case/total, unit (%), At certain point/period  intuitive

By cross-sectional study

Good for chronic Dx   bad at dealing cured/died

ex: 0.2 or 20% at 10am (point), 34% (period)

 

Cumulative Incidence (CI)

=Incidence Proportion, Risk, Average Risk, Case Fatality Rate, Attack Rate

CI=new case in a certain period/Dx free population at the start time,

(no unit with time duration)

by closed cohort study, not by cross-sectional study

bad at dealing loss of f/u, competing risk, open cohort

ex: 0.2 or 20% over 4-hour period

 

Survival Proportion (S)

=1-CI

 

Incident Rate (IR)

=average Rate during the study period

IR=new case in a certain period/sum of person time,

Good for open cohort, dealing competing risk and loss of f/u

ex: 20cases per 100 person-hour

 

 

* Prevalence/Proportion割合, Ratio比, Rate x/時間

*CI/Δt(=duration)=C/NΔt≒C/PT=IR

*Prevalence=IR*duration

*?CI=1-e-IR*t≒IR*t?

 

Frequency: number of case, odds, proportion, rate

Association: Relative scale, ~ ratio

 

Association

Comparison between groups

 

Causation

Assumption if everybody were exposed

 

Risk difference/Cumulative incidence difference (Absolute/Additive measure) Cohort

CID=CI1-CI2

Associational interpretation: “There was a 20% excess of Dx/ (20 more cases per 100 people) over 4-hour period in those exposed compared to those unexposed.”

Causal interpretation: “If everybody had been exposed, there would be a 20% excess risk of Dx/ (20 more Dx cases per 100 people) over the 4-hour period compared to had nobody been exposed.”

 

Risk ratio/Cumulative incidence ratio (Relative measure) Cohort

RR=CIR=CI1/CI2

Associational interpretation: “There was a 2 times the risk of Dx/ (100% higher risk of Dx) over 4-hour period in those exposed compared to those unexposed.”

Causal interpretation: “If everybody had been exposed, the risk of Dx over the 4-hour period would be a 2 times/ (100% higher than) the risk had nobody been exposed.”

 

Incident Rate difference (absolute measure)

IRD=IR1-IR2

 

Incident rate ratio (Relative measure)

IRR=IR1/IR2

 

Odds ratio (Relative measure) Case-Control

OR=O1/O2

*when Dx is rare, OR≒RR

 

2: STUDY DESIGN

 

Bias and Error in Epidemiology

Population selection

Data collection, analysis and interpretation

confounding

 

Type/Design Merits Demerits
A: Descriptive to generate hypothesis,
 1: Individuals
 Case-report
 Case-series “How common is this finding in a Dx”
 Cross-sectional

“How common is this Dx/condition”

◯Prevalence or Odds(ratio)

•Estimate general population burden (not just those seeking medical care)

•Good for Constant exposure(age, sex, etc)

•cheap and easy

×Cumulative Incidence,

×Incidence rate

•Description at that point in time

•Cannot determine cause come first

•Influence of survival factors

 2: Populations
 Ecological

(correlation)

“What explains the difference between groups”

•Generate Dx etiology hypothesis

•Focus on groups

•Inexpensive, little time b/o 2 data

•Can target high-risk pop/period

•Better measurement of exposure than individual study

•Good for policy making

 

•No link to individuals

“Ecological Fallacy”: =Ecological/Aggregating bias, Stereotyping. dose not always reflect true effect at the individual level

•must be same area or time period

•Cannot determine causal relation

•Collinearity: environmental variables are more highly correlated at the group level

B: Analytical to test hypothesis
 1: Observational
 Case-control

“What factor associate with Dx”

◯Odds ratio, Logistic regression

Good for rare Dx

Relatively inexpensive

Quick to acquire results

×cumulative incident, Incidence rate, RR

Bias in exposure assessment, control selection

Examine association only, not causality

Limit to one outcome

 Cohort

“How many people get the Dx when exposed” “What factor predict Dx”

◯Incidence rates, relative risk(risk ratio), Odds ratio, survival curves, hazard ratios

Measure risk factor prior to Dx

Good for multiple Dx outcome

Minimise bias in measuring exposure, survivor bias

No ethical concerns

Difficult to study rare Dx

Costly, Take long time

Examine association only, not causality

Healthy worker effect/ loss to f/u

Potentially large sample size

Exposure might change during study

 Nested studies
 2: Experimental
 RCT Incident rate ratio is better measure of association because of taking into account the follow-up time and incident case

 

Data Primary: collected in this time   Secondary: existing, such as routine data

Type of study design Quantitative vs Qualitative, Retrospective vs Prospective, etc.

 

Hierarchy

Case report Case series

Ecological

Cross-sectional

Case-control  matching reduce confounding

Cohort

RCT

Systemic review/Meta analysis

 

Control of Confounding factors

Study design (Randomize, Select comparable groups, Matching)

Analysis (Subgroup analysis, adjust data statistically)

 

Evaluating study

Power

Confounders

Biases

Misclassification of exposure

 

 

Case Control Study: Calculation

Cases Controls
Exposed A B Nexp=A+B
Not exposed C D Nnoexp=C+D
Ncase=A+C Ncont=B+D

Odds Ratio

ORexp=Odds of exposure among cases / Odds of exposure among controls

=( A/A+C / C/A+C ) / ( B/B+D / D/B+D )

=A/C / B/D =AD/CB

ORdis=Odds of disease among exposed / Odds of disease among not-exposed

=( A/A+B / B/A+B ) / ( C/C+D / D/C+D )

=A/B / C/D =AD/CB=ORexp

OR=exposed case*not-exposed control / exposed control*not-exposed case

“For individuals who are exposed, the odds of Dx is OR times greater than the odds for those who are unexposed.”

OR=1 no assoc, OR>1 positive assoc, OR<1 inverse assoc

Variance for logarithm of OR: Var(In OR)=   Standard Error for In(OR): SE(In OR)=

95% CI for OR: exp{In(OR)±1.96}

*√=square root, In=natural log, exp=natural exponent

Controls Cases OR
not-exposed A B
a exposed C D ORa=DA/CB
b exposed E F ORb=FA/EB

OR a vs. b=ORa / ORb

Case Control Study: Bias

Selection bias: ex. Volunteer bias

Misclassification(information) bias

Recall bias patient will answer more carefully.

Interviewer bias: interviewer know the dx

Exposure/outcome classification bias

Outcome bias: Diagnostic bias, survival bias

 

Odds R vs Risk R in observational study.  In general,

case-control: Odd R, chi-square test for multiple variate

prospective cohort: Risk R, Cox proportional hazards

 

 

Cohort Studies

Type

Prospective cohort

Prospective and retrospective cohort

Retrospective cohort: no future f/u, use existing data, exposure measures not for this study

 

Study population exclusion

At the analysis phase  c.f. at setting the control in case-control studies

 

Calculations

Relative risks /or risk ratio (RR)

RR=incidence in the exposed/ incidence in the unexposed

Cases Non-Cases Incidence of Dx
Exposed A B Iexp=A/A+B
Not exposed C D Ine=C/C+D

Relative Risk

=Iexp / Ine

“Smokers are 4.61 times more likely than non-smokers to develop lung cancer.”

Attributable Risk

=Iexp – Ine

“148 per 1000 smokers developed lung cancer because they smoked.”

 

95%CIs for RR
    * In=natural log, exp=e exponent

 

Survival Analysis

Cox proportional hazards regression

Include time-to-event

Calculate risk ratio (hazard ratio)

 

Bias in Cohort Studies

Selection bias: Healthy Worker Effect (employed population tend to be healthy)

Information bias: collect different quality of info from exp and not-exp.  Loss to f/u differs.

Misclassification bias: exp status or disease status

Generalisability: external validity may be questioned depending on the cohort selection

Nested Studies: Nested Case-Control Studies / Case-Cohort Studies

Control sampling

1 Cumulative sampling (traditional case-control design):

from those who do not develop the outcome. At the time of nested study. Matched.

2 Incidence density sampling (risk-set sampling, density sampling):

at the time each case is diagnosed. Time-Matched Controls. Matched

3 Case-cohort (case-base; case-referent):

from the entire cohort at baseline. Controls include cases. Random subcohort

 

Measurement of Association

If controls are sampled from those who are still disease-free at the end of f/u, then odds ratio is estimated.

If controls are sampled from those who are still at risk at the time each case is ascertained, then rate ratio is estimated.

If the controls are sampled from those who are initially at risk (case-cohort), then the risk ratio can be estimated.

 

Type/Design Merits Demerits
All Nested Good for rare outcomes

Exposure precedes outcome

Temporality of Exp-Dx relationship

Less bias in exp assessment

Efficient for cost and biospecimen

Flexible: allow testing other hypothesis

Reduce selection bias than case-control b/o from same population

Reduce information bias: assess exp blindly

 Nested Case-Control Possibility of risk matching Usually only one outcome
 Case-Cohort Can be used for multiple outcomes

When excluding cases from control is logically difficult

Allows estimation of risk factor distributions and prevalence rates- unbiased assessment of correlations among variables

Can include person time in analysis

When normal Cohort Study is possible, better to do normal study for the statistical efficiency.

RCT   

Question of RCT: superiority, non-inferiority, safety profile

Randomization

Simple: rarely used as can lead to imbalances. Not suitable for small studies

Blocked/ Restricted: Randomly into group1 and 2, ensure the balance within blocks. Blocks size can be set, blocks can be permuted. Latin square block design.

Stratified: use Subgroup by gender or age etc

Minimization: change the chance for the allocation to balance # of pts for treatment and controls

*Randomization minimises bias at outset, but Doesn’t prevent different treatment later or different assessment of outcomes.  Thus, Blinding required to reduce bias

Blinding

Single(p), double(p,i), Triple(participants, investigators, assessors)

Cons: cost, complexity, titration    Pro: no participant, investigator, research team bias

Design

Parallel trial design: simple and commonly used

Crossover design: for chronic condition, increase efficiency(lower cost, smaller sample size)

Wash-out period is to prevent carry-over effects

Factorial Design: evaluate multiple interventions in a trial, assess independent effect or interaction

Ex Treatment A and B, 2*2

Cluster: group rather than individual randomized to intervention

Adaptive design: allow modification during the trial(sample size, dose, etc), multiple arms, increased efficiency(time, cost),Bayesian Statistics

Bias

Population: Real world vs recruited to clinical trials, Use of a run-in phase

Generalisability: gender/age/region or recruitment/disease severity
Sponsorship: sponsor = “an individual, company, institution or organization which takes responsibility for the initiation, management, and/or financing of a clinical trial”

Loss to f/u

Analyses: Analysis population (ITT (include change of treatment)or per protocol(change excluded))

*ITT is better because it preserves the initial randomization.

Phase I: focus on safety, first in human, Phase II: initial investigation for clinical effect, small scale

Phase III: Full scale treatment evaluation, compare to standard therapy or placebo

Phase IV: post marketing surveillance

CONSORT: consolidated standards of reporting trials

(title, intro, background, objective, methods, results, discussion, registration, funding, protocol)

RCT: planned intervention, outcome carefully measured, drug trial, selected population, confounding reduced, greater confidence in causality

Observational study: no intervention, less selected population, confounders

 

 

Attributable Risk(AR)=Risk Difference

Iexpo – Iunexpo=(background incidence + incidence due to the exposure) – (background incidence)

“0.2 or 20% risk over the 4-hour periods”

 

Number Needed to Treat(NNT)

NNT=1/AR

“5 people need to receive the intervention to prevent 1 case of the disease”

 

Attributable risk percent (AR%)

Proportion of Dx among the exposed which can be attributed to the exposure

AR%=(Iexpo – Iunexpo) / Iexpo *100= (RR-1) / RR *100= 1- 1/RR *100 (%)

“The proportion of Lung Ca incidence rate due to cigarette smoking is 95%.”

**If negative value, calculation is not meaningful.  Change the control group.

 

Preventable fraction(PF)

Proportion of cases that would have occurred if people had not been exposed to the protective factor.  RR<1.0

PF=(Iunexpo – Iexpo) / Iunexpo = 1-RR

 

Population attributable risk(PAR)

Excess risk of Dx in the total study population of exposed and unexposed individuals that is attributable to the exposure

PAR=Ipopulation – Iunexpo = AR*p   ** p,(Pe)=proportion of exposed individuals in the population

 

PAR% (population attributable risk fraction)

Proportion of Dx in the population that could be prevented if one eliminates the risk factor.

PAR%=(Ipopulation – Iunexpo) / Ipopulation *100 ={(1-p)R0+pR1}-R0 / {(1-p)R0+pR1} *100

=p(RR-1) / p(RR-1)+1 *100  **p=population of exposed / total population

Among the general population, 88% of the total risk for fatal lung Ca is due to cigarette smoking.

88% of lung Ca can be prevented by eliminating cigarette smoking.

  • A big AR% dose not necessarily equate to a big PAR%

 

Absolute: useful to describe the public health impact.  Essential to make policy

Relative: useful to explore etiology.  Describe strength of the association

 

High risk strategy, Population strategy. Prevention paradox

A large number of people exposed to low risk generate more cases than a small number of people exposed to high risk.

A measure that brings large benefits to the community offers little to each participating individual.

*Be careful to shifting entire population, when the association between risk factor and disease is U-shaped.

 

Direct and Indirect standardization

To compare rates between populations with e.g. differing age-sex distributions

Direct standardization/ Directly Standardized Rate(DSR)

Apply the same set of weights to the stratum-specific rates of different populations

=apply the rate from the study population to the standard population

 

Standard Incident risk=Σ(Rl,P*WL,S)

L=adjustment variable, I=a certain stratum of the adjustment variable, Σ=the sum of all over strata l of variable L, Rl,P is the risk in stratum l in the population P, Wl,S is the proportion of individuals in stratum l in the standard population S

 

No need to know the overall rate for the entire population

For multiple comparisons between populations

×Changing the standard might change the result

×rather high numbers of events required

 

Indirect standardization

Standardized mortality ratio(SMR) / Standardized Incident Ratio(SIR) when the result is disease

SMR=Observed events/ Expected events(=Calculated by applying mortality rate from the standard population to the study population)

SMR=1: the same, SMR>1: more events were observed than expected from the stratum-specific rates in the standard population SMR<1: opposite

 

No need to have stratum-specific counts

often used when rates are low

Can be estimated by Poisson regression

SIR is Usually calculated only in Cohort Study

×Low expected counts lead to instability of estimates

×SMRs wouldn’t be comparable to each other.

 

Problem with Standardization

Once you have standardized for a factor you cannot investigate its impact

Standardized rates are summary measures, thus can mask interesting trends in stratum-specific rates.

 

 

3 Diagnostic Tests

Sensitivity: Proportion of those truely with disease in who tested positive

TP/(TP+FN)

Disease + Disease –
Test + TP FP
Test – FN TN

Specificity: Proportion of those without disease who tested negative TN/(TN+FP)

False positive: FP/(TN+FP),   False negative: FN/(TP+FN)

* Denominator is disease status(Dx+ or Dx-) in above 4

* These 4 are not influenced by Dx prevalence

Positive Predictive Value: Proportion of positive test that correctly identified Dx  TP/(TP+FP)

Negative Predictive Value: TN/(TN+FN)

*These 2 are heavily influenced by Dx prevalence

*Positive Predictive Value and Sensitivity is the 2 most important parameters.

*We can correct misclassification by sensitivity and specificity

 

ROC curves  Area under the curve indicates overall test performance

Likelihood Ratio Test

LR+:  = sensitivity / (1- Specificity)

LR-:  = (1- sensitivity) / specificity

*Somewhat arbitrary, but LR+>10 or LR-<0.1is perceived as high diagnostic value

4 Screening

Pre-Clinical Phase(PCP): Period from early detection possible by screening to Normal Clinical presentation

Indicate Expected utility of screening, Required minimal frequency of screening,

PCP is determined by Disease Incidence, previous screening, sensitivity of test

Lead time: Time by which diagnosis is advanced or made earlier

Evaluation of screening effectiveness

Lead-time bias: screened case could seem to have longer survival due to early diagnosis

Length-time bias: screening preferentially identifies slower or less progressive cases with better prognosis

Compliance bias(selection bias): screened patients are better educated and more health conscious

Feasibility of screening

Burden of disease: effectiveness of treatment without screening

Acceptability: convenience, comfort, safety, costs

Efficacy of screening: Sensitivity or Specificity of test, potential to reduce mortality

Efficiency: Risk and costs of f/u of test positives, cost effectiveness

Balance of risks(harms) vs benefits

 

5 REVIEW and META-ANALYSIS

(narrative) Review

Prone to being subjective, Qualitative summary only, Possible influence of publication bias

Systematic Review

Increased objectivity by pre-specified criteria, Possible influence of publication bias

Qualitative summary only

Meta-analysis   Increased objectivity

Possible influence of publication bias

Pooled re-analysis of individual data

Review protocol

Background selection, research question

Study inclusion and exclusion criteria (PICOS=patient,population, intervention, comparison, outcome, study design)

Objectives: Synthetic goal & Analytic goal(identify the difference among study specific-effect, Sub-group analysis to examine the effect of a variate, Sensitivity analysis to evaluate one study effect to the summary result, Meta-regression methods to identify the source of heterogeneity)

Continuous endpoint: Difference in means, The standard error of difference in means(95%CI)

Binary endpoint: the log(odds ratio), the standard error of the log(odds ratio)

Survival endpoint: the log(hazard ratio), the standard error of the log(hazard ratio)

Statistical methods: The combined effect is weighted average of the each effect =

The weight

Effect size variation: Fixed effect model=variation is assumed from random error within studies, gives the best estimate of an assumed common treatment effect.

Random effect model=variation is assumed from True variation in effect size from one study to the next, gives the average from the distribution of treatment effects across studies. Use this model when the heterogeneity across study effect is evident or likely.        *both models lead the same summary estimate

Heterogeneity: Cochran’s Q test, Q=ΣWi(Ti-)2 Wi=

Q follows chi-square distribution for df=k-1, where k=number of studies

Other measure, I2 statistic: Intuitive. Percentage of variation across studies that is due to heterogeneity rather than chance.

Source of heterogeneity: variation of intervention/exposure/disease, differences in covariates/design/analysis/population, bias

Issues in meta-analysis:  Dealing with heterogeneity: No heterogeneity, then use fixed effects model

Some heterogeneity, use random effects model

Too much heterogeneity, it might not be appropriate to pool studies, result of analysis is interpretable.

Study Quality: meta analysis increase precision of estimate but Bias in the studies are not elimanted

Publication bias: funnel plots to examine patterns of results

 

6 APPRECIATION OF STUDY

Validity: Absence of Systematic error, Elimination of the true association, invalid=inaccurate

Internal: bias(selection, Information), Confounding

External: generalizability

Precision: Absence of Random error, Repeatability of the study result

Bias  c.f. Chance is caused by random error, it may cancel each other out in the long term

Selection bias: chance to enroll to study population from source population affect expos and result

*Berkson’s bias(hospital-based CC study is more likely exposed to exposure)

*Healthy worker effect(cohort), Differential losses to follow-up(in cohort or RCT)

No bias to OR if sampling fractions differ only by case-control or exposure status not by both

Misclassification bias non-differential (errors in determining exposure status or results, but occur equally among cases and controls or exposed and unexposed)=OR towards the null, differential (occur unevenly)=OR increase or decrease

*Information bias (imperfect definitions of variable and flawed data collection procedure), *Interviewer bias (by knowing the result), *recall bias (due to memory)

Mixed bias: selection and information bias

*Detection bias(the medical situation relating to exposure promotes the detection of study outcome, cohort study)

 

Confounding: the effect of extraneous variables, More prevalent than bias, occur when effect of exposure is mixed with effect of another variable. Type of systematic bias toward or away from null.

Identification steps:1 in at study design, 2 in at data analysis

1 knowledge of subject matter (causal definition=biological evidence),

2a examine three conditions (1 association with exposure, 2 association with result without exposure, 3 a confounder is not the consequence of exposure=no collinearity)

2b Stratify data and check the similar OR among categories when stratified by confounder but different from ORcrude

*Breslow-Day test for homogeneity of the ORs (H0: ORwith confound=ORwithout confoud)

*Mantel-Haenszel OR adjusted for confounding to check OR is different from ORcrude

Positive confounding: ORcurde>ORadjusted, caused by homogenous assoc among Expo, Con, Dx

Negative confounding: ORcurde<ORadjusted, caused by heterogeneous association

*Excessive correlation: over adjustment by variable which has too close biological relationship (collinearity) ex. Obesity>Leptin>cancer

*Residual confounding by imperfect surrogate or misclassification

 

2b’ Examine multivariable model with confounder as a covariate

 

Control for Confounding

Advantage Disadvantage
Design Randomization Control for both known and unknown confounders Only for RCT
Restriction Inexpensive Reduce the pool of available subjects

Affect generalizability

Matching

 

Matching on strong confounder tends to increase the statistical power (efficiency) •Can not study the matching variable as a risk factor

•Matching on a non-confounder harm statistical efficiency

•Overmatching when matching variable is strongly correlated with exposure

•Cannot match on too many confounders

•Need statistics for matched design

Analysis Stratification Examine the raw data in detail

Can obtain a summary adjusted OR

Sparse data when multiple categorical C.

Need to change continuous C to categorical

Not feasible when multiple C

Multi-variable regression models Can manage multiple C

OR in Matching

Mantel Hanszel OR (in McNemar’s test) =b/c (only discordant group)

OR in Stratification

Breslow-Day test for homogeneity of the ORs (H0: ORwith confound=ORwithout confoud) p<0.05 = heterogeneous

Mantel-Haenszel adjusted OR to check OR is different from ORcrude

ORMH =weighted (Wi=) average of the stratum-specific ORs

Effect Modification(causal meaning)Interaction(statistical meaning: can be by chance)

Synergism (accentuating) or Antagonism (eliminating) effect to the outcome.

Two methods for Detection and Evaluation

1: Are Stratum-specific ORs heterogeneous?

Evaluate overlap of 95% CIs of each stratum with/without confound

Breslow-Day test for homogeneity of the ORs

Q-test

2: Use statistical model

Multiplicative model: Ho Interaction effect(OR++/OR-+*OR+-)=1, OR++=OR-+*OR+-

Additive model: RD++=RD-++RD+-(cohort study), OR++=OR-++OR+--OR(=1)(case-control study)

*in cc study, it’s not easy to assess the additive interaction since Incidence risk is not available, and can’t test this model directly in logistic regression, which is multiplicative model.

Regression model: include interaction(=exposure*modifier) as covariates

 

 

STATISTICS

Type of data

Numerical/Quantitative data: Frequency table, Histogram (Y axis; frequency or relative frequency density, gap means no observations, Probability Density Functions represent whole population) Box and Whisker (Median, 25th-75th IQR, Outlier good for skewed data)

Continuous                  Discrete ex 0,1, 2, 3,

Categorical data:  Frequency table, Bar Charts (drawn with gaps)

Ordinal ex Cancer Stage (I, II, III) Bar Charts, mode or median 

Nominal ex Blood type(A, B, O) Bar Charts, mode

Binary/dichotomous

Amount of information: Continuous>Discrete>Ordinal>Nominal

Mean, median, mode

If N is even, median is average of 2 middle observations.

If there is more than 2 modes, it is multimodal distribution

Spread in data, Shape of Distributions

Range: lowest to highest

Percentiles or Quantiles: inter-quartile range=25th to 75th percentile, 95% reference range=2.5th and 97.5thpercentiles

Variance=2    Standard deviation(SD)( in population)=: the measure of spread in individual values. Describe the variation between people

Skewed: Negative, Positive(right long tail) better to use median and IQR or 95% interval

Normal/Gaussian distributions: symmetric bell shape. 68% within Mean±SD. 95% within Mean±2SD, 99.7% within Mean±3SD

Population mean==E(X)=Σxi p(xi)  2=V(X)=E[(X-)2]=Σ(xi-)2p(xi)

die example: Pr(X=xi)=p(xi)=1/6 if xi={1,2,3,4,5,6}, =0 otherwise, =3.52

Z value 0.5 0.68 1 1.5 1.96 2
P(Z>z) 0.309 0.248 0.159 0.067 0.025 0.023

Z distribution: Z-value(number of SDs from the mean )

X~N(, 2) (= normally distributed with mean, variance2, SD =)

Z=(X-)/ where Z~N(0,1) Z=number of SDs from the mean

 

95% Confidence intervals (CI): refer to population mean, not to individual values

sample mean() = our best guess of population mean as point estimates.

“We are 95% confident that the population mean lies between 95%CI around our sample mean.”

“When we have many repeated samples and construct 95% CI’s around each sample mean, 95% of these CI’s will contain the population mean.” (Interpretation in terms of frequency of results)

Standard Error (SE): measure of precision of our estimate of the population mean, when using sample mean to estimate the population mean.  variation between sample means is measured by SE. SE()=

95% CI for  ≒ within ±2SE = ( – 2,  + 2)  * when sample more than50

c.f. SD: variability between individual observations is measured by SD.

t distribution

closely related to standard normal but with heavier tails. Tails depending on degree of freedom(df)

T=(-)/SE=(-)/(s/√n)

95%CI for a mean=(-t(n-1,0.025)s/√n, +t(n-1,0.025)s/√n) *t(n-1,0.025)=t value by t distribution with n-1 degrees of freedom that marks the 2.5% tail area

95% CI does not account for measurement bias or selection bias. Require random seletion.

95% CI for a difference in means

SE(12)= √(SE(1)2-SE(2)2)=√(s2/n1+s2/n2)  *s=combined SD from 2 samples

95%CI =(1-2)±t(n1+n2,0.025)SE(12)

“We are 95% confident that the mean is between A and B larger in C than in D in the population.”

“If we were to perform this experiment repeatedly and calculating 95% CI for the difference, the ture population difference in mean will be contained within 95% of these CI’s.” (Interpretation in terms of frequency of results)

 

 

Key Assumptions for t-test: Random Sampling, Independent Observation, Normally Distributed sample(skewed data also can be normal if N is enough), Population Variance assumed equal (SDs are within 2-fold of each other)

One sample test:    Comparing population mean() to a fixed value(C).

Null hypothesis, H0: =C, Alternative hypothesis, H1: C

P-value=Prob(||≧|| assuming H0 is true),

Result nonsignificant: Do not reject H0, it might be true=sample might be drawn from a population with=C

Result significant: Reject H0, Accept H1. Conclude that C.

Two sample test:   Comparing population mean() between 2 independent groups

Null hypothesis, H0: , Alternative hypothesis, H1:

P-value=Prob(||>|| assuming H0 is true), ,

**, for t-test, s=combined estimate of SD across both sample. t-test is used if s1 and s2 are within 2-fold of each other(Variance can be assumed equal). Otherwise use Welch’s test.

Result nonsignificant: Do not reject H0 , it might be true.

Result significant: Reject H0, Accept H1. Conclude that .

  Reject H0 Don’t reject H0
H0 is true Type I error α

False positive

True negative

1-α

H1 is true True positive

1-β

Type II error β

False negative

0.1>P>0.05 weak evidence, 0.05>P>0.01 evidence, 0.01>P>0.001 strong, 0.001>P very strong

Type I and II error and power   power is more than α

*Pr(Type I error)=significant level, α, usually 0.05 (1 in 20 tests are significant by chance)

*Pr(Type II error)=βdepends on (effect size), generally 10-20%

*Power=chance of correctly rejecting H0 =Pr(reject H0 when it is false)

=1-β *Power↑, when N↑, α↑ and effect size↑

Prob of Type I and Type II error is a trade-off

**SEs, 95% CI and p-value cannot reveal bias

 

Probability

When A and B are independent, Pr(A and B)=Pr(A)*Pr(B), Pr(A or B or both)=P(A)+P(B)-Pr(A and B)

When A and B are mutually exclusive events, Pr(A and B)=0 and Pr(A or B)=Pr(A)+Pr(B)

When A1,A2,,,An are mutually exclusive and exhaustive, Pr(A1 or A2 or,,,or An)=1

Conditional= Pr(A) given Pr(B) had occurred: Pr(A|B)=. If A and B are independent, Pr(A|B)=Pr(A)

Bayes Theorem: Pr(A)*Pr(B|A)=Pr(B)*Pr(A|B)=Pr(A and B)

Binomial Distribution=Proportion out of fixed number, The result of each patient is independent

Pr(Y=y)=, if Y~ Binominal(n,π) where π=probability of each test, with n individual trial

Mean E(X)=nπ when X~Binomial(n,π), For all distributionμ=E(X)=Σxip(xi)

Variance Var(X)=nπ(1-π) when X~Binomial(n,π), For all distribution 2=V(X)=E[(X-μ)2]=Σ(xi-μ)2p(xi)

Outcome Group1 Group2 Total
Yes a b a+b
No c d c+d
Total a+c b+d a+b+c+d

Pearson’s Chi-Squared test

Null hypothesis H0: row proportion in each column are equal

O=Observed, E=Expected, Group1 yes=(a+c)*(a+b)/(a+b+c+d)

Test statics y=, all E>1 and E>5 in 80%< cells

Chi-squared distributions, df=(row-1)*(column-1)

 

 

 

 

 

Linear Regression

y=α+βx, α=intercept (average y when x=0), β=slope, regression coefficient

**yi or=α+βxi+, xi or yi =ith value, =fitted value, =random variation in yi=residual =y-

Residual Sum of Squares(RSS)=Σ2=Σ(yi -α-βxi)2, Least Squares=choose α,β to minimise RSS

Null hypothesis H0: β=0

Test statistic t=/se(), t-distribution with df=n-2, (when 2 variable, use n-3)

95%CI=(-tn-2,0.975*se(), + tn-2,0.975*se())

Correlation (Correlation Coefficient)

r=0 no linear association, r=1 perfect positive assoc. r=-1 perfect negative assoc. usually denoted by r2

The percentage of variability of the response variables is explained by r squared

“We know that 9%(r2*100)% of the variation in systolic blood pressure is attributable to (or associated with) variation in sodium intake.”

Automatically increase when variables are added.

Residuals

Residual Plot; Residuals/Fitted values(), should be scattered equally

Standardized Residuals; divided by variance, should be normally distributed

Outlier

Bad data by measurement or recording, Inadequacies in model, Poor sampling of observations

May check Cook’s distance

Transformation of variables: In, exp, √, 2, 1/ to improve the fitted model

 

 

Binary covariate

yi=α+βxi, x=0, 1st group, x=1, 2nd group (indicator),

α=mean y in 1st group, α+β=mean y in 2nd group, β=difference in y

test β: equivalent to t-test for difference between groups *α also can be the same

One way ANOVA: Categorical variables with >2 levels create k-1 indicator variables for k level variables

F test: H0:there is no difference in all groups.  To check the difference in 2<levels categorical variables

Multiple Linear Regression

Interactions: yi=α+β1xi2zi3(xi* zi)+,  *the opposite is additive model

To allow a different relationship(slope) between each group of a variable. Not for relationship of variables

*When we compare the difference between xi with zi and xi without zi, the odds ratio difference is eβ1 x eβ3

and 95% CIs is from elowerβ1 x elowerβ3 to eupperβi x eupperβ3. We judge the significance by whether it include 1 or not.

ANOVA (Analysis of Variance): to test whether  decrease is by chance or not, when adding the variate

Variance Ratio=The difference of Residual SS(Sum of Squares) / MS(Mean Square) in new model

Comparing to F test=F (the difference of variate, n-number of variate) and calculate P value

Model building

Identify potential predictor variables from previous evidence

If possible, include all known predictors

Select the Best subset of predictors

Unnecessary variables will add noise

Need to be aware of collinearity

Variable selection

Forward: Start with null model (y=α), and continue to add variables if they make significant decrease in RSS.

Backward: Start with fullest model, and remove a variable if the model without it is not significantly worse.

Stepwise: Based on forward stepwise, but after adding a new variable test the effect of dropping variable

**include variables that are known to be important based on published data even if they are not significant

**eliminate variables with narrow distributions or sparse data in categories

**examine variable clustering

Variables in RCT

Treatment arm, design variables(recruitment centre), important covariates

Weeks 7

Logistic Regression

Outcome: Binary (code as 1 or 0), this is why linear regression cannot be used.

Predictor: Continuous or Categorical

odds=p / 1-p, when p=probability of having the outcome event, 0~1

In(odds)=In(p/1-p)=logit(p)=α+βx    odd=eα+βx=eα * eβx,  p=eα+βx / 1+eα+βx

Control: x=0, Treatment: x=1

In(odds ratio)=In(odds on treatment/odds on control)= In(p1 / 1-p1) – In(p0 / 1-p0)=logit(p0)-logit(p1)=α+β –α=βthus eβ=odd ratio   *this is always the same as direct calculation.

*Coefficient from the result of regression is β. So take exponent to calculate odds ratio. Same as 95% CI.

*Multivariate: In(p/ 1-p)=α+β1x12x23x3,,, if p=probability of outcome

**Test of null hypothesis; β2=0 carried out using Standard error of β2

Z-value=expβ2 / SE ofβ22 /SE ofβ2   p will be calculated by Z distribution (Larger Z leads lower p)

**95%CI; In(95%CI)=β2±1.96*SE of β2

**If x is Continuous Variables; odds ratio means 1 unit change. When calculate 5 units change, odds ratio=exp(5*β)=(eβ)5

 

Predicted hb_post = 99.67 – 0.61(hypertension) – 7.95(emergency)-5.69(urgent)

Where hypertension, emergency and urgent are indicator variables, taking values of 1 for people satisfying these conditions and 0 otherwise.

Those in hypertension group have mean hp_post values 0.61 lower than mean in non-hypertensive group, controlled for operation type.

 

 

 

Poisson Regression

-total population is not known

Distribution for modelling the number of occurrences of an event across time or over an area

Assumes events occur independently at constant rate

: the mean number of events in an interval of time or space, count

If Y~Poisson()

Pr(y events in the interval)=Pr(Y=y)=, mean of this distribution is E(Y)=, variance is V(Y)=

=λ x t,  λ=average number of events per unit of time or space=average rate.

In case of rare outcome(πis small) in large population(n is large), Binominal is approximated by a Poisson distribution. average

Use Linear regression for Poisson: =0~+∞(per person-year), In(i)=-∞~+∞. Thus, In(i)=α+βxi, xi=0 or 1. Thus,i=eα or eα×eβ, Rate Ratio=eβ, Null hypothesis: β=0

Hypothesis testing: β=0, using z test. z statistic=β/SE(β) for p-value, In(95%CI)=β±1.96SE(β)

Adding an offset term In(ti)

In(i)=α+βxi+In(ti) then, In(λi)=α+βxi (α,β=log rate)

Can also include covariates and interaction.

 

Binominal Distribution: Proportion, independent

Poisson Distribution: Only count. Count is independent,

 

Survival Data Analysis

Kaplan-Meier curve

We need the data of time origin, the scale for the passage of time and failure.

Censored individuals=who are not observed for the full time to failure, not regarded as a failure

P(surviving from t to t+1)=1-P(dying between t and t+1)=1 –

P(surviving in t)=multiple of all the P(surviving) before that time point (t)

Log-rank test

Test for a difference between the curves across the whole range of observed values

EA=sum of expected death of A, EB= sum of expected death of B =sum of death in both groups – EA

OA=observed death in A

, df=1  if p-value less than 0.05 then reject null hypothesis there is no difference

Cox Regression (Cox’s proportional hazards model) clear start time, knowing time to the result

*Describe the association of different variables with survival

*Adjust the effect of one variables for confounding variables

*Predict individuals with better or worse prognosis

Hazard (rate) functionλ(t)=rate of failure conditional on survival to time t.

λ1(t)=λ0(t)×exp{βx}  (x=0 or 1 for each group), assuming the hazards are proportional

*The model makes no assumptions about the absolute risk, λ0(t)

If we have covariates,λ1(t)=λ0(t)×exp{β1x12x2+,}  exp{β1x12x2+,} is relative hazard / risk score

Positive β=worse survival, negative β=more likely to survive

Interpretation: When β1=-0.78, exp(-0.78)=0.46, P=0.04, 95%CI for the HR=(0.22, 0.98), “Patients on * had only 46% of chance of dying compared to * at any given time, having accounted for covariate of risk scores.  Although this does not look like a chance finding, we are not very certain about the magnitude of the hazard ratio.”

Hazard ratio per 1unit increase in a continuous covariate or associated with a binary covariate is exp(β1)

HR=1 no difference, HR<1 or >1 smaller has smaller hazard rate

Interpretation

p value

“Under the null hypothesis of no association between X and Y, there is only a 6 in 1000/ 0.6% chance of getting results at least this extreme. This is strong evidence against the null hypothesis/ highly statistically significant, so we can reject it.” (when p=0.006)

Population

The population mean Z is significantly lower/higher in X compared to Y – the evidence for this is weak/ moderate/ strong/ very strong. The best estimate of this is that the population mean Z is C points lower/higher in X, compared to Y. (if C is small) So although the evidence is strong, the estimated difference in population IQs is not very big, and might not be of much practical relevance.

The population mean is not significantly different comparing X with Y, so we have no evidence that the population mean Z are different – they might plausibly be the same. The best estimate of this is that the population mean Z is C points lower/higher in X, compared to Y, (if n is small) however, since this is based on tiny samples of students, this estimate is very imprecise.

95% CI of population mean

“We are 95% confident that the population mean lies between 95%CI around our sample mean.”

“When we have many repeated samples and construct 95% CI’s around each sample mean, 95% of these CI’s will contain the population mean.” (Interpretation in terms of frequency of results)

Linear Regression

There is a significant linear relationship (after adjusting for pre HB value) between weight and post-surgery Hb.

X is on average C units higher for every one unit increase in Y, after adjusted for Z.

The 95% CI for the estimate means it could be as low as 0.07 unit or as high as 0.17 units.

We are 95% confident that the true population mean lies between 0.07 and 0.17

Logistic, Poisson, Cox Regression

Null hypothesis: Odds ratio / Incident rate ratio/ Hazard ratio of X compared to Y=1, (OR/IRR/HR=1)

Alternative hypothesis: Odds ratio/ Incident rate ratio/ Hazard ratio of X compared to Y≠1,(OR/IRR/HR≠1)

 

Interpretation

Value of Ratio (*Ratio is exponential of regression coefficient)

  • (Binominal variate) OR/ IRR/ HR of X compared to Y is A.

(or, Odds/ Incident rate/ Hazard of X is (1-A)×100% or A times lower/higher in X compared to in Y.)

  • (Numerical variate) Odds/ Incident rate/ Hazard of X decreases/increases by (1-A)×100% per unit decrease/increase in Y, adjusted for Z.

P value

  • This is not statistically significant (since p>0.05), so this association may result from random chance (in the absence of any true association in the population). It is plausible that the true OR/ IRR/ HR might be one.
  • Since p<0.001, we have very strong evidence to reject the null hypothesis of no association and to conclude that there is a decrease/increase in odds/ Incident Rate per month/ Hazard of X as Y increases, adjusted for Z.

95% CI (*Value is exponential of 95% CI of regression coefficient)

We are 95% sure/ confident that the true odds ratio /Incident ratio per month/ Hazard ratio are between 56% lower and 20% higher in X compared to in Y. [when 95%CI on OR/ IRR/ HR = 0.44 to 1.20]

We are 95% sure/ confident that the true odds ratio/Incident ratio per month/ Hazard ratio lies between 0.81 and 2.96, so we can’t be sure whether X are slightly better or three times worse than Y. [when 95%CI on OR/ IRR/ HR = 0.81 to 2.96]

Leave a Reply