Statistics

Features of a graph

Mode: Unimodal/Bimodal

Center

Mean

Median - not affected by outlier values

Spread: also known as "range"

Standard deviation

Outliers: 1.5 x Inter-quartile range

Shape: Symmetric/skewed/uniformed/bell-shaped

Quantitative data

Represented by stemplots, histograms

Relative standing

Z-score

Percentile

Density curve

Normal distribution (Gaussian Distribution)

Symmetric, single-peaked & bell-shaped

Centered at the mean

68-95-99.7 Rule

Assessed by Normal probability plot

Mean: μ

Standard deviation: σ

Permutation & Combination

Addition principle: m + n

Multiplication principle: m x n

Permutation: arrangement of objects taken from a set

With n distinct objects: n!

Choosing r objects from n distinct objects

With p identical objects

Circular permutation: (n - 1)!

Combination: unordered selection of objects from a set

Relationship

Variable

Response/dependent variable

Explanatory/independent variable

Lurking variable

Scatterplot

Direction

Form

Strength

Correlation

Only measures linear relationship

Always between -1 and 1

Not a resistant measure

Association

Positive

Negative

Explanations

Causation

Common response

Confounding effect

Least Squares Regression line

Residual: observed y - predicted y = ei

Residual plot

Should show no clear pattern

Curved pattern indicate non-linear relationship

Standard deviation

Transforming variables - non-linear relationship --> linear relationship

Power law model: ln y = ln a + plnx

Exponential growth model: ln y = ln a + xlnb

Categorical data

Represented by: Pie charts, dotplots, bar charts

Two way contigency table

Marginal distribution/marginal frequencies: row and column totals

Conditional distribution

Sampling distribution

Sample

Statistic: Number that can be computer from sample data

Mean

Proportion

Population

Parameter: Number that describes the population

Population proportion: p

Mean of population: μ

Unbiased: Mean of sampling distribution equal to true value of parameter

Variability: Described by spread of sampling distribution

Central limit theorem For large sample size, sampling distribution approximately normal

Mean, μ

Standard deviation

Distribution

1) Binomial distribution: X ~ B(n, p)

Conditions

Only 2 outcomes in trial, "success" or "failure"

Fixed number n of trials

Trials are independent

Probbbability of success for each trial is the same

Mean, μx = np

Standard deviation

Normal approximation

np ≥10 and n(1-p) ≥10

2) Geometric distribution

Conditions

Each observations fall into one of two categories, "success" or "failure"

Observations are all independent

Probability of success same for each observation

Variable of interest, X is no. of trials required for first success

Mean

Variance

3) Poisson distribution: X ~ Po (λ)

Conditions

Events occur singly and randomly

Events occur uniformly

Events occur independently

Probability of occurance of event within small fixed interval is negligible

E(X) = λ

Var(X) = λ

Additive property: X + Y ~ Po (λ+μ)

Approximating Binomial with Poisson: X ~ Po(np)

n is large (>50) and np < 5

Designing samples

Population: Entire group of individuals that we want information about

Sample: Part of population we examine in order to gather information

1) Census: Attempts to contact every individual in the entire population

Advantages: Able to find out all characteristics of the population accurately

Disadvantages: Expensive, time-consuming

2) Sampling: Involves studying a part in order to gain information about the whole

Advantages: Cheaper, less time needed

Disadvantages: May miss out certain characteristics of population

Cautions

Undercoverage - some groups in population left out

Nonresponse - individual can't be contacted or does not cooperate

Response bias - result of behaviour of respondent or interviewer

Types of sampling

1) Voluntary response sampling: People choose themselves by responding to general appeal

Advantages: Easy to sample

Disadvantages: People are biased - may not be representative of population

2) Convenience sampling: Choosing individuals who are easiest to reach

Advantages: Easy to sample

Disadvantages: People are biased - may not be representive of population

3) Simple Random Sampling (SRS): Consists of n individuals and every individual has an equal chance

Steps

1) Label

2) Table - Use random number table

3) Stopping Rule - Indicate when you should stop sampling

4) Identify Sample

4) Stratified Random sampling: Divide population into strata and within each stra, choose a separate SRS and combine SRS to form full sample

5) Cluster sampling: Divide popluation into cluster then randomly select some of the clusters

Designing experiment

Design

Experimental units: Individuals on which experiment is being done

Subjects: Human beings used as experimental units

Treatment: Specific experimental condition applied to units

Factor: Explanatory variable of experiment

Control group: Group that does not receive the treatment

Block (design): Group of experimental units that are known before experiment to be similar in some way

Example of block design: Matched pair design

Comparison of response between treatment group and control group - reduce problems posed by confounding and lurking variables

Placebo: dummy treatment

Replication - reduce role of chance variation and increase sensitivity of experiment

Randomization: use of chance to divide experimental units into groups

Cautions

Placebo effect: response to dummy treatment

Lack of realism - sometimes impossible to duplicate conditions

Probability

Terms

Sample space S: Set of all possible outcomes

Event: Any outcome or set of outcomes of random phenomenon

Probability model: Mathematical description of sample space S and way of assigning probabilities

0 ≤ P(A) ≤ 1

P(A) = 1, event A will certainly occur

P(A) = 0, event A will certainly not occur

Both event A and B occur: A ∩ B

Either event A or B occurs or both occurs: A U B

P(A U B) = P(A) + P(B) - P(A ∩ B)

P(A and B) = P(A) x P(B) if A and B are independent

If A and B are mutually exclusive, P(A | B) = O

If A and B are independent, P (A | B) = P(A)

Confidence interval

2 parts

1) Confidence interval of form: statistic ± margin of error

2) Confidence level C: success rate for the method

1) Confidence interval for population mean, σ known

Conditions

Sample must be SRS from population of interest

Sampling distribution approximately Normal or use of central limit theorem to tell that it is approximately normal if n is large (at least 30)

Individual observations are independent, population size at least 10x sample size

Margin of error

Decreases

Confidence level C decreases

Population standard deviation σ decreases

Sample size n increases

2) Confidence interval for μ (σ known)

Conditions

Data are SRS from population of interest or randomized experiment

Observations from population have Normal distribution or distribution be symmetric and single-peaked

Subtopic

Assumes individual observations are independent, population size at least 10x sample size

Paired t Procedure: Compare responses to two treatments in matched-pair design or before-and-after measurements

μ: Mean difference in response or between before-and-after measurements

3) Confidence interval for population proportion

Conditions

Data are SRS from population of interest

n is so large that np and n(1-p) are 10 or more

Individual observations are independent, population at least 10x sample

Testing claim: Significance test

Conditions

SRS from population of interest

Normality conditions: large sample size (n ≥ 30) for means or np ≥ 10 and n(1-p) ≥ 10 for proportions

Independence observations

Steps

1) Identify population parameter

2) State null hypothesis Ho and alternative hypothesis Ha

3) Calculate statistic that estimates parameter

4) Use of P-value

Result with small P-value (less than 0.05) is statistically significant, strong evidence against Ho

5) Significance level α

P-value small or smaller than α, data statistically significant at level α

Test statistic

One-sample

One sample z statistic

P-value

Ha: μ > μo is P(Z ≥ z)

Ha: μ < μo is P(Z ≤ z)

Ha: μ ≠ μo is 2P(Z ≥ |z|)

One sample t statistic

P-value

Ha: μ > μo is P(T ≥ t)

Ha: μ < μo is P(T ≤ t)

Ha: μ ≠ μo is 2P(T ≥ |t|)

One-proportion z statistic

P-value

Ha: p > po is P(Z ≥ z)

Ha: p < po is P(Z ≤ z)

Ha: p ≠ po is 2P(Z ≥ |z|)

Normality condition: npo and n(1-po) are both at least 10

Two-sample

Two-sample z statistic

Two-sample t-statistic

level C confidence interval

Conditions

Two SRSs from two distinct populations

Samples are independent, each population must be at least 10x as large as corresponding sample size

Both populations are Normally distributed: Distributions have similar shapes and data have no strong outliers

Approximation for degree of freedom

Both sample size are least 5

Two-proportion z interval

level C confidence interval

Conditions

Two samples can be viewed as SRS from respective populations

Two samples are independent, each population at least 10x as large as corresponding sample size

Counts of "success" and "failure" are all at least 5

Two-proportion z test

Conditions

Two samples can viewed as SRS from respective populations

Two samples are independent, each population at least 10x as large as corresponding sample size

Estimated counts of "success" and "failures" all are least 10

Combined sample proportion

Confidence interval

Duality: link between two-sided significance test and confidence interval

Two-sided hypothesis test - Significance test and confidence interval will yield same conclusion

level C = 1 - α

Errors

Type I error

Ho is true, reject Ho

α = probability of Type I error

Type II Error

Ha is true, fail to reject Ho

Power: Probability that significance test will reject Ho when particular alternative is true = 1 - β

Increases

Increase α

Consider particular alternative that is farther away from μo

Increase sample size

Decrease σ through improving measurement process

Chi-square

Test statistic

Condition: All expected counts are at least 5

Test for Homogeneity of populations

Conditions

No more than 20% of expected counts less than 5

All individual expected counts at least 1