Statistics
Features of a graph
Mode: Unimodal/Bimodal
Center
Mean
Median - not affected by outlier values
Spread: also known as "range"
Standard deviation
Outliers: 1.5 x Inter-quartile range
Shape: Symmetric/skewed/uniformed/bell-shaped
Quantitative data
Represented by stemplots, histograms
Relative standing
Z-score
Percentile
Density curve
Normal distribution (Gaussian Distribution)
Symmetric, single-peaked & bell-shaped
Centered at the mean
68-95-99.7 Rule
Assessed by Normal probability plot
Mean: μ
Standard deviation: σ
Permutation & Combination
Addition principle: m + n
Multiplication principle: m x n
Permutation: arrangement of objects taken from a set
With n distinct objects: n!
Choosing r objects from n distinct objects
With p identical objects
Circular permutation: (n - 1)!
Combination: unordered selection of objects from a set
Relationship
Variable
Response/dependent variable
Explanatory/independent variable
Lurking variable
Scatterplot
Direction
Form
Strength
Correlation
Only measures linear relationship
Always between -1 and 1
Not a resistant measure
Association
Positive
Negative
Explanations
Causation
Common response
Confounding effect
Least Squares Regression line
Residual: observed y - predicted y = ei
Residual plot
Should show no clear pattern
Curved pattern indicate non-linear relationship
Standard deviation
Transforming variables - non-linear relationship --> linear relationship
Power law model: ln y = ln a + plnx
Exponential growth model: ln y = ln a + xlnb
Categorical data
Represented by: Pie charts, dotplots, bar charts
Two way contigency table
Marginal distribution/marginal frequencies: row and column totals
Conditional distribution
Sampling distribution
Sample
Statistic: Number that can be computer from sample data
Mean
Proportion
Population
Parameter: Number that describes the population
Population proportion: p
Mean of population: μ
Unbiased: Mean of sampling distribution equal to true value of parameter
Variability: Described by spread of sampling distribution
Central limit theorem For large sample size, sampling distribution approximately normal
Mean, μ
Standard deviation
Distribution
1) Binomial distribution: X ~ B(n, p)
Conditions
Only 2 outcomes in trial, "success" or "failure"
Fixed number n of trials
Trials are independent
Probbbability of success for each trial is the same
Mean, μx = np
Standard deviation
Normal approximation
np ≥10 and n(1-p) ≥10
2) Geometric distribution
Conditions
Each observations fall into one of two categories, "success" or "failure"
Observations are all independent
Probability of success same for each observation
Variable of interest, X is no. of trials required for first success
Mean
Variance
3) Poisson distribution: X ~ Po (λ)
Conditions
Events occur singly and randomly
Events occur uniformly
Events occur independently
Probability of occurance of event within small fixed interval is negligible
E(X) = λ
Var(X) = λ
Additive property: X + Y ~ Po (λ+μ)
Approximating Binomial with Poisson: X ~ Po(np)
n is large (>50) and np < 5
Designing samples
Population: Entire group of individuals that we want information about
Sample: Part of population we examine in order to gather information
1) Census: Attempts to contact every individual in the entire population
Advantages: Able to find out all characteristics of the population accurately
Disadvantages: Expensive, time-consuming
2) Sampling: Involves studying a part in order to gain information about the whole
Advantages: Cheaper, less time needed
Disadvantages: May miss out certain characteristics of population
Cautions
Undercoverage - some groups in population left out
Nonresponse - individual can't be contacted or does not cooperate
Response bias - result of behaviour of respondent or interviewer
Types of sampling
1) Voluntary response sampling: People choose themselves by responding to general appeal
Advantages: Easy to sample
Disadvantages: People are biased - may not be representative of population
2) Convenience sampling: Choosing individuals who are easiest to reach
Advantages: Easy to sample
Disadvantages: People are biased - may not be representive of population
3) Simple Random Sampling (SRS): Consists of n individuals and every individual has an equal chance
Steps
1) Label
2) Table - Use random number table
3) Stopping Rule - Indicate when you should stop sampling
4) Identify Sample
4) Stratified Random sampling: Divide population into strata and within each stra, choose a separate SRS and combine SRS to form full sample
5) Cluster sampling: Divide popluation into cluster then randomly select some of the clusters
Designing experiment
Design
Experimental units: Individuals on which experiment is being done
Subjects: Human beings used as experimental units
Treatment: Specific experimental condition applied to units
Factor: Explanatory variable of experiment
Control group: Group that does not receive the treatment
Block (design): Group of experimental units that are known before experiment to be similar in some way
Example of block design: Matched pair design
Comparison of response between treatment group and control group - reduce problems posed by confounding and lurking variables
Placebo: dummy treatment
Replication - reduce role of chance variation and increase sensitivity of experiment
Randomization: use of chance to divide experimental units into groups
Cautions
Placebo effect: response to dummy treatment
Lack of realism - sometimes impossible to duplicate conditions
Probability
Terms
Sample space S: Set of all possible outcomes
Event: Any outcome or set of outcomes of random phenomenon
Probability model: Mathematical description of sample space S and way of assigning probabilities
0 ≤ P(A) ≤ 1
P(A) = 1, event A will certainly occur
P(A) = 0, event A will certainly not occur
Both event A and B occur: A ∩ B
Either event A or B occurs or both occurs: A U B
P(A U B) = P(A) + P(B) - P(A ∩ B)
P(A and B) = P(A) x P(B) if A and B are independent
If A and B are mutually exclusive, P(A | B) = O
If A and B are independent, P (A | B) = P(A)
Confidence interval
2 parts
1) Confidence interval of form: statistic ± margin of error
2) Confidence level C: success rate for the method
1) Confidence interval for population mean, σ known
Conditions
Sample must be SRS from population of interest
Sampling distribution approximately Normal or use of central limit theorem to tell that it is approximately normal if n is large (at least 30)
Individual observations are independent, population size at least 10x sample size
Margin of error
Decreases
Confidence level C decreases
Population standard deviation σ decreases
Sample size n increases
2) Confidence interval for μ (σ known)
Conditions
Data are SRS from population of interest or randomized experiment
Observations from population have Normal distribution or distribution be symmetric and single-peaked
Subtopic
Assumes individual observations are independent, population size at least 10x sample size
Paired t Procedure: Compare responses to two treatments in matched-pair design or before-and-after measurements
μ: Mean difference in response or between before-and-after measurements
3) Confidence interval for population proportion
Conditions
Data are SRS from population of interest
n is so large that np and n(1-p) are 10 or more
Individual observations are independent, population at least 10x sample
Testing claim: Significance test
Conditions
SRS from population of interest
Normality conditions: large sample size (n ≥ 30) for means or np ≥ 10 and n(1-p) ≥ 10 for proportions
Independence observations
Steps
1) Identify population parameter
2) State null hypothesis Ho and alternative hypothesis Ha
3) Calculate statistic that estimates parameter
4) Use of P-value
Result with small P-value (less than 0.05) is statistically significant, strong evidence against Ho
5) Significance level α
P-value small or smaller than α, data statistically significant at level α
Test statistic
One-sample
One sample z statistic
P-value
Ha: μ > μo is P(Z ≥ z)
Ha: μ < μo is P(Z ≤ z)
Ha: μ ≠ μo is 2P(Z ≥ |z|)
One sample t statistic
P-value
Ha: μ > μo is P(T ≥ t)
Ha: μ < μo is P(T ≤ t)
Ha: μ ≠ μo is 2P(T ≥ |t|)
One-proportion z statistic
P-value
Ha: p > po is P(Z ≥ z)
Ha: p < po is P(Z ≤ z)
Ha: p ≠ po is 2P(Z ≥ |z|)
Normality condition: npo and n(1-po) are both at least 10
Two-sample
Two-sample z statistic
Two-sample t-statistic
level C confidence interval
Conditions
Two SRSs from two distinct populations
Samples are independent, each population must be at least 10x as large as corresponding sample size
Both populations are Normally distributed: Distributions have similar shapes and data have no strong outliers
Approximation for degree of freedom
Both sample size are least 5
Two-proportion z interval
level C confidence interval
Conditions
Two samples can be viewed as SRS from respective populations
Two samples are independent, each population at least 10x as large as corresponding sample size
Counts of "success" and "failure" are all at least 5
Two-proportion z test
Conditions
Two samples can viewed as SRS from respective populations
Two samples are independent, each population at least 10x as large as corresponding sample size
Estimated counts of "success" and "failures" all are least 10
Combined sample proportion
Confidence interval
Duality: link between two-sided significance test and confidence interval
Two-sided hypothesis test - Significance test and confidence interval will yield same conclusion
level C = 1 - α
Errors
Type I error
Ho is true, reject Ho
α = probability of Type I error
Type II Error
Ha is true, fail to reject Ho
Power: Probability that significance test will reject Ho when particular alternative is true = 1 - β
Increases
Increase α
Consider particular alternative that is farther away from μo
Increase sample size
Decrease σ through improving measurement process
Chi-square
Test statistic
Condition: All expected counts are at least 5
Test for Homogeneity of populations
Conditions
No more than 20% of expected counts less than 5
All individual expected counts at least 1