Statistics

Exploring relationships between data

Variables

Response variable

Value depends on explanatory variable

Plotted on the y axis

Explanatory variables

Explains the change in the variable and affects the value of the response variable

Plotted on the x axis

Scatterplot

Scatterplots ae used to study the relationship between 2 quantitative variables

The variables are plotted and the relationship is noted

Relationship

Negative

Smaller variables of one variable is associated with the smaller value of the other variable

Positive

Larger values of one variable is associated with the larger values of the other variable

Points to note

Direction

Overall direction of relationship

Strength

Strength of relationship

Outliers

Outliers must be noted

Correation

Denoted by r

Helps to measure the strength and direction of a relationship between 2 variables

If r > 0 relationship is positive and if r < 0 relationship is negative

If r is near 0 , might mean no relation ship or a strong non linear relationship , if r is close to 1 or -1 then the relationship is linear

-1≤r≤1

r is not a resistant measure

Regression Line

A line drawn through a scatter plot so that it is close to as many points as possible

Least Squares Regression Line

The line that minimizes the sum of the squre of the differences between the observed values and values predicted by the line

y= a + bx

$b=r\frac{S_{x}}{S_{y}}$

Can be used to predict values

Residuals

Observed Y - Predicted Y

Sum of residuals is 0

Mean of redisuals is 0

Plot of residuals

Plot of residuals must have no pattern

If there is a pattern in the residual plot

It means that a non linear relationship would better explain the data

Coefficent of determination

Denoted by r^2

Gives the percentage of data that can be explained by the regression line

Cautions

Outliers

Shows up as a point with a very large residual

Lurking variables

May affect the relationship between variables without being aparent

Measures of Data

Measuring the Center

Mean

Sum of all the elements of a set divided by the number of elements

Can be affected by outliers

Median

The middle value of a data set once it has been arranged ascendingly or descendingly

Resistant to outliers

Meauring the Spread

Range

Interquartile Range

The difference between the largest and smallest value

IQR can be used to remove influence of extreme values

It is the middle 50% of the data ( i.e that the bottom and top 25% have been removed

Outliers

Values more than 1.5(IQR) below the 1st quartile of more than the 3rd quartile are outliers

Variance

Average of the squared deviations from the mean

Standard Deviation

Gives a measure of spread around the sample mean

Affected by outliers , use only mean is also used

n-1 is the degrees of freedom

Boxplots are very useful

Can give the IQR , Median and Outliers

Plotted to check for outliers usually

Modified boxplot is boxplot with outliers removed

Measuring the Position

Simple Ranking

Arranging the elements in some order and noting where in that order a particular value falls

Percentile Ranking

Rank of 80% means that 80% of all values fall below that point or on it

The value at which p% of values fall below or on that point

Quartile ranking in which the data is split into 4 equal quartiles of 25%

Z-Score

The number of standard deviations a values lies from the mean

Emperical Rule

a.ka 68-95-99.7 rule

68 % of data lies within 1 standard deviation of the mean

95% of the data lies within 2 standard deviations of the mean

99.7% of the data lies within 3 standard deviations of the mean

Usefull to estimate the standard deviation if we know the data

Analysis of Data

Quantitative Data

Histogram

A plot of the observation on the x axis and the frequency on the y axis

Suitable for large quantities of data

Easy to visualize data

Total count of data is given by the sum of frequencies

Cummilative frequency plot

Somewhat similar to a histogram

The relative frequency of the data is summed and plotted

Stemplot

Similar to a histogram but the frequency is represented by leafs on a stem

Good to spot normality

Sometimes the plot may get too big

Trim the stemplot

Split the stem into 2

Very important to indicate legend

Catergorical Data

Pie Charts

Each slice of the pie represents the proportion of a particular category

Very good for comparing the proportions of different categories

Bar Chart

Each bar represents a particular category

The height of the bar represents the amount of that category

Important to have the width the same

Dot Plots

Each catergory is plotted on one side and the frequency or value is indicated with dots on the other side of graph

Important to state the legend

Shapes of Graphs

Uniform Distribution

Uniform if the histogram of the distribution is a straight line

Example would be of a dice , the freq for all numbers is 1/6

Skewed graphs

Skewed right

Skewed right if spreads far and thinly towards the higher values

Skewed left

Skewed left is it spreads far and thinly towards the lower values

Symmetric Distribution

2 halves of the graphs are mirror images

Bell shaped curve

If the distribution is symmetric with 2 sloping tails

Describing Graphs

Center

The point that seperates the values(area of graph) roughly in half

Spread

Scope of values from smallest to lagest

Clusters

Natural subgroups that the values fall into

Gaps

Holes in distributions where no values fall in

Outliers

Extreme values of a graph

Usually determined by the 1.5IQR rule

Modes

Unimodal

Only has one peak

Bimodal / Multimodal

Has more than one peaks

Highest peaks in the grpahs

Sampling and Experimental study

Types of sampling

Bias

Bias is created when the results have a factor that has affected the outcome

Nonresponse

Occurs in convinience sampling usually , when the individaul does not respon to the survey

Small sample size

Small sample size would not give accurate resulkts

Undercoverage

Some people might have been left out in the sample

Different Sampling Types

Random Sampling

SRS ( Simple random sampling)

A random sample size is chosen from a population

Random table can be used to select

Stratifed Sampling

Population is split into different stratas before data is taken from each strata is taken

Usually split into diffeent strata based on common characteristic

Clusters

The population is split into clusters

A particular cluster is chosen and all those in it are included

Voluntary Sampling

Only results of those who respond is considered.

Biased as those who respond may share common trait

Convinience Sampling

Only those who approach the surveyor get sampled

Biased

Census

A census is when the data of the entire population is obtained

Experimental design

Observational Study

Done by observing the outcomes without influencing the enivronment

Experimental Study

Done by altering the environment and observing the results

Degin Types

Matched Pair

Subjects are paired

Each one is put through a different treatment group

Results are compared

Blocked

Subjects with similar conditions that may affect results are blocked

Randomised

Subjects are allocated at random

Blinding

Blinging is important as either party in the experiment knowing anything about the subjects might induce bias

Single blind

The subjects do not know what they are gettting

Double Blind

Both the subjects and the researchers do not know what they are getting

Placebo

Used when comparing 2 treatments, the placebo is a fasle treatement

Used to remove any phsycological bias

Steps to design

Suitable desgin type

Suitable treatments

All variables taken into account

Control

One group may receive the treatment while another control group may not receive any for control purposes

Replication

Experiment needs to be replicated to ensure consisten results

Also helps to remove error brought about by chance

Probability , Permutation and Combinations

Permutations

Different number of ways a set of objects can be arranged

take note that the order has to be distinct , order matters

n! gives number of ways for n objects to be arranged

(n-1)! gives the number of ways for n objects to be arranged in a circle

When choosing way to arrenge r objects from n

Combinations

Order is ignored

Only number of ways to get different items matters

Objects have to be distinct

Probability

Probability of event

P(x) = 1 means that the event will defenitely and always occur

P(x) = 0 means that the event would not occur

If A and B are disjoint

P(A U B) = P(A) + P(B)

P(A intersect B) = 0

Hypothesis testing

Hypothesis

Null hypo : Nothing is different

Alternative hypo : Different from null

Errors

Type 1 error

Rejecting the null hypothesis when it is true

Type 2 error

Fail to reject null hypothesis when it is not true

Power of test

To increase power

Increase alpha

More extreme alternative

Descrease std deviation

Increase sample size

80% power is used as standard

2 Population

2 Sample Z test

2 Sample T test

Confidence interval

Conservative approach for df

df = Min( n-1 , n02)

Conditions

SRS

Independence

Population size is at least 10(n)

Normality

At least 5 for each sample

n1+n2<15

Normal no outliers and no strong skew

15<n1+n2<30

No outliers and no skewedness must be present in the data

n1+n2>30

Can be used even if distribution is skewed

Hypothesis testing

Degrees of freedom

Conservative

The degrees of freedon of both the samples must be considered and the smaller one chosen

Close approximation

Population Proportion

2 population z-test for hypo. test

Combined sample proportion

Conditions

SRS

Data must be obtained from an srs

Independence

Population size is at least 10 times of sample

Normality

The count of success and failures is at least 5

2 population z-interval for confidence

Conditions

SRS

Data must be obtained from an srs

Independence

Population size is at least 10 times of sample

Normality

The count of success and failures is at least 5

1 Population

Test about population mean

Std deviation known

1 Population Z test

Alternate Hypothesis

Condition

SRS

Data must be obtained from an SRS

Independence

Population is at least 10 times bigger than sample

Normality

Sample size at least 30

Std deviation not known

1 Population T test

Conditions

SRS

Sample must be by SRS

Independence

Populaiton is at least 10 times bigger than sample

Normality

n < 15

Population must be normally distributed and there must be no outliers

n < 30

No outliers or strong skew must be present

n >30

T procedure can be applied even if there is skewedness present in the data

Test about population proportion

Hypothesis

Conditions

SRS

Data must be from an srs

Normality

np and nq must be greater than 10

Independence

Population is 10 times of sample size

Distribution

Chi Sqaure

Goodness of fit

Each expected cout must be more than 5

2 way table

Null hypothesis

The distribution in the response variable is the same in all x populations

Alternative hypothesis

The x distributions are all not the same

Chi Square distribution

Probability Density Function

Cummilative Distribution Function

Normal

Z score

Gives the number of std deviations the value is from the mean

Usually used for large sample sizes

Generally used for distributions where majority of the obervations cluser around mean

Clear link between mean and standard deviation

Central limit theorem

When the sample size is more than 30

Normality can be assumed

Does not work with proportions

Can be used to approximate binomial and poisson distribution

Bionomial

Conditions

There are only 2 outcomes in the trial, one defined as "Success" and the other as "Failure"

The number of trials n is fixed

The outcome of each trial is independent

Probability of succes for each trial is the same

X~B(n,p)

Mean = np

Std deviation = root(np(1-p))

Normal approximation

Can be approximated to normal distibution

Condition

np and np(1-p) greater than 10

Most accurate when p = 0.5

If p is not close to 0.5 , approximation would be poor

Continuity correction must be performed

Poisson

Conditions

Events occur singly and randomly

Events occur uniformly

Probability of even occuring is very low

Events occur independently

Po(x) + Po(y) = Po(x+y)

Normal approximation

X~ N(np)

np must be large > 50

More accurate if n is large and p very low

Geometric

Conditions

There are only 2 outcomes in the trial, one defined as "success" and the other as "Failure"

The number of trials n is not fixed

The outcome of each trial is independent

Probability of succes for each trial is the same

The variable of interest X is the number of trials before the first succes

mean is 1/p

Standard deviation is q/p^2

Sampling Distribution

Parameter and statistic

Parameter describes a property of a population such as mean

Statistic is usually computated from the sample of a proportion

Samling variablity

Value of statistics varies over many trials

A larger sample would produce lesser variablity

Sampling distribution

Overall shape should be roughly normal

Acceptable if it is skewed but must not have outliers

No outliers should be rpesent

Center of distribution should be close to mean of population

Bias

How far the mean of the sample is from the mean of the population

Sample proportion

Used when meauring the proportion of a population

Mean proportion of sample is equal to that of the population

Standard deviation

Sample mean

Mean of samle is mean of population

Std deviation