Statistics Summary

Author: Ben Reynwar Whatever wikipedia uses since there are probably bits cut and pasted. 2011 April 1 2011 May 28

These are notes I made while working through the book "Statistics Explained". They're rather terse but include R examples so I thought they might possibly be useful.

Type I error - finding an effect that isn't real.

Type II error - not finding a real effect.

Some Useful Distributions

Chi-square Distribution

The chi-square distribution with k degrees of freedom is the distribution of the sum of squares of k independent standard normal variables.

It is a special case of the gamma distribution.

Useful for estimating variance of a population. Suppose we have n observations from a normal population $$N(\mu, \sigma^2)$$ and $$S$$ is their standard deviation then $$\frac{(n-1)S^2}{\sigma^2}$$ has a chi squared distribution with n-1 degrees of freedom. Since $$\sigma$$ is the only unknown once can get confidence intervals for the population variance.

> # Number of observations in sample.
> n <- 10
> # Set alpha for 95% confidence interval (two-sided)
> alpha <- 0.05
> # Get a sample.  Pretend we don't know mean or standard deviation.
> sample = rnorm(n, 25, 8)
> top = var(sample)*(n-1)
> # Get bounds for confidence interval for the standard deviation.
> lower = sqrt(top/qchisq(1-alpha/2, n-1))
> upper = sqrt(top/qchisq(alpha/2, n-1))
> c(lower, upper)
[1] 3.531557 9.373243


F-Distribution

The ratio of two independent random variables both of which have chi-squared distributions has an F-distribution.

> # We take two samples from a normal population and take the ratio of variances.
> n1 <- 3; n2 <- 4
> # A 95% confidence interval for our result is:
> alpha <- 0.05
> c(qf(alpha/2, n1, n2), qf(1-alpha/2, n1, n2))
[1] 0.06622087 9.97919853
> mu <- 10; sigma <- 5
> sample1 <- rnorm(n1, mu, sigma)
> sample2 <- rnorm(n2, mu, sigma)
> ratio <- var(sample1)/var(sample2)
> # Calculated ratio is:
> ratio
[1] 0.2974744


Students's t-distribution

• An estimated distibution of sample means. Different from a normal distribution since it takes into account the uncertainty in the standard deviation.
• Depends on number of degrees of freedom of the standard deviation used.
> # Calculate the value of x for which the cumulative distribution of the
> # t-distribution is 0.05, for degrees of freedom = 1000.
> qt(0.05, 1000)
[1] -1.646379
> qt(0.05, 3)
[1] -2.353363
> # Get the density of a t-distribution.
> dt(0.05, 1000)
[1] 0.3983438
> # Get the cumultive distribution
> pt(1, 1000)
[1] 0.8412238


Student's t-test

• Assesses the statistical significance of the difference between two sample means.
• Can have paired or unpaired samples (related or independent).
• Assumes the two samples have equal variances. Often used as long as one is not more than three times as big as the other.
• Welch's t-test is an extension that does not assume equal variances.

Independent Example

> # Generate a random vector containing 10 values from a normal distibution
> # with mean 10 and standard deviation 2.
> x = rnorm(10, 10, 2)
> x
[1]  7.288907 10.679290  7.021853 12.356291 11.973576 13.452286 10.765847
[8] 12.815315 10.776181 11.478132
> # Generate another similar random vector.
> y = rnorm(10, 10, 2)
> y
[1] 10.228506  9.307230  9.007056 11.699393 12.524014 15.061729  9.449236
[8] 14.378682 10.251995  9.862443
> # Perform the t-test on them.
> # The p-value is the chance that the difference between the means would be
> # this large if the null hypothesis were true.
> t.test(x, y, var.equal=TRUE)

Two Sample t-test

data:  x and y
t = -0.3273, df = 18, p-value = 0.7472
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.346419  1.713897
sample estimates:
mean of x mean of y
10.86077  11.17703


Analysis of Variance

Variance Ratio(F) = (Between conditions variance)/(Error variance)

Assume samples come from normally distributed populations with equal variances.

F statistic depends on the dof for the two kinds of variances. These should always be given with the F value:

$$F(df_{bt.conds}, df_{error})$$ = calculated value

The p-value is then found from a table or calculation.

One factor independent measures ANOVA

• also called completely randomised design ANOVA
• We have $$K$$ conditions, with $$n_i$$ samples in the ith condition, and $$N$$ samples overall.
• $$\bar{Y}_{i}$$ denotes the sample mean in the ith condition.
• $$\bar{Y}$$ denotes the overall mean of the data.
• $$Y_{ij}$$ is the jth observation in the ith condition.

Between conditions variance = $$\sum_i n_i(\bar{Y}_{i} - \bar{Y})^2/(K-1)$$ you Error variance = $$\sum_{ij} (Y_{ij}-\bar{Y}_{i})^2/(N-K)$$

For two conditions it is mathematically identical to the t-test.

> # Number of points in each data set.
> ni <- 4
> # Generate four sets of data with different means.
> a <- rnorm(ni, 10, 2)
> b <- rnorm(ni, 12, 2)
> c <- rnorm(ni, 14, 2)
> d <- rnorm(ni, 16, 2)
> # Merge them all together into a data frame.
> values <- c(a, b, c, d)
> letters <- c(rep('a', ni), rep('b', ni), rep('c', ni), rep('d', ni))
> df <- data.frame(letter=letters, value=values)
> # Perform an anova analysis.
> fit <- aov(value ~ letter, data=df)
> summary(fit)
Df Sum Sq Mean Sq F value    Pr(>F)
letter       3 91.902 30.6342  24.829 1.964e-05 ***
Residuals   12 14.805  1.2338
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Post-hoc Tests

If we find an effect with the F-test then we need to work out where it is coming from. We do this with post-hoc tests. The risk is the increased chance of a type I error.

Least Significant Difference Test
• takes not account of the number of comparisons being made.
• increased risk of Type I error is simply accepted.
Neuman-Keuls Test
• ??
Duncan Test
• ??
Tukey Test
• We have K conditions, with n samples in each. N=nK.
• Studentized Range is the range of samples divided by an estimate of their standard deviation.
• We find the value of the Studentized Range for which their is some defined chance that the condition results will be under.
• If any two conditions deviate by more than this amount we can say they are significant.
• Depends on number of conditions and dof in standard deviation calculation.
> th <- TukeyHSD(fit)
> th
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = value ~ letter, data = df)

$letter diff lwr upr p adj b-a 0.7224116 -1.609443 3.054266 0.7949923 c-a 4.8536226 2.521768 7.185477 0.0002394 d-a 5.3724864 3.040632 7.704341 0.0000919 c-b 4.1312110 1.799356 6.463066 0.0009963 d-b 4.6500748 2.318220 6.981929 0.0003539 d-c 0.5188638 -1.812991 2.850718 0.9097766 > > # Now try to do the same thing but more manually for (b-a) comparison. > # Variance of residual errors. > vare <- mean(c(var(a), var(b), var(c), var(d))) > # Normalize the difference between the means the estimate of the standard deviation > # of the means. > st_range <- (mean(d)-mean(a))/(sqrt(vare/ni)) > # ptukey: ptukey(q, nmeans, df, nranges = 1, lower.tail = TRUE, log.p = FALSE) > # q - a given studentized range > # nmeans - the number of samples (i.e. number of means in this case) > # df - degrees of freedom in the calculation of the stdev. > # It returns the probability that the studentized range of the sample of means is > # less than the given value of q. > # For our example nmeans is clearly 4 (a,b,c and d) > # And df is 4*(ni-1) because we calculated the variance from 4 sets of samples each of > # which had (ni-1) degrees of freedom. > manual <- 1 - ptukey(st_range, 4, 4*(ni-1)) > manual [1] 9.19243e-05 > th$letter["d-a", "p adj"] - manual < 10e-6
[1] TRUE

Scheffé Test
• Very similar to Turkey method except we do not limit to pairwise comparisons.

• Let μ1, ..., μr be the means of some variable in r disjoint populations.

• Begin cut and paste from wikipedia: An arbitrary contrast is defined by

$$C = \sum_{i=1}^r c_i\mu_i$$

where

$$\sum_{i=1}^r c_i = 0.$$

If μ1, ..., μr are all equal to each other, then all contrasts among them are 0. Otherwise, some contrasts differ from 0.

Technically there are infinitely many contrasts. The simultaneous confidence coefficient is exactly 1 − α, whether the factor level sample sizes are equal or unequal. (Usually only a finite number of comparisons are of interest. In this case, Scheffé's method is typically quite conservative, and the experimental error rate will generally be much smaller than α.)

We estimate C by

$$\hat{C} = \sum_{i=1}^r c_i\bar{Y}_i$$

for which the estimated variance is

$$s_{\hat{C}}^2 = \hat{\sigma}_e^2\sum_{i=1}^r \frac{c_i^2}{n_i}.$$

It can be shown that the probability is 1 − α that all confidence limits of the type

$$\hat{C}\pm s_\hat{C}\sqrt{\left(r-1\right)F_{\alpha ; r-1 ; N-r}}$$

> # Perform a Scheffe test to see if the average of samples a and b is significantly
> # different from the average of c and d.
> cs <- c(-0.5, -0.5, 0.5, 0.5)
> means <- tapply(df$value, df$letter, mean)
> c.hat <- sum(cs*means)
> s2 <- vare/ni * sum(cs^2)
> fval = c.hat^2/s2/(4*(ni-1))
> fval
[1] 6.100458
> # The chance of any linear combination deviating by this much if all samples
> # were from the same normal population.
> 1 - pf(fval, 3, 4*(ni-1))
[1] 0.009188318


Analyzing Frequency Data

A population can be divided in c categories. The chance of an observation being of category i is $$p_i$$. We make N observations and find $$n_i$$ in category i.

$$\sum_i{\frac{(n_i-N p_i)^2}{N p_i}}$$ can be approximated by a chi-squared distribution with c-1 degrees of freedom as long as $$N p_i$$ is greater than 5 for all categories.

Possible Uses:

• A goodness of fit test. To see if a sample seems to match a normal distribution bin the observations and compare the observed frequencies to those expected.
• A test of independence. If we have two types of categories and each observation is a member of one of each types of categories then we can check if the categories are independent. We just see how the frequencies deviate from what would be expected if they were independent.

Linear Correlation

We have a sample of (x, y) pairs. Pearson correllation coefficient is:

$$r = \frac{Sum of XY products}{Sum of XX squares + Sum of YY squares}$$

r=1 is perfect positive correlation, r=0 is uncorrelated, r=-1 is perfect negative correlation. It is the slope of the line of best-fit through the reduced variables.

The distribution of r is approximately a Studentized t-distribution. To be exact the variable t ($$t = r\sqrt{\frac{n-2}{1 - r^2}}$$) has this distribution.

> # Create some correlated data.
> n <- 10
> a <- rnorm(n, 12, 1)
> b <- 3*a + 6 + rnorm(n, 0, 3)
> r <- cor(a, b)
> r
[1] 0.5985606
> t <- r * sqrt((n-2)/(1-r^2))
> t
[1] 2.113384
> # The probability of a correlation being this far from 0 by chance is:
> 2 * (1 - pt(t, n-2))
[1] 0.06751678