Chapter 9 Data Distributions

9.1 Normal distribution

Normal distribution is explain by 2 parameter, mean and standard deviation

Normal distribution with mean = 0 and sd = 1 is called the standard normal distribution, and is denoted as N(0,1).

Changing the mean value will shift the distribution along x-axis

Changing the standard deviation will change the spread of distribution, thus shape.

9.4 Use z score to calculate percentile (area below or lower tail)

Percentile is percentage of observations that fall below a given data point

Use pnorm (probability distribution function)

q = quantile. As q increases, the area under the curve below q increases. So as pnorm

## [1] 0.3085375
## [1] 0.4207403
## [1] 0.4601722
## [1] 0.5398278
## [1] 0.5792597
## [1] 0.6914625
## [1] 0.1586553

9.7 Explore the applet

https://gallery.shinyapps.io/dist_calc/

9.8 Evaluating the normal distribution

Histogram though is a nice way to look the overall distribution of data, it sometimes become difficult to check for normality from the plot itself. Another useful alternative way is to draw a normal probability plot (normal Q-Q plot or quantile-quantile plot).

Below we draw QQplot for each features from iris dataset. In a perfectly normal distribution, x-axis (theoretical quantiles) and y-axis (Observed quantiles) should match perfectly well, thus all points will be on the diagonel line. Any deviation (skewness) will show the deviation of datapoints away from the line. Based on which end the deviation is, one can guess whether the given data is either left/right skewed.

A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line.

In the example below

9.9 Sampling distribution

Lets take sepal length from iris dataset. There are 150 flowers. Lets consider this as population i.e. the 150 sepal length values are suppose the population (Which is not true in real sense).

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Lets randomly select 50 flowers and see the mean sepal length.

## [1] 5.932

Lets repeat the sampling process 5 times and see how the mean values varies. As you can see from the result, each time we randomly sample, the sample mean varies. This variation is due to random sampling.

Sampling distribution implies the distribution of sample statistics.

## [1] 5.934 5.866 5.748 5.838 5.902

Lets repeat this sampling 100 times and plot the distribution of sample means.

9.10 Standard Error

## [1] 0.09538724
## [1] 0.1210215

How larger sample sizes have low standard error i.e. larger sample truely represents population

Lets repeat the sampling 5000 times. Also lets explore the different sample sizes n=10, n=30, n=100, n=1000

## [1] 0.25022862 0.13585496 0.08228301 0.02609450

9.11 Confidence interval

First decide what confidece interval you want. Suppose lets say you want to estimate the 95% confidence interval for mean sepal length of iris dataset. So 95% interval means, middle area around mean is 95%. So remaining upper and lower tail area would be 2.5% each i.e. 0.025 (Lower 2.5%, Upper 2.5% so that total = 5%).

Estimate the zscore for 0.025 using qnorm function. Or to estimate the upper tail one can use (0.025+0.95 = 0.975).

Since there are two tails of the normal distribution, the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail.

## [1] -1.959964
## [1] -1.959964

Now estimate standard error for sepal length

## [1] 0.06761132

Now estimate confidence interval (mean +/- z* * SE)

z* * SE is called margin of error.

## [1] 5.710818
## [1] 5.975849

So we are 95% confident that population mean of sepal length will lie between 5.71 and 5.95.

9.12 Hypothesis test for single sample mean

Consider sepal length from IRIS dataset. Lets assume that its a sample of 150 flowers whose sepal length is measured. However the actual population must be larger than 150 flower. From this sample of 150 flowers estimate the sample mean.

Objective is to compare this sample mean w.r.t some population mean and ask the question whether the sample mean is different/greater/less than the population mean. If answer comes its really different/greater/less, then how much confident/sure we are.

We will proceed slowly.

For now, first estimate sample mean.

## [1] 5.843333

So sample mean estimated is 5.84;

Lets ask the question whether mean sepal length of iris flowers are different than 6 or not?

Null hypo: sample mean = 6

Alternate hypo: sample mean not equal to 6

Such hypothesis testing is called two-tailed. Lets do a t-test with two tailed.

## 
##  One Sample t-test
## 
## data:  sl
## t = -2.3172, df = 149, p-value = 0.02186
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
##  5.709732 5.976934
## sample estimates:
## mean of x 
##  5.843333

Lets analyze the results. How do u get t-statistics value as -2.31 ?

t = (x - population mean)/ standard error of mean

Lets calculate one by one

## [1] -2.317166

So we saw that test statistics is -2.31

Degree of freedom df is n-1 i.e. 149

Now lets estimate 95% confidence interval of sample mean

## [1] 5.710818 5.975849

Now lets estimate pvalue. Its the probability of observing a value > test statistics

since we are using t-test, we can call pt function.

## [1] 0.02185662

So this pval is < 0.05, we reject null hypothesis and state that there is strong evidence exist to support the fact that the sample mean of sepal length is significantly different than 6

9.13 One tailed test (Greater)

Null: sample mean = 6 Alternative : sample mean > 6

## 
##  One Sample t-test
## 
## data:  sl
## t = -2.3172, df = 149, p-value = 0.9891
## alternative hypothesis: true mean is greater than 6
## 95 percent confidence interval:
##  5.731427      Inf
## sample estimates:
## mean of x 
##  5.843333

test statistics is -2.3172 as calculated above

Confidence interval has no upper limit

What is probability of observing values > -2.3172 with 149 degree of freedom?

## [1] 0.9890717

Here we didnot multiply with 2 since we are performing one tailed test.

pval observed was 0.98 which is > 0.05 so we cant reject null hypothesis. i.e. there is no substantial evidence exist by which we can claim that sample mean of sepal length is > 6.

9.14 One tailed test (Lesser)

Null: sample mean = 6 Alternative : sample mean < 6

## 
##  One Sample t-test
## 
## data:  sl
## t = -2.3172, df = 149, p-value = 0.01093
## alternative hypothesis: true mean is less than 6
## 95 percent confidence interval:
##     -Inf 5.95524
## sample estimates:
## mean of x 
##  5.843333

test statistics is -2.3172 as calculated above

Confidence interval has no lower limit

What is probability of observing values < -2.3172 with 149 degree of freedom?

## [1] 0.01092831

Here we didnot multiply with 2 since we are performing one tailed test.

pval observed was 0.01 which is < 0.05 so we can reject null hypothesis saying that we have substantial evidence to claim that sample mean of sepal length is < 6.

9.15 Hypothesis test for two sample mean, unpaired

Using this test we can compare two samples and perform 3 possible tests

  1. Whether sample means of the two samples are same or not.

  2. Whether sample1 mean is > sample2 mean or not?

  3. Whether sample1 mean is < sample2 mean or not?

In this case we assume that two samples are not paired. So sample sizes may vary. However we assume that

  1. Both samples are from a normally distributed population.

  2. Observations are independent.

  3. Samples are with uniform variability.

Lets take an example from IRIS dataset. We have 3 flowers, setosa versicolor virginica, each 50 flowers. Our task is to check whether mean sepal length of these three flowers differ among them or not?

Now lets check whether sample mean of setosa and versicolor are different or not?

## 
##  Welch Two Sample t-test
## 
## data:  sl_setosa and sl_versicolor
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1057074 -0.7542926
## sample estimates:
## mean of x mean of y 
##     5.006     5.936

Now lets look at the output one by one

Lets estimate mean of x and mean of y and cross check with the output.

## [1] 5.006
## [1] 5.936

Now calculate test statistics which is equal to difference of mean

## [1] -0.93

9.16 Hypothesis test for two sample mean, paired

9.17 Check for normality

9.18 Check for uniform variance

9.19 Hypothesis test for single sample proportion