datasciencepatterns/StatisticalInference/index.Rmd at master · jfaleiro/datasciencepatterns · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Patterns for Statistical Inference"
author: "J Faleiro"
date: "April 2, 2015"
output:
    html_document:
        keep_md: true
        toc: true
        theme: united
---

## Probability Density Functions

Example of a PDF

```{r}
x <- c(-.5, 0, 1, 1, 1.5)
y <- c(0, 0, 2, 0, 0)
plot(x, y, lwd=3, frame=FALSE, type='l')
```

Is this a valid PDF?

* $p(x) >= 0$ everywhere? _yes_

* is $area = 1$? _yes_

this is a right triangle, so $area = (base * height) / 2$

```{r}
((1.0 - 0.0) * (2.0 - 0.0)) / 2 # area of right triangle
```

What is the probability that 75% or less of the calls are addressed?

```{r}
(1.5 * 0.75) / 2 # area of the right triangle from 0..0.75
```

this is also a special distribution called a beta distribution

```{r}
pbeta(0.75, 2, 1) # R code for beta distribution of base 1 and height 2 (note p prefix for probability)
```

Note that prefix p<density-name> in R is for CDFs

What is the probability that 40, 50 and 60% of calls are addressed?

```{r}
pbeta(c(0.4, 0.5, 0.6), 2, 1) # beta distribution of heigth 2 and base 1
```

What is the median of the distribution we are working?

Median on any distribution is the 50% percentile (0.5 quantile)

```{r}
qbeta(0.5, 2, 1) # 0.5 quantile of a beta distribution
```

What gives ~70% - that means in 50% of the days, about 70% of the calls or more are answered

Note a prefix q<dentity-name> prefix in R for quantile distributions

## Variability

Simulation example: simulations over standard normals

```{r}
simulations <- 1000
n <- 10
a <- rnorm(simulations * n) # simulations * n draws from a standard distribution
m <- matrix(a, simulations) # matrix of simulations rows and n columns
sd(apply(m, 1, mean)) # for each row calculate the mean and calculate the SD of them
```

i.e. means of standard normals have $\sigma^2 = 1/\sqrt{n}$, we should get a value that is close enough below

```{r}
1 / sqrt(n)
```

Simulation example: simulations over standard uniforms

```{r}
simulations <- 1000
n <- 10
a <- runif(simulations * n)
m <- matrix(a, simulations)
sd(apply(m, 1, mean))
```

i.e. means of standard uniforms have $\sigma^2 = 1/\sqrt{12n}$, we should get a value that is close enough below

```{r}
1 / sqrt(12 * n)
```

Simulation example: simulations of Poisson distributions (Poisson 4)

```{r}
simulations <- 1000
n <- 10
a <- rpois(simulations * n, 4) # simulations * n draws of a poisson 4
m <- matrix(a, simulations)
sd(apply(m, 1, mean))
```

i.e. means of Poisson 4 distributions $\sigma^2 = 2/\sqrt{n}$, we should get a value that is close enough below

```{r}
2 / sqrt(n)
```

Simulation example: Fair coin flips

```{r}
simulations <- 1000
n <- 10
a <- sample(0:1, simulations * n, replace=TRUE) # simulations * n flips of coin
m <- matrix(a, simulations)
sd(apply(m, 1, mean))
```

i.e. means of coin flips have $\sigma^2 = 1/{2\sqrt{n}}$, we should get a value that is close enough below

```{r}
1 / (2*sqrt(n))
```

Data example

```{r, warning=FALSE, message=FALSE}
library(ggplot2)
library(UsingR); data(father.son)
```

```{r}
x <- father.son$sheight
n <- length(x)
```

Plotting the distribution of son's heights:

```{r}
ggplot(data = father.son, aes(x = sheight)) +
    geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black") +
    geom_density(size = 2, colour = "black")
```

Looks like (is) a gaussian distribution

```{r}
round(c(var(x), var(x)/n, sd(x), sd(x)/sqrt(n)),2) # get some numbers, round them to 2 dec places
```

## Distributions

#### Binomial Trials

$P(X = x) = {n \choose x} p^x . (1 - p)^{n-x}$

What is the probability of getting 7 or more girls out of 8 births, assuming each gender has a 50% probability.

For 7 girls:

$P(X = 7) = {8 \choose 7} 0.5^7 . (1 - 0.5)^1$

For 8 girls:

$P(X = 8) = {8 \choose 8} 0.5^8 . (1 - 0.5)^0$


```{r}
choose(8,7) * .5^8 + choose(8,8) * .5^8
```

The same results using a function for binominal trials

```{r}
pbinom(6, size=8, prob=.5, lower.tail=FALSE)
```

#### Quantiles and Standard Normals

What is the 95th percetile of $X = N(\mu, \sigma^2)$ for $\mu = 10$ and $\sigma = 2$

```{r}
qnorm(.95, mean=10, sd=2)
```

Number of daily clicks on a web site is approx normally distributed with a daily mean of 1020 and SD of 50. What is the probability of getting more than 1160 clicks in a day?

```{r}
pnorm(1160, mean=1020, sd=50, lower.tail=FALSE)
```

Or, different way to solve it: How many SD from mean is 1160?

```{r}
(1160 - 1020) / 50
```

It is 2.8 standard deviations from mean, so, equivalent to a standard normal (mean = 0, SD = 1)

```{r}
pnorm(2.8, lower.tail=FALSE)
```

Same parameters as before, but now what number of clicks would represent the one where 75% of days have fewer clicks?

```{r}
qnorm(.75, mean=1020, sd=50)
```

Note, pnorm and qnorm are inverses, i.e.:

```{r}
pnorm(qnorm(0.53))
```

is 0.53, the same as

```{r}
qnorm(pnorm(0.53))
```

#### Poisson Distributions

The number of people that shows up at bus stop is Poisson with a mean of 2.5 per hour. If watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?

```{r}
ppois(3, lambda=2.5*4)
```

Approximation to Binomial

Poisson can be approximated to binomial when n is large and p is small, $\lambda = n * p$

We flip a coin with a success probability 0.01 five hundred times, what is the probability of 2 or fewer successes?

```{r}
pbinom(2, size=500, prob=0.01)
```

very close result to

```{r}
ppois(2, lambda=500*0.01)
```

## Asymptotics

#### Law of Large Numbers (LLN) in Action

A note on how to use cumsum and a range to generate a running average:

```{r}
cumsum(c(1,3,2,4,5)) # numbers
cumsum(c(1,3,2,4,5))/(1:5) # cumsum of numbers / range is the running average
```

With normal samples

```{r}
n <- 1000
means <- cumsum(rnorm(n))/(1:n)
ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y)) +
    geom_hline(yintercept = 0) + geom_line(size = 2) +
    labs(x = "Number of obs", y = "Cumulative mean")
```

With coin flips

```{r}
means <- cumsum(sample(0:1, n, replace=TRUE))/(1:n) # sample() flips a coin 1000 times
ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y)) +
    geom_hline(yintercept = 0) + geom_line(size = 2) +
    labs(x = "Number of obs", y = "Cumulative mean")

```

You see that the plots converges to 0.5 (the average you are trying to estimate)

#### Confidence Intervals

Give a confidence interval for the average height of sons

Confidence interval for 2 standard deviations: $\bar X \pm {2 \sigma} / {\sqrt{n}}$

Confidence interval for 97.5% percentile: $\bar X \pm {qnorm(0.975) \sigma} / {\sqrt{n}}$

```{r}
x <- father.son$sheight
(mean(x) + c(-1, 1) * qnorm(0.975) * sd(x) / sqrt(length(x)))/12
```

From and to respectivelly. Divided by 12 to have the result in inches, not feet.

In an election your advisor tells you that out of 100 interviewed electors 56 plan in voting for you. Can you assume this election won?

For 95% confidence intervals, $\hat p \pm 1/\sqrt{n}$

For 0.975 quantile, $\hat p \pm \sqrt{p(1-p)/n}$:

```{r}
0.56 + c(-1, 1) * qnorm(0.975) * sqrt(0.56 * 0.44/100)
```

what is equivalent to

```{r}
binom.test(56, 100)$conf.int # 56 successes out of 100 trials
```

Another example: A sample of 9 people have a sample average brain volume of 1100cc and sd of 30cc. What is the complete set of values of $\mu_0$ that a test of $H_0 : \mu = \mu_0$ would fail to reject the null hypothesis in a two sided 5% Students t-test?

```{r}
n <- 9
mu <- 1100
sigma <- 30
quantile <- 0.975 # 95%, i.e. 2.5% on each side of the curve, so right-side tail: 100% - 2.5% = 97.5%
mu + c(-1, 1) * qt(quantile, df=n-1) * sigma/sqrt(n)
```

Another example: some measurement of 9 subjects yielded a 90% confidence interval of 1,077 units to 1,123 units Would you reject in a two sided 5% hypothesis test of $H_0:\mu=1,078 units$?

No, you cannot reject, 1078 is in plausible 90% range of 1077:1123, so it is guaranteed to be in the 95% range.

##### Simulation of a coin flip

Let's see if lower and upper limit of a simulation go above a 95% confidence interval:

```{r}
n <- 20 # number of coin flips on each simulation
pvals <- seq(0.1, 0.9, by=0.05) # success probabilities
simulations <- 1000 # number of simulations
coverage <- sapply(pvals, function(p) { # for each probability
    phats <- rbinom(simulations, prob=p, size=n)/n # generate 1000 binomial simulations with prob p size n
    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) # lower limit
    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n) # upper limit
    mean(ll < p & ul > p) # proportion of times they cover the range used to simulate the data
})
ggplot(data.frame(x=pvals, y=coverage), aes(x=x, y=y)) +
    geom_hline(yintercept=0.95) + geom_line(size=1) +
    ggtitle('Coverage for a small N (n=20)') +
    labs(x="probabilities", y="coverage")
```

For n=20 we barely go above a 95% confidence interval.

This is because CLT does not behave well for small numbers of n. Let's try the same with a larger n, say n=100

```{r}
n <- 100 # number of coin flips on each simulation
pvals <- seq(0.1, 0.9, by=0.05) # success probabilities
simulations <- 1000 # number of simulations
coverage <- sapply(pvals, function(p) { # for each probability
    phats <- rbinom(simulations, prob=p, size=n)/n # generate 1000 binomial simulations with prob p size n
    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) # lower limit
    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n) # upper limit
    mean(ll < p & ul > p) # proportion of times they cover the range used to simulate the data
})
ggplot(data.frame(x=pvals, y=coverage), aes(x=x, y=y)) +
    geom_hline(yintercept=0.95) + geom_line(size=1) +
    ggtitle('Coverage for a not so small N (n=100)') +
    labs(x="probabilities", y="coverage")
```

We have now several instances where we go above the 95% interval. In each simulation, we have more coin flips, the CLT performs better.

There is a quick fix for CLT on binomial distributions: Agresti/Coull interval. Add 2 success to the random variable and 2 sucesses and 2 failures to number of trials, so that:

$\hat p \pm \sqrt{p(1-p)+2/{n+4}}$:

Let's add 2 successes and 2 failures to the model before and watch how it performs:

```{r}
n <- 20 # number of coin flips on each simulation
pvals <- seq(0.1, 0.9, by=0.05) # success probabilities
simulations <- 1000 # number of simulations
coverage <- sapply(pvals, function(p) { # for each probability
    phats <- (rbinom(simulations, prob=p, size=n) + 2) /(n + 4) # add 2 failures and 2 successes (Agrest)
    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n) # lower limit
    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n) # upper limit
    mean(ll < p & ul > p) # proportion of times they cover the range used to simulate the data
})
ggplot(data.frame(x=pvals, y=coverage), aes(x=x, y=y)) +
    geom_hline(yintercept=0.95) + geom_line(size=1) +
    ggtitle('Coverage for a small N (n=20) using Agresti/Coull Interval') +
    labs(x="probabilities", y="coverage")
```

So it is kept aabove the 95% confidence interval for all trials of the simulation.

It is recommended the use of Agresti/Coull interval instead of the Wald interval, practical results are better.

##### Confidence Interval of a Poisson Distribution

A nuclear pump failed 5 times out ot 94.32 days. Give a 95% interval for the failure rate per day.

```{r}
x <- 5
t <- 94.32
lambda <- x/t
round(lambda + c(-1, 1) * qnorm(0.975) * sqrt(lambda/t), 3)
```

or in other way:

```{r}
poisson.test(x, T=94.32)$conf
```

Simulation to detect the performance of Poission distributions over different values of lambda

```{r}
lambdavals <- seq(0.005, 0.1, by=0.01)
simulations <- 1000
t <- 100
coverage <- sapply(lambdavals, function(lambda) {
    lhats <- rpois(simulations, lambda=lambda*t)/t
    ll <- lhats - qnorm(0.975) * sqrt(lhats/t)
    ul <- lhats + qnorm(0.975) * sqrt(lhats/t)
    mean(ll < lambda & ul > lambda)
})
ggplot(data.frame(x=lambdavals, y=coverage), aes(x=x, y=y)) +
    geom_hline(yintercept=0.95) + geom_line(size=1) +
    ggtitle('Coverage for a somewhat large N (n=100) in a Poisson distribution') +
    labs(x="lambdas", y="coverage")
```

Never really good, but really bad for small values of lambda. Does it get any better in higher N?

```{r}
lambdavals <- seq(0.005, 0.1, by=0.01)
simulations <- 1000
t <- 1000
coverage <- sapply(lambdavals, function(lambda) {
    lhats <- rpois(simulations, lambda=lambda*t)/t
    ll <- lhats - qnorm(0.975) * sqrt(lhats/t)
    ul <- lhats + qnorm(0.975) * sqrt(lhats/t)
    mean(ll < lambda & ul > lambda)
})
ggplot(data.frame(x=lambdavals, y=coverage), aes(x=x, y=y)) +
    geom_hline(yintercept=0.95) + geom_line(size=1) +
    ggtitle('Coverage for a larger N (n=1000) in a Poisson distribution') +
    labs(x="lambdas", y="coverage")
```


# Student's T Confidence Intervals

## Biometrika Analysis

Quick analysis of the dataframe used by William Gosset (aka Student) in his Biometrika paper, which shows the increase in hours for 10 patients on 2 sleeping drugs.

The group indicates the drug, subjects are repeated, i.e. subject 1 and 11 are the same, so is 2 and 12.

```{r}
data(sleep)
head(sleep)
```

Shows number of extra hours slept, group id and id of the subject.

```{r}
ggplot(sleep, aes(x = group, y = extra, group = factor(ID))) +
    geom_line(size = 1, aes(colour = ID)) +
    geom_point(size =10, pch = 21, fill = "salmon", alpha = .5)
```

Lines connect the same subjects across different drugs, we can clearly see that drug 2 has a stronger impact in hours slept than drug 1.

Let's calculate the Student's T 95% confidence interval.

First calculate the difference between drugs, and the mean and standard deviation of the differences:

```{r}
g1 <- sleep$extra[1:10]; g2 <- sleep$extra[11:20]
difference <- g2 - g1
mn <- mean(difference); s <- sd(difference); n <- 10
c(mn, s)
```

The confidence interval of the Student's T distribution. The quantile is calculated by (100-CI%)/2 + CI%, hence for 95% CI the quantile is 0.975 (Why 0.975? Because there's 0.025 to the right, and therefore 0.975 to the left):

```{r}
quantile <- (100-95)/2 + 95 # 97.5
quantile
```
```{r}
mn + c(-1, 1) * qt(quantile/100, n-1) * s / sqrt(n)
```

These should yield similar results:

```{r}
t.test(difference)
```

```{r}
t.test(g2, g1, paired=TRUE)
```

```{r}
t.test(extra ~ I(relevel(group, 2)), paired=TRUE, data=sleep)
```

These all say that with probability .95 the average difference of effects (between the two drugs) for an individual patient is between 0.7 and 2.46 additional hours of sleep.


## Comparing Independent Groups

Comparing SBP (systolic blood pressure) for 8 contraceptive users versus 21 controls:

Contraceptive users: $\bar X = 132.66 mmHg$ and $S = 15.34 mmHg$

Control group: $\bar X = 127.44 mmHg$ and $S = 18.23 mmHg$

Calculate the pooled variance estimate:

```{r}
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 -2))
(132.86 - 127.44) + c(-1, 1) * qt(.975, 27) * sp * (1/8 + 1/21)^.5
```

## Mistakenly Treating Data as Grouped

What happens when you use `t.test` and `paired=TRUE` when data is not paired? Back to the paired database:

```{r}
n1 <- length(g1); n2<- length(g2)
sp <- sqrt( ((n1 -1) * sd(g1)^2 + (n2-1) * sd(g2)^2) / (n1 + n2 - 2))
md <- mean(g2) - mean(g1)
semd <- sp * sqrt(1/n1 + 1/n2)
rbind(
    md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd,
    t.test(g2, g1, paired=FALSE, var.equal=TRUE)$conf,
    t.test(g2, g1, paired=TRUE)$conf
    )
```

## Handling Independent Groups

Load chickweight
```{r, warning=FALSE, message=FALSE}
library(datasets); data("ChickWeight"); library(reshape2); library(dplyr)
head(ChickWeight)
```

Let's look at the data first before we run our tests in a spaguetti plot.

```{r}
ggplot(ChickWeight, aes(x = Time, y = weight, colour = Diet, group = Chick)) +
    geom_line() +
    stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
    facet_grid(. ~ Diet)
```

In dark the averages. We can see that diet 3 and 4 do seem to behave better than 1 and 2, but let's look at confidence intervals of those measurements. Let's do some transformations first.

ChickWeight is in a narrow format, let's make it wide:

```{r}
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var='weight')
head(wideCW)
```

Renaming the time columns:

```{r}
names(wideCW)[-(1:2)] <- paste('time', names(wideCW)[-(1:2)], sep='')
head(wideCW)
```

Make a new column, gain, as the difference between first and last measurements

```{r}
wideCW <- mutate(wideCW, gain=time21-time0)
```

Now analysis: let's look at the gain by diet, and verify by a T confidence interval the variance of 1 and 4:

```{r, warning=FALSE}
ggplot(wideCW, aes(x = factor(Diet), y = gain, fill = factor(Diet))) +
    geom_violin(col = "black", size = 2)
```

We are going to compare diet 1 and diet 4 - our assumption of equal variances does not seem to hold for this case, but let's test them anyways.

```{r}
wideCW14 <- subset(wideCW, Diet %in% c(1,4))
rbind(
    t.test(gain ~ Diet, paired=FALSE, var.equal=TRUE, data=wideCW14)$conf,
    t.test(gain ~ Diet, paired=FALSE, var.equal=FALSE, data=wideCW14)$conf
)
```

On the first case, we considered the variance the same and on the second different.

## T Tests

Going back to our father/son height database. We want to test or $H_0$ that fathers and sons have similar heights. We can start by running `t.test`:

```{r}
t.test(father.son$sheight- father.son$fheight)
```

From that, you can see the test statistic is $t = 11.7885$. The value $t$ is quite large so we REJECT $H_0$ that the true mean of the difference was 0.

The test statisticc $t$ is also the number of estimated std errors between the sample and hypothesized means, i.e.

$mean(differences) = t * sd(differences)/\sqrt{n}$

where `differences` is `father.son$sheight- father.son$fheight`, and $n = df + 1$ i.e.:

```{r}
11.7885 * sd(father.son$sheight- father.son$fheight)/sqrt(1078)
```

What is close enough to `mean of x` from `t.test` above.

Also, note that the confidence interval `CI` does not contain the hypothesized population mean 0 in $H_0$, so we can safely reject $H_0$.

This tells us that either our hypothesis $H_0$ is wrong or we're making a mistake (Type 1) in rejecting it.

Similarly, if a $(1-\alpha)%$ interval contains $\mu_0$ ($H_0$'s $\mu$), then we fail to reject $H_0$.


## P-Values

Suppose you get a $T$ statistic of 2.5 for 15 $df$ testing $H_0 : \mu = \mu_0$ versus $H_a : \mu > \mu_0$.

What is the probability of getting a $T$ statistic as large as 2.5?

```{r}
pt(2.5, 15, lower.tail=FALSE)
```

I.e, 2.27%

And what is the probability of getting a $T$ statistic of less or equal than 2.5?

```{r}
pt(2.5, 15, lower.tail=TRUE)
```

What is around 98.77%

Another example: Consider the test of blood pressure drug, and the following measurements and baseline:

```{r}
subject <- c(1,2,3,4,5)
baseline <- c(140,138,150,148,135)
week2 <- c(132,135,151,146,130)
df <- data.frame(subject, baseline, week2)
df
```

What is the p-value for the associated 2-sided T-test? (considering observations are paired)

```{r}
test <- t.test(x=baseline, y=week2, alternative='two.sided', paired=TRUE)
test$p.value
```


Another example: Coke versus Pepsi. Each of four people was asked which of two drinks they preferred. 3 of the 4 people chose Coke. Find a P-value for a test of the hypothesis that Coke is preferred to Pepsi using a one sided exact test.

```{r}
library(stats)
n <- 4
x <- 3
binom.test(x=x, n=n, p=0.5, alternative='greater')$p.value
```

Another example: Infection rates above 1 infection per 100 person days at risk is a benchmark. A hospital that had previously been above the benchmark had 10 infections over 1,787 person days at risk. What is the one sided P-value test of whether the hospital is below the standard?

```{r}
p <- 1/100 # 1 infection per 100 person days
pbar <- 10/1787 # rate above the benchmark
n <- 1787 # sample period
stdError <- sqrt(p * (1-p)/n)
zTest <- (p - pbar)/stdError
pnorm(zTest, lower.tail = FALSE) # below the standard
```

or

```{r}
poisson.test(10, T=1787, r=1/100, alternative="l")$p.value
```

Another example: 18 subjects were randomized, 9 each, to a new diet pill and a placebo. BMIs were measured at a baseline and again after having received the treatment (or placebo) for four weeks. The average difference from follow-up to the baseline (followup - baseline) was −3 kg/m2 for the treated group and 1 kg/m2 for the placebo group. The corresponding standard deviations of the differences was 1.5 kg/m2 for the treatment group and 1.8 kg/m2 for the placebo group. Does the change in BMI appear to differ between the treated and placebo groups? Assume normality of the underlying data and a common population variance, give a pvalue for a two sided t test.

```{r}
n <- 9
df <- n + n - 2
treatedDiff <- -3
placeboDiff <- 1
meanDiff <- treatedDiff - placeboDiff
sdTreated <- 1.5
sdPlacebo <- 1.8
pooledVariance <- (((n - 1) * sdTreated^2) + ((n - 1) * sdPlacebo^2)) / df
stdErrDiff <- sqrt(pooledVariance/n + pooledVariance/n)
tObserved <- meanDiff / stdErrDiff
2*pt(tObserved, df=df)
```

## Attained Significance Level

Attained significance level is the lowest value of $\alpha$ where you will still reject $H_0$.

Example: A test statistic was 2 (2 standard errors above the hypothesis mean of 30), $H_0 : \mu = 30$ versus $H_a : \mu > 30$. Let's assume a standard normal test statistic. We rejected $\alpha = 0.05$. Will we reject $\alpha = 0.01$? What about $\alpha = 0.001$?

p-value:

(a) The value above the line where SD=2, or in other words:
(b) The smallest value of $\alpha$ where we will still reject the null hypothesis

```{r}
pnorm(2, lower.tail=FALSE)
```

So, we reject $H_0$ for any $\alpha$ > 0.0227

#### Binomial Examples - one sided

A friend has 8 children, 7 are girls and no twins. If each birth is 50% probability boy/girl, what is the probability of getting 7 or more girls in 8 births?

$H_0 : p = 0.5$

$H_a : p > 0.5$

P-value, probability under $H_0$, i.e.:

```{r}
choose(8,7)*0.5^8 + choose(8,8)*0.5^8
```

what is equivalent to:

```{r}
pbinom(6, size=8, prob=0.5, lower.tail=FALSE)
```

So, we will reject any $H_0$ for which p-value > 0.03515

#### Binomial Examples - two sided

Follow a black magic procedure:

1.  Calculate 2 sides probability:

```{r}
leftPValue <- pbinom(6, size=8, prob=0.5, lower.tail=FALSE)
rightPValue <-pbinom(7, size=8, prob=0.5, lower.tail=TRUE)
c(leftPValue, rightPValue)
```

2. Take the smaller side

```{r}
smallerPValue <- min(leftPValue, rightPValue)
```

3. p-value is double the smaller side

```{r}
pValue <- smallerPValue * 2
pValue
```

So, we will reject any two sided $H_0$ for which p-value > 0.07031

#### Poisson Example

Hospital has an infection rate of 10 infections per 100 person/days at risk. Infection rate of 0.05 is a benchmark, if above that we have to implement expensive quality control procedures.

Given that, can a rate > 0.05 be attributed to chance?

$H_0 : \lambda_0 = 0.05$

$H_a : \lambda > 0.05$

```{r}
t <- 100
lambda <- 0.05
events <- 10
leftSideEvents <- events - 1
ppois(leftSideEvents, t * lambda, lower.tail=FALSE)
```

So, the probability of that infection rate be attributed to chance is only 3.18% - so chance can be safely ruled out and this hospital should implement some quality control procedures.

# Power

## Definition

$Power = 1 - \beta$ where $\beta$ is the probability of a type II error. The more power the better (error is a bad thing).

Example: $\mu_a = 32$, $\mu_0 = 30$, $n = 16$, $\sigma = 4$

```{r}
alpha <- 0.05
mu0 <- 30
mua <- 32
sigma <- 4
n <- 16
z <- qnorm(1 - alpha)
pnorm(mu0 + z * sigma/sqrt(n), mean=mu0, sd=sigma/sqrt(n), lower.tail = FALSE)
```

I.e., there is a 5% probability of getting a mean of $\mu_0 = 30$ if we conduct this experiment

```{r}
pnorm(mu0 + z * sigma/sqrt(n), mean=mua, sd=sigma/sqrt(n), lower.tail = FALSE)
```

I.e., there us a 64% probability of getting a mean of $\mu_a = 32$ if we conduct this experiment

## Testing

Since $Power = \sqrt{n} \frac {(\mu_a - \mu_0)}{\sigma}$

`power.t.test` calculates what is missing given what you provide.

As long as you keep the ratio $\frac {(\mu_a - \mu_0)}{\sigma}$ (called effect size) the same, the value of $Power$ does not change:


```{r}
power.t.test(n=16, delta=2, sd=4, type='one.sample', alt='one.sided')$power
```

```{r}
power.t.test(n=16, delta=2/4, sd=1, type='one.sample', alt='one.sided')$power
```

```{r}
power.t.test(n=16, delta=100, sd=200, type='one.sample', alt='one.sided')$power
```

All 60.4%

Calculating $n$ if you have power:

```{r}
power.t.test(power=0.8, delta=2, sd=4, type='one.sample', alt='one.sided')$n
```

```{r}
power.t.test(power=0.8,  delta=2/4, sd=1, type='one.sample', alt='one.sided')$n
```

```{r}
power.t.test(power=0.8, delta=100, sd=200, type='one.sample', alt='one.sided')$n
```

All 60.4%

## Examples

In a study of 100 adults to measure a four year mean brain volume loss of .01 $mm^3$. The $sd$ of four year volume loss is .04 $mm^3$. About what would be the power of the study for a 5% one sided test versus a null hypothesis of no volume loss?

```{r}
n <- 100
mu <- 0.01
sigma <- 0.04
level <- 0.05 # 5%
power.t.test(n=n, delta=mu, sd=sigma, sig.level=level, type='one.sample', alternative='one.sided')$power
```

Same as above, if want to test $n$ adults, what is the value of $n$ needed for 90% power of type one error rate of 5% one sided test versus a null hypothesis of no volume loss?

```{r}
power <- 0.9
power.t.test(power=power, delta=mu, sd=sigma, sig.level=level, type='one.sample', alternative='one.sided')$n
```


# Multiple Comparisons

## Green Jelly Beans Cause Acne

[Green Jelly Beans Cause Acne!](https://xkcd.com/882/)

## Adjusting p-values

After you adjust them, cannot be used as regular p-values for hypothesis testing

### Case study I: no true positives

```{r}
set.seed(1010093)
pValues <- rep(NA, 1000)
for(i in 1:1000) {
    y <- rnorm(20)
    x <- rnorm(20)
    pValues[i] <- summary(lm(y ~ x))$coef[2,4]
}
sum(pValues < 0.05)
```

Controlling FWER (family wide error rate):

```{r}
sum(p.adjust(pValues, method='bonferroni') < 0.05)
```

Controlling FDR (false positive rate), Benjamini–Hochberg correction (BH)

```{r}
sum(p.adjust(pValues, method='BH') < 0.05)
```

### Case study II: 50% true positives

```{r}
set.seed(1010093)
pValues <- rep(NA, 1000)
for(i in 1:1000) {
    x <- rnorm(20)
    if (i <= 500) {
        y <- rnorm(20)
    } else {
        y <- rnorm(20, mean=2*x)
    }
    pValues[i] <- summary(lm(y ~ x))$coef[2,4]
}
trueStatus <- rep(c('zero', 'not zero'), each=500)
table(pValues < 0.05, trueStatus)
```

Controlling FWER (family wide error rate):

```{r}
table(p.adjust(pValues, method='bonferroni') < 0.05, trueStatus)
```

Controlling FDR (false positive rate), Benjamini–Hochberg correction (BH)

```{r}
table(p.adjust(pValues, method='BH') < 0.05, trueStatus)
```

P-values versus adjusted p-values

```{r}
par(mfrow=c(1,2))
plot(pValues, p.adjust(pValues, method='bonferroni'), pch=19)
plot(pValues, p.adjust(pValues, method='BH'), pch=19)
```


# Bootstrapping

## Definition

Basic bootstrap procedure:

* Get a number $n$ of samples
* Put all samples in a bag
* Build $B$ different collection of samples ($B$ different simulations), in which every sample is taken from that bag,  *with replacement* (meaning, after you take an item of the bag, put it back in the bag)

B should be as large as possible, so the montecarlo error is small.

As a result, instead of one collection of samples of size $n$, you have $B$ samples, each with $n$ size (way more data). How to generate it:

```{r}
library(UsingR)
data(father.son)
x <- father.son$sheight # son's height
n <- length(x)
B <- 1000 # number of bootstraped samples
resamples <- matrix(sample(x, n * B, replace=TRUE), B, n) # B rows, n columns
resampledMedians <- apply(resamples, 1, median)
```

Things we can do with it:

Does it look like a normal distribution? Let's look at its density estimate (aka histogram)

```{r}
hist(resampledMedians)
```

or

```{r}
ggplot(data.frame(medians=resampledMedians), aes(x=resampledMedians)) +
    geom_histogram(color='black', fill='lightblue', binwidth=0.05)
```

comparing that to the original distribution

```{r}
ggplot(data.frame(son.heights=x), aes(x=x)) +
    geom_histogram(color='black', fill='lightgreen', binwidth=5)