BIOS601 AGENDA: Tuesday September 18, 2018

[updated Sept 15, 2018]

Agenda for September 18, 2018

Discussion of issues in JH's Notes and assignment on mean/quartile of a quantitative variable

Answers to be submitted by Friday 21st: Exercises 0.10, 0.11, 0.12, 0.13, 0.19, 0.20 (see below)

But, first, a few remarks on the Notes themselves:

Section 1 Notice the focus on the shape of a very specific type of distribution, or random variable, namely that based on a sum/mean/combination of several (usually n) other (usually i.i.d.) random variables. This shape is often very different from that of the 'component' random variables. You might say that the 'component' random variables have distributions generated (made) by nature, whereas those generated by aggregation of samples are 'man'-made i.e., 'researcher'-made distributions, with shapes determined in (a small, or even negligible) part by the shape of the distribution of the individual r.v.'s, but in ( a large, and dominant) part by the number of, and degree of independence between the individual r.v.'s aggregated.
Notice also, in section 1.3, the different terminology used for the standard deviation of an individual component r.v., and standard error (SE) of the aggregate r.v. Also, an SE typically involves a 'plug-in' estimate, and always (even if it is not explicit) a 1/sqrt(n) multiplier of the 'unit' variance. Comfort with, and appreciating the central role of SE's is a prerequisite for work in applied statistics. When teaching the epidemiology students, JH used to say "the SD is for the variation of individuals; the SE is for the (sampling) variation of a statistic".
Think of a statistic as an observable quantity calculated from from a sample of n observations. Think of a parameter as an unobservable (but estimable) quantity relating to a physical population, such as the average or median depth of the ocean, or concentration of radon in Canadian homes, or (if the object of inference is an individual person) the average amount of that person's physical activity , or level of blood pressure, or time spent indoors, over a year. or an unobservable (but estimable) constant of nature, such as the speed of light, or the ratio of the volume of a sphere to the cube of its radius.

DIGRESSION on historical (statistical) uses of the mean Some material from a new, entertaining, and interesting book -- by a master of storytelling.

Section 2.

The notes (and particularly the examples) make explicit that the number of individual r.v.'s aggregated has a central role in the shape of its sampling distribution. The centrality and importance of the Central Limit Theorem (CLT) in applied statistics cannot be over-emphasized.

In theoretical statistics, it is often presented as a mere mathematical result, and seldom (as 'Student' said in a different context) is one given any sense of when (i.e., at 'n') the law 'kicks in.' JH has the impression, from some PhD theses of statistics students he has examined, that the 'holy grail' is simply to establish asymptotic normality, with no consideration as to at what 'n' this is an acceptably-accurate approximation, or whether some change of scale (e.e., log or logit) might not make for a better normal approximation. He vividly remembers one PhD student, who, in his effort to prove asymptotic normality, used a Lemma that stated that if the log of the r.v. goes to Normal, so must the r.v. itself. What the student and supervisor hadn't bothered to discover (but what a few plots would have readily shown) is that in the case in point, the log of the r.v. goes to normal faster than did the r.v. itself!

Unfortunately, it takes a bit of experience watching the interplay ('battle') between the degree of non-normality of the individual (component) r.v.'s and the size of n, to develop some intuition and apprectation for the 'n' at which the CLT kicks in. The point is that it is context-specific: with well-behaved individual (component) r.v.'s, it happens well before the n=30 threshold that many courses teach. Indeed, when doing simulations, one can subtract 6 from the sum of just 12 U(0,1) r.v.'s and use the result as an acceptably accurate (ie accurate enough for government work) N(0,1) r.v. Sometimes (as in the case of the 'insurance premiums' example, it takes an n in the 100s or 1000s or more.

Thus, it is worth exploring as many of the CLT examples, and the simulations, as you have time for. If you are really short of time, and can only at one, look at 'The Central Limit Theorem in Action' in the Graphs Figures Tables Computing section of the Resources website. JH used this example (but without numbers) when he was co-teaching with, and handing over 607 to, Lawrence Joseph, and Lawrence (and his very artistic wife) gave it numerical and artistic expression. The example also shows that the CLT is not limited to sums or means of i.I.d. rv's, but also to i.NOT-I.d. rv's. The key property is the first 'i', the INDEPENDENCE, since it is the independence that helps in the Law of Cancellation of Extremes (JH's name for the CLT!).

Section 3.

Nothing very remarkable here (mostly manipulating formulae!( except to say that on re-reading these Notes, JH would now add a qualifier that the formulae to work correctly only for independent (or at a minimum!) uncorrelated estimates.

For a good example of 'several estimates of a single parameter' think of several independent estimates of the speed of light, or in the Cavendish example, of the density of the earth. In this case, the weights should be precision-based.

Section 4.

Student's problem was not about the n at which the CLT kicks in [he was already assuming the component r.v.s' are N(,)] but about when the sample standard deviation (s) is a good substitute (proxy) for the 'true' but unknown 'population' standard deviation (sigma): "no one has yet told us very clearly where the limit between 'large' and 'small' samples is to be drawn."

In connection with the 100th anniversary of Student's ground-breaking 1908 paper, JH and colleagues went back to that paper, and to the way he mathematically derived what is now called the 't' distribution. Full details, including why Gosset called himself 'Student', and his simulations to check out his shaky algebra, can be found at Article/Material in connection with 100th Anniversary of Student's 1908 paper.

The 2 persons in the photo with JH at the reception at the Guinness Brewery in Dublin in 2008 are the grandson and granddaughter of William Gosset ('Student'). At the unveiling of the plaque, the grandson told us that he was pretty sure he alone, of those assembled in 2008, had personally met Gosset. He was 6 months old when he was brought in to the hospital to see his grandfather, a few months before the grandfather died in 1937 in London (Gosset was English-born and educated, but worked for Guinness in the Dublin HQ from 1899 onwards. In 1935, he moved to London to take charge of the scientific side of production, at a new Guinness brewery at Park Royal in North West London, but died just two years later, at the age of 61. Short bio.

Interestingly, that 1908 paper was of limited use, since it dealt only with 1-sample problems. It took Fisher's insights in the 1920s to generalize it to not just 2-sample problems, but also correlation and regression, indeed to any context where one was dealing with a ratio of a mean or correlation or slope to its standard error; in turn, the SE involved the sqrt of an independent plug-in estimate of the unit variance. Fisher called the no. of independent contributions to that estimate the "degrees of freedom". In this context, JH usually defines the "d.f." as "the number of independent estimates of error": think of the number of independent residuals (which one "pools" to get one overall estimate of sigma-squared) as a case in point. It is no different in spirit from pooling the squared within-group deviations from their own means [they are also residuals, from each fitted (ie group) mean].

In "Another worked Example, with graphic", JH is trying to get statisticians and their collaborators to use a better way to display paired data: the usual presentations involve separate SE's for the two means, as though the one mean was from one sample of n, and the other from and entirely separate (independent) sample of n.

Section 4.3/4.4

Other years, we left sample size and precision issues until later in the course, where we planned deal with them 'en masse.' But many years we never had the time at the end of the course. So this year, following on from the calculations you did with the step-counter data, Q 0.11 will you to visit this section, and Figure 4 in particular.

Remarks on assigned exercises .

0.10 (Planning ahead) This relates to the topic of asymmetry. A factor that slows down the march towards a Normal distribution (of the sum or mean) is the lack of symmetry of the individual summands. Another slowing factor is any lack of independence among the summands.

0.11 Shape of waiting time distribution: Using the shorthand 'DU' for 'Discrete Uniform', the wait is DU(22-26) minus DU(1-5). If the throw on die_1 (die is singular of dice) could be DU(1-6), and the throw on die_2 also DU(1-6), what would be the shape of the distribution of die_1 + die_2? die_1 - die_2 ? What if had die_1 +/- die_2 +/- die_3? What if had continuous rv's : U_1(0,1) +/- U_2(0,1)? U_1(0,1) +/- U_2(0,1) +/- U_3(0,1) .. shape for sum/diff of 3 continuous U's is smooth, whereas shape for sum/diff of 2 continuous U's has a sharp mode.

0.12 Snail's pace: Given the 'topic of the day', you can probably guess the answer to part iii. Without the CLT, you wouldn't get very far just using Tchebychev's theorem!

0.13. New this year. So, email JH if the wording -- or anything else -- is unclear.

0.19 Bootstrap Investigation of Sampling Variability of an estimator: The bootstrap was developed to quantify the `difficult to study analytically' behaviour of estimators, and so it suits the purpose here: rather than you relying on your (or my) intuition that 30 is enough or 200 is enough, you can effectively simulate the variation. Of course, it would have been a tip off that the large-sample interval estimate was inappropriate if the CI based on the Gaussian model (CLT based) included a negative mu! Interesting that it was a former bios601 student, now a faculty member, who raised the possibility of doing so. So, I am never to old to learn, and I still consider myself -- like Gosset - a 'student'.

I have put a link to Efron's and Diaconis' very readable introduction to the bootstrap in the popular science magazine Scientific American in the resources, and it is also available here

0.20 Planning ahead - the (2015) sequel: JH found himself thinking probabilistically on Orientation Day 2015, when he had to leave the lunch early to get to his appointment.

0.21 Laplace, before computers: You have to both admire and sympathize with Laplace, who could come up with an exact, but at the time entirely impractical, probability formula, which got worse with n. But his 1810 approximation for the distributions of sums of n iid rvs from any behaved family (not just uniform) got better the greater the n.

0.22 Dice for Statistical Investigations: Another enterprising, but much more practical (and self-taught) statistician and polymath. Maybe if we get access to a 3D printer, we can build a pentakisdodecahedron (a golf ball would be ok too, but it has too many 'dimples' and they are to small to write the values into them!).