BIOS601 AGENDA: week of Monday September 19 & Wednesday September 21, 2022
[updated Sept 16, 2022]
BIOS601 AGENDA: week of Monday September 19 & Wednesday September 21, 2022
  
  -   Discussion of  issues
  in JH's 
  Notes and assignment on mean/quartile of a quantitative variable
 
 Exercises (I=Individual,
 A = Niki + Anji; B = Léa + Stephanie; C = Xianglin + Xing; D = Luce + Misha; E = A+B; F=C+D
 
 0.01: I
 0.02: A B C D
 0.03: I
 0.04: A B C D
 0.05: E F
 0.06: I
 0.07: I
 0.08: E F
 0.09: I
 0.10: I
 0.11: E F
 0.13: Read-only -- see Q1
 0.14: A B C D
 0.16: E F -- present in class Wednesday
 0.17: E F -- present in class Wednesday
 0.19: E F -- present in class Wednesday
 0.20: E F
 0.21: E F
 0.23: A B C D -- present in class Wednesday
 0.24: E F
 
 Remarks on Questions:
 
 0.1. for ii, you could pretend it is exactly triangular, highest at age 0, lowest at say 90.
  If continuous, that would make it a beta(1,2) distribution would make it a (scaled) beta(1,2) distribution. 
  For iii, be guided by the convolution results (see code & Q21).
 
 0.2. Things have changed considerably in last 50 years, and distribution of 
  no. of authors has become much wilder.
 
 0.3. Am not sure any of the honourable gentlemen were right! What if the minimum wage were 
  tied to the median?
 
 0.4. What shape is the distribution of age at death? 
  This would explain why 
  this author found that more than 50% lived past the mean.
  link
 The study design the writer used -- and improvements in medicine and public health --
  also helped increase the proportion!
 
 0.5. This is an important public health topic even today (see Flint, Michigan).
  The article introduced the geometric mean. We wil come back in Ch. 3 to fitting these
  distributions using parametric distributions. 
  In 'the old days' it was common to first reduce the data to grouped data,
  using bins and bin-frequencies. Back then, saving on tedious arithmetic was a driving 
  principle. (and graphs were harder to make).
 
 0.6. This is meant to get you to be clear on SD vs. SE (specifically SEM).
  If you back-calculate the SD in each row, you should find them to be
  fairly constant (apart form sampling variation of course).
  Part iii brings in the F distribution for the ratio of 2 sample variances.
  It is the non-symmetric shape of the F (plot it!) that makes for the not quite
  50:50 probabilities.
 
 Beginning biostatisticians often find it
 difficult to give a rough guess for what the SD should be. In the case
 of heights of adult (fe)males,
 he asks them to think of say the middle 95% of the distribution, i.e., from
 someone 'quite short' to someone 'quite tall', then to equate this range
 with  4SDs and back-calculate. For neck or waist or head sizes, do likewise.
 A favourite question  that he uses on interviews for statistical
 research assistants is to give the candidate a number for the SD and ask if it seems
 reasonable. For example, does the range of
 left middle finger lengths
 in Table III of the excerpt from Macdonell's (1901) report seem reasonable to you?
 Remember, you have seen a lot of people, and their fingers, over your lifetime,
 so you are (or should) be an expert in this topic, and be able to
 come up with a rough SD, and in the correct units!
 Remember, you have seen a lot of people, and their fingers, over your lifetime,
 so you are (or should) be an expert in this topic, and be able to
 come up with a rough SD, and in the correct units!
 
 In the case of head sizes, think of hat sizes. 
 And if you like, don't think of an absolute SDs, but of a SD as a percentage
 of the mean. For height, what would it be? For weight (over which
 people have more influence!), would it be a bigger percentage?
 Would it be a bigger percentage for finger length or height?
 What would this percentage (also called the coefficient of variation, or 'CV') 
 be for the head sizes, measured in utero by ultrasound, of babies
 of 13 weeks' gestation? 26 weeks' gestation? 40 weeks' ? The much smaller Cv at 13 week is used
 to 'date' the pregnancy, ie to provide a good estimate of the gestational age: after
 13 weeks, head sizes become more individualized and variable.
 
 If asked, many researchers (and even many not so young biostatisticians!)
 would confidently answer that the SD of the heights of 100 randomly samples individuals
 would be bigger than the SD of the heights of 10 randomly samples individuals.
 Many others would confidently answer that it the SD of the 100
 would be smaller than the SD of the 10. In fact, it is nearly impossible
 to consistently predict which would be greater than which -- try it out using R!
 A common but wrong reasoning is based on the fact that the sample SD involves
 an n-1. But in fact, the squared SD is nothing more than an average squared
 deviation, so the n (or n-1) has virtually nothing to do with it.
 If sampling from a N(,) distribution,
 the sampling distribution of a variance ratio, with df 99 and 9 in our e.g., is
 not quite symmetric -- but the median ratio is 1!
 The other common mistake is the same one made by Epstein -- mixing up SD and SE!
 A population SD is determined by nature -- think of the SD of the diameters
 of all of one's red blood cells. The SD of the millions of these cells
 is not determined or influenced by a researcher -- the SD is a property
 of the owner of the cells! The sample sd that the researcher sees in his/her
 sample of n cells is just an estimate of the owner's SD. Clearly,
 if the n is larger, the sample sd is more reliable, and doesn't tend to fall as
 far from the true SD as the sample sd of a smaller sample would.
 
 0.7. How myths get started. This day of week variation is now quite pronounced. See
  Link
 
 This q. emphasizes
  (i) day to day variability (ii) systematic differences between weekdays and weekends 
  (iii) the fact that it takes a big population, with lots of births, to reliably
  see the 'signal' ie the difference between weekdays and weekends. Think 
  of the weekday variation as driven by Nature, and the weekend pattern
  by physicians who want to have their weekends free, or would prefer
  that women deliver during the week when the hospital is more fully staffed (iv)
  the fact that the weekday variations are already close to Gaussian, so
  the sampling distribution of the men of several days can be acceptably
  approximated by the Normal without needing a big 'assist' from the CLT.
 
 The NYC blackout led to an amusing 'urban myth' but getting
   a sound answer to the
   question posed in part (iii) concerning the Quebec government incentive
   to increase the birth rate isn't quite to easy, and would take some
   ingenuity --  that's what makes epidemiologic research (with its limited
   opportnities for experimental control) more challenging and more 
   interesting.
 
 0.8. CAREFUL. can treat the 4 tires in series rather than in parallel. i.e.,
  the expected no. of ruptures should be 6, not 1.5
 
 See a modified version here.
  
  link
   or the original here
  link
 
 This 'JH-homegrown' question focuses on an important link between
  the tail area of the Poisson distribution and the (opposite) 
  tail are of the gamma (or chis-q) distribution.
  Before the universal access to statistical packages we now enjoy,
  this link was important for computational reasons: one could use
  tabulated tail areas of the chis-sq distribution to directly
  obtain Poisson tail areas.
 
 Fisher's 
  1935 derivation of this exact (and at first, surprising)
  link between the tail area of a discrete r.v. and the (opposite)
  tail of a continuous r.v. is not that easy to follow.
  If you want to 'see' the link more clearly, look at the form of the P value in
  Illustration I p 168 or 
  V p 170 in Pearson's 
   classic 1900 paper on the
  chi-sq goodness of fit statistic.
 
 The main purpose of the exercise is to get you to
  derive the link in an applied setting, by pure thought, rather than by
  blind algebra. The link between the continuous and discrete rv's
  is that the distance is a finite sum of distances, and the (discrete) number
  is the number of failures/replacements before a given distance is achieved.
 
 The second purpose is to sneak in the CLT. The more items in the
  sum, the closer the gamma is to Normal. And, since the individual components
  in the sum don't have anything near a symmetric, it takes a good few items in the sum
  to get the sum to 'forget' what its (many) parents are!
  If the individual components had had a distribution where the mode was
  not at the boundary, the CLT would 'kick in' at a lower n.
 
 If you read French, or even if you dont but would like to admire what we think is a 
  diagram, take a look at 
  teaching article which we prepared for a Math magazine that goes
  to all Québec high school students. Instead of by car on land, its a 10 year
  journey into space, where critical items fail, and must be replaced, if the 
  the mission is to succeed.
 
 The diagram (pp 28-29, designed to fit
  across 2 pages if printed) was a recent inspiration, but this story 
  has been used in bios601 from the beginning.
 
 0.9. Selecting mice at random is not that easy!
 
 It (again) emphasizes the CLT. How accurate it is depends on the distribution of the individual
  weights sampled from. Given that these are lab animals, all with the same parents,
  and the same boring food, it probably has a central mode  and is
  close to symmetric (it would be less so with free-living humans, but even then
  the n=30 would easily counteract the asymmetry.
 
 What is the point of this (and the other) exercise(s)?
 
 --------------
 
 Remarks on Notes:
 
 Section 1 Notice the focus on the shape of a very specific type of distribution, 
  or random variable,
  namely that based on a sum/mean/combination of 
  several (usually n) other (usually i.i.d.) random variables.
  This shape is often very different from that of the 'component' random variables.
  You might say that the 'component' random variables have distributions
  generated (made) by nature, whereas those generated by aggregation of samples
  are 'man'-made i.e., 'researcher'-made distributions, with shapes determined
  in (a small, or even negligible) part by the shape of the 
  distribution of the individual r.v.'s, but in 
  ( a large, and dominant) part by the 
  number of, and degree of independence between the
  individual r.v.'s aggregated.
 Notice also, in section 1.3, the different terminology used for the standard 
  deviation of an individual component r.v.,
  and standard 
  error (SE) of the aggregate r.v. Also, an SE
  typically involves a 'plug-in' estimate, and always (even if it is not explicit) a
  1/sqrt(n) multiplier of the 'unit' variance. Comfort with, and appreciating the
  central role of SE's is a prerequisite for work in applied statistics. When teaching the
  epidemiology students, JH used to say "the SD is for the 
  variation of individuals;
  the SE is for the (sampling) variation of a 
  statistic".
 Think of a statistic as an observable quantity calculated from
  from a sample of n observations. Think of a 
  parameter as an unobservable (but estimable) quantity 
  relating to a physical population, 
  such as the average or median depth of the ocean, or concentration of radon in 
  Canadian homes, or (if the object
  of inference is an individual person) the average amount of that person's
  physical activity , or level of blood pressure, or time spent indoors,
  over a year.
  or an unobservable 
  (but estimable) constant of nature, such as the speed of light, or the ratio
  of the volume of a sphere to the cube of its radius.
 
 DIGRESSION on historical (statistical) uses of the mean
  Some material from a new,  
  entertaining, and interesting book -- by a master of storytelling.
 
 Section 2.
 
 The notes (and particularly the examples) make 
  explicit that 
  the number of individual r.v.'s aggregated
   has a central role
  in the shape of
  its sampling distribution. The centrality and importance of the Central
  Limit Theorem (CLT) in applied statistics cannot be over-emphasized.
 
 In theoretical statistics, it is often presented as a mere 
  mathematical result, and seldom (as 'Student' said in a  different context)
  is one given any sense of  when (i.e., at 'n') the law 'kicks in.' JH has the impression, 
  from some PhD theses of statistics students he has examined, that
  the 'holy grail' is simply to establish asymptotic normality,
  with no consideration as to at what 'n' this is an acceptably-accurate
  approximation, or whether some change of scale (e.e., log or logit)
  might not make for a better normal approximation.
  He vividly remembers one PhD student, who, in his effort to
  prove asymptotic normality, used a Lemma that stated that if the log of
  the r.v. goes to Normal, so must the r.v. itself.
  What the student and supervisor hadn't bothered to discover (but what a few plots 
  would have readily shown)
  is that in the case in point, the log of the r.v. goes to normal 
  faster than did the r.v. itself!
 
 Unfortunately, it takes a bit of experience
  watching the interplay ('battle') between the degree of 
  non-normality of the individual (component) r.v.'s and the size of n,
  to develop some intuition and apprectation for the 'n' at which
  the CLT kicks in. The point is that it is context-specific: 
  with well-behaved individual (component) r.v.'s, it happens well before
  the  n=30 threshold that many courses teach.  Indeed, when doing simulations,
  one can subtract 6 from the sum of just 12 U(0,1) r.v.'s 
  and use the result as an acceptably accurate 
  (ie 
  accurate enough for government work)
  N(0,1) r.v.  Sometimes
  (as in the case of the 'insurance premiums' example, it takes an
  n in the 100s or 1000s or more.
 
 Thus, it is worth exploring as many
  of the CLT examples, and the simulations, as you have time for. 
  If you are really short of time, and can only at one, look
  at 'The Central Limit Theorem in Action' in the Graphs Figures Tables 
  Computing section of the
  Resources website. JH used this example (but without numbers) when he was
  co-teaching with, and handing over 607 to, Lawrence Joseph, and Lawrence 
  (and his very artistic wife) gave it numerical and artistic expression. 
  The example also shows that
  the CLT is not limited to sums or means of i.I.d. rv's, 
  but also to i.NOT-I.d. rv's. The key property is the first
 'i', the INDEPENDENCE, 
  since it is the independence that helps in the 
  Law of Cancellation
  of Extremes (JH's name for the CLT!).
 
 Section 3.
 
 Nothing very remarkable here (mostly 
  manipulating formulae!( except to say that on re-reading these
  Notes, JH would now add a qualifier that  the formulae
  to work correctly only for independent (or at a minimum!) uncorrelated
  estimates.
 
 For a good example of 'several estimates of a single parameter'
  think of several independent estimates of the speed of light,
  or in the Cavendish example, of the density of the earth.
  In this case, the weights should be precision-based.
 
 Section 4.
 
 Student's problem was not about the n at which the CLT kicks in
   [he was already assuming the component r.v.s' are N(,)]
   but about when the sample standard deviation (s) is a good substitute (proxy) for
   the 'true' but unknown 'population' standard deviation  (sigma):
    "no one has yet told 
   us very clearly where the limit between
    'large' and 'small' samples is to be drawn."
 
 In connection with the 100th anniversary of Student's
    ground-breaking 1908 paper, JH and colleagues went back to
    that paper, and to the way he mathematically derived  
    what is now called the 't' distribution.
    Full details, including why Gosset called himself 'Student',
    and his simulations to check out his shaky algebra,
    can be found at Article/Material in connection with 100th 
    Anniversary of Student's 1908 paper.
 
 The 2 persons in the photo
    with JH at the reception at the Guinness Brewery in Dublin in 2008
    are the grandson and granddaughter of William Gosset ('Student').
    At the unveiling of the plaque, the grandson told us that he
    was pretty sure he alone, of those assembled in 2008, had personally
    met Gosset. He was 6 months old when he was brought in to
    the hospital to see his grandfather, a few months before the grandfather died
    in 1937 in London (Gosset was English-born and educated, but worked 
    for Guinness in the Dublin HQ from 1899 onwards. In 1935,
    he moved to London to take charge of the scientific side of production, 
    at a new Guinness brewery at Park Royal in North West London,
    but died just two years later, at the age of 61. 
    Short bio.
 
 Interestingly, that 1908 paper was of limited use, since it dealt only
    with 1-sample problems. It took Fisher's insights in the 1920s
    to generalize it to not just 2-sample problems, but also
    correlation and regression, indeed to any context where one was dealing with
    a ratio of a mean or correlation or slope to its standard error;
    in turn, the SE involved the sqrt of an independent
    plug-in estimate of the unit variance. Fisher called the no.
    of independent contributions to that estimate the "degrees of freedom".
    In this context, JH usually defines the "d.f." as "the
    number of independent estimates of error": think of the
    number of independent residuals (which one "pools" to
    get one overall estimate of sigma-squared) as a case in point. It
    is no different in spirit from pooling the squared within-group
    deviations from their own means
    [they are also residuals, from each fitted (ie group) mean].
 
 In "Another worked Example, with graphic", JH is trying to get
    statisticians and their collaborators to use a better way to display
    paired data: the usual presentations involve separate SE's for the two
    means, as though the one mean was from one sample of n, and the other 
    from and entirely separate (independent) sample of n.
 
 Section 4.3/4.4
 
 Other years, we left sample size and precision issues
   until later in the course, where we planned deal with 
   them 'en masse.' But many years we never had the time 
   at the end of the course. So this year, following on from
   the calculations you did
   with the step-counter data,
   Q 0.11 will you to visit this section, and Figure 4 in particular.