BIOS601 AGENDA: Monday September 26, 2016
[updated Sept 15, 2016]
 Agenda for Monday September 26, 2016 
  
  -   Discussion of  issues
  in JH's 
  Notes and assignment on models for a (sample) proportion
 
 Answers to be handed in for: 
  Exercises 0.1, 0.2, 0.5, 0.8, 0.10, 0.11, 0.16
 
 Remarks on Notes:
 
 These notes are based on those he developed for the course Principles of Inferential Statistics in Medicine,
  which JH taught to incoming students in the epidemiology graduate program  1980-1993. As such,
  they emphasize the 'end-product', rather than how the product was arrived at.
  In bios601, we will emphasize both. [One of the illustrations, via real data, was the frequency of
  duplicate birthdays in the various size classes
  he taught -- he started collecting data in 1981, when the class size was already 27!]
 
 Section 1 1 (Binomial) Model for (Sampling)Variability of 
  Proportion/Count in a Sample.
 
 Think of  a binomial as the sum of n i.i.d. 
  Bernoulli random variables with (common) positivity probability pi.
 
 And in applied work, don't use the terms 'success' and
  'failure' that mathematical-statisticians do; instead, be practical and speak/write 
  generically of 'positive' versus 'negative, or better still, if the context is
  appropriate, of the presence/absence of a 'characteristic/trait/state of interest'.
 
 Take note of the various notations.
 
 You will be surprized how often the use of the binomial arises in
  other contexts than the traditional. A common example is when comparing two
  rates, where by conditioning on the sum of the two numerators,
  we arrive at a binomial.
 
 The 'requirements' for a binomial are not as simple as they might appear,
  as we will see in some of the assignments. In particular, be clear about
  the difference between lack of independence (but with a common pi)
  and lack of a common pi (but with independence), i.e., between
  I.notI.d and notI.I.d !
 
 JH finds the binomial tree (and generalizations of it) very helpful.
  
  If he had time to redo it in R (he made it with some very old software) 
  JH would use another example that the overused pi=0.5 -- and frequency of a certain
  outcome in thumbtacks rather than coins as the parameter of the inference.
 
 In going through the examples in 1.1, once you are sure the n is 
  fixed ahead of time,
  go through the
  Indep. or not? Ident. or not? checklist.
 
 1.2 (Calculation) is not the big issue it was when JH began teaching.
 
 2. Inference We will leave this until we have a 
  common way of approaching point and interval estimation.
 
 You have already
  dealt with the 'large n' situation in your survey sample of ocean versus land locations.
 
 The
  "Exact, first-principles, Confidence Interval" construction shows that it is not that
  easy to treat frequentist confidence intervals separately from p-values: 
  that one needs to be comfortable with p-values BEFORE being introduced to CI's,
  even though the modern tendency is to downplay p-values and to 
  play up CI's (interval estimates).
 
 The diagram in Figure 2 is worth studying. In addition to
  'after the fact' calculations of precision and margins of error,
  JH uses it regularly
  in consultations when the question of sample size for a survey
  arises. We could use it to decide what precision we would get with
  estimates of the percentage of the world that is under water
  obtained with various sample sizes. Of course, ahead of time,
  just as you had to do with the negative binomial calculations,
  you would have to make a guess as to what the 'true; percentage is.
  But you can see the =/- 3 percentage points margin of error that the 
  survey agencies use for all their surveys of size 1000 or so. They are a bit
  lazy and calculate the worst margin of error (ie the one at 50%)
  and use that for all situations, even when, if the percentage is
  closer to 0% (e.g. % of Canadians who have a PhD, or
  % of physicians over 60 who are female)
  or 100% (e.g. % of Canadians who have cell phone/computer, or 
  % of time they spend indoors), assuming 50% leads to an overestimate
  of the unit variance [ pi(1-pi) ].
  Between 30% and 70%, the unit variance 
  does not change at lot .. at 0.7 or 0.3, its 0.7 x 0.3 = 0.21,
  and 0.5 x 0.5 = 0.25 at pi = 0.5. It's only when it gets below
  say 0.2 or above 0.8 that the unit variance falls quickly. So, using 
  the worst case scenario is a safe 'conservative' approach,
  in that the actual margin of error, calculated after the data are in,
  has to be smaller  than that based on the 0.5 x 0.5.
 
 Method C in section 2 is the most logical and sensible one, and it 
  ties in with the logit being the natural ('canonical') link when 
  dealing with binomial regression via generalized linear models.
 
 Section 3.
 
 It took JH many years to find a satisfactory
  answers and examples for those who are baffled by these concepts as to why
  the proportion of a population that is samples is (usually) not as important as
  the number who are sampled.
 
 Section 3.3 shows how far the media has come  30 years!
 
 BE VERY CAREFUL with WORDS
 
 It is meaningless and incorrect to say "this sample is right 95% of the time." 
  Instead, saying
  that "95% of samples this size are right" comes closer to what is 
  claimed when we give a 95% assurance.
 
 Section 4.
 
 At this stage, the key item is 4.2, the Normal approximation to the Binomial.
   A useful way to think of the rules of thumb [ n.pi > 5 and n(1-pi)>5, or 
   n.pi > 10 and n(1-pi)>10] is to think of trying to overlay
   the normal curve on the (discrete) binomial distribution, so that
   not too much of the one or other of the two tails flows 'out of bounds' i.e.,
   giving a count or proportion < 0, or a count > n or 
   proportion > 1. The normal approximation works well
   if the approximation to the 'short tail' of the binomial 
   can be overlaid comfortably over the binomial tail.
 
 Section 5 (Sample size, precision/power) We will leave this until we 
    cover a 
  common (generic) way of approaching these issues.
 
 Remarks on assigned exercises .
 
 0.1 (m-s) Working with logits and logs of proportions
 The logit is absolutely central to epidemiologic data analysis,
 and so you need to be quite comfortable working in this scale,
 and then going back to the related scales.
 
 You also need to become very comfortable with the 
 'Delta method' for calculating the (approx.) variance
 of a transformed random variable. 
 The topic is usually taught, often with not much motivation
 or intuition,
 under the topic 'Change of variable'.
 JH thinks a better name would be 
 'Change of SCALE', since the
 entity under study remains the same. He gives the example
of variability, over some time period, in Montreal temperatures.
There is really just one r.v., namely temperature. What scale
it is measured on is arbitrary and almost secondary. 
 See more on this topic in the
 teaching article by Hanley and 
 Teltsch. And, when you come to 'Jacobians' in your other
 classes, this article will make them more real and intuitive.
 
 The reason JH wrote this piece is that he was tired of trying to remember
 whether the Jacobian in the new density involved  'dy/dx' or 'dx/dy'. 
 Now, he has no trouble doing so. He knows that
 the SD(variance) of temperatures
 on the Fahrenheit scale must be 1.8 and 1.8^2 times the SD(variance)
 of these same temperatures on the Celsius scale.
 Conversely, the C scale is only 5/9 ths as wide as the F scale.
 The C(old) -> F(new) transformation is a linear 1.8 magnification of the C scale.
 So if the temperature are more spread out on the new scale,
 then, in order to conserve the full probability mass (ie pdf integral = 1),
 the y-axis for the pdf on this new scale only goes up
 to 5/9 this of what it did on the C scale. So,
 
 new.pdf(F) = old.pdf(C equivalent of F) x (5/9)
  = old.pdf(C equivalent of F) x dC/dF
 
 or, generically,
 
 new.pdf(*) = old.pdf(equivalent of *) x ( d.Old/d.New )
 
 In many applications, such as with the logit, (and in just about all
 of the examples in textbooks) the magnification is not the same
 at different places on the scale. Moving from pi=0.5 to pi=0.6
 moves the odds from 1:1 to 6:4 or 1.5:1, and thus its log from 0 to 0.405;
Moving from pi=0.8 to pi=0.9
 moves the odds from 8:2 or 4:1 to 9:1, and thus its log from 
 1.386 to 2.197, a difference of 0.811, just about double the
 0.405 closer to the centre.
 [JH doesn't not understand why textbooks don't begin with 
 linear transformations]
 
 0.2 (m-s) Greenwood's formula for the SE of an estimated 
Survival Probability
 
 When JH took mathematical statistics, one of the applications of the Delta  
 method was to work out the (approx.) variance for the surface area of a table
 that was nominally W units wide and L units long, with
 a 'manufacturing' error, or variation, of epsilon, on W;
 and a similar (independent) one on L. He was told that sigma(epsilon)
 was small relative to W and L (ie the coefficients of variation,
 CV_L = 100 x sigma(epsilon_L) / L and 
 CV_W = 100 x sigma(epsilon_W) / W)
 were low, so that even if we assumed Normal distributions for the two
 epsilons, the probability of a table with a negative dimension was
 negligible. One way to arrive at an approximate variance
  was to expand the product (W+e_W)(L+e_L)
 and then ignore the small e_W x e_L component.
Another was to use the Delta method  for the log transformation to
derive  the variance of the log of the product, and then
transform back (again  using the Delta method for the antilog transformation.
 
 You can try that same exercise if you want but I don't expect you
  have that much free time right now!
 (Recently Amy Liu and I had to
 deal with a variant of this problem, involving correlated errors,
 when dealing not with a product,
 but with a quotient, of two r.v.'s in an analysis of the errors
 in quotients caused by using input values extracted from digitized
 images.
 
 Exercise 0.2 involves  a classic formula that biostatisticians refer to as
 
  "Greenwood's formula"
 and that is central to epidemiologic and biostatistical data analysis.
 I suppose you could give represent each component in the product as 
 the 'true' amount plus an epsilon, and expand the product, and ignore
 lower order terms, but it is probably easier to work with the 
 variance of the log of the product, and then to back to the original scale. 
 By the way, there seems to be a fixation on having the variance
 on the original (0,1) scale, even though Gaussian-based confidence intervals
 calculated on this (0,1) scale run the risk of going out of bounds. 
 Maybe we should  obtain the variance in the (0,1) scale and then
 move to the (unbounded) logit scale and calculate the CI there, THEN
 take the anti-logit to return to the (0,1) scale.
 There was a question on this in the 2012 Part A exam for PhD students
 in biostatistics.
 
 0.3 (m-s) Link between exact tail areas of Binomial and
F distributions
 
 Since this problem is a first cousin of exercise 0.4, JH has
instead assigned the cleaner 0.4 one. Moreover, as we saw in
the assignment on  ruptures  over a trip of 7500Km, it is easier
to 'see' the link between the Poisson and the Chi-Square tails
than  between the tails of the Binomial and the F.
 
 0.4 (m-s) Link between exact 
tail areas of Poisson and Chi-Square distribution
 
 Déjà vu la semaine dernière,  
so you can use an intuitive' proof if you like.
Or, if you wish, how about using a 
"proof by induction' ?
[JH doesn't see many proofs done this way any more] 
Or any
other method you can find - if from internet, please
credit your source!
 
 0.5 Clusters of Miscarriages
 
 Assume that -- even though, within a company,
the risk of miscarriage varies from women to woman --
the pregnancy outcomes for different women
in the same company are independent. The main point of this
(real) story is what some would call the
law of large numbers -- if there are enough companies,
it will happen in 1 (or more) of them. And of course,
there is also the fact that we tend to notice extremes.
 
 For more on this issue of co-incidences, and if you want
a break from the 'harder' stuff, you can look at
an article where JH has collected 
several stories
involving the same law of large numbers, and the same
fascination with (benign) co-incidences. Of course, in 
more serious situations, such as clusters of leukemias
and miscarriages and the like, it is not so easy
to convince people that it's all 'just' co-incidence.
And indeed, in any one instance, it is not easy
to distinguish a cluster that was caused by some
noxious agent from one that is a merely 'random' one.
 
 The "Births Case 3" in that collection --
about numbers of twins in a school -- is probably
the closest in structure to the one on miscarriages. 
JH was also struck by the role of
'filtering' that goes on in human-interest stories,
and the tendency of journalists to stretch the details even more
to make the odds even longer and the story all the
more remarkable! And the
fact that the same number in both states
(Lottery case 1) was more easily noticed because
the two states were beside each other in the alphabetical
ordering means that there might well have been other
days when two states that were not near each other in
the list had the same number -- but were not noticed.
 
 0.6 "Prone-ness" to Miscarriages ?
 
 Here we see the (absence of?) one of the
other requirements for a binomial. part 4 of the question 
deals with this.
 
 One could carry out a formal chi-squared test of the goodness
of fit of the expected numbers under a (common) Binomial.
You are not asked to go that far, just do a visual test.
By think about how many degrees of freedom the statistic
should have: it's not 5-1 = 4, because, in addition to the constraint that
the 5 frequencies add to 70, there is also a further
constraint imposed by the fact that the expected (ie fitted)
frequencies must give an overall 30%
miscarriage rate.
 If you were going to carry out a formal test,
one other issue would be how accurate the Chi-Sq distribution
is when some of the expected numbers are low, ie < 5.
 
 To be thinking about, and particularly in
light of Dr Moodie's presentation on random effect models:
Imagine (simplistically) every woman 
as having being born with a different probability of having a miscarriage,
and that, given that probability, the outcomes
of successive pregnancies in that woman were all Bernoulli
with that same probability. The different probabilities
are called 'random effects. Question: what would the shape be like?
 
 If this example is over-simplistic, think of
how much of the year each person spends indoors,
and what responses you would get if
you selected all (or a sample) of them,
and called each of them at 4 randomly selected times
(from all of the 60mins/24hrs/7/52/ over the year). Certain people
would have lower and some would have higher 
probabilities of being indoors.
What do you think might be the shape of the distribution
of person-specific probabilities?
 
 0.7 Automated Chemistries
 
 Here can you see  the absence of one of the
 requirements for a binomial?
 
 In part 3, an informal 'eye fit' is sufficient.
 
 BTW: by 'normal'
Ingelfinger means 'apparently healthy'.
 BTW2: How do you think
hospitals, and companies who sell them equipment for
testing, establish their 'limits of normal' ?
 
 0.8 Binomial or Opportunistic? 
Capitalization on chance... multiple looks at data
 
 This is very much in the same spirit
as the 'law of large numbers'
mentioned above.
 
 JH recently came across
an amusing example of astronomers (all mathematicians) and the pope.
[from Wikipedia:] Pope Clement VI reigned during the period of the Black Death. 
This pandemic swept through Europe (as well as Asia and 
the Middle East) between 1347 and 1350 and is believed 
to have killed between a third and two-thirds of Europe's
population. During the plague, Clement sought the insight
of astronomers for explanation. Johannes de Muris was 
among the team "of three who drew up a treatise explaining 
the plague of 1348 by the conjunction of Saturn, Jupiter, 
and Mars in 1341"
 Clement VI's physicians advised him 
that surrounding himself with torches would block the plague. 
However, he soon became skeptical of this recommendation 
and stayed in Avignon supervising sick care, burials, 
and the pastoral care of the dying. He never contracted 
the disease.
 
 How many candidate years, and how many candidate planets
(and how many other causes?) did  these mathematicians
search before finding the 'co-incidence'?
 
 This is a bit like not knowing about all of the discarded 
(unpublished) instances of P > 0.05,
and only seeing  the 1 (published!)
instance of P < 0.05! The p-value loses its interpretation
if there is selective reporting.
 
 0.9 Can one influence the sex of a baby?
 
 You are not asked to hand in answers to this question, but 
(besides it use of the Normal approximation to the binomial --
or you could use the exact binomial tail using software such as Excel or R)
it is a good example of the need to beware of
selective reporting. 
Imagine that each of 100 researchers tried
a different way to toss 145 coins, (or 
each of 100 biostatistics students used a different random seed to
generate  145 Bernoulli(0.5) realizations)
and only the ones who get
'statistically significant' deviations from 0.5 report their
findings. Many people are worried that that type of selective reporting
is going on in science, aided by the tendency of journals to only
publish 'statistically significant' deviations.
If you have time, Google the name 
John Ioannidis, who has been leading
a campaign for honest reporting, having found that many
so-called findings are nor reproducible.
In meta-analysis, 
this phenomenon has been called the 
file-drawer problem.
The 
funnel-plot is a useful way to see if the
p-values that do get reported are
representative.
 
 0.10 It's the 2nd week of the course: it must be Binomial!
 
 fixed n? and i.i.d. ?  If you like, suggest some of your own!
 
 0.11 Tests of intuition
 
 You can see why we can get more 'extreme' results
in small samples.
 
 0.12 Test of a proposed mosquito repellent
 0.13 Triangle Taste test
 0.14 Variability of, and trends in, proportions
 0.15 A Close Look at Therapeutic Touch
 
 are 4 real applications of the binomial.
 
 0.13 is a particularly good one to teach/learn about sample size and power,
and we will come back to it later.
 
 0.16 We shouldn't trust statistical calculations to those 
who can run a  statistical or mathematical package,
but do not have training in mathematical statistics and statistical inference! (this is a real case,
involving doping in sport)