BIOS601 AGENDA:(part of Tuesday Lab) and class Wednesday September 20, 2023
[updated Sept 14, 2023]
Agenda: (part of Tuesday Lab) and class Wednesday September 20, 2023
- Discussion of issues
in JH's
Notes and assignment on models for a (sample) proportion
with emphasis on when binomial (and CLT approxn, if appropriate) does/does not apply, and what to do if it doesn't
Exercises:
0.01
0.03 [PhD]
0.06
[ 0.07, 0.08, 0.09, 0.11, 0.19 in-lab/class-only ]
0.10
0.17 [PhD]
Remarks on Notes:
These notes are based on those he developed for the course Epid607 Principles of Inferential Statistics in Medicine,
which JH taught to incoming students in the epidemiology graduate program 1980-1993. As such,
they emphasize the 'end-product', rather than how the product was arrived at.
In bios601, we will emphasize both. [One of the illustrations, via real data, was the frequency of
duplicate birthdays in the various size classes
he taught -- he started collecting data in 1981, when the class size was already 27!]
The notes on proportions from Epid607 have since been updated and can be found in Chapter 13 of this (under construction)
online book:
Section 1 1 (Binomial) Model for (Sampling)Variability of
Proportion/Count in a Sample.
Think of a binomial as the sum of n i.i.d.
Bernoulli random variables with (common) positivity probability pi.
And in applied work, don't use the terms 'success' and
'failure' that mathematical-statisticians do; instead, be practical and speak/write
generically of 'positive' versus 'negative, or better still, if the context is
appropriate, of the presence/absence of a 'characteristic/trait/state of interest'.
Take note of the various notations.
You will be surprized how often the use of the binomial arises in
other contexts than the traditional. A common example is when comparing two
rates, where by conditioning on the sum of the two numerators,
we arrive at a binomial.
The 'requirements' for a binomial are not as simple as they might appear,
as we will see in some of the assignments. In particular, be clear about
the difference between lack of independence (but with a common pi)
and lack of a common pi (but with independence), i.e., between
I.notI.d and notI.I.d !
JH finds the binomial tree (and generalizations of it) very helpful.
If he had time to redo it in R (he made it with some very old software)
JH would use another example that the overused pi=0.5 -- and frequency of a certain
outcome in thumbtacks rather than coins as the parameter of the inference.
In going through the examples in 1.1, once you are sure the n is
fixed ahead of time,
go through the
Indep. or not? Ident. or not? checklist.
1.2 (Calculation) is not the big issue it was when JH began teaching.
2. Inference We will leave this until we have a
common way of approaching point and interval estimation.
You have already
dealt with the 'large n' situation in your survey sample of ocean versus land locations.
The
"Exact, first-principles, Confidence Interval" construction shows that it is not that
easy to treat frequentist confidence intervals separately from p-values:
that one needs to be comfortable with p-values BEFORE being introduced to CI's,
even though the modern tendency is to downplay p-values and to
play up CI's (interval estimates).
The diagram in Figure 2 is worth studying. In addition to
'after the fact' calculations of precision and margins of error,
JH uses it regularly
in consultations when the question of sample size for a survey
arises. We could use it to decide what precision we would get with
estimates of the percentage of the world that is under water
obtained with various sample sizes. Of course, ahead of time,
just as you had to do with the negative binomial calculations,
you would have to make a guess as to what the 'true; percentage is.
But you can see the =/- 3 percentage points margin of error that the
survey agencies use for all their surveys of size 1000 or so. They are a bit
lazy and calculate the worst margin of error (ie the one at 50%)
and use that for all situations, even when, if the percentage is
closer to 0% (e.g. % of Canadians who have a PhD, or
% of physicians over 60 who are female)
or 100% (e.g. % of Canadians who have cell phone/computer, or
% of time they spend indoors), assuming 50% leads to an overestimate
of the unit variance [ pi(1-pi) ].
Between 30% and 70%, the unit variance
does not change at lot .. at 0.7 or 0.3, its 0.7 x 0.3 = 0.21,
and 0.5 x 0.5 = 0.25 at pi = 0.5. It's only when it gets below
say 0.2 or above 0.8 that the unit variance falls quickly. So, using
the worst case scenario is a safe 'conservative' approach,
in that the actual margin of error, calculated after the data are in,
has to be smaller than that based on the 0.5 x 0.5.
Method C in section 2 is the most logical and sensible one, and it
ties in with the logit being the natural ('canonical') link when
dealing with binomial regression via generalized linear models.
Section 3.
It took JH many years to find a satisfactory
answers and examples for those who are baffled by these concepts as to why
the proportion of a population that is samples is (usually) not as important as
the number who are sampled.
Section 3.3 shows how far the media has come 30 years!
BE VERY CAREFUL with WORDS
It is meaningless and incorrect to say "this sample is right 95% of the time."
Instead, saying
that "95% of samples this size are right" comes closer to what is
claimed when we give a 95% assurance.
Section 4.
At this stage, the key item is 4.2, the Normal approximation to the Binomial.
A useful way to think of the rules of thumb [ n.pi > 5 and n(1-pi)>5, or
n.pi > 10 and n(1-pi)>10] is to think of trying to overlay
the normal curve on the (discrete) binomial distribution, so that
not too much of the one or other of the two tails flows 'out of bounds' i.e.,
giving a count or proportion < 0, or a count > n or
proportion > 1. The normal approximation works well
if the approximation to the 'short tail' of the binomial
can be overlaid comfortably over the binomial tail.
Section 5 (Sample size, precision/power) We will leave this until we
cover a
common (generic) way of approaching these issues.
Remarks on assigned exercises .
0.1 (m-s) Working with logits and logs of proportions
The logit is absolutely central to epidemiologic data analysis,
and so you need to be quite comfortable working in this scale,
and then going back to the related scales.
You also need to become very comfortable with the
'Delta method' for calculating the (approx.) variance
of a transformed random variable.
The topic is usually taught, often with not much motivation
or intuition,
under the topic 'Change of variable'.
JH thinks a better name would be
'Change of SCALE', since the
entity under study remains the same. He gives the example
of variability, over some time period, in Montreal temperatures.
There is really just one r.v., namely temperature. What scale
it is measured on is arbitrary and almost secondary.
See more on this topic in the
teaching article by Hanley and
Teltsch. And, when you come to 'Jacobians' in your other
classes, this article will make them more real and intuitive.
The reason JH wrote this piece is that he was tired of trying to remember
whether the Jacobian in the new density involved 'dy/dx' or 'dx/dy'.
Now, he has no trouble doing so. He knows that
the SD(variance) of temperatures
on the Fahrenheit scale must be 1.8 and 1.8^2 times the SD(variance)
of these same temperatures on the Celsius scale.
Conversely, the C scale is only 5/9 ths as wide as the F scale.
The C(old) -> F(new) transformation is a linear 1.8 magnification of the C scale.
So if the temperature are more spread out on the new scale,
then, in order to conserve the full probability mass (ie pdf integral = 1),
the y-axis for the pdf on this new scale only goes up
to 5/9 this of what it did on the C scale. So,
new.pdf(F) = old.pdf(C equivalent of F) x (5/9)
= old.pdf(C equivalent of F) x dC/dF
or, generically,
new.pdf(*) = old.pdf(equivalent of *) x ( d.Old/d.New )
In many applications, such as with the logit, (and in just about all
of the examples in textbooks) the magnification is not the same
at different places on the scale. Moving from pi=0.5 to pi=0.6
moves the odds from 1:1 to 6:4 or 1.5:1, and thus its log from 0 to 0.405;
Moving from pi=0.8 to pi=0.9
moves the odds from 8:2 or 4:1 to 9:1, and thus its log from
1.386 to 2.197, a difference of 0.811, just about double the
0.405 closer to the centre.
[JH doesn't not understand why textbooks don't begin with
linear transformations]
0.4 (m-s) Greenwood's formula for the SE of an estimated
Survival Probability
When JH took mathematical statistics, one of the applications of the Delta
method was to work out the (approx.) variance for the surface area of a table
that was nominally W units wide and L units long, with
a 'manufacturing' error, or variation, of epsilon, on W;
and a similar (independent) one on L. He was told that sigma(epsilon)
was small relative to W and L (ie the coefficients of variation,
CV_L = 100 x sigma(epsilon_L) / L and
CV_W = 100 x sigma(epsilon_W) / W)
were low, so that even if we assumed Normal distributions for the two
epsilons, the probability of a table with a negative dimension was
negligible. One way to arrive at an approximate variance
was to expand the product (W+e_W)(L+e_L)
and then ignore the small e_W x e_L component.
Another was to use the Delta method for the log transformation to
derive the variance of the log of the product, and then
transform back (again using the Delta method for the antilog transformation.
You can try that same exercise if you want but I don't expect you
have that much free time right now!
(Recently Amy Liu and I had to
deal with a variant of this problem, involving correlated errors,
when dealing not with a product,
but with a quotient, of two r.v.'s in an analysis of the errors
in quotients caused by using input values extracted from digitized
images.
Exercise 0.4 involves a classic formula that biostatisticians refer to as
"Greenwood's formula"
and that is central to epidemiologic and biostatistical data analysis.
I suppose you could give represent each component in the product as
the 'true' amount plus an epsilon, and expand the product, and ignore
lower order terms, but it is probably easier to work with the
variance of the log of the product, and then to back to the original scale.
By the way, there seems to be a fixation on having the variance
on the original (0,1) scale, even though Gaussian-based confidence intervals
calculated on this (0,1) scale run the risk of going out of bounds.
Maybe we should obtain the variance in the (0,1) scale and then
move to the (unbounded) logit scale and calculate the CI there, THEN
take the anti-logit to return to the (0,1) scale.
There was a question on this in the 2012 Part A exam for PhD students
in biostatistics.
0.5 (m-s) Link between exact tail areas of Binomial and
F distributions
0.6 Clusters of Miscarriages
Assume that -- even though, within a company,
the risk of miscarriage varies from women to woman --
the pregnancy outcomes for different women
in the same company are independent. The main point of this
(real) story is what some would call the
law of large numbers -- if there are enough companies,
it will happen in 1 (or more) of them. And of course,
there is also the fact that we tend to notice extremes.
For more on this issue of co-incidences, and if you want
a break from the 'harder' stuff, you can look at
an article where JH has collected
several stories
involving the same law of large numbers, and the same
fascination with (benign) co-incidences. Of course, in
more serious situations, such as clusters of leukemias
and miscarriages and the like, it is not so easy
to convince people that it's all 'just' co-incidence.
And indeed, in any one instance, it is not easy
to distinguish a cluster that was caused by some
noxious agent from one that is a merely 'random' one.
The "Births Case 3" in that collection --
about numbers of twins in a school -- is probably
the closest in structure to the one on miscarriages.
JH was also struck by the role of
'filtering' that goes on in human-interest stories,
and the tendency of journalists to stretch the details even more
to make the odds even longer and the story all the
more remarkable! And the
fact that the same number in both states
(Lottery case 1) was more easily noticed because
the two states were beside each other in the alphabetical
ordering means that there might well have been other
days when two states that were not near each other in
the list had the same number -- but were not noticed.
0.7 "Prone-ness" to Miscarriages ?
Here we see the (absence of?) one of the
other requirements for a binomial. part 4 of the question
deals with this.
One could carry out a formal chi-squared test of the goodness
of fit of the expected numbers under a (common) Binomial.
You are not asked to go that far, just do a visual test.
By think about how many degrees of freedom the statistic
should have: it's not 5-1 = 4, because, in addition to the constraint that
the 5 frequencies add to 70, there is also a further
constraint imposed by the fact that the expected (ie fitted)
frequencies must give an overall 30%
miscarriage rate.
If you were going to carry out a formal test,
one other issue would be how accurate the Chi-Sq distribution
is when some of the expected numbers are low, ie < 5.
To be thinking about, and particularly in
light of Dr Moodie's presentation on random effect models:
Imagine (simplistically) every woman
as having being born with a different probability of having a miscarriage,
and that, given that probability, the outcomes
of successive pregnancies in that woman were all Bernoulli
with that same probability. The different probabilities
are called 'random effects. Question: what would the shape be like?
If this example is over-simplistic, think of
how much of the year each person spends indoors,
and what responses you would get if
you selected all (or a sample) of them,
and called each of them at 4 randomly selected times
(from all of the 60mins/24hrs/7/52/ over the year). Certain people
would have lower and some would have higher
probabilities of being indoors.
What do you think might be the shape of the distribution
of person-specific probabilities?
0.8 Automated Chemistries
Here can you see the absence of one of the
requirements for a binomial?
In part 3, an informal 'eye fit' is sufficient.
BTW: by 'normal'
Ingelfinger means 'apparently healthy'.
BTW2: How do you think
hospitals, and companies who sell them equipment for
testing, establish their 'limits of normal' ?
0.9 Binomial or Opportunistic?
Capitalization on chance... multiple looks at data
This is very much in the same spirit
as the 'law of large numbers'
mentioned above.
JH recently came across
an amusing example of astronomers (all mathematicians) and the pope.
[from Wikipedia:] Pope Clement VI reigned during the period of the Black Death.
This pandemic swept through Europe (as well as Asia and
the Middle East) between 1347 and 1350 and is believed
to have killed between a third and two-thirds of Europe's
population. During the plague, Clement sought the insight
of astronomers for explanation. Johannes de Muris was
among the team "of three who drew up a treatise explaining
the plague of 1348 by the conjunction of Saturn, Jupiter,
and Mars in 1341"
Clement VI's physicians advised him
that surrounding himself with torches would block the plague.
However, he soon became skeptical of this recommendation
and stayed in Avignon supervising sick care, burials,
and the pastoral care of the dying. He never contracted
the disease.
How many candidate years, and how many candidate planets
(and how many other causes?) did these mathematicians
search before finding the 'co-incidence'?
This is a bit like not knowing about all of the discarded
(unpublished) instances of P > 0.05,
and only seeing the 1 (published!)
instance of P < 0.05! The p-value loses its interpretation
if there is selective reporting.
0.10 Can one influence the sex of a baby?
Besides it use of the Normal approximation to the binomial,
you could use the exact binomial tail.
it is a good example of the need to beware of
selective reporting.
Imagine that each of 100 researchers tried
a different way to toss 145 coins, (or
each of 100 biostatistics students used a different random seed to
generate 145 Bernoulli(0.5) realizations)
and only the ones who get
'statistically significant' deviations from 0.5 report their
findings. Many people are worried that that type of selective reporting
is going on in science, aided by the tendency of journals to only
publish 'statistically significant' deviations.
If you have time, Google the name
John Ioannidis, who has been leading
a campaign for honest reporting, having found that many
so-called findings are nor reproducible.
In meta-analysis,
this phenomenon has been called the
file-drawer problem.
The
funnel-plot is a useful way to see if the
p-values that do get reported are
representative.
0.11 It's the 2nd week of the course: it must be Binomial!
fixed n? and i.i.d. ? If you like, suggest some of your own!
0.11 Tests of intuition
You can see why we can get more 'extreme' results
in small samples.
0.13 Test of a proposed mosquito repellent
0.14 Triangle Taste test
This is a particularly good one to teach/learn
about sample size and power -- directly using the exact
binomial -- with no normal approximation to get in the way!
0.15 Variability of, and trends in, proportions
0.16 A Close Look at Therapeutic Touch
are 2 real applications of the binomial.
0.17 We shouldn't trust statistical calculations to those
who can run a statistical or mathematical package,
but do not have training in mathematical statistics and statistical inference! (this is a real case,
involving doping in sport)