BIOS601 AGENDA: Week of September 09 to September September 15, 2023
[updated September 11, 2023]
Agenda for Week of September 09 to September September 15, 2023
- Discussion of issues
in JH's
Notes and assignment on C&H Ch01 [prob. Models] and Ch02 [conditional Prob. Model]
INDIVIDUAL-LEVEL answers to be uploaded to MyCourses by end of the 'business' week for exercises :
'o' = optional
o 1.1 (C&H p. 7),
o 1.2 (C&H p. 8),
o 2.1 (C&H p. 11),
o 2.2 (C&H p. 13),
2.3 (C&H p. 13)
1.1 (jh p. 4),
1.2 (jh p. 5),
2.1 (jh p.10),
2.2 (jh p. 10, PhD),
o 2.3 (jh p 10),
2.4 (jh p. 10, PhD),
2.5 (jh p. 11),
2.15 (jh p. 24),
o 2.18 (jh p. 24),
2.19 (jh p. 25),
2.20 (jh p. 25),
2.21 (jh p. 25)
TEAM-LEVEL answers to be uploaded by team captains
for
2.11 (jh p. 19)
2.13 (jh p. 22)
Remarks:
Chapter 1 of C&H introduces some ways of looking at statistical
entities and concepts that you may not have met, as well as some
terminology that is used in a more specific way in epidemiology. You might want to
look at section 1 of JH's notes, from earlier years, on
Concepts involved in Occurrence Measures in Epidemiology.
JH has also included the first page of this section (mostly definitions) in
the notes that annotate the C and H chapters: he has placed it under the heading
'Important: Concepts and Terms in Epidemiology'
after his notes on 1.2 Binary data, and before 1.3 The binary probability model.
Supplementary Exercise 1.1 is designed to get you familiar with the
'other' scales
for measuring probabilities, and when the odds and probability measures are close, and when they diverge.
Other scales you will need to become very familiar with are the logit and the probit scales
We show all of these in one graph in our 'under construction'
online textbook for epidemiology students.
.
The online book has newer versions of some of the graphs JH in these notes, as well as additional commentaries.
JH's notes on Section 1.4 of C&H (and Supp. exercise 1.2)
are intended to 'shake you up a bit' and force you to think
outside the box as for how you used to estimate the parameters
of a simple linear regression. This model is usually
shown as a 2-parameter (slope, intercept) model, but JH has
deliberately reduced the model to a 1-parameter version,
with the "line" going through the origin [other examples
might be trying to estimate (from error-containing
measurements of the volumes of 2 spheres of different radii:
radii measured withut error!)
the constant in the relation:
Volume of a sphere = "some constant" times the cube of its diameter.]
The fewer the elements involved, the more chance there is to really
master the fundamentals and 'join the dots.'
He has recently added a
shiny app that allows additional criteria for the 'fit'.
You can also try the another 1-parameter (elevator) example at the bottom of that webpage.
Chapter 2 of C&H is -- to JH at least -- a very elegant and simple
and graphic way to introduce probabilities, and particularly
those that are linked to each other in time, or by
additional pieces of knowledge. And notice how many probabilities
of interest go from right to left, i.e., from after to before.
It is worthwhile to work through C&H's own exercises and then check
your answers agains the solutions they provide at the end of Ch 2.
Fig 2 in JH's Notes on Ch 2 has several simple but educational
examples showing the different 'directionalities'. It also
emphasizes that products of probabilities are like 'fractions of fractions'
but that sometimes, the probabilities depend on what has gone before,
and sometimes do not. (the online book has newer diagrams)
The 2 stories accompanying the Notes on section 2.2 should serve as a stark
and frightening reminder that P(theta|data) is a very different 'animal'
than P(data|theta) and that the consequences of mixing them can be enormous.
If you want a topical example, think of the difference between
P(A|B) and P(B|A), where A = the hypothesis that Higgs Boson particles exist,
and B = the bump in the curve. Btw, JH likes
to label the elements in what appears to be the best 'logical' or
'chronological' or 'causal' order, i.e., A -> B, but notices
that many textbooks teach the concepts using arbitrary letters.
JH's notes on Section 2.3 have a genetics (haemophilia) example that is
still very relevant. But, since he first encountered it 40 years ago,
medical science has advanced , and so one doesn't not now need to wait
until the woman has one or more offspring before learning about her carrier status.
JH would be grateful for a different example where one would still
need to wait.
At a debate a few years ago, JH came up with the challenge of
estimating/judging a person's age from various pieces of information.
You might like to take a quick look at
the
example & pieces of information provided
Supplementary Exercise 2.1 ('Efron's twins story') can be tackled in many ways.
Efron uses the odds scale to go from 'pre-' to 'post'-test odds, and then switches back to the probability scale.
We do the same when teaching medical students about diagnostic tests. Fortunately, today, with
readily accessible apps, there is less emphasis on the calculation, and more on the probabilities themselves.
A few pages further in the notes, you can will see what (paper) 'apps' were like in 1975! Fagan's
nomogram is still a clever tool, and JH has used it as a starting point for a shiny app
cited on the coloured box on the right hand side of page 8 of his Notes. This box gives you links to
the 'terminology' for the errors/performance of medical diagnostic tests (If JH had his way, we would
never have invented the terms sensitivity and specificity) and the correspondences with
statistical tests.
Supplementary Exercise 2.2 ('The Monty Hall Problem') can be very frustrating
and is easily misunderstood. JH has had to break up
fights between people who are over-confident but under-listening.
Key is the fact that Monty Hall KNOWS
which door contains which: sometimes
(how often?)
he has a choice of 2 doors that he could open
to reveal nothing, and sometimes (how often?)
he only has 1 choice.
In Exercise 2.3, it is equally important to be
very precise as to the
information provided.
In Exercise 2.4, we have another good example of the difference between P(H|data)
and P(data|H). Notice here that we are not examining a range of possible
H's, just 2 specific H's. Notice further that in the Bayesian approach we do not consider
data values that have not been observed; in contrast, the p-value does consider data values
that have not been observed (we should not call such unobserved values 'data', but
rather, potential data values.
JH finds that diagrams, especially 'tree' diagrams, can be very
helpful in these types of problems, and again when we revisit the Binomial.
Q2.5 was new in 2015,
so the wording hasn't had the same beta-testing as 2.1-2.4.
Its a pity that in the otherwise clever 'left brain' article,
the BMJ messed up on the 'teaser' introduction. JH
finds The Economist graphics clearer and simpler. What about you?
Q2.11 was new in 2019, having been prompted by a
(since withdrawn) tutorial article 'How to investigate an accused serial sexual harasser'
in Statistics in Medicine. If you Google it, you will see that
it generated considerable 'heat'. The tutorial referred indirectly to the data given in the exercise.
JH took a special interest
in the topic because of his involvement in reviewing the 2003 report.
Q2.12 is new in 2020, and was prompted by the coverage of the Santa
Clara study in Andrew Gelman's blog. The Santa Clare study was also the basis
for exercise 22 in the measurement material, and a question
in the Part A (bios700) PhD exam of August 4, 2020. We will come back to it again
when we adress Likelihood-only methods in C&H chapter 3.
Q2.13 is new in 2021, and was prompted by the increasing numbers of
statements about who the patients are that are being hospitalized for COVID-19.
It is also a great chance to learn a very common epidemiologic design,
one that goes by a very pooly chosen name -- the so-called (and very badly called)
'CASE-CONTROL' study.
We explain what it involves, and why it is a simple and otherwise standard comparison of two rates,
but where the (relative) (or maybe even the absolute) sizes of the
person-time denominators are ESTIMATED (using a denominator series) rather than KNOWN.
Q2.15 is new in 2021 (during Covid-19).
JH knows David Spiefelhalter and pays great attention
to everything he writes (and gives his books as
prizes for our end-of-term quis shows). Moreover, it took JH by surprize.
(I knew some politicians were playing this second-test game when
the first one was positive, and they did not believe it).
It is also a great chance to appreciate the direct
route from pre-test to post-test ODDS via the Likelihood Ratio.
And of course, being Bayesian, this was a natural way for
Speigelhalter. I noticed that it was also the route that
Efron (whom I think of as a bit less Bayesian) took when revising
the post-untrasound probability
that the twins are identical.
And it illustrates how one could use the Fagan Nomogram --
a tool that does not get enough respect, not just in medicine,
but also in Statistics.
Q2.16 Andrew Gelman had a number of very good COVID-based 'sermons,'
all very sharp and to the point. I put the 2020 Santa Clara study, which
he did a lot of work on, in the fast-breaking Notes in that glum
online-only Fall of 2020.
Q2.17 Gelman again -- stretching our brains!
Q2.18 This was prompted by the challenges of teaching 'post-test-probability' calculations to medical students, and the fact that many of the
teachers are not all that comfortable trying to show the derivation of this LR approach, especially the formula itself. I would love it if
one of you would join me in getting this manuscript cleaned up and submitted
somewhere.
Q2.19 I am very curious what you think of this article. I have
strong opinions about it. It's not so much that I don't agree with the
data (in fact I think I can explain the patterns seen in Fig 1 and 2)
as with the reasons why.how a journal could be convinced into publishing them.
One silver lining is that it gives you a chance to practice moving seamlessly
(and effortlessly, without a calculator) between the (0,1) probability scale,
and the (-Infinity, +Infinity) logit [a.k.a. log-odds ] scale you will use a lot in your GLM
course next term -- and in your career as a biostatistician. Translating a logit
of -1 or -2 (or +1 or +2) without a calculator should become as automatic
to you as it is for translating 10^(-2) or 10^(-1) or 10^(0) or 10^1 or 10^2 !!
Q2.20 This lay article does a nice job of avoiding the technical
terms (jargon). So, for you, this is a back-translation exercise!
But do check that there are no mkistakes in theior 'forward-translation.
I think a 2 x 2 frequency table might help. It would also help to
set up the 'of every 100 persons tested (or maybe 1000 persons tested)
the way the BMJ interactive tool (exercise 21) does.
And, of course, as Clayton and Hills say in their exercsie 2.3 (C&H p13),
the setting in which it is applied matters a lot.
Q2.21 JH has not examined it in great detail,
but it seems to be quite good, especially the interactive aspect. Compare with the
(static) diagrams in the Economist and in the
'left brain - right brain'
pieces in exercise 2.5 above.