Agenda for Aug 25 to Sept 08, 2023
[updated August 24, 2023 --please notify JH if you encounter any glitches]
Topic: The Quality of Measurements and the Effects of Measurement Error.
Preamble: This topic of measurement is probably new for you, as it was for JH
when he began in cancer clinical trials in 1973, and oncologists (cancer doctors)
were judging responses of advanced cancer to chemotherapy
by measuring tumours by
'palpation'.
Re Q1 and Q2: In early measurement courses JH gave, students measured readability manually
by counting the lengths of words and
sentences, and the number of syllables in words.
Early on in bios601, the task became much easier using online (and online textxs) and online tools, or those in Microsoft Word.
From 3 measurements of readability, they calculated the standard error of measurement as the SD of the 3.
The CV is the SD divided by the mean of the 3, expressed as a percentage.
standard error of measurementsince everyone knows what 0.7 of a grade is.
But if the scale is arbitrary (running from say 0 to 70) a SEE of 9 'points' is more difficult to judge,
unless one knows well what a '40' or a '20' is. In this case the ICC is more useful,
but it requires that one has > 1 measurement each on each of several texts of
different difficulty .. so one can judge how much is genuine 'between-text'
and within-text' variation.
For validity students went to the library (or looked around at home, or online) for books of
different difficulty so that we could see if the measurements agreed well with
what difficulty experts though the difficulty of each book was. It was not like
the study in Q16 where 500 meant 500 or 1500 meant 1550 and all would agree on
this 'gold standard'.
Unlike in physical measurements, this issue of
an independent gold standard' is a challenging one in psychometric measurement.
It's not like you can order 'a grade 6' book from the US
National Institute of Standards and Technology (NIST)
the way you can order a substance with a known cholesterol concentration, or a 1Kg weight.
One strategy we used more recently was to look online for lists of
books recommended by
teachers for children in different grades, and
our job was made easier if we could find
the texts themselves online, and simply cut and paste
samples of them into MS Word or an online 'readability'
tool. Some years,
we used the full range from 'the Cat in the Hat' to
university texts,
and plotted the measurements against the grade or age level.
If you measure a text with different instruments or tools, and if they have a
common scale (e.g. grade level) then you could use
a linear model to estimate how systematically (if any) they vary from one to another.
If you think of these tools as the only ones available (like iPhone vs Android)
then you should treat them as 'fixed' effects.
If one the other hand they are a sample of the many tools 'out there' then a
random
effects model might be more appropriate.
Even though we will use examples involving the measurement of
physical quantities such as activity and biochemical parameters,
JH kept the readability example in Q1 and Q2 as a reminder of the early psychometric
focus on the quality of measurements. The 2002 article, referred to in Q29, very much brings us into the computer era
Week of Aug 26- Sept 1 (optional, but strongly recommended)
Upload (via MyCourses) your answers to however much/little
you are able to do of
Q3, Q4, Q5, Q6, Q7[PhD]
by Friday Sept 1.
JH will go over some of these in class in Wed Aug 30. You can also look/listen back ot audio/videos
from past years, where some of this material has been covered.
{Qs 3, 4, 5 abd 7 are 'just algebra',
while Q6 is asking you to think about how the 'numbers' came to be.}
[Even if you cannot get to Qs 3-5, please do look at page 16, and results 1 and 2
on page 22 of the Notes before the 1st class on Aug 31.]
Q3 [ 'm-s' is JH shortand for 'math-stat' ].
The point of asking you to derive the link
is to emphasize that the
Standard Error of Measurement and R are not ENTIRELY
separate. Yes, the Standard Error of Measurement is more limited, because it does
not tell you how much variation there is from
person to person (or object to object). But you
can
think of R or the ICC as the proportion of the
OBSERVED variance that is 'real' (i.e. due to
genuine person to person variation),
and think of the remainder, 1 - R,
as the square of the Standard Error of Measurement.
(1) The square of the Standard Error of Measurement is ONE of the TWO
components in the observed variance. (2) The
square of the genuine between-person variance
is the other. The ICC is (2) as a fraction
of (1)+(2).
A good example of the 2 concepts
can be found in the blurb from the Educational
Testing Service called
INTERPRETING YOUR GRE SCORES, page 9 of the Notes.
Q4
Relationship between test-retest correlation
and ICC(X). The point here is to see the same
concept from two different perspectives.
Q5
Relationship between correlation(X,X')
and ICC(X): Some people like this explanation
of the ICC, since it echoes what was said above
about the ICC as the proportion of the variance we observe
in an imperfectly measured characteristic that is 'real'.
Think of a correlation as another way to measure
how strongly an imperfectly measured characteristic
correlates (agrees) with the perfectly measured version.
If you were trying to explain the ICC to a lay person, you
would probably have better success using 'correlation'
than 'variance'. To explain correlation you don't
have to get into as many details as you would have to if
you take the 'variance' route. If we are willing
to cheat a little bit (and tell people that
the SD is like the typical or average absolute deviation),
you might get away with using the concept of a SD,
but the concept of a typical or mean squared
deviation will for sure lose more people.
Q6
Galton's data:
Have a look at the Record of Family Faculties
(Family album)
respondents used to report
the family heights.
Who, if anyone, did the measuring? Who
did the 'reporting'?
Do you know how tall your parents and grandparents
are (were?)
Q7
'Increasing Reliability by averaging several
measurements'
This is a very topical and 'charged' issue at funding agencies, such as
the Canadian Institues of Health Research, where each
grant application used to be reviewed by 2 primary reviewers, and then
an average is made of the scores of up to 20 panel members (incl. the 2)
who heard and discussed the
2 reviews, and had also looked through the application themselves.
The new system uses 5 reviewers who do not meet/communicate, and
their scores are averaged.
If in the old system, where the ICC was say 0.4, what would be the ICC
if we used the average of 2 raters? 3 raters? 4 raters?
You can manipulate the algebra as you wish, but you might also
think of it as follows:
if we average m raters, the true sigma-sq-between is not affected,
but the true sigma-sq-within now gets reduced from
sigma-sq-within / 1 if 1 random rater, to sigma-sq-within / 2 if we average 2,
... sigma-sq-within / m if we average the scores of m raters.
So with an average of m raters, the observed variance of these averages is now
sigma-sq-between + sigma-sq-within / m
The fraction that this that is signal is
sigma-sq-between / [ sigma-sq-between + sigma-sq-within / m]
SO what the question is asking is what if we use N*m raters
so we have fractions
sigma-sq-between / [ sigma-sq-between + sigma-sq-within / N*m]
and
sigma-sq-between / [ sigma-sq-between + sigma-sq-within / 1*m]
The algebra is a matter of manipulation this ratio, so as ro remove the
'm' that is there to start with, and end with the basic ICC[1]
(ie what if m=1) and the scaling factor N.
Another example, if a 3 hour GRE exam, done by a paper and pencil, has
a reliability of 0.9, what reliability would a 6-hour or 12-hour exam have?
Taking 3 hours as the unit of effort, it is
0.9/ (0.9 + 0.1 ) for 3 hours
0.9/ (0.9 + 0.1/2) for 6 hours
0.9/ (0.9 + 0.1/4) for 12 hours
etc.
Geoff Norman was part of a group who developed
McMaster's 'Multiple Mini Interview' system.
McMaster, and many other schools since then have
abandoned the traditional interview
and use this instead
see Med Educ. 2004 Mar;38(3):314-26.
An admissions OSCE: the multiple mini-interview.
Eva KW, Rosenfeld J, Reiter HI, Norman GR.
and subsequent publications
that evaluated its measurement properties.
Week of Sept 1-8
Upload your answers to however much/little you are able to do of
Q08,
Q09a,
Q17 parts a and b,
Q20,
Q26 (PhD),
Q27 [PhD optional],
Q28 [PhD optional],
Q29,
Q30 (3 teams, each led by a PhD student),
Q31
by Friday September 8.
Q8
Just because (random) measurement errors tend to cancel out in
averages doesn't mean that errors in measurement can be ignored. For example,
how comfortable would you be in measuring how much physical activity JH does
by having him wear a 'step-counter' for a randomly selected week of the
year, and using that 1-week
measurement as an 'x' in a multiple or logistic or Cox
regression? See slides 7 and 8 from part of JH's
"Scientific reasoning, statistical thinking, measurement issues, and use of
graphics: examples from research on children"
at Royal Children's Hospital in Melbourne, some years ago.
pdf
Some of the the terminology will be new to you, and so (as you will discover
when you do run the simulations in Q8 of how well you can estimate the conversion factors between
degrees F and degrees C) will some of the consequences of measurement error.
The "animation (in R) of effects of errors in X on slope of Y on X" might be of interest,
as might the java applet accompanying "Random measurement error
and regression dilution bias".
These consequences are rarely touched on, yet alone emphasized, in theoretical courses on regression, where all
'x' values are assumed to be measured without error! Welcome to the REAL world.
Q9
The point is to 'smooth' the decay curve.
But (as the hint says) its broad form should
not be a big surprise : it was the
subject of a question earlier on in the math-stat.
questions. But, as JH recently realizes, and addresses in part [b] the functional form is not as simple or universal as he thought:
he had been 'over-selling this formula in his teaching for the last 15 years.
Q17
* Measuring Heart Rate is more challenging
Link
Link
* Measuring Environmental Noise:
Link
* Measuring Physical Activity: interpreting the statistical REPORT
is challenging!
Link
Q20
!!
This is a good illustration of result 2 on page 22 {BUT, we might actually need futher consitions
on the distribution of X !}.
Q21
A real example to convince you that
MEASUREMENT ERROR MATTERS.
By the way, section 4.7.4 Measurement error in regression,
of the Cox-Donnelly book cited in question 21 gives a nice visual
(rather than algebraic) explanation for the flattening of the slope.
Its the same visual explanation that JH's co-authors give
in their BMJ tutorial, here
J Hutcheon, A Chiolero and J Hanley Random measurement error and regression dilution bias
The AIDS example in Cox and Donnelly is a very nice illustration of how measurement errors
led to a delay in figuring out how HIV was transmitted. This re-inforces the message
that measurement errors can cause quite big but
subtle distortions.
This Q, new in 2019, is meant to show that the
dilutions you have calculated/seen are not just in toy math examples,
but also present in just about all research involving regression -- since it
is near impossible to measure all X's perfectly.
Clearly, some authors like the ICC, while others make it a bit more complicated
by using a ratio of the 2 components in the ICC. In the end, it comes to the same thing.
JH prefers the ICC version, since it is more easily remembered.
Q23
Unlike the ICC or R, or the variance-ratio used in Q21, the magnitude of the errors in this Q,
new in 2020, is not naturally quantified by a
variance, or a fraction of the overall variance. But the effect
is actually easier to explain to a lay person than the variance-based attenuations we have focused on.
The 1856 publication is not well known, and the description of the effects of the errors in the addresses is even less well known.
Thus, a short note to a modern day Epidemiology journal, explaining
this early attention to measurement , and recounting Snow's clear description of the consequences of errors,
would make a
nice contribution to the teaching of epidemiology and its history!
JH is hoping that your answers to this Q can be the start of such a note... authored by the class!
Q24
Will be interested to read your suggestions.
Q25
A hand-drawn version of the diagram will be fine.
Q26
No need to go beyond a 2 point X, or a 2-point errors. If you understand this case, you understand the more general case.
[see page 23 and the link on page 24 -=- to p 6 of his notes for another course]
Q27
This Q, new in 2021. it might be easier to do the
Blood Pressure example, and the even age distribution,
since it should give more or less the same answer as the
log of the 28-day mortality proportions
as a linear function of age.
[The point of showing you that complex model was
to make it a bit more 'real' and to introduce you to Gompertz law;
he was dealing with all-cause mortality in a general population,
and he needed 3 straight lines to show the U -shaped 'force of mortality'
function. See here
for a slightly more complex version.
We don't have enough really young people in the UK data, to test if
the same 3-part piece-wise log-linear model applies with
28-day mortality in persons who have tested
positive for COVID.
Even though the Blood Pressure example may look more like a 'toy' math example
where it is easier to 'see' what is going on, it is not THAT
far-fetched. For example, see
here
The '100 + your age' equation was commonly used when JH was young.
There was another rule, which his doctor gave him regarding weight-gain:
"after you get married (JH did at 26, and weighted 130 lbs or so)
you are allowed to put on a pound a year". Until COVID, he has more or less
stayed within this guideline!
Q28
This Q was also new in 2021, so the wording has not been extensively
beta-tested, unlike those from earlier years.
The difference between it and earlier Qs re. error is
that that the real 'X' (here VOC+ or VOC-) is BINARY, and
(especially early on) was not perfectly measured.
And when Y (1=dead, 0=survived) is also binary,
things get a bit more complicated. This example
is easier than the typical one, since there are only false
positives, and no false negatives. This simplest error context may
allow you to come up with a simpler algebraic formula than
would apply when both types of error are present
(as in the John Snow example in Q23, where there are mistakes in both sides.)
Q29
This Q is new in 2022. JH finds it to be a nice modern intro to readability.
He is sad to find that the SMOG index (which we used to use for manual
'measurements' way back when) did not make it to the list.
Maybe when Microsoft put some of the other tools into MS Word, that was the end of the
SMOG index. Some of the SMOG links no longer work, but you could probably find new ones. JH
always liked the name!