BIOS601 AGENDA: Wednesday September 06, 2017
[updated August 31, 2017 --please notify JH if you encounter any glitches]
 Agenda for Wednesday Sept 06, 2017 
  
     
 -   Discussion of issues in the 
  Assignment on measurement
 
 Q1 and Q2 (measuring 'Readability'): answers need not be handed in; just think about the issues;
  If there is time, we might discuss and do some 'measuring' in class.
 
 Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q18: Answers to be handed in.
 
 Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17: answers need not be handed in. If there's time,
  you and we will think about what the answers to them might have looked like.
 
 Remarks: this topic of measurement is probably new for you, as it was for JH
  when he began in cancer clinical trials in 1973, and oncologists (cancer doctors)
  were judging responses of advanced cancer to chemotherapy
  by measuring tumours by 
 'palpation'.
 
 Q1
 
 'Back then' ('BC') students had to measure the readability manually by counting the lengths of words and 
 sentences, and the number of syllables in words. 
 Today that is made much easier using online tools, and those in Microsoft Word.
 
 From 3 measurements of  readability, you can calculate the standard error of measurement as the SD of the 3. 
The CV is the SD divided by the mean of the 3, expressed as a percentage.
 
 If the scale is a natural one, like a grade level, then the SEE makes sense, 
since everyone knows what 0.7 of a grade is.
But if the scale is arbitrary (running from say 0 to 70) a SEE of 9 'points' is more difficult to judge, 
unless one knows well  what a '40' or a '20' is. In this case the ICC is more useful, 
but it requires that you have > 1 measurement each on each of several texts of
different difficulty .. so you can judge how much is genuine 'between-text' 
and within-text' variation.
 
 If you measure a text with different instruments or tools, and if they have a 
common scale (e.g. grade level) then you could use
a linear model to estimate how systematically (if any) they vary from one to another. 
If you think of these tools as the only ones available (like iPhone vs Android) 
then you should treat them as 'fixed' effects.
If one the other hand they are a sample of the many tools 'out there' then a 
random 
effects model might be more appropriate.
 
 Q2
 
 Again, 'back then'  we went to the library (or looked around at home) for books of 
 different difficulty so that we could see if the measurements  agreed well with 
 what difficulty experts though the difficulty of each book was.  It was not like 
 the study in Q16 where 500 meant 500 or 1500 meant 1550 and all would agree on 
 this  'gold standard'.
Unlike in physical measurements, this issue of 
an independent gold standard' is a challenging one in psychometric measurement.
It's not like you can order 'a grade 6' book from the US
National Institute of Standards and Technology (NIST)
the way you can order a substance with a known cholesterol concentration, or a 1Kg weight.
 
 One strategy we used more recently was to look online for lists of 
 books recommended by
 teachers for children in different grades, and 
 our job was made easier if we could find 
 the texts themselves online,  and simply cut and paste 
 samples of them into MS Word or an  online 'readability' 
 tool. Some years, 
 we used the full range from 'the Cat in the Hat' to 
 university texts, 
 and plotted the measurements against the grade or age level.
 
 Q3
 
 [ 'm-s' is short for 'math-stat' ].
 
The point of asking you to derive the link
is to emphasize that the 
SEE and R are nor ENTIRELY
separate. Yes, the SEE is more limited, because it does
not tell you how much variation there is from 
person to person (or object to object). But you 
can 
think of R or the ICC as the proportion of the
OBSERVED variance that is 'real' (i.e. due to 
genuine person to person variation),
and think of the remainder, 1 - R,
as the square of the SEE.
 
 (1) The square of the SEE is ONE of the TWO
components in the observed variance. (2) The 
square of the genuine between-person variance
is the other. The ICC is (2) as a fraction
of (1)+(2).
 
 A good example of the 2 concepts
 in the same
piece is the explanation from the Educational 
Testing Service called 
INTERPRETING YOUR GRE SCORES, contained on page 7
of JH's 'Introduction to Measurement Statistics' Notes
(available under Measurement -- Lecture Notes, etc in
the resources).
 
 Q4
 
 Relationship between test-retest correlation 
and ICC(X). The point here is to see the same
concept from two different perspectives.
 
 Q5
 
 Relationship between correlation(X,X') 
and ICC(X): Some people like this explanation
of the ICC, since it echoes what was said above
about the ICC as the proportion of the variance we observe
in an imperfectly measured characteristic that is 'real'.
Think of a correlation as another way to measure 
how strongly  an imperfectly measured characteristic
correlates (agrees) with the perfectly version.
 
 If you were trying to explain the ICC to a lay person, you
would probably have better success using 'correlation'
than 'variance'. To explain correlation you don't
have to get into as many details as you would have to if
you take the 'variance' route. If we are willing
to cheat a little bit (and tell people that 
the SD is like the typical or average absolute deviation),
you might get away with using the concept of a SD, 
but  the concept of a typical or mean squared
deviation will for sure lose more people.
 
 Q6
 
 Galton's data: 
 
 Have a look at the Family album
 respondents used to report
 the family heights.
 
 Who, if anyone, did the measuring? Who
 did the 'reporting'?
 
 Do you know how tall your parents and grandparents
 are (were?)
 
 Q7
 
 'Increasing Reliability by averaging several 
 measurements'
 
 This is a very topical and 'charged' issue at funding agencies, such as 
 the Canadian Institues of Health Research, where each
 grant application used to be reviewed by 2 primary reviewers, and then
 an average is made of the scores of up to 20 panel members (incl. the 2) 
 who heard and discussed the 
 2 reviews, and had also looked through the application themselves.
 
 The new system uses 5 reviewers who do not meet/communicate, and 
 their scores are averaged.
 
 If in the old system, where the ICC was say 0.4, what would be the ICC
if we used the average of 2 raters? 3 raters? 4 raters?
 
 You can manipulate the algebra as you wish, but you might also
think of it as follows:
 
 if we average m raters, the true sigma-sq-between is not affected,
but the true sigma-sq-within now gets reduced from
sigma-sq-within / 1 if 1 random rater, to sigma-sq-within / 2 if we average 2,
... sigma-sq-within / m if we average the scores of m raters.
 
 So with an average of m raters, the observed variance of these averages is now
 sigma-sq-between +  sigma-sq-within / m
 The fraction that this that is signal is
 
 sigma-sq-between / [ sigma-sq-between +  sigma-sq-within / m]
 
 SO what the question is asking is what if we use N*m raters
 
 so we have fractions
 
 sigma-sq-between / [ sigma-sq-between +  sigma-sq-within / N*m]
 
 and
 
 sigma-sq-between / [ sigma-sq-between +  sigma-sq-within / 1*m]
 
 The algebra is a matter of manipulation this ratio, so as ro remove the 
'm' that is there to start with, and end with the basic ICC[1] 
(ie what if m=1) and the scaling factor N.
 
 Another example, if a 3 hour GRE exam, done by a paper and pencil, has
a reliability of 0.9, what reliability would a 6-hour or 12-hour exam have?
Taking 3 hours as the unit of effort, it is
 
 0.9/ (0.9 + 0.1  ) for  3 hours
 0.9/ (0.9 + 0.1/2) for  6 hours
 0.9/ (0.9 + 0.1/4) for 12 hours
 etc.
 
 Geoff Norman was part of a group who developed 
McMaster's 'Multiple Mini Interview' system.
McMaster, and many other schools since then have 
abandoned the traditional interview
and use this instead
 
 see Med Educ. 2004 Mar;38(3):314-26.
An admissions OSCE: the multiple mini-interview.
Eva KW, Rosenfeld J, Reiter HI, Norman GR.
 
 and subsequent publications 
that evaluated its measurement properties.
 
 Q8
 
 Just because (random) measurement errors tend to cancel out in
 averages doesn't mean that errors in measurement can be ignored. For example,
 how comfortable would you be in measuring how much physical activity JH does  
 by  having him wear a 'step-counter' for a randomly selected week of the
 year, and using that 1-week
 measurement as an 'x' in a multiple or logistic or Cox
 regression? See slides 7 and 8 from part of JH's
 "Scientific reasoning, statistical thinking, measurement issues, and use of
 graphics: examples from research on children"
 at Royal Children's Hospital in Melbourne, earlier this year.
 pdf
 
 Some of the the terminology will be new to you, and so (as you will discover
when you do run the simulations in Q8 of how well you can estimate the conversion factors between
 degrees F and degrees C) will some of the consequences of measurement error.
 The "animation (in R) of effects of errors in X on slope of Y on X" might be of interest,
 as might the java applet accompanying "Random measurement error 
 and regression dilution bias".
 
 These consequences are rarely touched on, yet alone emphasized, in theoretical courses on regression, where all
 'x' values are assumed to be measured without error! Welcome to the REAL world.
 
 For this exercise, and the topics it addresses, the most relevant portions of 
 the 'surveys' resources are 
   
              Measurement: Reliability and Validity and               
 
              Effects of Measurement Error
 
 [last year: Computing issues that may arise in Q14: Dates are a pain, even in R. If you get stuck,
 use some of the R code  supplied, to compute week and day of week. Incidentally, whereas the exercise
 makes reference to 104 weeks, there are a few weeks with some missing data, so best 
 keep them out of the calculations for now (in practice JH would try to use all the data, but the
 imbalanced data have a messier EMS structure that -- for now -- distracts us from the main point) ].
 
 Q9
 
 The point is to 'smooth' the decay curve.
But (as the hint says) its form should
not be a big surprise : it was the 
subject of a question earlier on in the math-stat. 
questions.
 
 Q18
 
 JH did some pilot testing of the variability to expect
in subjects of your age. He has wondered why
100 per day were used in the study of the effects
of sleep deprivation.
 
 IF you want to achieve shorter reaction times, JH's pilot testers
tell him that its better to use a computer than a phone or
tablet, and also to use the space bar rather than the return
or enter key or the mouse. (The 'Hints' tell you you can use any key, rather than click with the mouse
or trackbar.)