In medicine it is often flippantly remarked that eliciting clinical signs and interpreting biomarkers is more of an art than a science.

Imagine your last patient of the day comes for a routine check-up. His blood pressure is just above the cut-off to be termed high (hypertensive). But it late in the day and the patient was clearly anxious about the appointment. And you used a cuff that was too small for his large arms. Would you formally diagnose him with hypertension?

In the wake of Covid-19 screening, we all know that not all measurements should be taken at face value. If we believed all wrist and forehead temperature measurements, half of us would be classified as corpses. It is inherently understood that a measurement does not necessarily reflect the true biological state.

But in a scientific study environment, a measurement is a measurement. And if measurements show a correlation, that correlation becomes a paper, the paper becomes a practice guideline, and the practice guideline determines the doctor’s decisions.

So then, how much can small measurement errors really contribute to wrong conclusions and poor medical practice?

In this article, as the first part of a series on data quality, we simulate a simple dataset to answer this question.

The blood pressure example

Blood pressure is one of the most routinely measured biomarkers in medicine. While of importance in acute medicine and critical care, most of its value in general practice is as a screening and monitoring tool for cardiovascular diseases. Direct targeting of blood pressure control could avert over a hundred million deaths globally over the next 30 years (Pickersgill et al., 2022). A blood pressure measurement typically consists of two numbers: a diastolic and systolic blood pressure. For this exercise, we only use diastolic blood pressure (DBP) measurements, for simplicity’s sake.

Blood pressure measurement is a textbook example of how random measurement error may creep in. Because of its routine nature, often less care is taken to produce highly accurate readings. Also, there are so many different factors that could introduce error. Some are patient-related (e.g. recent food consumption, caffeine or other substance use, movement), some device-related (e.g. poor device calibration, non-validated or expired equipment) and others are procedure-related (e.g. patient position, cuff position, cuff size, rest period before taking the measurement, etc.).

Our assumptions and definitions

In order to generate our dataset of 10 000 hypothetical employees, we use the following assumptions and definitions:

No alt text provided for this image

For this exercise, we start from the true baseline DBP, derived from the body mass index (BMI) of a population with a normally distributed height and weight.

Working off the findings of several studies, a standard deviation of 5 mm Hg is used to add noise to this true DBP (Kallioinen et al., 2017). This means that 95% of the DBP measurements are within 10 mm Hg (two standard deviations) of the actual DBP at the time of the measurement. We also assume that this noise is random: not related to the true DBP values, and over all measurements, the average noise is 0 mm Hg). The graph to the right, above, maps out the distributions of true DBP readings and the measured DBP readings at baseline.

No alt text provided for this image

Next we simulate an intervention that only targets individuals with a blood pressure above 90 mm Hg. A participant with a DBP ≤ 90 mm Hg is considered normotensive and a participant with a DBP > 90 mm Hg is considered hypertensive.

Finally, we assume that our blood pressure reduction intervention (e.g. exercise and diet modification) reduces the true DBP according to the function depicted to the left, above (Kelley et al., 2001).

The consequences

We describe three major consequences:

The first consequence

Some participants would be misclassified at baseline, which affects whether they will be included or excluded from the study. A normotensive patient (DBP ≤ 90 mm Hg) could be classified as hypertensive (DBP ≤ 90 mm Hg) and vice versa. In our example, 13% of normotensive participants would be incorrectly targeted as hypertensive, and 24% of hypertensive patients would be incorrectly excluded because their measurement incorrectly labelled them as normotensive, as can be seen in the table below:

A table showing the following: 86.9% of people were classified as True negatives: their measured DBP and their true DBP were both equal to or below 90 mmHg. 76.3% of people were classified as True positives: their measured DBP and their true DBP were both above 76.3% 23.7% of people were incorrectly classified as negative: their measured DBP was equal to or below 90 mmHg, but their true DBP was above 90 mmHg 13.1% of people were incorrectly classified as positive: their measured DBP was above 90 mmHg, but their true DBP was equal to or below 90 mmHg

The second consequence

A scatterplot showing change in DBP on the y axis and DBP at baseline on the x axis.  Coral-coloured points show measured DBP values, while teal-coloured points show true DBP values. Trend lines for both data sets show a negative correlation: as true DBP increases, the change in DBP from baseline to post-intervention values becomes more negative: i.e. the intervention becomes more effective as baseline DBP values are greater. the trend line for measured values is steeper, and the points are more dispersed vertically about the mean: the range in the y values spands from approximately -20 to +10 at an x value (DBP at baseline) of 90 mmHg, whereas the points for the true values are very closely clustered around the mean, with y values spanning from appoximately 0 to -5 mmHg at a abseline DBP of 90mmHg.

Due to the increased variance in measurements both at baseline and post-intervention caused by the added random noise, the distribution of the change from baseline to post-intervention has a wider spread.

It appears as if the intervention leads to widely different results from person to person (the coral-coloured dots on the plot below), while the true intervention effect (the teal-coloured dots) is fairly similar for people with the same baseline DBP.

The third consequence

Arguably the most important consequence is that there is a clear bias in the estimated average intervention effect (the solid line in the scatter plot above): the error in measurement has made the intervention seem more effective at decreasing blood pressure. This is despite the suboptimal matching of intervention exposure to the people who stand to benefit from it.

Even though the random noise (e) is balanced around zero, a very high baseline DBP measurement is likely the result of a highly positive noise term (e_baseline). On the scatter plot below, the values with a higher measured DBP (orange-red) are predominantly to the far right of the graph. Values which had a measured DBP below 90 are grey and transparent, as they would be excluded from the intervention.

No alt text provided for this image

When we measure the DBP again after the intervention, the phenomenon of regression to the mean comes into play: after randomly drawing a very high number, it is much more likely that the next draw (e_post-intervention) is smaller, than it is likely that the second number is just as large or larger. This is because the error value is taken randomly from a normal distribution, where it is most likely that a value near the mean (i.e. a value approaching 0) is drawn.

What this means is that there is likely a net negative difference in post-intervention versus baseline error (e_post-intervention – e_baseline) for people with a high DBP measurement at baseline. Hence the red dots are skewed to towards the bottom half of the scatter plot. For example, if someone had a true baseline DBP of 100 mm Hg, but had +15 mm Hg error, and then post-intervention had a true DBP of 95 mm Hg but negligible measurement error, the intervention would appear to have a -20 mm Hg effect, rather than the true effect of -5 mm Hg; the other -15 mm Hg is from the difference in error terms.

In contrast to this example, people with a very low measured baseline DBP, who are likely to have a very negative error term, are not included in the study (the greyed-out points on the graph), and therefore the positive change in error does not get factored into the estimation of the effect size when using the measured values.

What does the measurement error mean for the study conclusions?

In order to assess the impact of the intervention, we must understand how blood pressure relates to health outcomes. It turns out that there is a strong, non-linear link between DBP and the risk of dying from cardiovascular disease (CVD), as graphed below (Flint et al., 2019). Lowering a person’s DBP by 10 mm Hg cuts their 10-year CVD mortality risk in half!

No alt text provided for this image

The table above shows how the impact indicators of the intervention are affected by the data imprecision. The numbers in the “Optimal” column are what we could achieve if there were no measurement errors at all. The “Actual” column reflects the actual impact made by the intervention, using the true DBP values for the group that participated in the intervention (considering the flawed participant selection). The “Estimated” column shows what we would mistakenly estimate the impact to be, based on the imprecisely measured DBP values of the intervention participants.

The perceived relative risk reduction (31.3%) is just over 1.5 times greater than the actual risk reduction (20.4%), while the perceived number of deaths averted (39.6%) is just short of double the actual number of deaths averted (20.8%). Because these indicators of impact and efficiency apply to a 10-year risk period, the discrepancies between the model estimates and the true health gains would likely only be discovered after many years of continued effort and investment.

How to address measurement error

Considering the disproportionate impact that relatively small measurement errors can have on study outcomes, and subsequently on medical practice, it is key to prevent, identify and correct imprecise measurements.

Prevention includes adequate education and awareness among patients, health care practitioners, technicians, and investigators. Training, refreshing and monitoring measurement technique is vital for reducing random error. Additionally, repeated measures per visit, and the inclusion of multiple time points before during and after the intervention, will not only provide estimates closer to the truth, but can also provide information about measurement accuracy.

After the data is collected, identification and quantification of measurement errors should be attempted. The estimated within- and between-individual variation can be compared against benchmarks obtained under ideal conditions (perfect adherence to data collection protocols, use of recently calibrated equipment, etc.)

To correct for measurement error, we can fit special statistical models that take into account the overdispersion of DBP measurements due to measurement errors (Shepard & Finison, 1983).

Take home message

Most biostatisticians will spontaneously think of so-called regression dilution when talking about random measurement error. This phenomenon, also known as attenuation bias, usually leads to an underestimation of disease risk. Our example shows that random measurement error can also inflate effects and associations.

Measurement imprecision can be seen throughout medical science. Often there is no time to take off shoes before a patient’s height is measured against the tattered paper ruler on the door. Weight may be measured with a variable amount of clothing items still on. And sometimes a patient may have had a cup of sugary tea before a blood glucose measurement, while another patient didn’t eat all day because they’ve been waiting in the clinic queue. Other examples of imprecise measures: self-reported diet, alcohol intake, cigarettes smoked per day, physical activity, stress levels, mood, pain, and hours of sleep.

While some measures are closer to a ground truth than others, there is no perfect measurement. By recognising the potential ramifications of measurement error and incorporating mitigation strategies into our methodology, the degree of imprecision can be estimated and taken into account during the statistical analysis.

A technical report for this work, including additional references to scientific sources and the computer code behind the simulation, is also available here.

References

Flint AC, Conell C, Ren X, et al. Effect of Systolic and Diastolic Blood Pressure on Cardiovascular Outcomes. N Engl J Med. 2019;381(3):243-251. doi:10.1056/NEJMoa1803180

Kallioinen N, Hill A, Horswill MS, Ward HE, Watson MO. Sources of inaccuracy in the measurement of adult patients’ resting blood pressure in clinical settings: a systematic review. J Hypertens. 2017;35(3):421-441. doi:10.1097/HJH.0000000000001197

Kelley GA, Kelley KA, Tran ZV. Aerobic exercise and resting blood pressure: a meta-analytic review of randomized, controlled trials. Prev Cardiol. 2001 Spring;4(2):73-80. doi: 10.1111/j.1520-037x.2001.00529.x. PMID: 11828203; PMCID: PMC2094526.

Lewington S, Clarke R, Qizilbash N, Peto R, Collins R; Prospective Studies Collaboration. Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies [published correction appears in Lancet. 2003 Mar 22;361(9362):1060]. Lancet. 2002;360(9349):1903-1913. doi:10.1016/s0140-6736(02)11911-8

Pickersgill, SJ, Msemburi, WT, Cobb L. et al. Modeling global 80-80-80 blood pressure targets and cardiovascular outcomes. Nat Med (2022). doi:10.1038/s41591-022-01890-4

Shepard DS, Finison LJ. Blood pressure reductions: correcting for regression to the mean. Prev Med. 1983;12(2):304-317. doi:10.1016/0091-7435(83)90239-6