What is reliability and validity? Reliability and validity refer to the dependability and accuracy of a measurement.
For example, if your child has a fever, you want a reliable thermometer.
Thus, if you purchased a new thermometer and took your child’s temperature three times at five-minute intervals; and the thermometer gave temperatures of 100, 105, and 102 degrees Fahrenheit, you would conclude that the thermometer was unreliable.
The C&P exam (Department of Veterans Affairs compensation and pension examination) and the VBA (Veterans Benefits Administration) rating (disability evaluation rating or “percentage”) are both measurements. Reliability refers to the dependability, and in particular the consistency,, of a measurement procedure.
There are different types of reliability; various mathematical procedures to quantify the extent to which a measurement method is reliable, e.g., a reliability coefficient; and the different types of reliability vary in importance depending on the nature and application of the measurement method. For example, when evaluating blood pressure measurement methods, test-retest reliability is very important.
Inter-rater reliability (interrater reliability) is “[t]he extent to which two independent parties, each using the same tool or examining the same data, arrive at matching conclusions.”
When discussing compensation and pension exams, inter-rater reliability refers to the consistency between and among C&P psychologists’ and psychiatrists’ exam conclusions across the United States.
For example, imagine a veteran who is moving from California to Kentucky. Imagine further that our hypothetical veteran has the option to have his Initial PTSD C&P exam with an MSLA contract psychologist in San Diego, or at the Louisville VA Medical Center. If the psychologists at both locations achieve highly consistent results, i.e., excellent interrater reliability, then, all other things being equal, it should not matter where this hypothetical veteran has his C&P exam.
Validity is the extent to which a measurement instrument or procedure truly measures what it purports to measure.
Going back to the thermometer example, imagine you purchase a new forehead (temporal artery) thermometer for taking kids’ temperatures at home, and you read that it “has undergone thorough testing by an independent laboratory and was found to have a 0.99 test-retest reliability coefficient.” Sounds like you can rely on this thermometer to provide consistent readings! But then you see on the evening news that authorities have ordered a recall of the thermometers because subsequent research found that the thermometers actually measured forehead moisture, not temporal artery temperature.
Thus, even though the forehead thermometer was reliable, it’s measurements were not valid. The thermometer did not measure what it’s manufacturer said it measured.
Examples and related concepts
The reliability and validity of psychological testing generally is quite good.
Structured diagnostic interviews exhibit significantly better inter-rater reliability and construct validity (diagnostic accuracy) than unstructured clinical interviews.,,,,,
Clinical utility is a related but different concept. Clinical utility refers to a reliable and valid psychological test, assessment procedure, diagnosis, etc., that is relatively easy to use, and which provides useful information to clinicians that helps them better understand a person; communicate essential information to patients, families, and other clinicians; suggest the etiology of a patient’s current difficulties; predict treatment response; and helps formulate treatment plans.,
Multi-method assessment in clinical psychology refers to using a variety of assessment methods, such as:
- unstructured clinical interviews;
- structured diagnostic interviews;
- self-report instruments such as questionnaires and psychological tests;
- performance-based measures, e.g., remembering a list of words;
- collateral interviews;
- record reviews; and
- physiological measures, e.g., changes in heart rate when recounting a traumatic stressor;
 Oxғᴏʀᴅ Dɪᴄᴛɪᴏɴᴀʀɪᴇs (“reliability n. – 1.1 The degree to which the result of a measurement, calculation, or specification can be depended on to be accurate. Example: ‘To ensure inter-rater reliability, evaluators must agree on a student’s overall writing strength assessment.’”) https://en.oxforddictionaries.com/definition/reliability
 Aᴍᴇʀɪᴄᴀɴ Hᴇʀɪᴛᴀɢᴇ Dɪᴄᴛɪᴏɴᴀʀʏ ᴏғ ᴛʜᴇ Eɴɢʟɪsʜ Lᴀɴɢᴜᴀɢᴇ (5th ed., 2016) (“2. reliability n. – Yielding the same or compatible results in different clinical experiments or statistical trials.”) http://www.thefreedictionary.com/reliability
 Oxғᴏʀᴅ Eɴɢʟɪsʜ Dɪᴄᴛɪᴏɴᴀʀʏ (“reliability n. – 2. Statistics. The degree to which repeated measurements of the same subject under identical conditions yield consistent results.” ) http://www.oed.com/view/Entry/161904
 Brian E. Perron & David F. Gillespie, Kᴇʏ Cᴏɴᴄᴇᴘᴛs ɪɴ Mᴇᴀsᴜʀᴇᴍᴇɴᴛ 4 (2015). (“Reliability is the degree to which measurements are free from error, making reliability inversely related to error. The Standards [for Educational and Psychological Testing] defines reliability as ‘the consistency of the scores across instances of the testing procedure’. Reliability goes hand-in-hand with validity … reliability is a condition for validity.”) [citations omitted; emphasis in original]
 Mᴏsʙʏ’s Mᴇᴅɪᴄᴀʟ Dɪᴄᴛɪᴏɴᴀʀʏ (Kindle Edition), Kindle locations 140642-140645 (“reliability – the extent to which a test measurement or device produces the same results with different investigators, observers, or administration of the test over time. If repeated use of the same measurement tool on the same sample produces the same consistent results, the measurement is considered reliable.”)
 William M. Trochim, Types of Reliability, Tʜᴇ Rᴇsᴇᴀʀᴄʜ Mᴇᴛʜᴏᴅs Kɴᴏᴡʟᴇᴅɢᴇ Bᴀsᴇ, (2nd ed., 2006) http://www.socialresearchmethods.net/kb/measval.php
 Noreen M. Webb, Richard J. Shavelson, & Edward H. Haertel, Reliability Coefficients and Generalizability Theory, in Psʏᴄʜᴏᴍᴇᴛʀɪᴄs, at 81 (Hᴀɴᴅʙᴏᴏᴋ ᴏғ Sᴛᴀᴛɪsᴛɪᴄs, Vᴏʟ. 26, C. Radhakrishna Rao & S. Sinharay eds., 2006).
 Pablo E. Pérgola et al., Reliability and Validity of Blood Pressure Measurement in the Secondary Prevention of Small Subcortical Strokes Study, 12 Bʟᴏᴏᴅ Pʀᴇss. Mᴏɴɪᴛ. 1, 7 (2007) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3970705/
 Kevin A. Hallgren, Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial, 8 Tᴜᴛᴏʀ. Qᴜᴀɴᴛ. Mᴇᴛʜᴏᴅs Psʏᴄʜᴏʟ. 23, 23-24 (2012) (“The assessment of inter-rater reliability (IRR, also called inter-rater agreement) is often necessary for research designs where data are collected through ratings provided by trained or untrained coders. … measurement error may be introduced by imprecision, inaccuracy, or poor scaling of the items within an instrument (i.e., issues of internal consistency); instability of the measuring instrument in measuring the same subject over time (i.e., issues of test-retest reliability); and instability of the measuring instrument when measurements are made between coders (i.e., issues of IRR [inter-rater reliability]).”) [emphasis added]
 Andrew F. Hayes & Klaus Krippendorff, Answering the Call for a Standard Reliability Measure for Coding Data, 1 Cᴏᴍᴍᴜɴ. Mᴇᴛʜᴏᴅs Mᴇᴀs. 77 (2007).
 Medical Dictionary (2009) (“… In an analysis of anxiety, for example, a graduated scale may rate research subjects as ‘very anxious, ‘ ‘somewhat anxious,’ ‘mildly anxious,’ or ‘not at all anxious,’ … If the study is carried out and coded by more than one psychologist, the coders may not agree on the implementation of the graduated scale: some may interview a patient and find him or her ‘somewhat’ anxious; another might assess the patient as being ‘very anxious.’ The congruence in the application of the rating scale by more than one psychologist constitutes its interrater reliability.”) [HTML].
 Gregory J. Meyer et al., Psychological Testing and Psychological Assessment – A Review of Evidence and Issues, 56 Aᴍ. Psʏᴄʜᴏʟ. 128, 155 (2001). (“…the validity of psychological tests is comparable to the validity of medical tests … distinct assessment methods provide unique sources of data and … sole reliance on an [unstructured] clinical interview often leads to an incomplete understanding of patients. On the basis of a large array of evidence, we have argued that optimal knowledge in clinical practice (as in research) is obtained from the sophisticated integration of information derived from a multimethod assessment battery.”) [emphasis added]
 Kevin A. Hallgren, Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial, 8 Tᴜᴛᴏʀ. Qᴜᴀɴᴛ. Mᴇᴛʜᴏᴅs Psʏᴄʜᴏʟ. 23, 23 (2012); see also Andrew F. Hayes & Klaus Krippendorff, Answering the Call for a Standard Reliability Measure for Coding Data, 1 Cᴏᴍᴍᴜɴ. Mᴇᴛʜᴏᴅs Mᴇᴀs. 77 (2007); and William M. Trochim, Reliability & Validity, Tʜᴇ Rᴇsᴇᴀʀᴄʜ Mᴇᴛʜᴏᴅs Kɴᴏᴡʟᴇᴅɢᴇ Bᴀsᴇ (2nd ed.) (last updated 20 October 2006).
 William M. Trochim, Construct Validity, Tʜᴇ Rᴇsᴇᴀʀᴄʜ Mᴇᴛʜᴏᴅs Kɴᴏᴡʟᴇᴅɢᴇ Bᴀsᴇ (2nd ed.) (last updated 20 October 2006), http://www.socialresearchmethods.net/kb/constval.php
 Regan W. Stewart et al., A Decision-Tree Approach to the Assessment of Posttraumatic Stress Disorder: Engineering Empirically Rigorous and Ecologically Valid Assessment Measures, 13 Psʏᴄʜᴏʟ. Sᴇʀᴠɪᴄᴇs 1, 2 (2016) (“… clinicians often fail to probe important events during the course of unstructured clinical encounters. Interrater reliability has also been shown to be higher for structured interviews than for unstructured interviews.”) (citations omitted).
 Douglas B. Samuel et al., Convergent and Incremental Predictive Validity of Clinician, Self-Report, and Structured Interview Diagnoses for Personality Disorders Over 5 Years, 81 J. Cᴏɴsᴜʟᴛ. Cʟɪɴ. Psʏᴄʜᴏʟ. 650, 657 (2013) (“… the use of semistructured diagnostic interviews and/or self-report questionnaires would improve the validity of PD [personality disorder] diagnoses in clinical practice.”)
 Paul R. Miller et al., Inpatient diagnostic assessments: 1. Accuracy of structured vs. unstructured interviews, 105 Psʏᴄʜɪᴀᴛʀʏ Rᴇs. 255 (2001) (“Structured methods were significantly better than the unstructured traditional diagnostic interviews.”)
 Andrea Suppiger et al., Acceptance of Structured Diagnostic Interviews for Mental Disorders in Clinical Practice and Research Settings, 40 Bᴇʜᴀᴠ. Tʜᴇʀ. 272 (2009) (“… structured diagnostic interviews are highly accepted by interviewers and patients in a variety of settings.”)
 Richard Rogers, Hᴀɴᴅʙᴏᴏᴋ ᴏғ Dɪᴀɢɴᴏsᴛɪᴄ ᴀɴᴅ Sᴛʀᴜᴄᴛᴜʀᴇᴅ Iɴᴛᴇʀᴠɪᴇᴡɪɴɢ (2001).
 Richard Rogers, Standardizing DSM-IV Diagnoses: The Clinical Applications of Structured Interviews, 81 J. Pᴇʀs. Assᴇss. 220 (2003) (“Psychological assessments of Axis I and Axis II diagnoses are strongly enhanced by the use of structured interviews.”)
 Robert Kendell & Assen Jablensky, Distinguishing Between the Validity and Utility of Psychiatric Diagnoses, 160 Aᴍ. J. Psʏᴄʜɪᴀᴛʀʏ 4, 9 (2003).
 Stephanie N. Mullins-Sweatt & Thomas Widiger, Clinical Utility and DSM-V, 21 Psʏᴄʜᴏʟ. Assᴇss. 302 (2009).
 Robert F. Bornstein & Christopher J. Hopwood, Introduction to Multimethod Clinical Assessment in Christopher J. Hopwood & Robert F. Bornstein (eds.), Mᴜʟᴛɪᴍᴇᴛʜᴏᴅ Cʟɪɴɪᴄᴀʟ Assᴇssᴍᴇɴᴛ 2 (2014) (“Just as physicians cannot gain complete understanding of a patient’s problem unless they integrate evidence from multiple modalities (e.g., self-report, behavioral, physiological), psychologists cannot gain complete understanding of a patient’s difficulties without evidence from multiple modalities (e.g., self-report, behavioral, performance-based).”).
 Frank Castro, Jasmeet P. Hayes, & Terence M. Keane, Issues in Assessment of PTSD in Military Personnel in Treating PTSD in Mɪʟɪᴛᴀʀʏ Pᴇʀsᴏɴɴᴇʟ: A Cʟɪɴɪᴄᴀʟ Hᴀɴᴅʙᴏᴏᴋ 23, 27 (Bret A. Moore & Walter E. Penk, eds., 2011) (“Accurate assessment of PTSD for military personnel is a multifaceted process that includes orienting the client to the assessment process, creating a comfortable environment in which patients feel at ease to share their experiences, and engaging in a multimethod approach to the assessment of PTSD that includes structured diagnostic interviews, self-report psychological questionnaires, and medical record review…”) [citation omitted]
 The concept of multi-method assessment is analogous to the psychometric research method known as the “multitrait-multimethod matrix.” See: Donald T. Campbell & Donald W. Fiske, Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix, 56 Psʏᴄʜᴏʟ. Bᴜʟʟ. 81 (1959).