Interpreting Test Scores & Key Concepts with Standardized Tests

Introduction and Overview

Welcome everybody. I would like to mention that if you have access to a book or handout with the normal curve (normal distribution), you may want to take it out to reference during this presentation, or use it later to make sense of some of the content. I will be referring to it about halfway through the presentation. I do have a mini- version of one in the slides handout.

Evidence-Based Practice

What I will cover in this course is related to evidence-based practice. As a review, ASHA did have a joint coordinating committee on evidence-based practice and does have a position statement about this. One part of the position statement, in terms of this course, is: “For clinical practice to be evidence-based, speech-language pathologists and audiologists must evaluate prevention, screening and diagnostic procedures, protocols and measures to identify maximally informative and cost-effective diagnostic and screening tools, using recognized appraisal criteria described in the evidence-based literature” (ASHA 2005). What is this position statement really telling us? It is telling us that it is our responsibility, when we select diagnostic measurements, to have the most scientific support and psychometric adequacy. Remember that assessment is ongoing, and it is only through our continuous evaluation procedures that we can monitor treatment progress on the goals we have selected as targets.

Just to review, what is evidence-based practice? It is integrating the research with our clinical expertise and it also involves the perspective of the client as a person, not just as a communication disorder. The research evidence is critical in our selection of norm-referenced tests and non-standardized measurements when it comes to our clients. As clinicians, we need to be very careful to choose standardized tests for our clients that meet the psychometric criteria of, for example, reliability, validity and standardization -- which I will be talking more about later. We also have to make sure that we are identifying those individuals with communication disorders.

Diagnostic Process

With that, I will move on to an overview of the diagnostic process. Standardized testing should only be one piece of information used in the evaluation process. Using standardized tests, and the whole diagnostic process, can be very challenging because we are administering various tests and measurements that are sometimes very long. Sometimes our clients or patients do not have the attention span for it.

When I refer to standardized testing, I am mostly referring to norm-referenced tests, and also a bit about criterion-referenced tests toward the end of the course. Part of the diagnostic process, when it comes to standardized testing and norm-referenced testing particularly, is interpreting the test scores and making clinical decisions. Again, this should be based not only on standardized tests, but should also include things like the case history, and authentic assessments such as observation, interviews, language samples and schoolwork samples.

Creating Effective Assessments

When creating effective assessments - which include standardized testing - we want to be sure we are being thorough. We want to use a variety of assessments, such as standardized, authentic, and dynamic types of assessments. We want to make sure our evaluation is valid and reliable.

We want to individualize the assessment. Just because we feel comfortable with a certain test, does not mean that we continually give that same test to everyone. We administer a particular test because it is going to answer the questions we are asking about our particular client. We do not want to do “cookbook assessments” which again, is in line with the previous point about individualizing your assessment.

All of this has been documented in plenty of references that are listed at the end of the course. Again, our purpose is to determine each client's communicative abilities and whether a communication disorder exists. So our job is to gather the relevant information regarding each individual client.

Test Selection

In terms of test selection, we want to make sure to select tests that meet the assessment's purpose. We should ask ourselves a number of different questions leading up to the testing that we are doing: Is the test appropriate for the age level and age range for which it was standardized? Was the test well standardized? Consider the representativeness of the samples; do they mirror the people that we are actually testing at that moment? Consider also the objectivity of administration and scoring, as well as the norms. One of the questions we definitely always want to ask is, “Is this test going to answer the questions I have about my client or patient?”

It is important to be familiar with the test. Sometimes a new version of a test will come out. We have to actually practice with them; we cannot assume that it is going to be exactly the same as what we have been used to. The way that we actually administer the test can affect the reliability, so being very familiar with the test is extremely important. I do emphasize this to the students that I teach as well. I know I am speaking here to practicing SLPs but these are still good things to do and keep in mind, especially when there are updates and revisions to tests.

Know the information you must have in order to make an informed decision. Be a good consumer of tests. Determine if you are qualified to administer the test. There are certain tests that overlap, for example, between the fields of speech pathology and psychology. Therefore, we need to make sure that we are the ones that are supposed to be administering those types of tests.

Psychometric Variables

There are some psychometric variables that are extremely important to consider in test selection. Those are reliability, validity, and standardization.

Reliability

Reliability refers to the consistency with which a test gives results on repeated administrations, or with different examiners administering the same test measurements. In other words, reliability equals replicability.

The reliability of a clinical measure is often provided by those who have developed the assessment tool. Many of the tests that we use are commercial products that cost quite a bit of money. But part of the publisher or the manufacturer's justification for the price of the test is the cost associated with determining its reliability. As an educated consumer, you need to require that the supplier or the developer of a tool or test provides information concerning its reliability. And, you as a speech pathologist should be able to read the manual and understand and evaluate the information that is provided on reliability. Again, I just want to emphasize that the cost of products is because it takes so much time for the statisticians, and all of those individuals working on the test, to make sure that it has high reliability, high validity, and high standardization.

Reliability, or replicability, indicates how stable the results are. This focuses on the issue of stability or consistency of the test scores, and the precision of the test. It indicates the amount of confidence one can actually have in the scores that we get from our clients. Even when we use the same test and give the test under apparently ideal circumstances, we will usually obtain somewhat different results. If several measurements of the same individual’s abilities are taken within a short period of time, each obtained score may vary slightly from the “true score” that defines that individual's true ability -- even though the underlying ability remains constant. So, you have to understand that the “true score” - if you read about it in any of the manuals - does not actually exist. What we are trying to get is the score that is as close to the true score as possible. That comes into play later when I address standard error of measurement.

If a test is not very reliable, the clinician needs to be cautious about interpreting the results. Reliability involves correlation or a reliability coefficient to provide an estimate of the consistency of the test. As practicing speech pathologists, I know that it is hard to keep up on all these different statistics. But if it is a good test with a good manual, it will be able to interpret those coefficients into lay language so you can understand what those mean.

There are several different types of reliability, but for the sake of time, I will not be going over all of these in detail. Some of the different types of reliability you will probably see in your manuals are called test/retest, alternative form, split-half, and also intra-rater and inter-rater reliability.

I will speak briefly about rater reliability, because I think it is extremely important, especially with regard to having multiple people give tests. We want to make sure that we are getting the same results over multiple test administrations, because sometimes we might question our own results, or we have a difficult case and we want another SLP to check that child or adult. We want to make sure that an individual's performance is not hindered by examiners' influences. You might see intra-rater, or intra-judge, reliability in the manual. That refers to the same person administering or scoring the same test, multiple times. Inter-rater or inter-judge reliability refers to consistency when different people administer or score the same test. Now, when the test is being standardized and developed, that is when they will be assessing all the different rater reliabilities. You should be able to see that coefficient in the manual.

I know for many of you this is a review of what you already know or have learned. But hopefully, hearing some of this information will help when you go back to the manuals of some of the tests that you are using all the time. You can go back and read more about this information for a particular test, and hopefully, it will make more sense.

Validity

Validity refers to the extent that a test measures what it claims to measure. In other words, how accurately does the assessment tool measure the characteristics in which we are interested, or how well does the test measure what it is supposed to measure? Validity provides us with the information about what the measurement means, and therefore, what conclusions can be made from the measure. In a sense, then, validity is concerned with the truthfulness of the measure; that is why I have “Validity = Truthfulness” on the slide.

An assessment measure can have strong validity for some purposes and poor validity for other purposes. So when you are looking at a test, you want to read about the validity. What is the purpose, and what is it intended for? It may be valid for one purpose and invalid for another purpose.

As with reliability, there are several types of validity, and for time sake, I am not going to be going over all of the different types of validity in detail. But I do want to mention construct validity. I do not know how many of you remember the original Peabody Picture Vocabulary Test (PPVT). I remember giving it when I was an undergraduate back in the 1980s. The original PPVT was designed to be used as an intelligence measure. Over the years, psychologists and other researchers found that it is not a particularly good measure of intelligence, because it depends on only one skill: receptive vocabulary. It was basically torn apart by several different people, because it is not regarded as a valid measure of intelligence. However, it is now viewed as a valid measure of receptive vocabulary. This demonstrates the fact that as clinicians, we are responsible for learning about the issues of validity associated with the measures that we use. I think that this is a really good example because the Peabody Picture Vocabulary Test can be used across the lifespan of ages, and most of us have been through several different versions of the Peabody Picture Vocabulary Test. When it is revised, it is always done for a good reason, especially with the norms. But when it went to the second edition from the first edition, it did not say or spell out that its intent was to measure intelligence.

You might also read in the manual something about content validity, and criterion-related validity, which has subcategories of concurrent and predictive validity. You can read about those at a later time and learn more about them.

Standardization

Standardization is synonymous with “formal tests” or “norm-referenced tests.” Standardization is accomplished so that the test-giver’s bias and other extraneous influences do not affect the client's performance, and so that the results from different people giving tests are comparable. It is important to practice giving a test and to be familiar with the instructions. This is where reliability and validity are tied in to standardization. I do know, for example, that on one of the Clinical Evaluation of Language Fundamentals (CELF) subtests, you have to read a sentence, and then the client is supposed to repeat the sentence back. If you are not familiar with the test stimuli and you are stumbling over the words, then you will be affecting the reliability and the standardization of the test.

All examiner manuals that are standardized will include information about the test purpose, the test development and construction, procedures for administration and scoring, the normative sample group and statistical information resulting from the group, and reliability and validity. You can also read reviews of tests in peer-reviewed journals. I would say that we are not seeing as many of the tests being torn apart in the journals as we did a long time ago. These days, most publishers are not going to spend the money to come up with a new test or revise a test unless they know that the standardization is excellent, that they have a representative normal sample, and that they are doing everything they have to do so that when they market it, it will be really reliable and have a tremendous amount of validity in the construction. That is something to consider. But it is interesting when you can find peer-reviewed journal articles about some of the tests that we use.

I do actually have one final point about standardization. It is a process by which the test or measurement device is administered to a sample of a population with specified physical (e.g., age, gender, etc.) and non-physical (e.g., language exposure, socioeconomic status, geographical) characteristics according to specified explicit procedures. So, there are standard procedures for the administration and scoring of the test.

Data from the testing administration can then be analyzed to determine the mean standard deviation and other statistical attributes of the obtained scores. In order to interpret the performance of a new subject by referencing these data, we must assume that the person in question is representative of the population, or at least very similar to the population from which the original sample was drawn when the test was standardized. In other words, the subjects within the sample population must be similar to the average person in the standardization population. Again, I am making the point that you do not just pick out tests because you feel comfortable with them; you pick out tests because they are going to answer the questions you want to answer, and also because this person standing in front of you is representative of the sample that this test was standardized on. If you do deviate from the way the test was standardized, you must say so in writing in your report. That means that already, reliability, validity, and standardization definitely go out the window, because you are maybe repeating directions or doing something that you are not supposed to be doing. Some of us just need to get scores for certain reasons, but may have difficulty doing so. So you do need to say your report, if you are reporting scores, that you have changed your administration from the way the test was standardized.

Interpreting Test Scores & Key Concepts with Standardized Tests

Carolyn (Carney) Sotto, PhD, CCC-SLP

Introduction and Overview

Evidence-Based Practice

Diagnostic Process

Creating Effective Assessments

Test Selection

Psychometric Variables

Reliability

Validity

Standardization

Carolyn (Carney) Sotto, PhD, CCC-SLP

Interpreting Test Scores & Key Concepts with Standardized Tests

Carolyn (Carney) Sotto, PhD, CCC-SLP

Introduction and Overview

Evidence-Based Practice

Diagnostic Process

Creating Effective Assessments

Test Selection

Psychometric Variables

Reliability

Validity

Standardization

Carolyn (Carney) Sotto, PhD, CCC-SLP

Related Courses

Interpreting Standardized Tests Incorporating Culturally Responsive Practice

Course: #9953Level: Introductory1 Hour

ApPARENTly This Is Not Going Well: Difficult Conversations with Parents

Course: #9726Level: Intermediate1 Hour

DIRFloortime®: Beyond Playing on the Floor

Course: #9642Level: Advanced4 Hours

Autism Outreach Podcast: Parents as an Important Part of the Therapeutic Team

Course: #9810Level: Introductory0.5 Hours

Autism Outreach Podcast: Play Based Speech Therapy

Course: #9818Level: Introductory0.5 Hours