Pick an item, any item

So we're going to play a little game in this post. But first, let me set the stage.

While lurking on a Twitter exchange about race, education, and schools, I saw a great reminder from Bill Fitzgerald scroll by. In effect [and apologies, Bill, if I've summarized incorrectly], it's worth engaging around important topics even it's clear the discourse isn't going anywhere because you never know who might be listening, seeking to learn. To revisit an earlier post, I decided not to worry about the manager of the nursery, and consider instead the walkers just out for a Sunday stroll who may overhear the discussion about the daisies.

I am going to make one claim here in this post and one claim only: when adults look at multiple choice items, we see them differently than students do. Experience, background knowledge, expertise, confirmation bias, 20 years of living - a wide variety of things influence how we read an item. Any teacher who's seen students ace a "hard" item or tank on an "easy" one will know that it's not until students actually take the items that we get a real sense of the item's difficulty.

Item design is a science - and an art. Objectivity plays a large role. BUT:
One cannot determine which item is more difficult simply by reading the questions. One can recognize the name in the second question more readily than that in the first. But saying that the first question is more difficult than the second, simply because the name in the second question is easily recognized, would be to compute the difficulty of the item using an intrinsic characteristic. This method determines the difficulty of the item in a much more subjective manner than that of a p value. Basic Concepts in Item Design
This is why there's field testing. Or why we should field test classroom tests and why states have to field test items from their large scale tests. The test designer (teachers or Pearson writers) do their level best but we need certain statistics (available only after students have actually taken and responded to an item) to reach conclusions about how high quality an item is. The most common statistic we can use is what's known as a p-value. This value is the percent correct - the higher the p-value, the more students who got an item correct, the easier the item was for the group of students who took the test. There are guidelines around p-values but generally speaking, "p-values should range between 0.30 and 0.90." There's a lot more to unpack around item difficulty but we'll just leave this here.

In the absence of these p-values, our observations about the difficulty of an item are just that - observations. Hambleton and Jirka (2006), two psychometricians/unicorns reviewed the literature around estimating item difficulty and found studies where highly qualified, well-experienced teachers were inconsistent when it came to accurately estimating how students would do on an item. "No one in any studies we have read has been successful at getting judges to estimate item difficulty well." Pretty compelling evidence that we need to temper our opinions with supporting evidence from students who, you know, actually took the assessments.

So now onto the game. Let's pick an. any item. How about Item 132040149_1 from the released Grade 4 Assessment items?


Now, in order for this game to work, you have to play along. Click the link above to read the Pecos Bill story and do your best to answer the question. You may look at it and conclude it requires skills "required skills out of the reach for many young children" or that the number of steps to answer this question are too many and too complicated for 4th graders. Now consider the question below also from the fourth grade test:


Which one would you expect to be harder for the students? The top one or the bottom one? What's informing your decision making? What evidence are you using? What percent of students do you think got the top item correct? How about the bottom one?

Hit me up here or on Twitter and share your thinking. I'll follow up with the answer in an upcoming blog, provided I get through my rose-tending to do list.


Chasing Down Pineapple Chasers


Imagine you're strolling with a loved one in a local park one bright Sunday morning. You and your companion pass a cluster of flowers and you overhear another stroller say, "Look at those gross weeds. They should be pulled out, they ruined this entire garden." You look where he's pointing and see the happiest, cheeriest, sunniest, albeit ugliest, bunch of daisies you've ever seen. You look at his face and recognize the speaker as a highly regarded and respected nursery owner. What do you do? What would Emily Post say? What would Freud advise? You look over at your loved one, panic clear on your face. If your loved one is like mine, s/he smiles, squeezes your hand and asks, "is it worth it?"

You decide it's not. No damage done. Who cares that the nursery owner confused a rare variety of daisies with weeds? But then you look over and see a group you recognize from the local gardening club, nodding along. "Bad weeds" you hear one mutter. "Terrible things. They should be yanked." Another pulls out a garbage bag and covers up the bunch of daisies. "No one should have to see these weeds." She says and you hear the passion and conviction in her voice. Her voice practically vibrates with anger.

The nursery owner is consulting a gardening book and reading aloud the problems he sees with the weeds and your stomach drops as you recognize he's misreading some of the information. "They're going to strangle the whole garden." You hear him mutter and you start to twitch, knowing that the daisies actually attract a particular strain of butterflies that help germinate a different section of the park.

You know because you're a botanist. You spend your professional life studying plants. While your work actually involves roses, you had to study daisies in order to better understand the species you grow. There is the real possibility, you admit, that you're wrong. The longer you stand there, the louder the group gets, the more convinced you are that you must be off-base; daisies aren't THAT necessary and it would be great to use the space for more of your roses..

So, you say nothing. The moment passes. The group is unified by their hate for those damn not-really-weeds. Not much you can say.  So you walk on with your companion, working hard to not give your loved one yet another lecture on the importance of ugly daisies in a well-balanced ecosystem.  On your way out of the park, you hear members of the gardening group telling incoming strollers how the owner of the nursery had, just that morning, published a piece in a national gardening newsletter, "setting the record straight" on those nasty weeds.

What's the role of expertise in conversations like this? Do you, with expertise in flowers, though you way prefer roses to daisies, speak up? Does your obligation to speak up change based on the size of the crowd? Is it changed by knowing the nursery owner isn't fond of you, and has even publicly called you "uninformed?" In truth, last time you spoke up, you had a middle school flashback to being told "your opinion doesn't matter because you're not tall/ short/ athletic/ musical/ smart enough/ right-handed enough" to comment. Even worse, the last time someone spoke up, it seemed some members of the gardening group became even more insistent and vocal about calling the much maligned daisies "weeds."

Help me out here, gentle readers. If you were the botanist, what you do? What if you were the nursery owner, would you want the botanist to speak up? Is there a right time and place to speak up? Is it worth it?

Chasing Pineapples – Part 1

In my column for NY ASCD, I considered the role of assessment literacy on education. Here on my blog, I want to poke at the idea a bit more in-depth. And since this is my (“our” actually - Theresa was a much more consistent writer than I was when we first started. Her reflections and thinking can be found throughout the archive) blog, I’m going to draw a line between assessment literacy and the Common Core Learning Standards.

I have a favorite Common Core Learning Standard. I realize that’s a bit like saying I have a favorite letter in the alphabet, but there you go.

CCSS.ELA-LITERACY.W.11-12.1: Write arguments to support claims in an analysis of substantive topics or texts, using valid reasoning and relevant and sufficient evidence. (NYS added an additional sentence to this standard when the state adopted CCLS: Explore and inquire into areas of interest to formulate an argument.)

Making the promise to NY students that we will do everything in our power to help them develop their ability to understand arguments, logic, evidence, and claims, in my humble opinion, is long overdue. In truth, I’m jealous that my teachers weren't working towards this goal. I learned how to write a really solid 5 paragraph essay in HS English and it wasn't until I was paying for my education that I was introduced to the rules of arguments and logical fallacies. Since I missed the chance during my formative years to explore this approach to discourse and discussion, I try to practice it as much as I can as an adult.

It’s my hunch that assessment illiteracy is having a dramatically negative impact on how we talk about public education. More to the point, I suspect that the same quirk that makes us fear Ebola more than texting while driving is what leads us to discuss and debate the state assessments with more energy, passion, and time more than the assessments students see on a regular, daily basis. My claim: when viewed as a data-collection tool mandated by politicians with a 10,000 foot perspective, the tests are benign. Their flaws and challenges are amplified when we connect them to other parts of the system, or if we view them through the same lens we view assessments designed by those with a 10 foot perspective on student learning. When we chase the flaws in a test that takes less than 1% of the year, we end up chasing pineapples. 

In the traditional of well-supported arguments, I want to focus on patterns more than individuals and on a narrow, specific claim, rather than a bigger narrative. (In other words, I’m not defending the tests, NYSED, Pearson, Bill Gates, APPR, or anything else.) The pattern across Twitter and in Facebook groups is a call for NYSED to release the NCLB-mandated tests so that the public (including parents) can judge their appropriateness, quality, length, use of copyright, or whichever variable the person asking for the release wants to investigate. I absolutely support the Spencerport teachers desire to see the entire test but a voice in the back of my head keeps asking, “Why? So what? What criteria are you using to determine if the test is any good?” Last year, NYSED released 25% of the items and a few bloggers shared their opinions about the quality of the items but I haven't been able to find any analysis of the released items against specific criteria. This is not to say they don't exist, just that they escaped my google searches. This year, NYSED released 50% of the items and the response has been NYSED should release ALL of the items. Which, I suspect, is what NYSED wants to be able to do but funding issues are preventing it from happening. I've been watching Twitter, hoping to see critical reviews of the released 50% but instead, there’s been lots of opining. Lots of “I” statements, not a lot of evidence-based claims. This, I suspect, is a side effect of assessment illiteracy across the board. We just aren't any good as a field, much less as a collection of citizens, at assessing the quality of measurement tools.

So, what makes an assessment good? What makes it bad? Given rules of quality test design as outlined in the APA Testing Standards, why are we willing to accept that the strength of the speaker’s opinion as the determining factor of quality? Is the issue of quality in large scale assessments a matter of opinion? I suspect anyone who has taken any large scale test (from the drivers test to the SAT’s) hopes that’s not the case. I know that numerous organizations including the National Council for Measurement in Education work to establish explicit criteria. The USDOE is instituting a peer review process for state assessments to ensure quality. PARCC is being as transparent as possible, including bringing together 400 educators and teachers to provide feedback. All of these groups use specific criteria to assess the quality large scale assessments. Yet, in the media - social and traditional, one person's opinion about "bad" or "hard" items is treated as if it's the truth.

So, my confusion remains: if members of a groups who do assessment for a living spend years establishing and sharing measures of quality for large scale tests, what tools will the public use to assess their quality? How can the general public use “cardiac evaluation” (I know it because I feel it - not my phrase, I totally cribbed it from someone else) when the vast majority of classroom teachers receive little or no training during teacher prep in how to assess and evaluate assessments? When it comes to state assessments, is it more about chasing pineapples – making claims about the tests quality – than actually catching them – supporting claims with evidence?

 And as I often do, I end up asking myself why it matters. If a parent says "I think this item is too hard/developmentally inappropriate/unfair" should that be enough to say that it is? How much of the science of the education profession actually belongs to members of the profession?