When an assessment is designed for repeated use over time, the answer choices themselves become a source of measurement error. Item banking is the mechanism that keeps retake scores honest.
An assessment administered once faces a familiar set of measurement challenges: response bias, aspirational self-reporting, question interpretation. An assessment designed for repeated use over time faces all of those, plus one more that is specific to repetition: respondents remember the answers.
This is not a minor problem. When a respondent completes an assessment and then retakes it six months later, they have access to something no first-time respondent has: memory of which options produced which scores the last time. If the same answer choices appear in the same form, a motivated respondent can select the highest-scoring options from memory, independently of whether their behavior has actually changed. The score improves. Nothing else has.
In a platform designed around longitudinal tracking, where the explicit purpose is to measure behavioral change over time, this would be a fundamental failure. A score improvement that reflects memorization rather than development gives individuals false evidence of growth and gives organizations false evidence of program effectiveness. It is worse than no measurement, because it provides a misleading signal with the appearance of rigor.
Item banking is the structural response to this problem.
An item bank is a collection of multiple, interchangeable phrasings for the same behavioral anchor. For every question in every assessment, each of the five point values has a pool of answer phrasings associated with it. All phrasings in the pool for a given point value describe the same level of behavior and produce the same score. They differ only in how that behavioral level is expressed in language.
Each time a tool is opened, the system randomly selects one phrasing per point value per question from the available pool. The respondent sees five choices, one for each level, but the specific words used to describe each level vary from session to session. A respondent who remembers that a particular sentence described the highest behavioral level on their last attempt will not see that sentence again. They will see a different description of the same level, requiring them to evaluate it on its content.
The score produced by selecting any option in the score-3 pool is identical. The score produced by selecting any option in the score-4 pool is identical. What changes between sessions is which phrasing represents each level, making memorization of specific text useless as a strategy for improving scores.
The core proposition of the Evans Learning Labs platform is that assessments should be retaken over time, and that score changes over time should be interpretable as evidence of behavioral change. Item banking is what makes that proposition credible.
Without item banking, longitudinal score comparisons are contaminated by a confound the platform cannot separate from genuine change: how familiar is the respondent with the specific answer choices? A score that increases from 2.8 to 3.6 between an initial assessment and a six-month retake might reflect real behavioral development. Or it might reflect that the respondent remembered which choices produced higher scores and selected them more strategically the second time. Item banking removes the second explanation.
In educational and psychological measurement, the problem of practice effects (improved performance on retesting due to familiarity rather than genuine change) is well documented (Lievens et al., 2007).1 Item banking is the standard mechanism for managing this effect in high-stakes longitudinal assessment contexts (van der Linden & Hambleton, 1997).2 The application here adapts the same logic to behavioral self-report instruments.
This matters most in the specific contexts where ELL tools are most often used. When an individual retakes a tool after a coaching engagement, the retake score is used as evidence of whether the engagement produced change. When an organization tracks aggregate scores across a leadership cohort before and after a development program, those aggregate comparisons are used to evaluate program effectiveness. In both cases, contamination of the score by answer memorization would produce systematically misleading evidence.
The bank is not static. Additional phrasings can be added to any question's pool at any time without changing the scoring, the structure of the tool, or the interpretability of historical scores.
The bank will be expanded based on what is learned about how different groups interpret and relate to different formulations of the same behavioral level. Phrasings that are consistently misunderstood or that produce unexpected response patterns can be replaced or supplemented. New phrasings that better capture certain behavioral nuances can be added.
This is one of the structural advantages of separating answer content from assessment structure. The questions, the scoring logic, and the interpretive framework can remain stable while the specific language used to present each behavioral level continues to be refined. Respondents benefit from an instrument that improves over time without losing comparability with their own historical results.
The free Organizational Performance Assessment demonstrates item banking, answer randomization, and the full ELL methodology with no purchase required.
Try the free tool1 Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92(6), 1672–1682. https://doi.org/10.1037/0021-9010.92.6.1672
2 van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of Modern Item Response Theory. Springer. https://doi.org/10.1007/978-1-4757-2691-6