Title: Free Text Responses – a comparison of evaluation methods in constructed response items
Background
Online assessment has increased importance in a pandemic context. When assembling cases, quizzes and problems, we have known since McCarthy’s work in 1966 that cues provided in selected response items strongly influence candidate performance and item validity. However, case authors avoid constructed responses because of the challenges in evaluating them.
Summary of work
Using the OLab education research platform, we compared various approaches in evaluating constructed responses. Based on author needs, the platform has expanded the variety of methods for handling free text, including keyword matching, regular expression parsing, peer-reviewed response assessment, chat-based hybrid conversational interpretation and AI-assisted natural language understanding. We compared the complexity of case design afforded, their use cases and the axiology or value of the various approaches.
Summary of results
Emulation of exam environments has been a popular use for the OLab platform. Authors are able to construct cases and patient management problems that closely simulate those seen in high stakes exams. Handling free text inputs is now increasingly feasible and scalable, without a huge increase in assessment effort. But not all response handling methods provide equal value in all contexts – it is important to match the method to the context and to be clear on what construct you are assessing. AI-assisted natural language understanding is very powerful and a useful addition to the authors’ quiver but the value proposition is not extensible to all constructs. There is stil value in other approaches.
Discussion and Conclusion
Matching the evaluation method with the use case remains important: it is not cost-effective to assume that AI can do everything. Phrasing nuance can be crucial, especially in therapeutic language and critical conversations. These findings apply across a variety of assessment platforms.
Take home messages
- Consider what you are assessing: communication skills or decision-making
- One size does not fit all: variety in how free text inputs are processed is essential