General Use AI as a Qualitative Coder: Testing Reliability in School-based Consultation Data

Apr 12
3 min read

In April 2026, our student Noah Jones presented the results of his initial testing of ChatGPT-generated codes versus the conclusions of human coders when analyzing school constulation transcripts. This analysis used data from the study of ATHEMOS the Game, funded by the Institute of Education Sciences, U.S. Department of Education, through Grant R324A180219 to East Carolina University and Ohio University. (The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.)

Introduction

Qualitative data analysis is a key research technique for understanding clinical processes like those that unfold during school consultation. But qualitative coding is time-consuming, requiring extensive human training and data interpretation. With the growing accessibility of general-use artificial intelligence (AI) tools, researchers may have a new tool to support or enhance traditional coding practices (Bunt et al., 2025; Tai et al., 2024).

Generative AI built on large language models (LLMs), like ChatGPT, appear well-suited to deductive qualitative coding. LLMs are designed to encode text into meaningful categories, but research is still in early stages.

Present Study: We evaluated how closely LLM-generated qualitative coding aligns with human coding using a set of school consultation transcripts. We focused on scalar outcomes—counts of specific codes per transcript—as an initial step toward establishing the reliability of AI coding.

Method

•Six trained human coders double-coded 36 school consultation transcripts using a predefined codebook of nine code categories informed by…

•Human code counts were estimated using inclusive (“triangulation”) and consensus (agreement-only) approaches.

•AI was prompted to generate transcript-level code counts using the codebook (see Figure 1)

•All three results (two human and AI) were then compared.

Analysis & Results

Mean absolute errors (MAE) were calculated to quantify the average difference in code counts between AI and human coding under both triangulation (inclusive coding) and consensus (agreement-only coding) approaches. Analyses were conducted at the transcript level (n = 36) across nine predefined coding categories.

AI-generated coding appeared closer to triangulated coding for 7 of 9 codes and closer to consensus for 2 codes (see Table 1). This pattern suggests that AI coding most closely resembles inclusive (“triangulated”) coding rather than a consensus-based approach. Error magnitudes were relatively small for most codes, but substantially larger discrepancies were noted for high-frequency codes, including Giving Information and Tech Talk, where AI clearly diverged from human coding.

Discussion

Our findings suggest that AI-generated qualitative coding can approximate human coding patterns, but alignment depends on how human coding is defined. AI appears to produce results approximating inclusive, “triangulated” efforts across multiple human coders, rather than coder consensus. Conceptually, this reflects a pattern of high sensitivity but lower selectivity, which is consistent with expectations for LLMs.

For most codes, differences between AI and human coding were relatively small, suggesting that AI can provide a reasonable approximation of human-coded frequencies at the transcript level. But discrepancies in high-frequency categories (e.g., Giving Information, Tech Talk)—instances where AI tended to over-estimate code counts—highlight important limitations that require further research.Readers should note that our results focus on outcome-level concordance, and future research will need to examine utterance-level agreement to inform specific strategies for combining AI and human coding.

Readers should note that our results focus on outcome-level concordance, and future research will need to examine utterance-level agreement to inform specific strategies for combining AI and human coding.

References

Bunt, H.L., Goddard, A., Reader, T.W., & Gillespie, A. (2025). Validating the use of large language models for psychological text classification. Computational Social Psychology, 3. https://doi.org/10.3389/frsps.2025.1460277

Tai, R.H., Bentley, L.R., Xia, X., Sitt, J.M., Frankhauser, S.C., Chicas-Mosier, A., & Monteith, B.G. (2024). An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods, 23. https://doi.org/10.1177/16094069241231168

General Use AI as a Qualitative Coder: Testing Reliability in School-based Consultation Data

Comments

Featured Posts

Children and Adolescents with ADHD: Effective Diagnosis and Treatment Planning

Recent Posts

An Evaluation of ADHD Symptoms and Behaviors within a Summer Camp Designed for Children with ADHD

General Use AI as a Qualitative Coder: Testing Reliability in School-based Consultation Data

Mapping Standalone Newcomer Schools in the United States: A National Landscape Analysis

The Mediating Role of Student Session Engagement in the Relationship between School-Based Behavioral Health Interventions and Outcomes

School Consultation Process in Support of a Computer-Assisted Behavior Intervention for Adolescents: A Mixed Method Analysis

The Moderating Role of Gender in the Relationship between Youths' Internalizing Symptoms and School Climate

A Scoping Review of Culturally Adapted Mental Health and Behavioral Health Interventions for Latino Youth

A Systematic Review and Content Analysis of Serious Video Games for Children with ADHD

Camp Boost @ ECU!

An Examination of Modular Therapy for Tier 3 Student Support: Clinician Preferences and Impressions

Search By Tags

Follow Us