top of page

General Use AI as a Qualitative Coder: Testing Reliability in School-based Consultation Data

  • 3 days ago
  • 3 min read

In April 2026, our student Noah Jones presented the results of his initial testing of ChatGPT-generated codes versus the conclusions of human coders when analyzing school constulation transcripts. This analysis used data from the study of ATHEMOS the Game, funded by the Institute of Education Sciences, U.S. Department of Education, through Grant R324A180219 to East Carolina University and Ohio University. (The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.)


Introduction


Qualitative data analysis is a key research technique for understanding clinical processes like those that unfold during school consultation. But qualitative coding is time-consuming, requiring extensive human training and data interpretation. With the growing accessibility of general-use artificial intelligence (AI) tools, researchers may have a new tool to support or enhance traditional coding practices (Bunt et al., 2025; Tai et al., 2024).


Generative AI built on large language models (LLMs), like ChatGPT, appear well-suited to deductive qualitative coding. LLMs are designed to encode text into meaningful categories, but research is still in early stages.


Present Study: We evaluated how closely LLM-generated qualitative coding aligns with human coding using a set of school consultation transcripts. We focused on scalar outcomes—counts of specific codes per transcript—as an initial step toward establishing the reliability of AI coding.


Method


•Six trained human coders double-coded 36 school consultation transcripts using a predefined codebook of nine code categories informed by…

•Human code counts were estimated using inclusive (“triangulation”) and consensus (agreement-only) approaches.

•AI was prompted to generate transcript-level code counts using the codebook (see Figure 1)

•All three results (two human and AI) were then compared.


Analysis & Results


Mean absolute errors (MAE) were calculated to quantify the average difference in code counts between AI and human coding under both triangulation (inclusive coding) and consensus (agreement-only coding) approaches. Analyses were conducted at the transcript level (n = 36) across nine predefined coding categories.


AI-generated coding appeared closer to triangulated coding for 7 of 9 codes and closer to consensus for 2 codes (see Table 1). This pattern suggests that AI coding most closely resembles inclusive (“triangulated”) coding rather than a consensus-based approach. Error magnitudes were relatively small for most codes, but substantially larger discrepancies were noted for high-frequency codes, including Giving Information and Tech Talk, where AI clearly diverged from human coding.



Discussion


Our findings suggest that AI-generated qualitative coding can approximate human coding patterns, but alignment depends on how human coding is defined. AI appears to produce results approximating inclusive, “triangulated” efforts across multiple human coders, rather than coder consensus. Conceptually, this reflects a pattern of high sensitivity but lower selectivity, which is consistent with expectations for LLMs.


For most codes, differences between AI and human coding were relatively small, suggesting that AI can provide a reasonable approximation of human-coded frequencies at the transcript level. But discrepancies in high-frequency categories (e.g., Giving Information, Tech Talk)—instances where AI tended to over-estimate code counts—highlight important limitations that require further research.Readers should note that our results focus on outcome-level concordance, and future research will need to examine utterance-level agreement to inform specific strategies for combining AI and human coding.


Readers should note that our results focus on outcome-level concordance, and future research will need to examine utterance-level agreement to inform specific strategies for combining AI and human coding.


References


Bunt, H.L., Goddard, A., Reader, T.W., & Gillespie, A. (2025). Validating the use of large language models for psychological text classification. Computational Social Psychology, 3. https://doi.org/10.3389/frsps.2025.1460277


Tai, R.H., Bentley, L.R., Xia, X., Sitt, J.M., Frankhauser, S.C., Chicas-Mosier, A., & Monteith, B.G. (2024). An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods, 23. https://doi.org/10.1177/16094069241231168



 
 
 

Comments


Featured Posts
Recent Posts
Search By Tags
Follow Us
  • Facebook Classic
bottom of page