Home > Featured Articles > Are California Teachers Actually Using AI Software to Grade Papers?

Are California Teachers Actually Using AI Software to Grade Papers?

Douglas Mark,

Previously, it was thought that K-12 teachers across America were mainly using artificial intelligence tools to save time and effort on administrative tasks and lesson plans. But a controversial June 2024 report suggests that automated AI software could actually be assigning grades for writing assignments completed by a growing segment of California’s elementary, junior high, middle, and high school students.

Published by CalMatters, the story chronicles the experiences of two teachers who grade papers using AI software that also provides students with automated feedback on their assignments’ writing mechanics, like grammar and syntax. One educator teaches fourth graders in a private country day school in the Northern California college town of Chico, and the other teaches in a Southern California public high school in San Diego.

Anecdotal Evidence Suggests More Teachers Grading with AI

As reported by Wired and several other outlets, surveys first reported in March 2023 that K-12 teachers across America have been using generative AI software much more than their students, and the teachers’ main objectives in doing so are typically to reduce the threat of burnout by saving time and effort. However, no specific data yet supports the notion that AI software grading amounts to a trend among teachers in the Golden State. Believe it or not, that’s because the California Department of Education doesn’t keep those records.

Alix Gallagher, the head of strategic partnerships at PACE—the Policy Analysis for California Education Center at Stanford University’s Graduate School of Education—told CalMatters that California’s state government does not track the curricula adopted by local and regional school districts, or the software the districts use. That means it would be “highly unusual” for the Education Department to track contracts between the districts and tech firms that might be selling access to artificial intelligence platforms for use in classrooms.

Moreover, the Department’s computer science coordinator Katherine Goyette told the outlet that the state also does not track AI usage by school districts. But in a significant disclosure, Goyette did confirm that increasing numbers of teachers across California are applying AI technology to help with grading students’ work.

“My Job Is Not to Spend Every Saturday Reading Essays”

In San Diego, veteran English teacher Jen Roberts at Point Loma High School is one of those teachers. She says that out of her nearly 30-year teaching career, the past school year was her best year yet. She credits several AI tools, like Writable, an artificial intelligence platform developed by a Palo Alto, California-based startup recently acquired by Houghton Mifflin Harcourt. Writable automatically grades her students’ writing assignments.

By using an application programming interface (API) that accesses one of OpenAI’s GPT-4 large language models, Writable also gives her students automated feedback on their writing and grades. Roberts claims that this automated system tells her students how they can improve their writing much faster than she ever could give them her own feedback. That way, she can issue more assignments and asserts that these additional writing projects have made her students better writers.

What’s important to understand about Roberts’ workflow is that when she’s talking about how long it took her to give her students feedback during the pre-AI era, she’s not talking about a few days, but two or three weeks. In one of this article’s most surprising passages, here’s how CalMatters’ reporter Khari Johnson breaks down this lengthy, tedious process:

Roberts says the average high school English teacher in her district has roughly 180 students. Grading and feedback can take between five to 10 minutes per assignment she says, so between teaching, meetings, and other duties, it can take two to three weeks to get feedback back into the hands of students unless a teacher decides to give up large chunks of their weekends. With AI, it takes Roberts a day or two. . .

She says AI reduces her fatigue, giving her more time to focus on struggling students and giving them more detailed feedback.

“My job is to make sure you grow, and that you’re a healthy, happy, literate adult by the time you graduate from high school, and I will use any tool that helps me do that, and I’m not going to get hung up on the moral aspects of that,” she said. “My job is not to spend every Saturday reading essays. Way too many English teachers work way too many hours a week because they are grading students the old-fashioned way.”

“Not a Great Use of My Time”

Despite her enthusiasm, Roberts acknowledges that problems might exist with the grades that Writable assigns in certain cases.

For example, when assigning grades for the work of average students, Roberts believes that Writable is “very accurate.” But there appears to be a “regression towards the mean” effect with the way the platform assigns grades: Writable sometimes assigns students who are high-performing writers lower grades than she would, and conversely assigns struggling students higher grades.

Roberts says that she routinely monitors each student’s grade assigned by Writable’s AI system. But she doesn’t monitor the feedback students receive from the platform. “That’s just not a great use of my time,” she says.

Teachers vs. AI: Research Compares Grading and Feedback

Roberts may be right about not needing to review the automated feedback her students receive from an AI platform like Writable. An October 2023 study by a team of professors from Brigham Young University and Florida State University compared college students’ writing feedback generated from the same automated OpenAI GPT-4 system used by Writable with feedback delivered to a control group of students by expert human tutors. In the report published by the International Journal of Educational Technology in Higher Education, the researchers say they observed no significant differences in the writing outcomes between the two groups of students.

Moreover, two later 2024 studies also compared artificial intelligence-generated assessments of middle and high school students’ writing with assessments delivered by expert evaluators. Surprisingly, the researchers from the University of California at Irvine and Arizona State University that conducted both studies failed to find—as many had expected—overwhelmingly better assessments provided by teachers compared with the grades and feedback delivered by ChatGPT.

In the team’s first study published in June 2024 that compared written evaluations, the human evaluators performed significantly better than the bot. But in a second study that compared “holistic” numerical scores, ChatGPT performed with marginally better reliability—and that’s a remarkable achievement for two reasons.

First, ChatGPT wasn’t competing against average teachers. Instead, the human scorers in this study were experienced, veteran teachers—some with PhDs—who had all received three hours of training on how to assess essays for this study.

Second, the research team conducted their experiment as a zero-shot exercise, meaning that they didn’t first pre-train ChatGPT by showing the bot grades awarded to a control group of essay samples. The professors merely gave the AI platform the same scoring guidelines, known as a “grading rubric,” that they also gave the human graders.

Furthermore, the researchers also didn’t perform any prompt engineering. They instructed ChatGPT to act as if it were a human secondary school teacher and then assign a grade on a 1 to 6 scale to each of the 1,800 essays.

In other words, the study showed that an average teacher wouldn’t need sophisticated programming know-how, experience with AI systems, or a substantial budget to quickly and easily set up ChatGPT to start grading essays. And that teacher could feel reasonably confident that the bot would return grades on average with reliability roughly equivalent to that of a highly trained human evaluator.

However, that doesn’t mean that ChatGPT’s scores matched the scores awarded by the human graders for every subset within the sample of 1,800 essays. For example, when ChatGPT scored a group of 493 history essays from “students in largely Latino schools in Southern California,” the bot returned scores within a single point of the evaluator’s grade only about three-quarters of the time—a performance level far lower than with other tested subsets.

Accordingly, the team’s lead researcher Dr. Tamara Tate of the University of California at Irvine has downplayed expectations for ChatGPT’s grading performance. Dr. Tate told the Hechinger Report’s Jill Barshay that results showed ChatGPT could function as well as an “average busy teacher,” and “certainly as good as an overburdened, below-average teacher.”

However, at this time, Dr. Tate does not recommend that teachers use grades from ChatGPT on an essay that might account for much of the student’s final class grade or that they grade any other high-stakes essay or examination with an AI platform. In other words, Dr. Tate recommends that teachers only use grades returned by ChatGPT for preliminary grades on first drafts and other low-stakes purposes.

Such preliminary first-draft grades are one type of formative assessment. As this article from Stanford University’s Teaching Commons describes, formative assessments are evaluations meant to measure learning and provide feedback in a spirit of growth and improvement.

Because it’s known that offering formative feedback on early-phase writing will facilitate students’ writing development, the UC Irvine team in their earlier June 2024 study also examined the capacity of ChatGPT to deliver effective formative feedback on rough drafts. “Well-trained evaluators provided higher quality feedback than ChatGPT,” said the researchers, concluding that overall, the human evaluators performed better.

Nevertheless, the researchers still argued in favor of a specific role that generative AI could play in providing automated formative feedback:

Even if ChatGPT’s feedback is not perfect, it can still facilitate writing instruction by engaging and motivating students and assisting teachers with managing large classes, thus providing them more time for individual feedback or differentiated writing instruction (Grimes & Warschauer, 2010).

Given our results, we see a plausible use case for generative AI: providing feedback in the early phases of writing, where students seek immediate feedback on rough drafts. This would precede, not replace, teacher-provided formative or summative evaluation that is often more accurate and more tailored to student-specific characteristics, albeit less timely.

“I Couldn’t Go Backwards Now”

Meanwhile, up north in Chico, Alex Rainey teaches fourth-grade English at Chico Country Day School—and she’s certainly using ChatGPT for much more than rough-draft formative feedback.

Using the same OpenAI GPT-4 system that’s accessed by Writable, her $20-per-month subscription version of ChatGPT grades Rainey’s papers and writes her feedback for her students.

But Rainey must first perform some prompt engineering and “one-shot” model pre-training. She then has to upload to ChatGPT her own prompts containing examples of her written feedback and her grading standards rubric. She told Johnson that she had usually relied on the AI system to analyze completed assignments’ sentence structure and grammar while she manually assessed her students’ creativity.

Rainey said, “I feel like the feedback it gave was very similar to how I grade my kids, like my brain was tapped into it.” And she agreed with Roberts about the dramatic time savings that typically compressed many protracted hours of work for Rainey into only a portion of an hour.

However, in her only difference of opinion with Roberts, Rainey emphatically believes that teachers need to monitor both the AI platform’s automated feedback and the grades the system assigns. Yet she nevertheless agreed that her students had also become better writers because the much faster automated feedback made more writing assignments possible.

“I think it’s amazing. I couldn’t go backwards now,” she told Johnson.