AI Hallucination Fix Now in Testing, Georgia Tech Researchers Say
Probably no challenge hinders the adoption of generative artificial intelligence in online education like the now well-known defect of “hallucinations.” This fault repeatedly results in AI platforms’ lying about facts with unwavering confidence, as we’ve explained in several recent reports here on OnlineEducaiton.com.
If a teaching assistant deliberately lied to a student about the date, time, and location of a final exam, their university probably would never renew that TA’s contract. From this perspective, it’s apparent why AI hallucinations in systems like ChatGPT have invited so much intense scrutiny and broad concern throughout higher education.
Fortunately, a team of researchers from the Georgia Institute of Technology in Atlanta announced late in June 2023 that they may have devised a solution. They claim their approach could boost the factual accuracy of large language models like ChatGPT as high as 95 percent.
However, what’s interesting is that the team’s approach isn’t by any means a rewrite of the programming underlying such models. Instead, the professors discovered that adding a second, executive AI agent specifically trained to perform error checking on ChatGPT’s output could substantially raise factual validity scores.
AI Returns Patterns, Not Facts
Georgia Tech’s Sandeep Kakar told EdSurge’s Jeffrey Young that hallucinations are such a challenging problem because AI platforms aren’t designed to return facts. Instead, they’re designed to return answers that fit predictable patterns. This is a key insight that helps explain why hallucinations appear so frequently within the output from artificial intelligence platforms. Dr. Kakar explained,
ChatGPT doesn’t care about facts, it just cares about what’s the next most-probable word in a string of words. It’s like a conceited human who will present a detailed lie with a straight face, and so it’s hard to detect. I call it a brat that’s not afraid to lie to impress the parents. It has problems saying, “I don’t know.”
Everybody working with ChatGPT is trying to stop hallucinations, but it is literally in the DNA of large language models.
However, Georgia Tech had already developed and refined a closely-related AI resource that faculty researchers suspected might help.
In 2014, the school’s computer science department had been searching for a way to help nine chronically-overworked teaching assistants who were supporting students in an online artificial intelligence class. These TAs had been struggling to keep up with roughly 10,000 questions each semester asked by the course’s growing enrollment that had just surpassed 300 students across different time zones from all over the world.
The department had noticed that many of the questions directed to the TAs were repetitive or easily answerable from textbooks or readings, making this an ideal potential use case for automated artificial intelligence support. The professors didn’t want to replace the TAs with an AI system, but did want to ease their workloads by reducing some of the time and effort involved with student interactions.
To help, IBM then gave Georgia Tech access to its famous Watson Engagement Manager system. Watson is the natural-language, rapid question-answering computer system originally developed to answer questions live on TV during the quiz show Jeopardy! After a four-year development process costing IBM tens of millions of dollars, in 2011 Watson competed live on the program against the world’s best human champions. As this video shows, Watson made history by winning the first place prize of $1,000,000.
Researchers from the department then modified the Watson AI software used on TV so that it could quickly answer students’ questions through the course’s discussion board and within chat conversations. The professors then invested almost 1,500 hours in training the chatbot not only on the course material, but also on appropriate and effective ways to interact with students as well.
This team’s leader was Professor Ashok Goel, the director of the department’s Design & Intelligence Laboratory and an expert in both knowledge-based artificial intelligence as well as cognitive science. OnlineEducation.com readers might recall that we had interviewed Dr. Goel for a report we published back in 2020 on new developments in AI-driven adaptive learning within online courses.
Meet Jill Watson
The professors’ modifications to the Watson software had essentially created an AI-assisted virtual teaching assistant that in 2016 they named “Jill Watson.” But before they invited students to interact with their new TA, Dr. Goel’s team had also filed a petition with Georgia Tech’s institutional review board—the ethics committee that evaluates potential experiments on human subjects. Because the researchers wanted to know if their online students could tell that Jill wasn’t human, they didn’t tell the students in advance that Jill was actually a bot.
It turns out that after final exams at the end of Jill’s first semester, survey questionnaires showed that students couldn’t discern which of their TAs actually was an AI chatbot. In part, that was because of Jill’s remarkable accuracy, since she was able to correctly answer 97 percent of her questions while only needing to refer the toughest three percent to human TAs.
However, while many students assumed that she was a friendly and bright grad student, some of their more observant classmates did notice that Jill seemed better in certain respects than typical graduate assistants.
For example, Jill’s consistently fast replies to questions seemed surprising. And Jill had no problem with conversations outside of office hours; she seemed like she was never pressed for time, available to chat around the clock, and almost always online in the middle of the night. Several of her students wanted to nominate Jill for an award as an outstanding teaching assistant, and a few of the residential students on campus liked Jill so much that they had even asked her to lunch.
When her students finally found out that Jill was a bot, they were amazed. In the course’s online forum, one shared that “I feel like I am part of history because of Jill and this class!” Another wrote “Just when I wanted to nominate Jill Watson as an outstanding TA in the CIOS survey!”
The department further refined Jill’s performance over the last seven years. Jill has also “scaled up.” She’s currently serving as a teaching assistant for 17 undergraduate and graduate courses at Georgia Tech, a workload no human could ever manage. Professor Goel explained to EdSurge:
The human TAs were just answering the more mundane questions again and again and again. By automating some of the very mundane things, we’re freeing up time. There are so many things to do as a teacher that students can take as much of your time as you have.
And now, Jill’s remarkable accuracy and performance levels might also make her into an ideal candidate to help prevent hallucinations returned by systems like ChatGPT.
Jill’s New Role: AI Verification Agent
During his keynote address before an April 2023 AI symposium at Duke University on emerging educational pedagogies, Dr. Goel suggested that it might be possible to mitigate some of the risks of ChatGPT by using it in conjunction with other technologies. He told his Durham, North Carolina audience that one way might be to deploy “a collection of AI agents working together—ChatGPT as an AI agent and some other AI agent which is going to check ChatGPT.”
This appears to be Jill Watson’s next role. Dr. Goel explained that his team’s new hallucination-killing platform sets Jill up to function as a verification intermediary between ChatGPT and the user. In this role, Jill monitors and rapidly fact-checks ChatGPT’s work before forwarding the results to students.
This function is possible because these days only about ten setup hours instead of the first generation’s 1,500 hours are now required for most instructors to train an optimized Jill Watson “clone” to check errors for a specific course. This training procedure requires first feeding Jill all the course materials starting with the textbook, followed by transcripts of lecture videos along with their PowerPoint slides. Instructors can then train Jill on posts and comment threads from the course’s discussion forum during previous semesters, which would also provide the bot with an extensive database of common questions, model answers, and solved problems.
Experts like Dr. Goel can build such Jill clones in far less time. He told the Duke audience that “If you give me a class. . .I’ll ask you for all of your educational materials, the content that you have, the course syllabus, the books that you use, the documents that you pass [out] and I’ll train a Jill Watson for you and build a Jill Watson for you in about five minutes.”
Once trained and installed as the verification agent, Jill can function in either of two ways. Either she can fact-check ChatGPT’s answers to student questions against the training materials, or Jill can direct ChatGPT to the specific locations to review within the textbook or lecture notes before returning an answer. As soon as Jill detects a hallucination, the bot can either block ChatGPT’s result, or it will send to the student a confidence warning that reports “I have low confidence in this answer.”
Fighting for Points
The EdSurge article goes on to report that Georgia Tech is already testing this new system in three computer science courses during the summer 2023 term. Dr. Kakar emphasizes that initial testing has demonstrated encouraging results, in that Jill’s supervision resulted in ChatGPT’s returning accurate answers in more than 95 percent of cases. That would amount to a vast 58 percent increase over current validity measurements, which show that OpenAI’s latest GPT-4 model consistently returns accurate facts only about 60 percent of the time.
Nevertheless, there still seems to be a lot of controversy as to whether a 95 percent confidence level would be accurate enough to consistently meet the demanding standards of college and graduate programs. Young writes:
“We are fighting for the last couple of percentage points,” says Kakar. “We want to make sure our accuracies are close to 99 percent.”
And Kakar admits the problem is so tough that he sometimes wakes up at 3 in the morning worrying if there’s some scenario he hasn’t planned for yet:
Imagine a student asking when is this assignment due, and ChatGPT makes up a date. That’s the kind of stuff we have to guard against, and what we’re trying to do is basically build those guardrails.