Skip to content

Gemini’s data analysis capabilities are not as good as Google claims

One of the selling points of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flashis the amount of data they can supposedly process and analyze. In press conferences and demonstrations, Google has repeatedly claimed that the models can perform tasks that were previously impossible thanks to their “long context,” such as summarizing several hundred-page documents or searching for scenes in movie sequences.

But new research suggests that the models are, in fact, not very good at those things.

Two separate studies They investigated how Gemini models from Google and others make sense of enormous amounts of data (think the length of “War and Peace”). They both concluded that Gemini 1.5 Pro and 1.5 Flash have difficulty correctly answering questions about large data sets; In a series of document-based tests, the models gave the correct answer only 40% 50% of the time.

“While models like the Gemini 1.5 Pro can technically process long contexts, we’ve seen many cases indicating that the models don’t actually ‘get’ the content,” Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, said. to TechCrunch.

Gemini context window missing

A model’s context, or context window, refers to the input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question—“Who won the 2020 U.S. presidential election?”—can serve as context, as can a movie script, show, or audio clip. And as context windows grow, so does the size of the documents included in them.

Newer versions of Gemini can accept more than 2 million tokens as context. (“Tokens” are subdivided bits of raw data, like the syllables “fan,” “tas,” and “tic” in the word “fantastic.”) That’s roughly equivalent to 1.4 million words, two hours of video, or 22 hours of audio — the largest amount of context of any commercially available model.

At a briefing earlier this year, Google showed off several pre-recorded demos meant to illustrate the potential of Gemini’s long-context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing broadcast (about 402 pages long) for quotes containing jokes, then find a scene in the broadcast that resembled a pencil sketch.

Oriol Vinyals, vice president of research at Google DeepMind, who led the briefing, described the model as “magical.”

“[1.5 Pro] “You do these kinds of reasoning tasks on every page, on every word,” he said.

It could have been an exaggeration.

In one of the aforementioned studies comparing these capabilities, Karpinska, along with researchers at the Allen Institute for AI and Princeton, asked models to evaluate true or false statements about fiction books written in English. The researchers chose recent works so that the models couldn’t “cheat” by relying on prior knowledge, and peppered the claims with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Given a statement such as “Using her abilities as Apoth, Nusis can reverse engineer the type of portal opened by the reagent key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash—having ingested the relevant book—were required to say whether the statement was true or false and explain their reasoning.

Image credits: University of Massachusetts Amherst

Researchers tested a book approximately 260,000 words (~520 pages) in length and found that 1.5 Pro answered true/false statements correctly 46.7% of the time, while Flash answered correctly only 20% of the time. . That means one coin is significantly better at answering questions about the book than Google’s latest machine learning model. When averaging all baseline results, none of the models managed to achieve above chance accuracy in terms of answering questions.

“We have noticed that the models have a harder time verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be resolved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models have difficulty verifying claims about implicit information that is clear to a human reader but is not explicitly expressed in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason about” videos — that is, search for and answer questions about the content of them.

The co-authors created a set of images (e.g., a photo of a birthday cake) along with questions for the model to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?” “). To evaluate the models, they chose one of the images at random and inserted “distractor” images before and after it to create slideshow-like sequences of images.

Flash did not perform well. In a test where the model transcribed six handwritten digits from a “slideshow” of 25 images, Flash managed to correctly transcribe about 50% of the transcriptions. Accuracy dropped to around 30% with eight digits.

“On real image question answering tasks, it seems to be particularly difficult for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That small amount of reasoning — recognizing that a number is in a frame and reading it — could be what’s breaking the model.”

Google promises too much with Gemini

Neither study has been peer-reviewed, nor do they test versions of Gemini 1.5 Pro and 1.5 Flash with 2 million token contexts (both tested versions with 1 million token contexts). And Flash isn’t intended to be as capable as Pro in terms of performance; Google markets it as a low-cost alternative.

However, both Add fuel to the fire that Google has been over-promising – and under-delivering – with Gemini From the beginningNone of the models the researchers tested, including OpenAI’s, GPT-4o and Anthropic Claude’s Sonnet 3.5It worked well. But Google is the only model provider that gives the context window a prominent place in its ads.

“There is nothing wrong with the simple statement: ‘Our model can take X amount of tokens’ based on objective technical details,” Saxon said. “But the question is: what useful thing can be done with it?”

Generative AI, broadly speaking, is coming under increased scrutiny as companies (and investors) become increasingly frustrated by the technology’s limitations.

in a pair of recent surveys of Boston Consulting Group, roughly half of respondents (all C-suite executives) said they do not expect generative AI to deliver substantial productivity gains and are concerned about the potential for errors and data compromises arising from generative AI-powered tools. PitchBook recently reported that for two consecutive quarters, early-stage generative AI dealmaking has declined, falling 76% from its peak in Q3 2023.

Faced with meeting-summarizing chatbots that conjure up fictional details about people and AI search platforms that basically amount to plagiarism generators, customers are on the hunt for promising differentiators. Google, which has raced, sometimes awkwardlyTo catch up with its generative AI rivals, it was desperate to make Gemini’s context one of those differentiators.

But the bet, it seems, was premature.

“We haven’t decided on a way to actually demonstrate that ‘reasoning’ or ‘understanding’ is occurring across long papers, and basically every group publishing these models is cobbling together their own ad hoc assessments to make these claims,” Karpinska said. . “Without knowing how long context processing is implemented (and companies do not share these details) it is difficult to say how realistic these claims are.”

Google did not respond to a request for comment.

Both Saxon and Karpinska believe that the antidotes to exaggerated claims about generative AI are better benchmarks and, along the same lines, a greater emphasis on third-party criticism. Saxon notes that one of the most common tests for long context (cited liberally by Google in its marketing materials), “needle in the haystack,” only measures a model’s ability to recover particular information, such as names and numbers, from sets. of data, do not respond. Complex questions about that information.

“All of the scientists and most of the engineers who use these models essentially agree that our current benchmark culture is broken,” Saxon said, “so it’s important for the public to understand that these giant reports containing numbers like ‘general intelligence across benchmarks’ should be taken with a huge grain of salt.”