Google's 1.3 Quadrillion Token Boast
Google wants you to be impressed by 1.3 quadrillion tokens processed per month. CEO Sundar Pichai highlighted the figure at a recent Google Cloud...
4 min read
Writing Team
:
Dec 15, 2025 8:00:01 AM
Google announced the FACTS Benchmark Suite this week—a comprehensive evaluation framework testing how accurately large language models answer factual questions across four domains: parametric knowledge (internal training data), search-assisted queries, grounded responses (using provided context), and multimodal image-based questions. They've partnered with Kaggle to manage the benchmark, keeping a private held-out test set to prevent overfitting.
The results: Gemini 3 Pro leads with a 68.8% overall accuracy score. That's the best performance among 15 evaluated models. It's also a 31.2% failure rate on questions designed to test whether AI can deliver factually accurate information—the exact use case where Google says models are "increasingly becoming a primary source for information delivery."
Let's talk about what these numbers actually mean.
The suite includes 3,513 public examples across four categories. The Parametric Benchmark tests whether models can answer trivia-style questions using only internal knowledge—no external tools. Questions like "Who played harmonica on 'The Rockford Files' theme song?" where Wikipedia contains the answer and models should theoretically know it from training data.
The Search Benchmark evaluates how well models use web search tools to answer complex queries requiring multiple sequential fact retrievals. Example: "What is the sum of the birth years of the British boxer who defeated Vazik Kazarian at the 1960 Summer Olympics, the Moroccan boxer who also competed in the men's light welterweight event at those same Olympics, and the Danish boxer who competed in both the 1960 and 1964 Summer Olympics?" That's not a realistic user query, but it tests whether models can chain information retrieval correctly.
The Multimodal Benchmark shows models images and asks factual questions about them—testing visual grounding combined with parametric knowledge. The Grounding Benchmark v2 (an update to Google's previous work) measures whether models can accurately answer questions when relevant context is provided in the prompt.
All four benchmarks matter because they represent different ways users actually interact with AI systems. The problem is that no model performs well across all four.
Gemini 3 Pro achieved the highest FACTS Score at 68.8%, showing particular strength in the Search and Parametric benchmarks compared to its predecessor, Gemini 2.5 Pro. Google reports they reduced error rates by 55% on Search and 35% on Parametric between versions. That's meaningful improvement within Google's own model family.
But the absolute numbers tell a different story. The Multimodal Benchmark—where models answer questions about images—saw the lowest scores across all evaluated models. Google notes that "all evaluated models achieved an overall accuracy below 70%, leaving considerable headroom for future progress." That's diplomatic phrasing for "none of these systems are reliable enough to trust without verification."
Google also cites SimpleQA Verified, another factuality benchmark, where Gemini 3 Pro scored 72.1% compared to Gemini 2.5 Pro's 54.5%. Again, substantial improvement within their model family. Still not accurate enough to use as a primary information source without human oversight.
The FACTS suite focuses on questions with objectively verifiable answers—trivia, historical facts, visual identification tasks, and multi-step retrieval problems. These are the easiest cases for factuality evaluation because ground truth exists and can be confirmed. They're also a narrow subset of how people actually use AI systems.
Users ask models for analysis, interpretation, recommendations, and synthesis—tasks where "factuality" gets murky because there's no single correct answer. The FACTS benchmarks don't test whether models hallucinate statistics, misrepresent sources, or confidently assert incorrect causal relationships in domains where verification requires expertise. They test whether models can answer questions that Wikipedia or a web search could definitively resolve.
That's useful as far as it goes. It doesn't go very far.
Google positions this work as addressing the challenge of AI systems becoming primary information sources. But factuality benchmarks measure accuracy on discrete, verifiable questions—not whether models provide misleading context, omit relevant information, or present contested claims as settled fact.
A model can score 70% on FACTS and still generate plausible-sounding explanations that are subtly wrong in ways users can't detect without domain expertise. It can accurately retrieve facts while completely misunderstanding the question. It can answer correctly on the benchmark while failing on structurally similar queries that fall outside the test distribution.
Google acknowledges this indirectly by noting that "LLM factuality is still an area of ongoing research." They're measuring progress on a problem they haven't fully solved—and may not be able to solve with current architectures.
Every leading model evaluated on FACTS—including GPT-4, Claude, Llama, and others—scored below 70%. Some significantly below. This isn't a Google-specific issue or a competitive advantage for Gemini. It's an industry-wide limitation that all frontier labs are working on and none have overcome.
The benchmark exists because factuality remains unreliable across all models despite years of research on hallucination mitigation, retrieval-augmented generation, and post-training refinement. Google's work here is valuable for measuring incremental progress. It doesn't change the fundamental constraint: you cannot trust any current AI system to deliver consistently accurate information without verification.
That constraint matters enormously for real-world deployment. If models fail 30% of the time on carefully curated benchmark questions, they'll fail more often on messy, ambiguous, or adversarial queries users actually ask. And users won't know which 30% is wrong unless they already know the answer—at which point they don't need the model.
Google frames the FACTS Benchmark Suite as advancing the field toward "better and more accurate models." Fair enough. But practitioners need to understand that "better" still means "unreliable without verification." A 68.8% accuracy score on factual questions is not good enough to replace search, reference materials, or human judgment—which is why Google isn't positioning it that way.
The honest takeaway: factuality benchmarks help measure progress. They don't guarantee reliability. Every model will improve on these metrics over time. None will reach 100%, and even high scores won't eliminate the need for verification in contexts where accuracy matters.
If you're evaluating AI tools for your business and need help understanding what accuracy metrics actually mean in practice, Winsome's team can walk you through the gap between benchmarks and real-world performance.
Google wants you to be impressed by 1.3 quadrillion tokens processed per month. CEO Sundar Pichai highlighted the figure at a recent Google Cloud...
Google just shipped a batch of developer experience updates to AI Studio, and the most revealing thing about them isn't what they include—it's the...
Google CEO Sundar Pichai announced at Salesforce's Dreamforce conference that Gemini 3.0 will release before the end of 2025. That's roughly two...