The Test: One Simple Question, Three Popular AI Tools

Barry deliberately kept his prompt simple and conversational — the way most real users actually interact with AI. No fine-tuned instructions. No paid tiers. No custom system prompts. Just the free versions of ChatGPT, Gemini, and Claude, asked a question millions of car shoppers type into search bars every month.
This was the right methodological choice. It mirrors how the average person uses these tools: quickly, casually, and with high expectations. The results were instructive precisely because of that realism.
What the AI Tools Got Right

To their credit, all three models pointed users toward genuinely strong options. The Toyota Grand Highlander Hybrid appeared consistently across recommendations, and the explanations accompanying it were coherent and largely accurate. For a first-pass orientation to a complex product category, that is not nothing.
The tools also demonstrated a useful ability to compare vehicles side by side, generate readable summaries, and explain trade-offs in plain language. For a buyer who knows nothing about 3-row SUVs, this kind of structured overview can be a reasonable starting point — provided they treat it as exactly that: a starting point.
Where the AI Tools Failed — and Why It Matters

The failures were not minor. They were systematic, and in a high-stakes purchase context, they would be costly.
Hallucinated Vehicles and Discontinued Models

ChatGPT recommended a “2026 Cadillac XT6” — a model discontinued for the U.S. market in 2025 — and a “Lexus TX350h,” which does not exist as a hybrid configuration. Gemini generated options that had no real-world counterpart at all. These are not edge cases. They are the kind of errors that send a buyer to a dealership asking for a car that cannot be purchased.
Misattributed Reliability Scores

Gemini cited an Autoblog article to claim the Lexus GX received a “perfect 100/100 reliability score from Consumer Reports.” The actual Consumer Reports predicted reliability score for the GX was 46/100 — placing it among the least reliable Lexus models on the market. The source article existed, but Gemini had misread it, misapplied its findings, and presented the error with full confidence.
Low-Quality and Fabricated Sources

ChatGPT sourced vehicle specifications from a website that simultaneously advertised itself as a church, a clothing retailer, a cryptocurrency exchange, a pawn shop, and a zombie response team. This is not a fringe failure mode — it reflects a structural limitation. As Consumer Reports’ head of fact-checking Tracy Anderman put it:
“AI tools are not always able to distinguish between what’s credible information and what isn’t. They can just be flat out wrong. They’ll be really nice to you about it, but they can still be flat out wrong.”
Model Year Confusion

All three tools mixed up 2024, 2025, and 2026 model years in their comparison charts — a particularly serious error when a vehicle has recently been redesigned. Specifications, pricing, and reliability data can shift significantly between model years. A chart that looks authoritative but draws from the wrong year is worse than no chart at all.
The Core Problem: Language Machines Posing as Knowledge Machines

Understanding why these failures happen is essential for using AI tools intelligently.
Juan Ricafort, senior product manager for AskCR — Consumer Reports’ own AI-powered agent — frames it precisely:
“The key word with large language models is ‘language.’ LLMs are language machines, not knowledge machines.”
They are optimized to produce fluent, coherent text. They are not optimized to be correct about which trim level of a specific SUV comes with a hybrid powertrain in a given model year.
AI expert Dave Birss extends this point:
“It’s being rewarded for fluency, not for accuracy.”
When a model has to fill a gap in its training data, it does not flag uncertainty — it generates the most statistically plausible-sounding answer. In a domain as specification-dense as automotive purchasing, that behavior produces confident errors at scale.
The analogy Ricafort offers is sharp: LLMs are good at Wheel of Fortune — predicting which letter comes next to complete a pattern. They are not good at The Price Is Right — knowing what a car actually costs, or which version of it exists.
Prompt Quality Changes the Output — But Does Not Eliminate the Risk

A weak prompt produces weaker results. Asking for the “most reliable” SUV invites subjectivity. Asking for the “best” invites scope creep. Gemini’s occasional recommendation of a $90,000 BMW X7 — a vehicle outside most buyers’ budgets — was a direct consequence of the vague framing.
Better prompts produce better outputs. Birss recommends specificity over generality: define a price range, cite a safety testing standard like IIHS, specify the number of rows or cargo requirements, and explicitly request reliability data from named sources. Asking the model to rely only on primary sources — manufacturer websites, established automotive publications — measurably improved result quality in Barry’s testing.
But prompt engineering has limits. Even well-structured prompts cannot prevent a model from hallucinating a vehicle that does not exist, or from misreading a source article and inverting its findings. The responsibility for verification cannot be fully engineered away.
Where AI Genuinely Adds Value in Car Buying

Despite the failures documented above, AI tools are not useless in this context. They are simply misapplied when treated as authoritative recommendation engines.
Clarifying Your Own Requirements

If you do not yet know what you want, AI tools can function as a structured interview partner. Birss’ business partner Helena McAleer used this approach — letting an AI model ask her questions about her needs, budget, and priorities — and ended up purchasing a vehicle she had not initially considered, one that proved to be a strong fit. The AI did not choose the car. It helped her articulate what she was actually looking for.
Negotiation Preparation

Once you have identified a specific make, model, and trim through verified sources, AI tools can generate negotiation scripts tailored to counter high-pressure dealership tactics. This is a task that plays to the genuine strengths of language models: producing fluent, persuasive text on demand. Some paid services, such as CarEdge, take this further by using AI to negotiate with salespeople over email on a buyer’s behalf.
Initial Orientation
For buyers who are entirely new to a vehicle category, AI-generated overviews can provide a useful first map of the landscape — provided every specific claim is verified against primary sources before acting on it.
A Practical Framework for AI-Assisted Car Research

Given the evidence, here is how to use AI tools responsibly in a car-buying workflow:
- Use AI to explore, not to conclude. Let it help you understand a category, generate questions to ask a dealer, or draft a negotiation script. Do not let it make the final recommendation.
- Write specific prompts. Include price range, intended use, safety priorities, and reliability requirements. Name the sources you want the model to draw from.
- Verify every specific claim. Model years, trim configurations, reliability scores, and pricing must be confirmed against manufacturer websites, established automotive publications, and independent testing organizations.
- Treat AI sources as leads, not facts. When a model cites an article or a score, click through and read the original. As the Consumer Reports benchmark demonstrated, the summary and the source can contradict each other entirely.
- Reserve judgment for yourself. As Consumer Reports’ chief data officer Suman Veeramalla states directly: “AI can inform your decision, but it cannot make it for you. Don’t let AI replace your judgment.”
The Verdict

ChatGPT, Gemini, and Claude are capable of providing a useful first orientation to the 3-row SUV market. They are not capable of reliably replacing the structured, source-verified research that a major purchase demands. The gap between what they appear to know and what they actually know is wide enough to send a buyer toward a discontinued model, a nonexistent trim, or a vehicle with a reliability score that is the inverse of what the AI claimed.
The right mental model comes from Anderman’s analogy: these tools are the interesting but somewhat unhinged acquaintance at a party. You would not spend tens of thousands of dollars on their word alone. You would take the conversation as a starting point, then verify everything with people and sources you actually trust.
AI tools observe the landscape. The judgment — and the test drive — remain yours.
Comments (0) No comments yet
Want to join this discussion? Login or Register.
No comments yet. Be the first to share your thoughts!