DeepVest Logo
DeepVest
Back to Articles
Research

When AI Sounds Smart But Gets the Numbers Wrong

Why general-purpose AI tools fail the precision test for financial advisors

By DeepVest Research


The Problem Advisors Are Facing

Across the wealth management industry, a quiet trend is emerging: financial advisors are using AI tools like ChatGPT, Perplexity, and Gemini for portfolio analysis. The tools are fast, confident, and impressively articulate.

They’re also frequently wrong.

Consider this scenario (a composite based on conversations with multiple RIAs):

An advisor uses ChatGPT to calculate tax-loss harvesting opportunities for a $2M client portfolio. The AI estimates $12,400 in potential tax savings. Specific, detailed, confident. The advisor presents these numbers in a client meeting.

The accountant finds errors: miscalculated cost basis on three positions, wash sale rules completely ignored. Actual potential savings: $7,800.

The $4,600 difference isn’t enough to trigger a lawsuit. But it’s enough to damage the relationship. The client questions whether they can trust the advisor’s other recommendations. In wealth management, trust is the product.

This advisor isn’t alone. We decided to find out just how widespread the problem is.

We Put It to the Test

Between January and February 2026, we evaluated five AI platforms that financial advisors commonly use or encounter:

  • ChatGPT 5.2 (OpenAI) — Most widely used general-purpose AI
  • Perplexity Pro — AI marketed for research and analysis
  • Gemini 3.0 (Google) — Latest enterprise AI from Google
  • SuperGrok — X based AI
  • Claude Opus 4.6 (Anthropic) — Latest frontier model from Anthropic
  • DeepVest — Purpose-built portfolio intelligence platform for financial advisors

We tested them on ten real investment workflows: the kind portfolio managers and financial advisors handle every day:

  • Maximum drawdown calculations
  • Rolling correlations
  • Portfolio backtesting
  • Fundamental analysis
  • Stock screening
  • Mean-variance optimization
  • Macro stress testing
  • Event backtesting
  • Technical analysis

For each test, we compared the AI’s output against ground truth: either calculated independently using verified Python scripts or cross-referenced against Bloomberg data.

The results: General-purpose AI tools failed 85% of tasks, producing incorrect calculations, hallucinated data, or no results at all.

What We Found: Three Critical Examples

Let’s illustrate with three tests that every financial advisor should care about.

Test 1: Rolling Correlations (Diversification Analysis)

The Task: Calculate rolling correlation between Bitcoin and S&P 500 from January 1, 2018 to January 16, 2026 using a 252-day lookback window.

Understanding how assets move together is fundamental to portfolio construction and risk management. Rolling correlations reveal changing relationships over time, essential for diversification decisions.

Rolling Correlations from Jan 1, 2018 to Jan 16, 2026. NA = Unable to compute, H = Hallucinations

CategoryGround TruthDeepVestChatGPT 5.2SuperGrokGemini 3.0Perplexity ProClaude Opus 4.6
Jan 16 2026 correl0.100.10NANANANAH
Jan 16 2026 correl0.350.35NANANANAH
Time47s2m 46s5m 11s1m 8s20s2m 53s

ChatGPT 5.2 spent 2 minutes 46 seconds trying various approaches. Its final attempt? Pulling Bitcoin prices from FRED (the Federal Reserve Economic Database), which doesn’t track cryptocurrency. This is like looking up a restaurant review in a hardware store catalog.

SuperGrok spent over 5 minutes iterating and debugging, ultimately outputting monthly correlations instead of the requested daily data.

Gemini 3.0 created an approximated chart with this disclaimer: “This chart is an approximation derived from historical financial reports and correlation studies, as real-time daily data for the entire range was not directly accessible.” In other words: the numbers were made up.

Perplexity Pro stated the task was impossible and provided non-functional code.

Claude Opus 4.6 fabricated data to produce approximate results—essentially hallucinations built on hallucinations.

DeepVest computed the correct correlation values in 47 seconds.

Why This Matters: Advisors rely on correlation analysis for portfolio diversification. Using approximated data or pulling from wrong sources can lead to portfolios that appear diversified but concentrate risk in ways that only become apparent during market stress.

Test 2: Portfolio Backtesting (The Numbers You Show Clients)

The Task: Backtest two portfolios from January 1, 2018 to January 16, 2026:

  • Portfolio A: 60% equities / 40% bonds (classic allocation)
  • Portfolio B: 60% equities / 35% bonds / 5% Nvidia

Portfolio backtesting is central to investment management. Advisors use historical performance to validate strategies before recommending them to clients.

60/40 Model Portfolios from Jan 1, 2018 to Jan 16, 2026. NA = Unable to compute

CategoryGround TruthDeepVestChatGPT 5.2SuperGrokGemini 3.0Perplexity ProClaude Opus 4.6
60/40 Total Ret120.39%120.97%NA90.89%112.1%NANA
60/40 Avg Ret10.37%10.77%NA8.38%~9.8%NA4.7%
60/40 SR0.810.81NA0.48–0.52~0.70NA0.41
60/35/5 Total Ret302.31%304.15%NA273.17%259.6%NANA
60/35/5 Avg Ret18.98%19.32%NA17.81%~17.3%NA11.6%
60/35/5 SR0.990.99NA0.78–0.84~0.80NA0.73
Time47s4m 2s55s1m 13s39s4m 59s

ChatGPT 5.2 spent over 4 minutes attempting to connect to various data sources before failing completely.

SuperGrok used a poor-quality data source, resulting in return figures 30 percentage points below actual and Sharpe ratios that appeared hallucinated.

Gemini 3.0 returned all results with the "∼" approximation symbol, and even the approximations were materially wrong and off by 6–14%.

Perplexity Pro randomly searched news websites and returned no calculations.

Claude Opus 4.6 attempted to scrape random data while writing code on the fly, producing a confidently wrong result.

Why This Matters: Showing a client historical performance that’s off by 30% isn’t just unprofessional. If you’re basing current recommendations on this flawed analysis, you’re making portfolio decisions that could materially harm client outcomes.

This is the kind of data advisors put in client proposals. Getting it wrong doesn’t just lose the deal. It damages your credibility.

Test 3: Fundamental Analysis (The Data You Trust)

The Task: Perform fundamental competitor analysis for Robinhood (HOOD) using P/E ratio, Price/Book, ROE, and Market Cap. Compare current valuation (February 6, 2026) to historical valuation (June 1, 2024).

Access to accurate fundamental data is the foundation of equity analysis.

Fundamental results for Robinhood for current (as run Feb 6, 2026) versus June 1, 2024

CategoryGround Truth (Bloomberg)DeepVestChatGPT 5.2SuperGrokGemini 3.0Perplexity ProClaude Opus 4.6
PE Ratio (2/6/26)34.7333.5534.6434.5134.4x~34.4~35–37x
PE Ratio (6/1/24)143.6142.27~27.0568.82~70.9xNA~134x
P/B Ratio (2/6/26)8.628.598.698.699.9x~9–10~8.5–9.0x
P/B Ratio (6/1/24)2.672.63~2.4710.30NANA~2.2–2.5x
Market Cap (2/6/26)$74.4B$74.4B$74.4B$74.47B$72.5B~74–75B~$68B
Market Cap (6/1/24)$18.76B$18.74B~18.36B~$20.4B~$18.2BNA~$15.5B
ROE (2/5/26)27.82%27.82%-7.93%27.82%~10–12%17–27%~22%
ROE (6/1/24)1.82%1.82%27.82%4.25%NANA~10.7%
Time46s1m 23s3m 33s12s21s5m 22s

ChatGPT 5.2 showed current ROE as negative 7.93% when it’s actually positive 27.82%. Even worse, it showed the current ROE (27.82%) for the historical date: demonstrating it has no understanding of time-series financial data.

SuperGrok had historical P/E wrong by over 100% (reported 68.82 vs. actual 143.6).

Gemini 3.0 approximated most metrics, with errors ranging 13–60%.

Perplexity Pro had mixed results on current data but completely failed on historical comparisons.

Why This Matters: Presenting a client with a P/E ratio that’s off by 100% or an ROE that’s negative when it’s actually positive destroys trust. If you’re charging advisory fees based on this analysis, the question is: what are your clients actually paying for?

The Pattern: Why General AI Fails Portfolio Work

Across all ten use cases we tested, a consistent pattern emerged:

  • Wrong Data Sources: ChatGPT tried to pull Bitcoin prices from FRED (the Federal Reserve Economic Database). The Fed doesn’t track cryptocurrency. This isn’t a small error. It’s using completely inappropriate data sources with absolute confidence.
  • Hallucinated Numbers: Gemini provided results with the "∼" approximation symbol, admitting uncertainty but still presenting numbers. When we checked them against ground truth, errors ranged from 13% to 60%. These aren’t rounding differences. They’re material misstatements.
  • Complete Failures: ChatGPT and Perplexity simply admitted they couldn’t compute many results. That’s honest, but it defeats the purpose of using AI in the first place.
  • Non-Repeatable Results: When we asked the same question multiple times, frontier models frequently gave different answers. This reflects their probabilistic nature, but it’s a fatal flaw for professional financial work where consistency and auditability are required.

What This Means for Financial Advisors

The implications are clear: general-purpose AI tools are not suitable for professional investment work where accuracy, repeatability, and regulatory compliance are required.

These tools were built for consumers, not fiduciaries. They:

  • Lack access to verified financial datasets
  • Can’t provide audit trails for compliance
  • Don’t guarantee repeatability
  • Use your client data to train their models
  • Have no accountability for errors

The gap isn’t just accuracy—it’s the entire infrastructure required for fiduciary-level work.

The Real Cost of AI Errors

It’s not just about wrong numbers. Consider what happens when:

  • A compliance audit fails because you can’t produce documentation of how AI arrived at recommendations. SEC examiners expect you to explain and defend every investment decision. “The AI said so” isn’t a defense.
  • A client’s accountant catches errors in your AI-generated analysis. Even if no money is lost, trust is damaged. In a referral-based business, reputation is everything.
  • You waste hours verifying AI output to ensure accuracy. At that point, you’re not saving time — you’re adding risk and operational overhead.
  • You make portfolio decisions on false premises. Wrong correlation data leads to portfolios that appear diversified but concentrate risk. Incorrect backtesting validates strategies that shouldn’t be recommended.

Want the Complete Analysis?

This article covers 3 of the 10 investment workflows we tested. The patterns we found extend across every type of portfolio task, from simple calculations to sophisticated optimization.

Exclusive Research Report

Download the Full Whitepaper

The complete research report includes all 10 use case comparisons with detailed data tables, complete test methodology validated against Bloomberg, and the repeatability problem analysis.

  • All 10 use case comparisons with detailed data tables and results
  • Complete test methodology — How we validated results against Bloomberg and verified calculations
  • The repeatability problem — Why frontier models produce different answers to the same question
  • Detailed analysis of all failures — What went wrong with each platform across all tests
  • Full data tables for rolling correlations, stock screening, event backtesting, portfolio optimization, macro stress testing, and technical analysis
  • Time comparisons — How long each platform took (and when they failed completely)
  • Ground truth validation — How we ensured our baseline measurements were accurate

Instant Access

Complete the form below to receive your copy

Secure & Private Research

About This Research

This study was conducted by DeepVest in February 2026 to evaluate the suitability of AI tools for professional investment management workflows. All testing was performed using the latest available versions of each platform.

DeepVest is a purpose-built portfolio intelligence platform designed specifically for financial advisors. Unlike general-purpose AI tools, DeepVest provides institutional-grade data, complete audit trails, and repeatable results built for fiduciary-level work.

For questions about this research: contact [email protected].


DeepVest LogoDeepVest

AI-powered portfolio intelligence for financial advisors. Institutional-grade analysis, zero hallucinations, fully auditable.

    When AI Sounds Smart But Gets the Numbers Wrong | DeepVest