When AI Sounds Smart But Gets the Numbers Wrong
Why general-purpose AI tools fail the precision test for financial advisors
By DeepVest Research
The Problem Advisors Are Facing
Across the wealth management industry, a quiet trend is emerging: financial advisors are using AI tools like ChatGPT, Perplexity, and Gemini for portfolio analysis. The tools are fast, confident, and impressively articulate.
They’re also frequently wrong.
Consider this scenario (a composite based on conversations with multiple RIAs):
An advisor uses ChatGPT to calculate tax-loss harvesting opportunities for a $2M client portfolio. The AI estimates $12,400 in potential tax savings. Specific, detailed, confident. The advisor presents these numbers in a client meeting.
The accountant finds errors: miscalculated cost basis on three positions, wash sale rules completely ignored. Actual potential savings: $7,800.
The $4,600 difference isn’t enough to trigger a lawsuit. But it’s enough to damage the relationship. The client questions whether they can trust the advisor’s other recommendations. In wealth management, trust is the product.
This advisor isn’t alone. We decided to find out just how widespread the problem is.
We Put It to the Test
Between January and February 2026, we evaluated five AI platforms that financial advisors commonly use or encounter:
- ChatGPT 5.2 (OpenAI) — Most widely used general-purpose AI
- Perplexity Pro — AI marketed for research and analysis
- Gemini 3.0 (Google) — Latest enterprise AI from Google
- SuperGrok — X based AI
- Claude Opus 4.6 (Anthropic) — Latest frontier model from Anthropic
- DeepVest — Purpose-built portfolio intelligence platform for financial advisors
We tested them on ten real investment workflows: the kind portfolio managers and financial advisors handle every day:
- Maximum drawdown calculations
- Rolling correlations
- Portfolio backtesting
- Fundamental analysis
- Stock screening
- Mean-variance optimization
- Macro stress testing
- Event backtesting
- Technical analysis
For each test, we compared the AI’s output against ground truth: either calculated independently using verified Python scripts or cross-referenced against Bloomberg data.
The results: General-purpose AI tools failed 85% of tasks, producing incorrect calculations, hallucinated data, or no results at all.
What We Found: Three Critical Examples
Let’s illustrate with three tests that every financial advisor should care about.
Test 1: Rolling Correlations (Diversification Analysis)
The Task: Calculate rolling correlation between Bitcoin and S&P 500 from January 1, 2018 to January 16, 2026 using a 252-day lookback window.
Understanding how assets move together is fundamental to portfolio construction and risk management. Rolling correlations reveal changing relationships over time, essential for diversification decisions.
Rolling Correlations from Jan 1, 2018 to Jan 16, 2026. NA = Unable to compute, H = Hallucinations
| Category | Ground Truth | DeepVest | ChatGPT 5.2 | SuperGrok | Gemini 3.0 | Perplexity Pro | Claude Opus 4.6 |
|---|---|---|---|---|---|---|---|
| Jan 16 2026 correl | 0.10 | 0.10 | NA | NA | NA | NA | H |
| Jan 16 2026 correl | 0.35 | 0.35 | NA | NA | NA | NA | H |
| Time | — | 47s | 2m 46s | 5m 11s | 1m 8s | 20s | 2m 53s |
ChatGPT 5.2 spent 2 minutes 46 seconds trying various approaches. Its final attempt? Pulling Bitcoin prices from FRED (the Federal Reserve Economic Database), which doesn’t track cryptocurrency. This is like looking up a restaurant review in a hardware store catalog.
SuperGrok spent over 5 minutes iterating and debugging, ultimately outputting monthly correlations instead of the requested daily data.
Gemini 3.0 created an approximated chart with this disclaimer: “This chart is an approximation derived from historical financial reports and correlation studies, as real-time daily data for the entire range was not directly accessible.” In other words: the numbers were made up.
Perplexity Pro stated the task was impossible and provided non-functional code.
Claude Opus 4.6 fabricated data to produce approximate results—essentially hallucinations built on hallucinations.
DeepVest computed the correct correlation values in 47 seconds.
Why This Matters: Advisors rely on correlation analysis for portfolio diversification. Using approximated data or pulling from wrong sources can lead to portfolios that appear diversified but concentrate risk in ways that only become apparent during market stress.
Test 2: Portfolio Backtesting (The Numbers You Show Clients)
The Task: Backtest two portfolios from January 1, 2018 to January 16, 2026:
- Portfolio A: 60% equities / 40% bonds (classic allocation)
- Portfolio B: 60% equities / 35% bonds / 5% Nvidia
Portfolio backtesting is central to investment management. Advisors use historical performance to validate strategies before recommending them to clients.
60/40 Model Portfolios from Jan 1, 2018 to Jan 16, 2026. NA = Unable to compute
| Category | Ground Truth | DeepVest | ChatGPT 5.2 | SuperGrok | Gemini 3.0 | Perplexity Pro | Claude Opus 4.6 |
|---|---|---|---|---|---|---|---|
| 60/40 Total Ret | 120.39% | 120.97% | NA | 90.89% | 112.1% | NA | NA |
| 60/40 Avg Ret | 10.37% | 10.77% | NA | 8.38% | ~9.8% | NA | 4.7% |
| 60/40 SR | 0.81 | 0.81 | NA | 0.48–0.52 | ~0.70 | NA | 0.41 |
| 60/35/5 Total Ret | 302.31% | 304.15% | NA | 273.17% | 259.6% | NA | NA |
| 60/35/5 Avg Ret | 18.98% | 19.32% | NA | 17.81% | ~17.3% | NA | 11.6% |
| 60/35/5 SR | 0.99 | 0.99 | NA | 0.78–0.84 | ~0.80 | NA | 0.73 |
| Time | — | 47s | 4m 2s | 55s | 1m 13s | 39s | 4m 59s |
ChatGPT 5.2 spent over 4 minutes attempting to connect to various data sources before failing completely.
SuperGrok used a poor-quality data source, resulting in return figures 30 percentage points below actual and Sharpe ratios that appeared hallucinated.
Gemini 3.0 returned all results with the "∼" approximation symbol, and even the approximations were materially wrong and off by 6–14%.
Perplexity Pro randomly searched news websites and returned no calculations.
Claude Opus 4.6 attempted to scrape random data while writing code on the fly, producing a confidently wrong result.
Why This Matters: Showing a client historical performance that’s off by 30% isn’t just unprofessional. If you’re basing current recommendations on this flawed analysis, you’re making portfolio decisions that could materially harm client outcomes.
This is the kind of data advisors put in client proposals. Getting it wrong doesn’t just lose the deal. It damages your credibility.
Test 3: Fundamental Analysis (The Data You Trust)
The Task: Perform fundamental competitor analysis for Robinhood (HOOD) using P/E ratio, Price/Book, ROE, and Market Cap. Compare current valuation (February 6, 2026) to historical valuation (June 1, 2024).
Access to accurate fundamental data is the foundation of equity analysis.
Fundamental results for Robinhood for current (as run Feb 6, 2026) versus June 1, 2024
| Category | Ground Truth (Bloomberg) | DeepVest | ChatGPT 5.2 | SuperGrok | Gemini 3.0 | Perplexity Pro | Claude Opus 4.6 |
|---|---|---|---|---|---|---|---|
| PE Ratio (2/6/26) | 34.73 | 33.55 | 34.64 | 34.51 | 34.4x | ~34.4 | ~35–37x |
| PE Ratio (6/1/24) | 143.6 | 142.27 | ~27.05 | 68.82 | ~70.9x | NA | ~134x |
| P/B Ratio (2/6/26) | 8.62 | 8.59 | 8.69 | 8.69 | 9.9x | ~9–10 | ~8.5–9.0x |
| P/B Ratio (6/1/24) | 2.67 | 2.63 | ~2.47 | 10.30 | NA | NA | ~2.2–2.5x |
| Market Cap (2/6/26) | $74.4B | $74.4B | $74.4B | $74.47B | $72.5B | ~74–75B | ~$68B |
| Market Cap (6/1/24) | $18.76B | $18.74B | ~18.36B | ~$20.4B | ~$18.2B | NA | ~$15.5B |
| ROE (2/5/26) | 27.82% | 27.82% | -7.93% | 27.82% | ~10–12% | 17–27% | ~22% |
| ROE (6/1/24) | 1.82% | 1.82% | 27.82% | 4.25% | NA | NA | ~10.7% |
| Time | — | 46s | 1m 23s | 3m 33s | 12s | 21s | 5m 22s |
ChatGPT 5.2 showed current ROE as negative 7.93% when it’s actually positive 27.82%. Even worse, it showed the current ROE (27.82%) for the historical date: demonstrating it has no understanding of time-series financial data.
SuperGrok had historical P/E wrong by over 100% (reported 68.82 vs. actual 143.6).
Gemini 3.0 approximated most metrics, with errors ranging 13–60%.
Perplexity Pro had mixed results on current data but completely failed on historical comparisons.
Why This Matters: Presenting a client with a P/E ratio that’s off by 100% or an ROE that’s negative when it’s actually positive destroys trust. If you’re charging advisory fees based on this analysis, the question is: what are your clients actually paying for?
The Pattern: Why General AI Fails Portfolio Work
Across all ten use cases we tested, a consistent pattern emerged:
- Wrong Data Sources: ChatGPT tried to pull Bitcoin prices from FRED (the Federal Reserve Economic Database). The Fed doesn’t track cryptocurrency. This isn’t a small error. It’s using completely inappropriate data sources with absolute confidence.
- Hallucinated Numbers: Gemini provided results with the "∼" approximation symbol, admitting uncertainty but still presenting numbers. When we checked them against ground truth, errors ranged from 13% to 60%. These aren’t rounding differences. They’re material misstatements.
- Complete Failures: ChatGPT and Perplexity simply admitted they couldn’t compute many results. That’s honest, but it defeats the purpose of using AI in the first place.
- Non-Repeatable Results: When we asked the same question multiple times, frontier models frequently gave different answers. This reflects their probabilistic nature, but it’s a fatal flaw for professional financial work where consistency and auditability are required.
What This Means for Financial Advisors
The implications are clear: general-purpose AI tools are not suitable for professional investment work where accuracy, repeatability, and regulatory compliance are required.
These tools were built for consumers, not fiduciaries. They:
- Lack access to verified financial datasets
- Can’t provide audit trails for compliance
- Don’t guarantee repeatability
- Use your client data to train their models
- Have no accountability for errors
The gap isn’t just accuracy—it’s the entire infrastructure required for fiduciary-level work.
The Real Cost of AI Errors
It’s not just about wrong numbers. Consider what happens when:
- A compliance audit fails because you can’t produce documentation of how AI arrived at recommendations. SEC examiners expect you to explain and defend every investment decision. “The AI said so” isn’t a defense.
- A client’s accountant catches errors in your AI-generated analysis. Even if no money is lost, trust is damaged. In a referral-based business, reputation is everything.
- You waste hours verifying AI output to ensure accuracy. At that point, you’re not saving time — you’re adding risk and operational overhead.
- You make portfolio decisions on false premises. Wrong correlation data leads to portfolios that appear diversified but concentrate risk. Incorrect backtesting validates strategies that shouldn’t be recommended.
Want the Complete Analysis?
This article covers 3 of the 10 investment workflows we tested. The patterns we found extend across every type of portfolio task, from simple calculations to sophisticated optimization.
Download the Full Whitepaper
The complete research report includes all 10 use case comparisons with detailed data tables, complete test methodology validated against Bloomberg, and the repeatability problem analysis.
- ✓All 10 use case comparisons with detailed data tables and results
- ✓Complete test methodology — How we validated results against Bloomberg and verified calculations
- ✓The repeatability problem — Why frontier models produce different answers to the same question
- ✓Detailed analysis of all failures — What went wrong with each platform across all tests
- ✓Full data tables for rolling correlations, stock screening, event backtesting, portfolio optimization, macro stress testing, and technical analysis
- ✓Time comparisons — How long each platform took (and when they failed completely)
- ✓Ground truth validation — How we ensured our baseline measurements were accurate
Instant Access
Complete the form below to receive your copy
About This Research
This study was conducted by DeepVest in February 2026 to evaluate the suitability of AI tools for professional investment management workflows. All testing was performed using the latest available versions of each platform.
DeepVest is a purpose-built portfolio intelligence platform designed specifically for financial advisors. Unlike general-purpose AI tools, DeepVest provides institutional-grade data, complete audit trails, and repeatable results built for fiduciary-level work.
For questions about this research: contact [email protected].