What are Deep Research Tools: A Comprehensive Analysis

Mar 8, 2025

Key Points

Research suggests Deep Research tools, like those from OpenAI, Perplexity, Google, and xAI, vary in technical approaches, with OpenAI using the o3 model and Perplexity using DeepSeek R1 (DataCamp, 2025; ZDNET, 2025).
It seems likely that these tools evolved from early Directed Acyclic Graph (DAG)-based workflows to more dynamic Finite State Machine (FSM) and fully trained models (Siddhardha, 2024; Hopsworks, 2024).
The evidence leans toward using Humanity's Last Exam (HLE) scores, like OpenAI's 26.6%, to evaluate quality, with speed and report depth also considered (Center for AI Safety, 2025; Scale AI, 2025).
Training methods likely include reinforcement learning for OpenAI's o3 and fine-tuning for Perplexity, though details vary (The Decoder, 2024; US AI Institute, 2025).
Deep Research differs from Retrieval Augmented Generation (RAG) by offering multi-step research, and from agentic systems by focusing on research tasks, with some debate on whether it's innovation or rebranding (Berkeley Artificial Intelligence Research, 2024; McKinsey, 2024).
Practical limitations include factual errors and source credibility issues, with varying autonomy levels affecting human oversight (ScienceAlert, 2025; Nature, 2025).
Iterative search cycles enhance research depth, with applications in literature reviews and complex topic analysis, supported by HLE benchmarks (arXiv, 2025; InfoQ, 2024).

Technical Distinctions

Deep Research tools from major AI labs show distinct technical approaches:

OpenAI's Deep Research uses the o3 model, handling text, images, and PDFs, with future visualization capabilities, scoring 26.6% on HLE (DataCamp, 2025).
Perplexity's Deep Research relies on a custom DeepSeek R1 with Test Time Compute (TTC) expansion, scoring 21.1% on HLE (ZDNET, 2025).
Google's Deep Research, part of Gemini 2.0 Pro, integrates with their AI assistant for comprehensive reports (Google, 2025).
xAI's DeepSearch, based on Grok 3, focuses on reasoning and research, with less specific performance data available (Business Insider, 2025).

Evolution and Metrics

These tools likely evolved from early DAG-based workflows, where tasks were predefined, to dynamic FSM and fully trained models that adapt during research (Siddhardha, 2024). Evaluation metrics include HLE scores, with OpenAI leading at 26.6%, and time to completion, with Perplexity being faster (under 3 minutes) compared to OpenAI (5-30 minutes) (Creator Economy, 2025; The Indian Express, 2025).

Comprehensive Analysis of Deep Research Implementations

This note provides a detailed examination of Deep Research across major AI labs, including OpenAI, Perplexity, Google, and xAI, addressing technical distinctions, evolutionary paths, evaluation metrics, training methodologies, differences from prior technologies, practical limitations, iterative search cycles, real-world applications, empirical evidence, and the balance between autonomy and human oversight. The analysis is grounded in recent findings as of March 7, 2025, and aims to offer a professional, thorough overview.

Technical Distinctions Between Implementations

Deep Research tools are AI agents designed for autonomous, in-depth research, with each lab adopting unique technical approaches:

OpenAI's Deep Research: Based on the o3 model, a reasoning-focused large language model (LLM) introduced in December 2024. It can interpret and analyze text, images, and PDFs, with plans to produce visualizations and embed images in reports. It scored 26.6% on Humanity's Last Exam (HLE), surpassing rivals like DeepSeek's R1 (9.4%) and GPT-4o (3.3%) (DataCamp, 2025). Limitations include factual hallucinations and difficulty distinguishing authoritative sources.
Perplexity's Deep Research: Utilizes a custom version of DeepSeek R1, an open-source model, with a proprietary framework called Test Time Compute (TTC) expansion. This enables systematic exploration by mimicking human cognitive processes through iterative analysis cycles, performing dozens of searches and reading hundreds of sources. It scored 21.1% on HLE, with a focus on speed, completing most tasks in under 3 minutes (ZDNET, 2025).
Google's Deep Research: Integrated into Gemini Advanced, using the Gemini 2.0 Pro model, announced in December 2024. It conducts research by creating multi-step plans, browsing hundreds of sites, and delivering comprehensive reports with linked sources, emphasizing integration with productivity ecosystems (Google, 2025).
xAI's DeepSearch: Part of Grok 3, launched in February 2025, with reasoning capabilities and a focus on multistep research. It uses a tool to track internet searches, teaching the model natural search and reasoning skills, available to X Premium and Premium+ users. Specific HLE scores were not found, but it competes with OpenAI and Google (Business Insider, 2025).

These distinctions highlight differences in underlying models, data handling capabilities, and performance metrics, with OpenAI and Perplexity providing benchmark scores for comparison.

Evolution from Early DAG-Based Approaches to Sophisticated Models

The evolution of Deep Research likely progressed from early Directed Acyclic Graph (DAG)-based approaches, where research tasks were represented as nodes with dependencies (e.g., workflow orchestration in Apache Airflow), to more sophisticated Finite State Machine (FSM) and fully trained models. DAGs were used to define static sequences of research steps, limiting adaptability. Current implementations, such as those using FSM, allow dynamic state transitions based on research outcomes, while fully trained models (e.g., o3, Grok 3) learn to autonomously plan and refine research processes, enhancing flexibility and depth (Siddhardha, 2024; Hopsworks, 2024).

This shift reflects a move toward AI systems that can mimic human research processes, with iterative learning and adaptation, rather than rigid, predefined workflows.

Quantifiable Metrics and Comparisons

Evaluation metrics for Deep Research quality include:

Humanity's Last Exam (HLE): A benchmark with 3,000 expert-level questions across mathematics, humanities, and natural sciences, designed to test reasoning beyond simple retrieval. Scores include:
- OpenAI Deep Research: 26.6%
- Perplexity Deep Research: 21.1%
- Google's Gemini and xAI's DeepSearch lack specific HLE scores in recent data (Wikipedia, 2025a).
Time to Completion: Perplexity completes tasks in under 3 minutes, while OpenAI takes 5-30 minutes, affecting user experience and efficiency (The Indian Express, 2025).
Comprehensiveness: Measured by report depth, citation quality, and ability to handle complex queries, with OpenAI noted for analytical depth and Perplexity for speed and accessibility.

Comparisons show OpenAI leading in HLE performance, but Perplexity offers faster, more affordable access, highlighting trade-offs between accuracy and efficiency.

Specific Training Methodologies

Training methodologies vary, tailored to enhance research capabilities:

OpenAI's o3: Uses reinforcement learning with simulated reasoning and private chain-of-thought techniques, allowing the model to pause and reflect, improving accuracy on complex tasks like coding and math (The Decoder, 2024).
Perplexity's Deep Research: Likely involves fine-tuning DeepSeek R1, an open-source model known for reasoning, with TTC expansion for iterative analysis, though specific details are proprietary (US AI Institute, 2025).
Google's Gemini 2.0 Pro: Trained on large datasets using supervised and reinforcement learning, focusing on complex tasks and reasoning, with integration into Gemini Advanced for research (Google Gemini, 2025).
xAI's Grok 3: Trained on extensive datasets with a focus on reasoning, using 200,000 Nvidia H100 GPUs, emphasizing multimodal capabilities and DeepSearch functionality (PCWorld, 2025).

These methodologies highlight a trend toward specialized training for research tasks, with reinforcement learning and fine-tuning being common.

Differences from RAG and Agentic Systems

Deep Research differs from previous technologies as follows:

Retrieval Augmented Generation (RAG): RAG enhances LLMs with retrieval mechanisms for up-to-date information, focusing on single-step generation. Deep Research extends this by performing multi-step, iterative research, planning, and synthesizing reports, going beyond retrieval (Berkeley Artificial Intelligence Research, 2024).
Agentic Systems: These are broader AI systems acting autonomously, while Deep Research is a specific subset focused on research tasks, with enhanced planning and reasoning capabilities. The innovation lies in depth and autonomy, though some argue it's rebranding of advanced agentic systems, sparking debate on novelty versus marketing (McKinsey, 2024).

Practical Limitations

Current Deep Research implementations face several limitations:

Factual Errors: All systems can produce hallucinations, with OpenAI noting issues in distinguishing authoritative sources (ScienceAlert, 2025).
Source Credibility: Difficulty in identifying reliable sources, potentially including rumors, affecting report accuracy.
Uncertainty Conveyance: May not accurately reflect uncertainty, impacting trust.
Time and Cost: OpenAI's $200/month Pro plan limits access, while Perplexity offers free tiers but with query limits (Creator Economy, 2025).
Human Oversight: Requires intervention for complex tasks, highlighting the need for user guidance.

Implementation of Iterative Search Cycles

Iterative search cycles involve multiple rounds of searching, analyzing, and refining, impacting research depth:

OpenAI: Uses simulated reasoning, with o3 pausing to reflect, potentially performing multiple iterations, taking 5-30 minutes, enhancing depth but increasing latency.
Perplexity: Employs TTC expansion for iterative refinement, completing tasks quickly (under 3 minutes), balancing depth and speed.
Google: Creates multi-step plans for user approval, allowing iterative browsing and analysis, with reports reflecting comprehensive insights.
xAI: DeepSearch tracks internet searches, teaching reasoning skills, with iterative processes likely embedded in Grok 3's reasoning modes (Think, Big Brain), affecting depth based on mode selection.

This variability affects research depth, with longer cycles potentially yielding more comprehensive results but at higher computational cost.

Real-World Applications and Use Cases

Deep Research tools demonstrate significant benefits in:

Literature Reviews: OpenAI's tool produces cited, pages-long reports, useful for scientists (Nature, 2025).
Complex Topic Research: Perplexity excels in finance, marketing, and technology, delivering expert-level analysis in minutes (InfoQ, 2025).
Educational and Business Reports: Google's Deep Research aids in industry trends, competitive analysis, and customer research, enhancing productivity (Google Workspace Updates, 2025).

These applications highlight the transformative potential for knowledge workers and researchers.

Research Papers and Empirical Evidence

Empirical evidence includes:

HLE Performance: Provides scores for comparison, with OpenAI at 26.6% and Perplexity at 21.1%, indicating reasoning capabilities (arXiv, 2025).
Other Benchmarks: GPQA, Codeforces, and SWE-Bench Verified scores for models like o3, showing performance in coding and math, supporting research effectiveness (InfoQ, 2024).

These papers offer robust data for evaluating Deep Research tools.

Balancing Autonomous Research with Human Oversight

Different systems balance autonomy and oversight variably:

OpenAI: Allows user interaction for approving research plans, with transparency in reasoning steps, but requires Pro subscription for full access, limiting autonomy for free users.
Perplexity: Offers free access with limits, enabling user queries but with iterative refinement largely autonomous, balancing speed and depth.
Google: Users can revise multi-step plans, enhancing oversight, with integration into productivity tools facilitating human intervention.
xAI: DeepSearch operates within Grok 3, with modes like Think and Big Brain showing thought processes, allowing user oversight, but specifics on intervention are less clear.

This balance ensures users can guide research while leveraging AI autonomy, with varying levels of transparency and control.

Summary Table: HLE Performance and Key Metrics

Implementation	HLE Score	Time to Completion	Data Handling
OpenAI Deep Research	26.6%	5-30 minutes	Text, Images, PDFs
Perplexity Deep Research	21.1%	Under 3 minutes	Text (assumed)
Google's Deep Research	Not specified	Not specified	Text, Web Sources
xAI's DeepSearch	Not specified	Not specified	Text, Web, X

This table summarizes key metrics, highlighting performance and operational differences.

In conclusion, Deep Research represents a significant advancement in AI-driven research, with distinct implementations offering unique strengths and limitations, supported by empirical benchmarks and real-world applications, while balancing autonomy with necessary human oversight.

References

arXiv. (2025). Humanity's Last Exam. arXiv:2501.14249.
Berkeley Artificial Intelligence Research. (2024, February 18). The shift from models to compound AI systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
Business Insider. (2025, February). Elon Musk's xAI has been working on a 'DeepSearch' feature, employees say, and it could compete with Google and OpenAI. https://www.businessinsider.com/xai-deepsearch-google-gemini-openai-2025-2
Center for AI Safety. (2025). Humanity's Last Exam. GitHub. https://github.com/centerforaisafety/hle
Creator Economy. (2025). Deep Research: The best AI product from OpenAI since ChatGPT. https://creatoreconomy.so/p/deep-research-the-best-ai-agent-since-chatgpt-product
DataCamp. (2025). OpenAI's Deep Research: A guide with practical examples. https://www.datacamp.com/blog/deep-research-openai
Google. (2025). Try Deep Research and our new experimental model in Gemini, your AI assistant. https://blog.google/products/gemini/google-gemini-deep-research/
Google Gemini. (2025). Gemini Advanced - get access to Google's most capable AI models with Gemini 2.0. https://gemini.google/advanced/?hl=en
Google Workspace Updates. (2025, February). Gemini Deep Research and experimental models now available to Google Workspace users in Gemini Advanced. https://workspaceupdates.googleblog.com/2025/02/deep-research-available-for-google-workspace-in-gemini-advanced.html
Hopsworks. (2024). What is a DAG Processing Model? https://www.hopsworks.ai/dictionary/dag-processing-model
InfoQ. (2024, December). OpenAI announces 'o3' reasoning model. https://www.infoq.com/news/2024/12/openai-announces-o3/
InfoQ. (2025, February). Perplexity unveils Deep Research: AI-powered tool for advanced analysis. https://www.infoq.com/news/2025/02/perplexity-deep-research/
McKinsey. (2024). Why AI agents are the next frontier of generative AI. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/why-agents-are-the-next-frontier-of-generative-ai
Nature. (2025). OpenAI's 'deep research' tool: Is it useful for scientists? https://www.nature.com/articles/d41586-025-00377-9
PCWorld. (2025). xAI launches new Grok-3 AI model with DeepSearch reasoning. https://www.pcworld.com/article/2611838/xai-launches-new-grok-3-ai-model-with-deepsearch-researching.html
Scale AI. (2025). Humanity's Last Exam - Scale AI and CAIS unveil results. https://scale.com/blog/humanitys-last-exam-results
ScienceAlert. (2025). ChatGPT's Deep Research is here. But can it really replace a human expert? https://www.sciencealert.com/chatgpts-deep-research-is-here-but-can-it-really-replace-a-human-expert
Siddhardha. (2024). Agentic AI workflows in Directed Acyclic Graphs (DAGs) — Intro. Medium. https://medium.com/@siddhardha/agentic-ai-workflows-in-directed-acyclic-graphs-dags-intro-5d00444124dd
The Decoder. (2024). OpenAI's o3 model shows major gains through reinforcement learning scaling. https://the-decoder.com/openais-o3-model-shows-major-gains-through-reinforcement-learning-scaling/
The Indian Express. (2025). Perplexity AI's Deep Research tool is free to use: Here's how it works. https://indianexpress.com/article/technology/artificial-intelligence/perplexity-ais-deep-research-tool-is-free-to-use-heres-how-it-works-9837369/
US AI Institute. (2025). What is Perplexity Deep Research – A detailed overview. https://www.usaii.org/ai-insights/what-is-perplexity-deep-research-a-detailed-overview
Wikipedia. (2025a). Humanity's Last Exam. https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam
Wikipedia. (2025b). Deep Research. https://en.wikipedia.org/wiki/Deep_Research
Wikipedia. (2025c). ChatGPT Deep Research. https://en.wikipedia.org/wiki/ChatGPT_Deep_Research
ZDNET. (2025). What is Perplexity Deep Research, and how do you use it? https://www.zdnet.com/article/what-is-perplexity-deep-research-and-how-do-you-use-it/

‹ Diffusion LLMs: Re-imagining Language Generation

AI Language Models as Evolving Mirrors of Human Complexity ›