1744369545198.jpg

1744369714252.jpg

💡 Introduction

The rapid advancement of large language models (LLMs) has brought about a transformative shift in the field of artificial intelligence. As a notable innovation within the LLM landscape, large reasoning models—such as DeepSeek-R1 and QwQ—have demonstrated superior logical reasoning capabilities in complex problem-solving tasks by incorporating reinforcement learning and chain-of-thought techniques. However, due to the intrinsic complexity of information retrieval these models are not specifically optimized for such scenarios. Consequently, direct application of large reasoning models often suffers from inefficient search strategies, excessive redundant reasoning, and low precision in searched content.

Recently, researchers begin to explore methods for enhancing LLMs’ complex reasoning capabilities in information retrieval tasks. These approaches typically leverage reinforcement learning to stimulate autonomous search during the reasoning process. Notably, such methods require only the raw questions as input, without the need for high-quality response supervision. While effective in improving model performance, reinforcement learning incurs substantial training overhead. Moreover, many current approaches rely on local retrieval corpora; transitioning to web search during training further reduces training efficiency. Additionally, methods employing web search often demand significant training costs, hindering the broader adoption of complex reasoning-based search systems. This motivates the need for a solution that enables powerful reasoning with more economical training cost.

To this end, we propose SimpleDeepSearcher, a framework designed to stimulate autonomous web search during complex reasoning via knowledge distillation and self-distillation. We aim to achieve high training efficiency and performance with limited supervision. Despite its conceptual simplicity, constructing high-quality training set presents two key challenges. On the question side, existing open-source datasets often suffer from issues such as imbalanced domain distributions, repetitive structures, and insufficient complexity, limiting their utility in eliciting deep search behavior. On the response side, solving deep search tasks requires effectively decomposing complex questions while avoiding invalid reasoning steps and overthinking—objectives that are fundamentally distinct from those in traditional mathematical or logical reasoning tasks.

To address these challenges, we first perform fine-grained filtering of existing open-source datasets based on multiple dimensions including domain coverage, structural variety, and question complexity. This ensures that the selected questions exhibit diverse domains and structures, as well as a balanced difficulty distribution. Next, we perform rollout sampling using large reasoning models in a real-world web search environment. The resulting traces are then filtered again based on criteria such as format, sub-query quality, question difficulty, and reasoning path integrity to eliminate redundant reasoning. The curated data is subsequently used to train multiple LLMs, enabling us to explore the potential of distillation techniques in fostering autonomous web search capabilities.

We evaluate the proposed method on five challenging benchmarks—2WikiMultiHopQA, Bamboogle, Musique, FRAMES, and GAIA—and our results demonstrate that SimpleDeepSearcher consistently outperforms a range of recent state-of-the-art baselines.

We open-source all models and efficient fine-tuning datasets (0.5k and 0.8k). The 0.5k dataset features more direct reasoning paths, while the 0.8k dataset includes richer reflection and rethinking processes. The code will be released progressively after organization, and a detailed technical report will follow—stay tuned to our project!

✨  Key Insights

  1. Data Synthesis Based on Real-World Web Environments: We design a large-scale data synthesis pipeline grounded in authentic open-web environments, enhancing the diversity and realism of training corpora. This significantly improves the model’s ability to retrieve and integrate information in complex search tasks.
  2. Rigorous Data Filtering Strategy: We introduce a task-specific QA pair filtering method tailored for search-oriented training, enabling fine-grained selection of high-quality training samples.
  3. Efficient Performance Boost with Limited Data: Using only 871 distilled examples, our 7B-scale model surpasses existing models trained via reinforcement learning. Notably, Qwen-32B-Instruct approaches the performance of QwQ-32B, which possesses built-in searchcapabilities, while also enabling further performance gains for QwQ-32B itself.
  4. Generalization to OOD Evaluation Sets: Training on conventional multi-hop datasets leads to strong generalization capabilities on out-of-distribution (OOD) benchmarks, including FRAMES and GAIA.
  5. Analysis of Post-Distillation Reinforcement Learning: We further finetune the distilled 7B model with reinforcement learning and provide an in-depth analysis of the training dynamics and performance impact.

✨ Methodology and Technical Framework