Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees (2025)

11institutetext: L3S Research Center, Leibniz University Hannover, Hannover, Germany 11email: {gollam.rabby, mitra}@l3s.de22institutetext: TIB—Leibniz Information Centre for Science and Technology, Hannover, Germany22email: {diyana.muhammed, auer}@tib.eu
\faDatabase Dataset\faGlobe Project Page\faGithub Codebase
00footnotetext: Equal contribution.

Gollam Rabby (✉)1 †1 †  Diyana Muhammed2 †2 †  Prasenjit Mitra22  Sören Auer1122

Abstract

Scientific hypothesis generation is a fundamentally challenging task in research, requiring the synthesis of novel and empirically grounded insights. Traditional approaches rely on human intuition and domain expertise, while purely large language model (LLM) based methods often struggle to produce hypotheses that are both innovative and reliable. To address these limitations, we propose the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST), a novel framework that integrates Monte Carlo Tree Search (MCTS) with Nash Equilibrium strategies to iteratively refine and validate hypotheses. MC-NEST dynamically balances exploration and exploitation through adaptive sampling strategies, which prioritize high-potential hypotheses while maintaining diversity in the search space. We demonstrate the effectiveness of MC-NEST through comprehensive experiments across multiple domains, including biomedicine, social science, and computer science. MC-NEST achieves average scores of 2.65, 2.74, and 2.80 (on a 1-3 scale) for novelty, clarity, significance, and verifiability metrics on the social science, computer science, and biomedicine datasets, respectively, outperforming state-of-the-art prompt-based methods, which achieve 2.36, 2.51, and 2.52 on the same datasets. These results underscore MC-NEST’s ability to generate high-quality, empirically grounded hypotheses across diverse domains. Furthermore, MC-NEST facilitates structured human-AI collaboration, ensuring that LLMs augment human creativity rather than replace it. By addressing key challenges such as iterative refinement and the exploration-exploitation balance, MC-NEST sets a new benchmark in automated hypothesis generation. The framework provides a robust and adaptable approach that advances the boundaries of scientific discovery. Additionally, MC-NEST’s ethical design enables responsible AI use, emphasizing transparency and human supervision in hypothesis generation.

Keywords:

Scientific Hypothesis Generation Monte Carlo Tree Search
Adaptive Sampling Strategies Hypothesis Refinement.

1 Introduction

Scientific hypothesis generation drives discovery and innovation but remains limited by the scale and complexity of modern challenges. While large language models (LLMs) show promise in automating this process[2], existing approaches struggle to generate hypotheses that are both novel and empirically grounded due to a lack of iterative refinement and poor exploration-exploitation balance[5]. To address these challenges, we utilize the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST), a framework that integrates the Monte Carlo Tree Search (MCTS) with Nash Equilibrium strategies to iteratively refine hypotheses[15]. MC-NEST frames hypothesis generation as a game, where the players are competing strategies for exploring and refining hypotheses. This game-theoretic approach allows MC-NEST to balance the trade-offs between exploring new ideas and exploiting known high-quality hypotheses. Each strategy aims to maximize the quality of the generated hypotheses, and Nash Equilibrium ensures a balance where no player (strategy) can improve its outcome by unilaterally changing its approach. These strategies guide the exploration and refinement phases by dynamically adjusting the trade-off between exploring new hypotheses and exploiting known high-quality ones, ensuring optimal hypothesis generation.

MC-NEST dynamically balances exploration and exploitation using adaptive sampling techniques, ensuring diverse and high-potential hypotheses. The framework operates in two phases: (1) an exploration phase, where MCTS navigates the hypothesis space guided by Nash Equilibrium, and (2) a refinement phase, where adaptive sampling and iterative self-reflection ensure hypotheses are innovative and empirically grounded. For instance, in peptide optimization, exploration might involve proposing a new substitution (e.g., replacing arginine with lysine) to test its effect on solubility, while exploitation would refine this idea by validating whether the substitution improves solubility without compromising the peptide’s nuclear localization function. Experiments across biomedicine, social science, and computer science demonstrate MC-NEST’s effectiveness in hypothesis generation. MC-NEST achieves higher novelty, clarity, significance, and verifiability compared to existing methods[23], demonstrating its effectiveness in generating scientifically impactful hypotheses. Specifically, MC-NEST achieves scores of 2.65, 2.74, and 2.80 (on a 1-3 scale) for novelty, clarity, significance, and verifiability on the social science, computer science, and biomedicine datasets, respectively. These results outperform state-of-the-art prompt-based methods, which achieve 2.36, 2.51, and 2.52 on the same datasets. This improvement demonstrates MC-NEST’s ability to generate hypotheses that are not only innovative but also empirically grounded and scientifically impactful.

A key innovation is MC-NEST’s ability to incorporate emerging scientific literature, addressing the limitations of automatic refinement and exploration-exploitation balance. The framework supports structured human-AI collaboration, where LLMs augment human expertise rather than replace it. This approach balances AI and human judgment, mitigating over-reliance on AI. While AI excels at generating novel hypotheses and exploring chemical spaces, human expertise is critical for interpreting results, identifying biases, and ensuring ethical decisions. For example, in peptide optimization, MC-NEST proposes substitutions (e.g., lysine-for-arginine) to improve solubility, while humans validate whether these changes maintain nuclear localization and align with biochemical principles. This iterative collaboration combines AI’s exploratory capabilities with human expertise, ensuring scientifically robust and ethically sound outcomes.

Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees (1)
Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees (2)

For research problems, impact is as critical as novelty. While novelty ensures that hypotheses are original, impact ensures they address meaningful scientific challenges. MC-NEST achieves this balance by generating hypotheses that are not only novel but also grounded in domain-specific knowledge and validated for real-world applicability. Unlike purely exploratory methods, MC-NEST incorporates iterative refinement and validation, ensuring that hypotheses are both innovative and empirically grounded. For example, in complex scientific domains such as protein engineering, MC-NEST’s proposed modifications (e.g., lysine-for-arginine substitutions) are designed to enhance solubility while maintaining critical functional properties—a dual focus that directly addresses high-priority scientific and therapeutic needs. By combining exploration with rigorous validation, MC-NEST ensures that its hypotheses are not only novel but also impactful, contributing to solving real-world problems with significant scientific and practical implications. Our contributions include:

  • MC-NEST, a framework integrating MCTS and Nash Equilibrium for hypothesis generation, enhanced by adaptive sampling techniques.

  • A comprehensive performance analysis across multiple domains, with detailed studies highlighting the impact of each component.

  • A human-AI collaboration approach that improves hypothesis quality through expert refinement.

To ensure reproducibility, we will release all used source codes, datasets, and evaluation protocols.

To illustrate MC-NEST’s capabilities, we present an example of hypothesis generation and refinement for optimizing a synthetic peptide sequence (MARTKQTARKSTGGKAPRKQLASKAARKSAARAAAAGGGGGGG) for nuclear localization and solubility. MC-NEST generates an initial hypothesis: Substituting lysine for arginine in the nuclear localization signal (NLS) preserves the positive charge required for nuclear import while enhancing solubility due to lysine’s less bulky structure. Validation against biochemical principles reveals potential trade-offs, such as reduced binding affinity to nuclear import receptors[7]. MC-NEST refines the hypothesis by incorporating additional modifications: Replacing some glycine residues with alanine in the glycine-rich linker to maintain flexibility without introducing phosphorylation sites. Experimental validation confirms that the modified peptide outperforms the original sequence, retaining nuclear localization efficiency while improving solubility and functionality. The updated sequence generated by MC-NEST is: MAKTQTGRPKSTGGPAPRKQLASPPARKSVAARAAAASGGGSGG. A visual comparison (by AlphaFold) of the original and updated peptide sequences is shown in Figure1.

Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees (3)

2 Methodology

MC-NEST is a computational framework designed to enhance the problem-solving capabilities of LLMs for scientific hypothesis generation[15]. As illustrated in Figure2, MC-NEST integrates the Monte Carlo Tree Search, a decision-making algorithm for exploring large search spaces[3], with Nash Equilibrium strategies to iteratively refine hypotheses and solutions. By dynamically balancing exploration and exploitation, MC-NEST ensures that generated hypotheses are both innovative and empirically grounded.

2.0.1 Problem Setting for Hypothesis Generation.

MC-NEST is designed for a structured search over combinatorial hypothesis spaces, particularly in domains requiring rigorous reasoning and insight. The framework addresses the challenge of efficiently navigating vast search spaces while ensuring quality, efficiency, and novelty. Specifically, MC-NEST targets problems where:

  • The hypothesis space is combinatorial, with solutions constructed from smaller reasoning steps or building blocks. For example, in protein engineering, a hypothesis might propose amino acid substitutions to optimize functions like nuclear localization or solubility[19]. A specific hypothesis could suggest substituting lysine for arginine in a nuclear localization signal (NLS), preserving the positive charge required for nuclear import while enhancing solubility due to lysine’s less bulky structure. Such hypotheses are built from testable steps (e.g., charge preservation, solubility enhancement) that can be experimentally validated.

  • The search space is large for exhaustive exploration, necessitating intelligent traversal strategies[9]. For example, the space of possible amino acid substitutions is intractable without a guided search. Traditional methods often focus on well-known substitutions (e.g., arginine-to-lysine in NLS), while MC-NEST explores less-studied modifications, such as introducing alanine into glycine-rich linkers to enhance flexibility without adding phosphorylation sites. By prioritizing high-potential but underexplored changes, MC-NEST uncovers novel solutions missed by traditional approaches.

  • Solutions must satisfy strict correctness criteria, including clarity, testability, relevance, and novelty[23]. For instance, a hypothesis must clearly describe relationships (e.g., “substituting lysine for arginine enhances nuclear import efficiency”), be testable (e.g., via fluorescence microscopy or solubility assays), relevant (e.g., optimizing synthetic peptides for mammalian cell expression), and novel (e.g., identifying alanine’s role in linker flexibility). MC-NEST ensures that hypotheses meet these criteria by iteratively refining and validating them against biochemical principles and experimental data.

2.0.2 Search Space and Traversal Strategy.

The search space in MC-NEST is represented as a tree, where nodes correspond to solutions (e.g., hypotheses or amino acid substitutions), and edges represent logical transitions. The traversal strategy combines exploration and exploitation: 1) Upper Confidence Bound for Trees (UCT) balances exploration and exploitation by estimating branch potential using confidence intervals, favoring high-uncertainty or high-performance paths[15] (subsection2.2). For example, UCT explores less-studied substitutions (e.g., alanine in glycine-rich linkers) while leveraging known modifications (e.g., lysine-for-arginine in the NLS). 2) Exploration prioritizes underexplored branches, balancing novelty and promise, as seen in game-playing AI like AlphaGo[18] (subsection2.2). 3) Exploitation refines promising branches using probabilistic node selection, focusing on high-quality regions while maintaining diversity (subsection2.2). For instance, MC-NEST exploits beneficial substitutions (e.g., lysine-for-arginine) while exploring novel combinations (e.g., alanine in glycine-rich linkers) to optimize functionality.

2.1 Benefits of the MC-NEST Framework in Hypothesis Generation

Scientific discovery has traditionally relied on structured methodologies but often faces limitations due to their lack of refinement and difficulty in balancing exploration and exploitation[5]. Existing frameworks struggle to adapt to emerging scientific literature or integrate new discoveries, leading to hypotheses that are either theoretically sound but empirically unsupported or computationally generated but lacking empirical grounding. For example, traditional methods might focus on well-known substitutions (e.g., arginine-to-lysine in the NLS) but overlook novel modifications (e.g., alanine in a glycine-rich linker) that enhance functionality[19]. The exponential growth of scientific publications further complicates the process, as researchers must sift through vast amounts of literature to identify meaningful insights[14]. While LLMs offer potential, they often fail to generate hypotheses that are both novel and empirically validated[2].

Limitations of Existing Approaches.

Previous works have attempted to address these gaps through approaches like zero-shot hypothesis generation but suffer from critical limitations: 1) Lack of Iterative Refinement: Hypotheses may be theoretically sound but lack iterative refinement[20]. 2) Imbalanced Exploration-Exploitation: Conventional approaches struggle to balance novel hypothesis exploration with established patterns, leading to biased or suboptimal results[9].

Addressing Challenges with MC-NEST.

MC-NEST integrates Nash Equilibrium strategies with LLM-based self-refinement to address these limitations: 1) Dynamic Adaptation: MC-NEST balances exploration and exploitation using Nash Equilibrium, enabling adaptability to emerging scientific contexts. For example, in protein engineering, it explores less-studied modifications (e.g., alanine in a glycine-rich linker) while leveraging well-known substitutions (e.g., lysine-for-arginine in the NLS). 2) Iterative Self-Refinement: MC-NEST employs MCTS with iterative self-critique, refining hypotheses against known principles. For instance, it identifies trade-offs (e.g., reduced binding affinity) and incorporates additional modifications (e.g., alanine in the glycine-rich linker). 3) Strategic Exploration: MC-NEST uses sampling approaches to prioritize high-potential hypotheses while maintaining diversity, ensuring robust hypothesis generation.

2.2 Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST)

The objective of MC-NEST is to generate a research hypothesis hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for a given problem instance p𝑝pitalic_p.Formally, let \mathcal{H}caligraphic_H denote the hypothesis space, where each hypothesis hh\in\mathcal{H}italic_h ∈ caligraphic_H represents a candidate research statement. The goal is to identify hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that optimizes a quality function Q(h)𝑄Q(h)italic_Q ( italic_h ), capturing validity, novelty, and coherence: h=argmaxhQ(h)superscriptsubscript𝑄h^{*}=\arg\max_{h\in\mathcal{H}}Q(h)italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT italic_Q ( italic_h )

2.2.1 Initialization.

In MC-NEST, the root node represents the initial hypothesis state, with edges denoting potential transformations or refinements through iterative self-critique and exploration strategies. To initialize the root node, we use a pre-trained LLM with a Zero-Shot Chain-of-Thought (ZSCoT) strategy[10]. Specifically, the LLM is prompted with the input instance p𝑝pitalic_p to generate an initial hypothesis without relying on task-specific fine-tuning or prior search history. This approach leverages the LLM’s broad, pre-trained knowledge to establish a well-reasoned starting point, enhancing adaptability and promoting a wide, unbiased exploration of the hypothesis space. The initialization is represented as: root=Node(hypothesis=ZSCoT_LLM(p))rootNodehypothesisZSCoT_LLM𝑝\smash{\text{root}=\text{Node}(\text{hypothesis}=\text{ZSCoT\_LLM}(p))}root = Node ( hypothesis = ZSCoT_LLM ( italic_p ) )

2.2.2 Candidate Node Generation.

Child nodes are generated by applying a structured process of self-refinement and self-evaluation to the parent node’s hypothesis. Self-refinement focuses on improving the hypothesis itself by prompting the LLM with the current hypothesis and customizing instructions, such as increasing specificity, enhancing novelty, or aligning better with empirical data. The LLM determines what to refine using predefined heuristics—such as logical coherence, relevance to the research goal, and consistency with known information—that guide the refinement. Following refinement, self-evaluation updates the hypothesis against these metrics to ensure each child node represents an improvement over its parent. Nodes are visited using a breadth-first search (BFS) strategy[4], where a node is expanded if it has not reached its maximum allowed children and none of its children have a higher quality score Q𝑄Qitalic_Q than the node itself. If no candidate nodes meet these criteria, the method refines the root node, reinitializing the search by generating a new hypothesis using the ZSCoT strategy. This approach balances exploration (generating new hypotheses) and exploitation (refining existing ones) by dynamically adjusting based on the quality scores of hypotheses. While global optimality is not guaranteed, the iterative refinement process aims to converge towards high-quality hypotheses, with higher Q𝑄Qitalic_Q-scores indicating better solutions.

2.2.3 Nash Equilibrium Strategy for Node Selection.

The hypothesis generation process in MC-NEST begins with an initial hypothesis generated by a pre-trained LLM at the root node of a search tree. Each node represents a unique hypothesis state, and edges signify possible refinements through iterative self-critique. Child nodes are created by refining the parent node’s hypothesis using structured prompts, employing self-refinement and self-evaluation techniques to iteratively enhance the hypothesis.

During node selection, MC-NEST uses the UCT, where each node is assigned a quality score Q𝑄Qitalic_Q derived from evaluation metrics such as logical coherence, novelty, and empirical alignment. The UCT score balances the exploration of under-explored nodes and the exploitation of high-quality hypotheses, guiding the search toward optimal solutions. A node is considered fully expanded if it reaches the maximum allowed number of children or if any child exhibits a reward Q𝑄Qitalic_Q greater than or equal to that of the current node. For a set of candidate nodes, Node(Hypothesis)={h1,h2,,hn}𝑁𝑜𝑑𝑒𝐻𝑦𝑝𝑜𝑡𝑒𝑠𝑖𝑠subscript1subscript2subscript𝑛Node(Hypothesis)=\{h_{1},h_{2},\dots,h_{n}\}italic_N italic_o italic_d italic_e ( italic_H italic_y italic_p italic_o italic_t italic_h italic_e italic_s italic_i italic_s ) = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the Nash Equilibrium strategy assigns a uniform probability distribution over possible actions: π(hi)=1n,i=1,2,,nformulae-sequence𝜋subscript𝑖1𝑛for-all𝑖12𝑛\pi(h_{i})=\frac{1}{n},\quad\forall i=1,2,\dots,nitalic_π ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG , ∀ italic_i = 1 , 2 , … , italic_n, where n𝑛nitalic_n is the number of candidate nodes. This uniform probability ensures fair exploration of the hypothesis space, preventing premature convergence to suboptimal solutions. The MC-NEST framework employs three selection policies to balance exploration and exploitation:

  • Greedy Policy selects the node with the highest combined score of UCT and Nash equilibrium probability: i=argmaxi[UCT(i)+π(hi)]superscript𝑖subscript𝑖𝑈𝐶𝑇𝑖𝜋subscript𝑖i^{*}=\arg\max_{i}\left[UCT(i)+\pi(h_{i})\right]italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_U italic_C italic_T ( italic_i ) + italic_π ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]

  • Importance Sampling Policy assigns selection weights based on the product of UCT scores and Nash equilibrium probabilities: Weight(i)=UCT(i)×π(hi)Weight𝑖𝑈𝐶𝑇𝑖𝜋subscript𝑖\text{Weight}(i)=UCT(i)\times\pi(h_{i})Weight ( italic_i ) = italic_U italic_C italic_T ( italic_i ) × italic_π ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ),
    i=random_choice(C,weights={Weight(i)})superscript𝑖random_choice𝐶weightsWeight𝑖\quad i^{*}=\text{random\_choice}(C,\text{weights}=\{\text{Weight}(i)\})italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = random_choice ( italic_C , weights = { Weight ( italic_i ) } )

  • Pairwise Importance Sampling Policy evaluates pairs of nodes (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) based on UCT differences and weights, selecting the node with the higher combined score:i=argmax(UCT(i)+π(hi),UCT(j)+π(hj))superscript𝑖UCT𝑖𝜋subscript𝑖UCT𝑗𝜋subscript𝑗i^{*}=\arg\max\left(\text{UCT}(i)+\pi(h_{i}),\text{UCT}(j)+\pi(h_{j})\right)italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max ( UCT ( italic_i ) + italic_π ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , UCT ( italic_j ) + italic_π ( italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ).

These policies systematically balance exploration and exploitation, ensuring that the search process prioritizes high-reward nodes while maintaining a broad exploration of the hypothesis space.

2.2.4 Upper Confidence Bound (UCT) Update.

The UCT update guides node refinement by computing: UCT(i)=Q(i)+Cln(Nparent)N(i)+ϵ𝑈𝐶𝑇𝑖𝑄𝑖𝐶subscript𝑁parent𝑁𝑖italic-ϵUCT(i)=Q(i)+C\sqrt{\frac{\ln(N_{\text{parent}})}{N(i)+\epsilon}}italic_U italic_C italic_T ( italic_i ) = italic_Q ( italic_i ) + italic_C square-root start_ARG divide start_ARG roman_ln ( italic_N start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N ( italic_i ) + italic_ϵ end_ARG end_ARG,where Q(i)𝑄𝑖Q(i)italic_Q ( italic_i ) is the hypothesis reward, C𝐶Citalic_C controls exploration, Nparentsubscript𝑁parentN_{\text{parent}}italic_N start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT is parent visits, N(i)𝑁𝑖N(i)italic_N ( italic_i ) is node visits, and ϵitalic-ϵ\epsilonitalic_ϵ avoids division by zero. The score is adjusted with Nash equilibrium probability: UCT(i)=Q(i)+Cln(Nparent)N(i)+ϵ+1nUCT𝑖𝑄𝑖𝐶subscript𝑁parent𝑁𝑖italic-ϵ1𝑛\text{UCT}(i)=Q(i)+C\sqrt{\frac{\ln(N_{\text{parent}})}{N(i)+\epsilon}}+\frac{%1}{n}UCT ( italic_i ) = italic_Q ( italic_i ) + italic_C square-root start_ARG divide start_ARG roman_ln ( italic_N start_POSTSUBSCRIPT parent end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N ( italic_i ) + italic_ϵ end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG.The node with the highest score, i=argmaxi[Score(i)]superscript𝑖subscript𝑖Score𝑖i^{*}=\arg\max_{i}[\text{Score}(i)]italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ Score ( italic_i ) ], is selected for refinement or as the final hypothesis, ensuring robust exploration and exploitation of the hypothesis space.

2.2.5 Expansion.

Following node selection, MC-NEST expands the search tree by generating a refined child node. Given a selected node nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, a new child ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is created via self-refinement: nc=SelfRefine(ns)subscript𝑛𝑐SelfRefinesubscript𝑛𝑠n_{c}=\text{SelfRefine}(n_{s})italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = SelfRefine ( italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). This process critiques and improves the solution at nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, storing the refined version in ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT: ns.childrenns.children{nc}subscript𝑛𝑠.childrensubscript𝑛𝑠.childrensubscript𝑛𝑐n_{s}\text{.children}\leftarrow n_{s}\text{.children}\cup\{n_{c}\}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT .children ← italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT .children ∪ { italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }. The critique is formulated as Critique(as)=LLMCritique(p,as)Critiquesubscript𝑎𝑠LLMCritique𝑝subscript𝑎𝑠\text{Critique}(a_{s})=\text{LLMCritique}(p,a_{s})Critique ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = LLMCritique ( italic_p , italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), where p𝑝pitalic_p is the problem instance. The refined answer acsubscript𝑎𝑐a_{c}italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is: ac=RefineAnswer(p,as,Critique(as))subscript𝑎𝑐RefineAnswer𝑝subscript𝑎𝑠Critiquesubscript𝑎𝑠a_{c}=\text{RefineAnswer}(p,a_{s},\text{Critique}(a_{s}))italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = RefineAnswer ( italic_p , italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , Critique ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) and assigned to ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This structured expansion enables MC-NEST to enhance solutions iteratively, driving systematic search improvement.

2.2.6 Backpropagation.

MC-NEST updates node quality scores Q𝑄Qitalic_Q and visit counts from the newly expanded node ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT up to the root. This propagates deeper exploration insights into higher-level decisions. Given a child node ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and its parent npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, backpropagation updates Q(np)𝑄subscript𝑛𝑝Q(n_{p})italic_Q ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) using: Q(np)=Q(np)+max(Q(nc))2𝑄subscript𝑛𝑝𝑄subscript𝑛𝑝𝑄subscript𝑛𝑐2Q(n_{p})=\frac{Q(n_{p})+\max(Q(n_{c}))}{2}italic_Q ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = divide start_ARG italic_Q ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + roman_max ( italic_Q ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) end_ARG start_ARG 2 end_ARG. This balances the exploitation of known values with exploration. The visit count is incremented:Visit(np)=Visit(np)+1.Visitsubscript𝑛𝑝Visitsubscript𝑛𝑝1\text{Visit}(n_{p})=\text{Visit}(n_{p})+1.Visit ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = Visit ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + 1 .The process recurses from ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to the root, ensuring informed node selection in MC-NEST.

2.2.7 Self-Refine.

MC-NEST evaluates candidate answers by assigning a reward Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT based on answer quality. Given a node n𝑛nitalic_n with answer Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the reward is computed as:Rn=LLM(EvaluatePrompt(P,An)).subscript𝑅𝑛LLMEvaluatePrompt𝑃subscript𝐴𝑛R_{n}=\text{LLM}\left(\texttt{EvaluatePrompt}(P,A_{n})\right).italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = LLM ( EvaluatePrompt ( italic_P , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) .If Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT exceeds a predefined limit, a penalty is applied:

R~n={Rn,RnRn_limitRnpenalty,Rn>Rn_limit.subscript~𝑅𝑛casessubscript𝑅𝑛subscript𝑅𝑛subscript𝑅𝑛_limitsubscript𝑅𝑛penaltysubscript𝑅𝑛subscript𝑅𝑛_limit\tilde{R}_{n}=\begin{cases}R_{n},&R_{n}\leq R_{n}\_\text{limit}\\R_{n}-\text{penalty},&R_{n}>R_{n}\_\text{limit}.\end{cases}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , end_CELL start_CELL italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT _ limit end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - penalty , end_CELL start_CELL italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT _ limit . end_CELL end_ROW

Node statistics are updated:TotalRewardn+=R~n,VisitCountn+=1.formulae-sequencelimit-fromsubscriptTotalReward𝑛subscript~𝑅𝑛limit-fromsubscriptVisitCount𝑛1\text{TotalReward}_{n}+=\tilde{R}_{n},\quad\text{VisitCount}_{n}+=1.TotalReward start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + = over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , VisitCount start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + = 1 .This ensures balanced reward scaling, refining MC-NEST’s decision-making.

2.2.8 Self-Evaluation.

MC-NEST iteratively improves candidate solutions via LLM-based critique and refinement. Given a node n𝑛nitalic_n with answer Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, a critique Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is generated using:Cn=LLM(P+An).subscript𝐶𝑛LLM𝑃subscript𝐴𝑛C_{n}=\text{LLM}(P+A_{n}).italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = LLM ( italic_P + italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .Using Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the answer is refined:An+1=LLM(P+An+Cn).subscript𝐴𝑛1LLM𝑃subscript𝐴𝑛subscript𝐶𝑛A_{n+1}=\text{LLM}(P+A_{n}+C_{n}).italic_A start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = LLM ( italic_P + italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .The refined answer An+1subscript𝐴𝑛1A_{n+1}italic_A start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is stored in a new child node, iteratively enhancing solutions in MC-NEST.

2.2.9 Human-AI Collaboration.

MC-NEST is designed to facilitate iterative human-AI collaboration, enabling researchers to refine and validate hypotheses dynamically. Upon generating a final hypothesis, MC-NEST enables human experts to evaluate its novelty, clarity, significance, and verifiability, with the option to iteratively refine the process as needed based on researcher input. This iterative loop ensures that the generated hypotheses align with domain-specific knowledge and scientific rigor while also incorporating human intuition and expertise. By integrating human judgment at critical stages, MC-NEST not only enhances the reliability of its outputs but also fosters a collaborative environment where AI augments human creativity rather than replacing it.

3 Experiments

In our experiments, we utilized ZSCoT prompting as our base prompting style with GPT-4o[1],DeepSeek-R1-Distill-Qwen-32B[6] and DeepSeek-R1-Distill-Qwen-7B[6] LLM.

3.1 Evaluation Setup

We evaluated MC-NEST using GPT-4o, DeepSeek-R1-Distill-Qwen-32B, and DeepSeek-R1-Distill-Qwen-7B, with GPT-4o serving as a strong general-purpose baseline due to its proficiency in hypothesis generation[16]. DeepSeek (32B and 7B parameters) provides insights into the scalability and efficiency of MC-NEST across different distilled LLM sizes. To ensure consistent and systematic evaluation, we employed three prompting styles: zero-shot (ZS)[12], zero-shot chain-of-thought (ZSCoT) and few-shot (FS)[11], using 2-shot, 3-shot, and 5-shot configurations with both closed-source and open-source LLMs to assess the impact of prompting.

DatasetSourceDomainAnnotationCount
LLM4BioHypoGen[13]TextBiomedicineManual200
MOOSE[21]TextSocial ScienceManual50
LLM4CSHypoGen (Ours)TextComputer ScienceManual150

3.2 Datasets

We evaluated MC-NEST on three datasets spanning social science, biomedicine, and computer science. Each dataset was carefully curated to ensure high-quality annotations and relevance to hypothesis generation tasks. Table1 provides an overview of the datasets used in our experiments. 1) Social Science Dataset: The MOOSE dataset[21] consists of 50 social science research papers paired with raw web corpora (e.g., news articles, Wikipedia). This dataset challenges systems to generate novel hypotheses without relying on pre-existing scientific knowledge, emphasizing the open-domain nature of hypothesis generation. 2) Biomedicine Dataset: The LLM4BioHypoGen dataset[13] contains 200 background-hypothesis pairs extracted from biomedical research papers. It is divided into training, seen, and unseen test sets based on publication dates to prevent data contamination, ensuring robust evaluation of hypothesis generation capabilities. 3) Computer Science Dataset: Our LLM4CSHypoGen dataset comprises 150 research papers (2024–2025) with structured content, including hypotheses, methods, and results. Each entry was cross-checked by domain experts to ensure accuracy and reliability, providing a robust foundation for evaluating hypothesis generation in computer science.

LLMSizePromptBertScoreNoveltyClaritySignificanceVerifiabilityAvg
PrecisionRecallF1
GPT-4o-ZS85.9785.4785.711.901.862.082.622.12
GPT-4o-ZSCoT79.8684.8682.272.162.462.722.502.46
GPT-4o-2FS83.3786.8385.062.002.222.222.622.27
GPT-4o-3FS83.3086.8185.012.062.082.182.482.20
GPT-4o-5FS83.2786.7484.962.022.082.222.522.21
Deepseek32BZS83.2586.1684.672.102.402.602.652.44
DeepSeek32BZSCoT78.7684.9981.742.352.752.752.752.65
Deepseek32B2FS82.7186.1684.392.102.702.652.652.52
Deepseek32B3FS82.5886.0384.272.252.652.452.702.51
Deepseek32B5FS82.0486.0083.962.302.452.502.752.50
Deepseek7BZS82.7485.5184.092.102.352.352.552.34
Deepseek7BZSCoT78.1084.1681.002.202.702.702.652.56
Deepseek7B2FS84.5686.6085.562.102.402.602.702.45
Deepseek7B3FS82.8585.6184.172.102.252.652.602.40
Deepseek7B5FS83.5986.0684.802.202.252.502.402.34

LLMSizeRolloutSamplingBertScoreNoveltyClaritySignificanceVerifiabilityAvg
PrecisionRecallF1
GPT-4o-4Greedy80.7185.4482.992.582.842.702.882.75
GPT-4o-4IS80.7285.4383.002.602.802.782.942.78
GPT-4o-4PIS80.6585.4282.952.742.762.702.922.78
GPT-4o-8Greedy80.5085.1482.742.702.802.802.942.81 \uparrow
GPT-4o-8IS80.3385.1382.652.642.822.642.902.75
GPT-4o-8PIS80.5585.1682.782.742.822.802.842.80
Deepseek32B4Greedy80.8785.2582.992.553.002.802.952.83
Deepseek32B4IS80.3885.3682.792.702.852.852.902.83
Deepseek32B4PIS80.8885.3483.042.552.852.752.902.76
Deepseek32B8Greedy80.5385.2482.812.702.953.003.002.91 \uparrow
Deepseek32B8IS80.5485.3882.892.652.852.852.952.83
Deepseek32B8PIS80.1584.9882.492.752.952.952.952.90
Deepseek7B4Greedy80.6185.1682.812.552.602.902.952.75
Deepseek7B4IS80.0884.6682.312.652.852.752.852.78 \uparrow
Deepseek7B4PIS80.4585.1082.702.502.802.652.902.71
Deepseek7B8Greedy80.7885.0582.852.452.752.602.852.66
Deepseek7B8IS80.6085.0582.762.652.652.652.802.69
Deepseek7B8PIS80.5484.9282.672.552.852.802.852.76

3.3 Evaluation Metrics

We evaluate generated hypotheses using both automatic and human assessments. For automatic evaluation, GPT-3.5 scores hypotheses on four key aspects: novelty, relevance, significance, and verifiability[17]. Novelty and verifiability are prioritized as they align with the philosophical foundations of hypothetical induction, while relevance and significance reflect the practical utility of hypotheses for researchers. Conventional metrics like BERTScore[22] are excluded to focus on task-specific goals. For human evaluation, three domain experts (Professors, postdocs, and PhD students) blindly assess 100 randomly selected hypotheses from baseline and proposed methods, using a standardized 3-point scale. Novelty is emphasized over verifiability, as even imperfect hypotheses can inspire scientific exploration[23], whereas non-novel hypotheses offer limited utility. We also analyze the correlation between GPT-3.5 and expert evaluations, suggesting GPT-3.5’s potential as a reliable evaluator for machine-generated hypotheses[2].

LLMSizePromptBertScoreNoveltyClaritySignificanceVerifiabilityAvg
PrecisionRecallF1
GPT-4o-ZS88.0588.3588.192.102.122.422.862.38
GPT-4o-ZSCoT81.5887.4284.392.312.292.282.942.60
GPT-4o-2FS84.8488.7986.762.192.052.532.912.42
GPT-4o-3FS84.8388.8886.802.162.052.452.882.38
GPT-4o-5FS85.0688.8886.932.172.032.532.882.40
Deepseek32BZS87.8189.6188.692.252.152.602.902.48
Deepseek32BZSCoT80.7588.3084.342.502.552.903.002.74
Deepseek32B2FS84.4689.4686.872.402.402.752.852.60
Deepseek32B3FS84.6789.5787.042.402.402.802.852.61
Deepseek32B5FS84.1589.3986.682.552.502.752.652.61
Deepseek7BZS86.2289.1787.662.202.202.752.952.53
Deepseek7BZSCoT79.4787.54-2.352.702.803.002.71
Deepseek7B2FS85.6388.7787.152.001.952.452.752.29
Deepseek7B3FS86.6889.6088.112.152.052.802.952.49
Deepseek7B5FS85.6189.0987.292.202.252.452.852.44

LLMSizeRolloutSamplingBertScoreNoveltyClaritySignificanceVerifiabilityAvg
PrecisionRecallF1
GPT-4o-4Greedy82.6488.2485.352.682.672.853.002.80
GPT-4o-4IS82.8188.1185.372.712.582.883.002.79
GPT-4o-4PIS82.6588.2285.342.712.622.832.992.76
GPT-4o-8Greedy82.7288.1185.322.722.592.852.992.79
GPT-4o-8IS82.6088.1185.262.732.572.842.972.78
GPT-4o-8PIS82.5488.1485.252.772.652.852.992.82 \uparrow
Deepseek32B4Greedy83.1988.3985.662.552.652.853.002.76
Deepseek32B4IS82.9988.4985.642.652.602.953.002.80
Deepseek32B4PIS83.0788.6385.752.552.352.853.002.69
Deepseek32B8Greedy82.4688.2485.252.602.652.903.002.79
Deepseek32B8IS83.0288.5085.662.652.602.903.002.79
Deepseek32B8PIS82.8188.4285.512.652.753.003.002.85 \uparrow
Deepseek7B4Greedy83.4688.5985.942.602.602.903.002.78
Deepseek7B4IS83.4088.4185.872.552.452.753.002.69
Deepseek7B4PIS83.3588.6385.902.652.502.753.002.73
Deepseek7B8Greedy82.8888.5585.612.752.702.803.002.81 \uparrow
Deepseek7B8IS83.1388.4985.722.652.752.853.002.81 \uparrow
Deepseek7B8PIS82.0387.9084.862.652.652.802.952.76

LLMSizePromptBertScoreNoveltyClaritySignificanceVerifiabilityAvg
PrecisionRecallF1
GPT-4o-ZS87.5385.5986.541.882.172.322.592.24
GPT-4o-ZSCoT81.7486.2083.912.312.492.872.832.62
GPT-4o-2FS87.1188.3987.741.982.172.352.722.31
GPT-4o-3FS87.0788.5087.752.022.212.312.682.30
GPT-4o-5FS87.0888.4987.772.042.202.352.692.32
Deepseek32BZS85.7685.1485.431.952.352.652.652.40
Deepseek32BZSCoT80.0785.6882.772.552.802.902.952.80
Deepseek32B2FS86.1388.0687.082.102.402.652.752.48
Deepseek32B3FS86.5188.2487.362.152.402.402.752.43
Deepseek32B5FS85.7687.9886.852.152.552.452.502.41
Deepseek7BZS83.1985.8084.441.901.862.082.622.12
Deepseek7BZSCoT80.6285.4782.962.062.082.182.482.20
Deepseek7B2FS85.1286.3985.742.022.082.222.522.21
Deepseek7B3FS86.2087.2286.702.162.462.722.502.46
Deepseek7B5FS85.0386.5385.752.002.222.222.622.27

LLMSizeRolloutSamplingBertScoreNoveltyClaritySignificanceVerifiabilityAvg
PrecisionRecallF1
GPT-4o-4Greedy82.6386.1684.352.702.792.862.932.82
GPT-4o-4IS82.5286.2484.292.642.792.812.922.79
GPT-4o-4PIS82.5586.1684.322.702.762.872.932.82
GPT-4o-8Greedy82.5386.1184.292.672.812.832.952.82
GPT-4o-8IS82.0886.0484.002.772.762.862.972.84 \uparrow
GPT-4o-8PIS82.1786.0584.062.802.732.892.952.84 \uparrow
Deepseek32B4Greedy82.8586.0484.412.652.902.852.952.84
Deepseek32B4IS82.2585.9184.042.702.953.002.852.87 \uparrow
Deepseek32B4PIS82.1985.8883.992.752.752.802.952.81
Deepseek32B8Greedy82.4785.9984.192.552.802.952.952.81
Deepseek32B8IS82.1586.1784.112.752.902.852.902.85
Deepseek32B8PIS82.4985.7584.082.602.602.852.952.75
Deepseek7B4Greedy82.5985.8784.192.602.752.752.802.73
Deepseek7B4IS82.7185.7284.182.602.802.802.852.76
Deepseek7B4PIS82.6685.7184.152.502.752.752.852.71
Deepseek7B8Greedy82.2585.6883.922.602.752.802.902.76
Deepseek7B8IS82.0185.3383.632.652.752.852.952.80
Deepseek7B8PIS82.3985.8884.092.803.002.852.852.88 \uparrow

4 Results and Analyses

In this section, we present the results of our experiments evaluating the performance of prompting strategies and MC-NEST across three datasets: Social Science, Computer Science, and Biomedicine. We analyze the impact of different prompting methods (Zero-Shot, Few-Shot, and Zero-Shot Chain-of-Thought) and MC-NEST sampling strategies (Greedy, Importance Sampling, and Pairwise Importance Sampling) on hypothesis generation quality, as measured by BERTScore and qualitative metrics such as novelty, clarity, significance, and verifiability.

4.1 Social Science Dataset

Prompting Strategies.

Table2 summarizes the performance of different prompting strategies on the Social Science dataset. ZSCoT consistently outperforms ZS and FS approaches across all evaluated LLMs. For DeepSeek-32B, ZSCoT achieves an average score of 2.65, compared to 2.44 for ZS and 2.52 for 2-FS. Similarly, DeepSeek-7B with ZSCoT attains an average score of 2.56, outperforming ZS with 2.34 and 2-FS with 2.45. GPT-4o also shows significant improvements with ZSCoT, achieving an average score of 2.46 compared to 2.12 for ZS.

MC-NEST Sampling Strategies.

Table3 presents the results of MC-NEST evaluations using Greedy, Importance Sampling, and Pairwise Importance Sampling. For GPT-4o, Greedy sampling with an eight-step rollout achieves the highest overall score of 2.81. Pairwise Importance Sampling, however, excels in novelty with 2.74 while maintaining competitive clarity and significance scores. DeepSeek-32B shows similar trends, with Greedy sampling achieving the best overall results with 2.91 at an eight-step rollout. For DeepSeek-7B, Importance Sampling performs best at a four-step rollout with 2.78, while Pairwise Importance Sampling achieves balanced performance at eight steps with 2.76. These results highlight the effectiveness of MC-NEST in enhancing the quality of social science hypothesis generation.

4.2 Computer Science Dataset

Prompting Strategies.

As shown in Table4, ZSCoT again demonstrates outstanding performance on the Computer Science dataset. DeepSeek-32B achieves an average score of 2.74 with ZSCoT, compared to 2.48 with ZS and 2.60 with 2-FS. Similarly, DeepSeek-7B with ZSCoT attains an average score of 2.71, outperforming ZS with 2.53 and 2-FS with 2.29. Notably, verifiability scores improve significantly with ZSCoT, reaching 3.00 for DeepSeek-32B, highlighting the usefulness of structured reasoning for factual consistency.

MC-NEST Sampling Strategies.

Table5 presents the results of MC-NEST evaluations on the Computer Science dataset. For GPT-4o, Pairwise Importance Sampling outperforms other strategies at an eight-step rollout, achieving an average score of 2.82. DeepSeek-32B achieves its highest score with Pairwise Importance Sampling at an eight-step rollout with 2.85, while DeepSeek-7B performs best with Greedy and Importance Sampling at a four-step rollout with 2.78. These results suggest that longer rollouts and adaptive sampling strategies enhance computer science hypothesis generation quality.

4.3 Biomedicine Dataset

Prompting Strategies.

Table6 summarizes the performance of prompting strategies on the Biomedicine dataset. ZSCoT consistently improves performance across LLMs, with DeepSeek-32B achieving an average score of 2.80, compared to 2.40 with ZS and 2.48 with 2-FS. Qualitative metrics, such as novelty and significance, also show substantial improvements with ZSCoT. For instance, DeepSeek-32B with ZSCoT achieves a novelty score of 2.55 and a significance score of 2.90, compared to 1.95 and 2.65 with ZS, respectively.

MC-NEST Sampling Strategies.

Table7 presents the results of MC-NEST evaluations on the Biomedicine dataset. For GPT-4o, Greedy and Pairwise Importance Sampling perform best at an eight-step rollout, achieving an average score of 2.84. DeepSeek-32B achieves its highest score with Importance Sampling at a four-step rollout with 2.87, while DeepSeek-7B performs best with Pairwise Importance Sampling at eight steps with 2.88. These results demonstrate the importance of adaptive sampling strategies using MC-NEST for optimizing hypothesis generation in biomedicine domains.

Our experiments demonstrate that structured reasoning and adaptive sampling strategies with MC-NEST significantly enhance hypothesis generation quality across domains. Increasing rollout lengths generally improves performance, with Pairwise Importance Sampling offering a competitive balance between novelty and verifiability. These findings underscore the importance of MC-NEST with sampling strategies for optimizing LLM performance in scientific hypothesis generation.

DatasetLLMSizePromptNoveltyClaritySignificanceVerifiabilityAvg
Social ScienceGPT-4o-ZSCoT2.333.002.662.492.62
Social ScienceGPT-4o-Greedy2.161.662.162.502.12 \downarrow
BiomedicineGPT-4o-ZSCoT1.662.332.501.662.03
BiomedicineGPT-4o-IS1.832.502.832.332.37 \uparrow
Computer ScienceGPT-4o-ZSCoT1.662.502.662.662.37
Computer ScienceGPT-4o-PIS1.852.502.662.502.38 \uparrow
Social ScienceDeepseek32BZSCoT2.161.832.662.162.20
Social ScienceDeepseek32BGreedy2.161.832.492.502.25 \uparrow
BiomedicineDeepseek32BZSCoT2.661.662.662.332.32
BiomedicineDeepseek32BIS2.412.172.662.502.44 \uparrow
Computer ScienceDeepseek32BZSCoT2.332.172.662.502.42
Computer ScienceDeepseek32BPIS2.502.162.662.172.37 \downarrow
Social ScienceDeepseek7BZSCoT2.331.832.662.332.29
Social ScienceDeepseek7BIS1.832.332.662.832.41 \uparrow
BiomedicineDeepseek7B3FS2.162.502.832.662.54
BiomedicineDeepseek7BPIS1.672.832.832.332.42 \downarrow
Computer ScienceDeepseek7BZSCoT1.662.502.502.502.29
Computer ScienceDeepseek7BIS1.832.502.662.502.37 \uparrow

4.3.1 Human Evaluation and Case Study.

The human evaluation results in Table8 highlight the outstanding performance of MC-NEST with Greedy, Importance Sampling, and Pairwise Importance Sampling strategies compared to other approaches. MC-NEST with ZSCoT prompting achieved the highest average score of 2.62 in Social Science and 2.37 in Computer Science, while Importance Sampling achieved the best performance in Biomedicine with a score of 2.37. Notably, MC-NEST with Importance Sampling prompting outperformed other methods in Biomedicine, achieving a score of 2.44 with Deepseek-32B, and in Social Science, scoring 2.41 with Deepseek-7B. Similarly, Pairwise Importance Sampling demonstrated strong performance for GPT-4o in Computer Science, achieving a score of 2.38, though it slightly underperformed for Deepseek-32B with a score of 2.37. These results underscore the effectiveness of MC-NEST with different sampling strategies—Greedy, Importance Sampling, and Pairwise Importance Sampling—in optimizing hypothesis generation across domains, outperforming traditional approaches.

4.3.2 Usefulness of the MC-NEST Framework.

MC-NEST is a powerful framework for hypothesis generation, combining MCTS with Nash Equilibrium strategies to dynamically balance exploration and exploitation. It iteratively refines hypotheses through self-critique and validation, ensuring novelty and empirical grounding. In experiments, MC-NEST outperformed baselines across multiple domains, achieving higher BertScore and qualitative metrics (novelty, clarity, significance, and verifiability). For example, in optimizing synthetic peptide sequences for nuclear localization and solubility, MC-NEST proposed experimentally validated modifications. Its ability to incorporate emerging scientific literature and adapt to new discoveries distinguishes it from frameworks lacking iterative refinement. These features make MC-NEST a versatile and effective tool for advancing scientific discovery through automated hypothesis generation.

Rollout Strategy for MC-NEST Hypothesis Generation.

Our experiments demonstrate that longer rollouts consistently enhance the performance of MC-NEST across datasets and sampling strategies. Increasing the rollout length from four to eight steps improves both the BERTScore and qualitative metrics, such as novelty and verifiability. Pairwise Importance Sampling, in particular, benefits from extended rollouts, achieving the highest scores in novelty and significance while maintaining competitive performance in other metrics. These results indicate that longer rollouts enable a more comprehensive exploration of the hypothesis space, leading to higher-quality and more innovative solutions.

4.4 Ethical Considerations

Integrating LLMs into hypothesis generation introduces sociotechnical and intellectual challenges, as over-reliance on LLMs risks stifling human creativity and expertise. We advocate for structured human-AI collaboration, where LLMs augment human creativity, supported by findings that human refinement of LLM-generated hypotheses yields superior outcomes. Transparent documentation of LLM usage—including model details, training data, and frameworks—is essential for fair credit attribution and fostering trust in AI-assisted research. Ethical concerns include misuse, low-quality outputs, and unoriginal hypotheses that could overwhelm academic venues, necessitating rigorous scrutiny to ensure novelty, testability, and grounding in sound principles. In high-stakes domains, proactive measures like Reinforcement Learning from Human Feedback (RLHF) and adversarial robustness are critical to mitigate risks of unethical or harmful research. Additionally, LLMs’ tendency to produce hypotheses clustered around common training data patterns risks reducing diversity and novelty, highlighting the need for future work to enhance output diversity through model refinement or frameworks that explicitly encourage unconventional ideas.

5 Conclusion and Limitations

We introduced MC-NEST, a novel framework integrating Monte Carlo Tree Search with Nash Equilibrium strategies to enhance hypothesis generation. MC-NEST outperforms baselines across domains, excelling in quantitative metrics (e.g., BERTScore) and qualitative measures (e.g., novelty, clarity, significance, and verifiability). Adaptive sampling and iterative self-refinement enable MC-NEST to balance exploration and exploitation, generating innovative and empirically grounded hypotheses. Our findings emphasize the value of structured human-AI collaboration, where LLMs augment human creativity rather than replace it. Future work should focus on enhancing diversity and addressing socio-technical challenges. Limitations include the dataset’s focus on computer science papers, though each is curated and annotated by domain experts, ensuring academic rigor. MC-NEST’s applicability across diverse domains is a challenge, but it is the first framework to integrate MCTS with LLMs for hypothesis generation in fields like biomedicine, social science, and computer science. While the framework automates hypothesis generation with human-AI collaboration, future work will adapt it to controlled settings by incorporating researcher-defined inputs, ensuring versatility.

Author Contributions

Gollam Rabby developed the initial idea, designed the experiments, and contributed to the manuscript writing.Diyana Muhammed conducted the experiments.Prasenjit Mitra provided feedback on the initial idea and supported the manuscript writing.Sören Auer contributed to the initial idea and provided support in the manuscript writing.

Acknowledgements

We acknowledge the support of the KISSKI project (funding no. 01IS22093C) for providing computational resources, which will enable us to extend this research in the future.

References

  • [1]Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., etal.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
  • [2]Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., etal.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [3]Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4(1), 1–43 (2012)
  • [4]Dijkstra, E.W.: A note on two problems in connexion with graphs. In: Edsger Wybe Dijkstra: his life, work, and legacy, pp. 287–290 (2022)
  • [5]Flaspohler, G.E.: Balancing Exploration and Exploitation: Task-Targeted Exploration for Scientific Decision-Making. Ph.D. thesis, Massachusetts Institute of Technology (2022)
  • [6]Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., etal.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
  • [7]Hahn, S., Schlenstedt, G.: Importin β𝛽\betaitalic_β-type nuclear transport receptors have distinct binding affinities for ran–gtp. Biochemical and Biophysical Research Communications 406(3), 383–388 (2011)
  • [8]Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., etal.: Highly accurate protein structure prediction with alphafold. nature 596(7873), 583–589 (2021)
  • [9]Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: European conference on machine learning. pp. 282–293. Springer (2006)
  • [10]Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213 (2022)
  • [11]Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350(6266), 1332–1338 (2015)
  • [12]Larochelle, H., Erhan, D., Bengio, Y.: Zero-data learning of new tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence (2008)
  • [13]Qi, B., Zhang, K., Tian, K., Li, H., Chen, Z.R., Zeng, S., Hua, E., Jinfang, H., Zhou, B.: Large language models as biomedical hypothesis generators: a comprehensive evaluation. arXiv preprint arXiv:2407.08940 (2024)
  • [14]Rabby, G., Auer, S., D’Souza, J., Oelen, A.: Fine-tuning and prompt engineering with cognitive knowledge graphs for scholarly knowledge organization. arXiv preprint arXiv:2409.06433 (2024)
  • [15]Rabby, G., Keya, F., Zamil, P., Auer, S.: Mc-nest–enhancing mathematical reasoning in large language models with a monte carlo nash equilibrium self-refine tree. arXiv preprint arXiv:2411.15645 (2024)
  • [16]Rosoł, M., Gąsior, J.S., Łaba, J., Korzeniewski, K., Młyńczak, M.: Evaluation of the performance of gpt-3.5 and gpt-4 on the medical final examination. MedRxiv pp. 2023–06 (2023)
  • [17]Si, C., Yang, D., Hashimoto, T.: Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers. arXiv preprint arXiv:2409.04109 (2024)
  • [18]Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van DenDriessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., etal.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016)
  • [19]Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., Chandak, P., Liu, S., VanKatwyk, P., Deac, A., etal.: Scientific discovery in the age of artificial intelligence. Nature 620(7972), 47–60 (2023)
  • [20]Xiong, G., Xie, E., Shariatmadari, A.H., Guo, S., Bekiranov, S., Zhang, A.: Improving scientific hypothesis generation with knowledge grounded large language models. arXiv preprint arXiv:2411.02382 (2024)
  • [21]Yang, Z., Du, X., Li, J., Zheng, J., Poria, S., Cambria, E.: Large language models for automated open-domain scientific hypotheses discovery (2023)
  • [22]Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)
  • [23]Zhou, Y., Liu, H., Srivastava, T., Mei, H., Tan, C.: Hypothesis generation with large language models. In: Peled-Cohen, L., Calderon, N., Lissak, S., Reichart, R. (eds.) Proceedings of the 1st Workshop on NLP for Science (NLP4Science). pp. 117–139. Association for Computational Linguistics, Miami, FL, USA (Nov 2024). https://doi.org/10.18653/v1/2024.nlp4science-1.10, https://aclanthology.org/2024.nlp4science-1.10/

A. Appendix

In the following sections, we report additional details on the following topics:

  1. 1.

    All Unique Keys Found in LLM4CSHypoGen Dataset (Section A.1)

  2. 2.

    Prompts in Experiment (Section A.2)

A.1 All Unique Keys Found in LLM4CSHypoGen Dataset

Column NameDescription
DOIThe digital object identifier for the paper.
TitleThe title of the research paper.
Authors_namesNames of the authors of the paper.
Authors_orcidORCID identifiers of the authors.
Paper_domainThe domain or field of research the paper belongs to.
Research_IdeaThe central idea or motivation behind the research.
Problem_StatementThe specific research problem being addressed.
HypothesisThe hypothesis formulated in the research.
Literature_ReviewSummary of previous research relevant to the study.
AbstractA concise summary of the research paper.
MethodThe methodology used in the research.
Summarized_MethodA concise summary of the methodology.
ResultsThe Findings of the research study.
Summarized_ResultsA brief summary of the results.
ConclusionThe final conclusions drawn from the research.
Summarized_ConclusionA concise summary of the conclusion.

A.2 Prompts in Experiment

Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5873

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.