Zhi ZhouYuhao TanZenan LiYuan YaoLan-Zhe GuoXiaoxing MaYu-Feng Li
Abstract
Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, single-shot inference often yields unreliable results for complex reasoning tasks, leading researchers to explore multiple reasoning paths through methods such as perplexity and self-consistency.In this paper, we present the first theoretical error decomposition analysis of these techniques, breaking down their error into estimation error and model error. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function, while self-consistency exhibits high estimation error due to a slow error convergence rate.To overcome these limitations, we propose Reasoning-Pruning Perplexity Consistency (Rpc). This approach combines Perplexity Consistency, which seamlessly integrates LLM perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths to effectively prevent the degeneration of estimation error reduction.Theoretical analysis demonstrates that Rpc not only accelerates the convergence rate of estimation error to an exponential level but also holds strong potential for further reducing model error.Extensive empirical evaluations on seven benchmark datasets confirm that Rpc can significantly improve reasoning performance, sample efficiency, and confidence reliability.
Large Language Model, Self Consistency, Math Reasoning, Code Generation
1 Introduction
Recently, large language models (LLMs) have shown significant progress in various applications such as problem solving(Lewkowycz etal., 2022; Li etal., 2024a), planning(Valmeekam etal., 2023; Deng etal., 2024), and decision making(Ouyang & Li, 2023; Sblendorio etal., 2024), demonstrating their reasoning capabilities.Since single-shot inference is not always reliable, especially in complex reasoning tasks, one often requires the LLM to produce multiple reasoning paths, facilitating its reasoning performance.
When multiple reasoning paths for a given problem are available, the reasoning performance is determined by the confidence estimation for each result.To achieve this, perplexity methods(Chen etal., 1998; Murugadoss etal., 2025) apply LLMs’ internal probability to estimate the confidence of the reasoning path.Although the internal probability is quite accurate, the reasoning path confidence is highly insufficient to distinguish each answer, thereby greatly limiting the effectiveness of perplexity methods(Chen etal., 2024).In contrast,self-consistency methods(Wang etal., 2022; Chen etal., 2023b) switch to establish the answer confidence using a pre-defined consistency function.However, the answer confidence cannot be directly derived from the internal probabilities of LLMs, necessitating the use of Monte-Carlo estimation, which significantly degrades the convergence rate(Aggarwal etal., 2023; Wan etal., 2024; Wang etal., 2024).
To better understand the limitations of current methods and to guide the development of an effective and efficient LLM reasoning approach, we formulate the LLM reasoning problem and present a theoretical analysis that decomposes the reasoning error into two components: Estimation Error and Model Error.Self-consistency methods, which rely on the Monte-Carlo estimation, achieve only a linear estimation error reduction rate with respect to the sample size.The linear convergence rate leads to the method requiring a large sampled budget.For instance, implementing self-consistency with 64 samples on the MATH dataset using the GPT-4 API costs approximately $2000(Li etal., 2024c), rendering it extremely expensive for both researchers and organizations.As to perplexity methods, their estimation error converges exponentially as they use the internal probability of LLMs.The exponential convergence rate ensures that perplexity methods can work well even in a very limited sample budget, while its final convergent result is far from satisfactory due to the high model error.A comparison between self-consistency methods and perplexity methods is shown in Figure1.This complementary between estimation error and model error raises a chance to further improve the LLM reasoning performance:Can we design a method that achieves both fast estimation error convergence rate and low model error?To the best of our knowledge, the efforts in this aspect remain limited.
In this paper, we explore effectively and efficiently integrating the internal LLM probability into the self-consistency framework, allowing us to utilize an accurate probability for rapid estimation error reduction while maintaining low model error.We name this confidence estimation approach Perplexity Consistency.Our theoretical study illustrates that perplexity consistency provides a tight integration and can indeed achieve the goal.However, the reduction rate of perplexity consistency estimation error undesirably degenerates to a linear rate when the magnitude of the LLM’s internal probability is low.To tackle this issue, we further introduce Reasoning Pruning to automatically model the probability distribution for each reasoning problem and remove low-probability reasoning paths.Combining the perplexity consistency and the reasoning pruning, we propose Reasoning-pruning Perplexity Consistency (Rpc).
Our theoretical and experimental results confirm the efficient and effective performance of Rpc.Specifically, on four mathematical reasoning datasets, Rpc successfully reduces the sampling budget by at least 50% while achieving the same reasoning performance as self-consistency.Conversely, with an equal sampling budget, Rpc outperforms existing methods by 1.29% on average.Additionally, Rpc provides confidence estimates that align better with the ground truth compared to existing methods.
To summarize, the main contributions of the paper are:
(1) We formulate the LLM reasoning problem and offer a theoretical analysis that decomposes LLM reasoning performance into estimation error and model error. This analysis emphasizes the benefits of self-consistency while revealing its limitations when working with limited sampling budgets.
(2) Building on our theoretical framework, we introduce the Rpc, which integrates Perplexity Consistency and Reasoning Pruning. This approach utilizes precise LLM probabilities and eliminates low-probability reasoning paths to enhance reasoning performance.
(3) Our theoretical analysis shows that Perplexity Consistency achieves an exponential error reduction rate in most cases, and Reasoning Pruning effectively compensates for the remaining degeneration issues.
(4) Through extensive experiments conducted on four mathematical reasoning and three code generation tasks, our proposed Rpc delivers promising results of improving both accuracy and confidence consistency.
2 Problem and Analysis
In this section, we start by outlining the problem formulation of LLM reasoning through sampling multiple reasoning paths.Then, we provide a theoretical analysis that decomposes LLM reasoning performance into estimation error and model error.Finally, we present experimental results verifying our theoretical analysis.Our theoretical and empirical analysis motivates our follow-up method design.
2.1 Problem Formulation
Given a reasoning problem , where represents the input query, and represents the ground-truth answer.The LLM generates a reasoning path by sequentially sampling tokens according to the conditional probability distribution , where denotes the length of the reasoning path.The probability of generating the reasoning path is defined as , a.k.a the confidence of the reasoning path .An answer extraction function maps the reasoning path to the final answer , and the reasoning correctness is evaluated by the indicator function .We can extend the probability to the answer , i.e., the answer confidence, denoted as .
The confidence essentially represents the probability that the reasoning path or answer is correct, which enables LLMs to select the most reliable solution among multiple candidates.Nevertheless, enumerating all reasoning paths or answers is unfeasible;we have to estimate the LLM confidence based on finite sampled reasoning paths instead.Furthermore,to measure the reasoning performance of LLMs, we use the squared error of confidence estimation to the reasoning path :
If we can extend the confidence estimation to the answer , the squared error can be reformulated as
Below, we analyze two confidence estimation methods, i.e., self-consistency method(Wang etal., 2022) and perplexity method(Huang etal., 2023). Specifically, the self-consistency method computes the answer confidence using Monte-Carlo estimation based on a consistency function , while the perplexity method directly computes the confidence of reasoning paths using internal LLM probabilities.
2.2 Theoretical Analysis
To maximize the reasoning performance of LLMs,self-consistency methods (denoted as Sc)(Xiong etal., 2024; Abbasi-Yadkori etal., 2024; Becker & Soatto, 2024) often sample reasoning paths , and then estimate the probability of each answer by
Then, the reasoning error of the Sc method for a given problem can be computed by
To illustrate the key factors affecting the reasoning error,we provide an error decomposition in the following proposition.
Proposition 1(Sc Reasoning Error Decomposition).
For any input with ground-truth answer , let denote the estimated probability of by Sc.Then, the reasoning error can be divided into two components:
Remark 1.
The detailed proof is provided in AppendixA.1.The estimation error refers to the error caused by the finite sampling from the LLM probability, while the model error indicates the LLM’s limited reasoning capability.Note that the estimation error of Sc reduces to only the variance as the sampling is unbiased.This proposition demonstrates that, aside from the model error, which is determined by the LLM’s inherent reasoning capabilities,the reasoning error is bounded by the estimation error.Moreover, the estimation error reduction rate of the sample size is linear, resulting in a large error margin when the sampling is insufficient.
To effectively offset the estimation error, we switch to analyze the reasoning error of perplexity methods (denoted as Ppl).In contrast to the Sc method that estimates the answer probability using the Monte-Carlo estimation, Ppl directly utilizes the internal probability of LLMs for the sampled reasoning path .Therefore, for given the unique set of sampled reasoning paths , the estimated probability of each reasoning path is
Similarly, we also use the mean squared error to measure the reasoning performance of Ppl:
Now, we can obtain the following proposition.
Proposition 2(Ppl Reasoning Error Decomposition).
For any given input with ground-truth answer , let denote the estimated probability of by Ppl method.Then, the reasoning error can be divided into two components:
Remark 2.
The detailed proof is provided in AppendixA.1.Compared with Sc, the estimation error of Ppl decreases exponentially, which is much faster. However, the model error of Ppl is usually larger than that of Sc in practice. In AppendixA.2, we provide Proposition3 to demonstrate that Sc achieves a smaller model error than Ppl in the ideal case, due to the advantages of the consistency function.
2.3 Empirical Observations
To confirm our theoretical results, we conduct some initial experiments on the GSM8K dataset using the InternLM-MATH-Plus 7B model. We limit the sample size from to and plot the accuracy curves and the estimation error in Figure2.Additionally, we include an ablative version called by Naïve-Sc.Naïve-Sc applies the Monte-Carlo estimation, which is consistent with Sc,but its consistency function is degraded to a Naïve version to the reasoning path matching rather than the answer matching, which is consistent with Ppl.In other words, the reasoning error of Naïve-Sc can be decomposed as
The derived results highlight the following two observations:
(I) Estimation Error.The estimation errors of both Sc and Ppl decrease as the sample size increases.However, the accuracy curves and estimation error illustrate that the Ppl has a much faster convergence rate compared to Sc.Specifically, Ppl reaches a stable result with , while Sc cannot converge even for .Naïve-Sc confirms this result, showing a lower convergent rate, since it uses the same Monte-Carlo estimation with Sc.
(II) Model Error.Sc and Ppl ultimately converge to different results. This is because their model errors are intrinsically different. Sc groups reasoning paths that yield the same answer through its consistency function, ensuring a higher accuracy of Sc. In contrast, Ppl only estimates the probability of individual reasoning paths without considering answer-level consistency. Naïve-Sc also supports this conclusion, converging to the worst results due to its lack of a proper consistency function.
Key Insights.Our theoretical and empirical analyses point to the deficiencies and potential synergies of Sc and Ppl.Although Sc achieves a lower model error due to the advantages of the consistency function, its estimation error is hindered by a slower convergence rate. In contrast, Ppl exhibits a much faster convergence rate in estimation error using LLM internal probabilities, but this comes at the cost of a higher model error. This naturally raises a fundamental research question: Can we design a method fusing strengths of Sc and Ppl: achieve both a rapid estimation error convergence rate and a low model error simultaneously?
3 Methodology
Based on our theoretical and empirical analysis, we propose a new method called Reasoning-Pruning Perplexity Consistency (Rpc).Specifically, we first integrate internal LLM probability into the self-consistency framework, paving the way for a confidence estimation function called Perplexity Consistency (Pc).This function utilizes LLM probabilities to reduce estimation error more efficiently while maintaining a low model error.Our further analysis guides the design of a new Reasoning Pruning (Rp) module that addresses the limitations of Pc by filtering out reasoning paths with low probabilities.Figure3 shows the overall framework.
3.1 Perplexity Consistency
To improve the efficiency of estimation error reduction, we propose Pc, which directly leverages the LLM’s prediction probability like Ppl, obtaining the benefit of an exponential convergent rate;and also applies the consistency function of Sc, to minimize the model error.Formally, for the unique set of sampled reasoning paths , the estimated probability of the answer is
which calculates the cumulative probability of all unique reasoning paths whose answer is .Therefore, the mean squared error of Pc is
Now, we present the following theorem, which explores the reasoning error decomposition of Pc.
Theorem 3(Pc Reasoning Error Decomposition).
Assume that and define .Then, the reasoning error of Pc can be divided into two components:
Remark 4.
The proof is presented in AppendixA.3.The theorem states that Pc successfully fuses the strengths of Ppl and Sc: it achieves the same level of model error as Sc while ensuring the same convergence rate as Ppl in the estimation error.Particularly, the convergence rate can be computed as .
The convergence rate is primarily influenced by the magnitude of . In most scenarios, it remains exponential, facilitating rapid estimation error reduction.However, when and , we only have (Kozma, 2021), resulting in the convergence rate unexpectedly degenerating to a linear result.
3.2 Reasoning Pruning
Our analysis of the convergence rate in estimation error suggests some possibility for further improving the estimation error reduction efficiency.Specifically, the rate degenerates when is too small.Therefore, we propose directly pruning these low-probability answers instead of sampling, called by Reasoning Pruning (Rp).
Reasoning pruning essentially aims to set , i.e., when the cumulative probability of all its corresponding reasoning paths are very low,making the estimation error vanish.Although pruning low-probability answers can boost the efficiency of estimation error reduction, it also induces a level of model error.In other words, it may ignore some correct answers of low probability, thus implicitly degrading the performance of LLM reasoning.
Ideally, if we know the probability that the LLM can generate the correct answer, the optimal threshold for pruning should be . In this case, one can obtain not only an exponential convergence rate but also a significant reduction in model error, as all incorrect answers satisfying are eliminated. However, to obtain this optimal error reduction, two challenges need to be resolved: (1) we only have an estimate of since the accurate value is unknown; and (2) we cannot know the ground-truth probability , making the threshold difficult to determine.
For the first problem, we propose directly pruning reasoning paths instead of answers in Rp, with the following theorem.
Theorem 5(Effectiveness of Reasoning Path Pruning).
Assume that the optimal threshold , and let , which refers to the size of samples whose answer is .Hence, Rp achieves the optimal error reduction with at least the probability
Remark 6.
The proof is included in AppendixA.4.The theorem provides a guarantee that Rp can achieve the optimal error reduction for each given problem at a high probability.Note that the optimal error reduction not only boosts the estimation error reduction efficiency but also effectively reduces the model error, thus improving the final reasoning capability of LLMs.
The remaining problem is determining the optimal threshold for reasoning path removal.To achieve this goal, we develop an automated strategy.Inspired by open-set recognition(Bendale & Boult, 2016), we model the probability distribution of and as a mixture of two Weibull distributions, representing high and low probability regions.
Elaborately, we define the PDF of mixture distribution as:
where the Weibull PDF(Weibull, 1951) is defined as .We use the maximum likelihood estimation methods to estimate the parameters, i.e., , , , and on the probability distribution of all sampled reasoning paths for each reasoning problem.We assume that is the high probability distribution and is the low probability distribution. Then, the probability of reasoning path being in the high probability distribution is derived by
where is the value of LLM internal probability.
Then, we remove sampled reasoning paths satisfying , which more likely to be in the low probability distribution. Moreover, to ensure the algorithm’s stability when is limited, we employ the Truncated Mean method(Marazzi & Ruffieux, 1999), retaining outputs with a probability greater than the overall mean. This prevents the removal of too many reasoning paths due to potential inaccurate estimation of the mixture distribution.
Overall, we apply the Reasoning Pruning to all sampled reasoning paths for removing low probability reasoning paths and then compute the confidence based on Perplexity Consistency, forming our proposed Reasoning-pruning Perplexity Consistency confidence (Rpc). The pseudo-code is shown in Algorithm1 in AppendixB.
4 Experiments
In this section, we first conduct experiments to answer the following research questions:
RQ1: Efficiency. How does Rpc reduce the number of samples required to achieve comparable performance through faster convergence?
RQ2: Efficacy. How does Rpc improve reasoning performance compared to existing methods?
RQ3: Reliability. How does Rpc enhance the reliability of confidence estimation compared to existing methods?
Additional discussions are devoted to further demonstrating the effectiveness of Rpc.Due to space limitations, supplementary experimental results are included in AppendixD.
Method
MATH
MathOdyssey
OlympiadBench
AIME
Accuracy
#Samplings
Accuracy
#Samplings
Accuracy
#Samplings
Accuracy
#Samplings
Best of Sc
50.57
64
28.32
112
11.07
128
9.40
128
Pc
50.63
32
28.51
112
11.07
128
9.00
64
+0.06
-50.0%
+0.19
-0.0%
0.00
-0.0%
0.00
-50.0%
Rpc
51.16
32
29.31
32
11.07
64
9.50
48
+0.59
-50.0%
+0.99
-71.4%
0.00
-50.0%
+0.10
-62.5%
4.1 Experimental Setting
In this section, we briefly introduce the comparison methods, datasets, and details of the implementation.The experimental settings can be found in AppendixC.
Comparison Methods.We compare three types of LLM confidences: perplexity confidence(Wang etal., 2022) (Ppl), self-consistency confidence(Chen etal., 1998) (Sc), and verbalized confidence(Tian etal., 2023) (Verb).The verbalized confidence is computed based on the probability that the LLM outputs “True” versus “False” when asked an “Is-True” question. For code generation tasks, we extracted verbalized confidence scores from the model’s numerical likelihood expressions by prompting the LLM.
Datasets.We introduce four popular benchmarks for math reasoning: MATH(Hendrycks etal., 2021b), MathOdyssey(Fang etal., 2024), OlympiadBench(He etal., 2024), and AIME(Zamil & Rabby, 2024).As to code generation tasks, we evaluate each method on three benchmarks, i.e., HumanEval(Chen etal., 2021), MBPP(Austin etal., 2021), and introductory-level problems of APPS(Hendrycks etal., 2021a).
Implementation Details.For math reasoning tasks, we evaluate the InternLM2-Math-Plus models with 1.8B and 7B parameters(Ying etal., 2024), as well as the DeepSeekMath-RL 7B model(Shao etal., 2024). The consistency function is answer comparison. For code generation tasks, we evaluate the Deepseek-Coder 33B model. The consistency function is constructed based on semantic equivalence(Malík & Vojnar, 2021) by clustering code based on given test cases. We set the sample size to for the MathOdyssey, OlympiadBench, and AIME datasets and for the MATH dataset by default. Each experiment is repeated 10 times with different random seeds, and the average performance is reported. All experiments were conducted on Linux servers with A800 and H800 GPUs.
Method
MATH
MathOdyssey
OlympiadBench
AIME
Average
Accuracy()
ECE()
Accuracy()
ECE()
Accuracy()
ECE()
Accuracy()
ECE()
Acc.()
ECE()
Ppl
46.99 0.20
48.99 0.19
27.35 1.22
67.70 1.22
7.27 0.36
86.90 0.35
5.96 0.48
88.98 0.49
21.90
73.14
Verb
26.14 0.25
47.46 0.07
10.06 0.61
69.92 0.88
3.68 0.16
84.68 0.25
3.17 0.17
86.29 0.20
10.76
72.09
Sc
50.57 0.17
6.71 0.18
28.25 0.60
12.23 0.54
11.07 0.15
20.20 0.16
9.40 0.21
14.35 0.23
24.82
13.37
Rpc
51.95 0.15
6.41 0.18
31.62 0.75
9.87 0.73
11.14 0.15
18.86 0.18
9.74 0.23
14.32 0.21
26.11
12.37
Method
InternLM2-Math-Plus 1.8B
MATH
MathOdyssey
OlympiadBench
AIME
Ppl
33.24 0.24
16.56 0.88
3.08 0.20
1.66 0.15
Verb
7.21 0.17
2.81 0.26
0.77 0.06
0.26 0.05
Sc
36.48 0.15
14.52 0.46
5.99 0.17
2.66 0.20
Rpc
37.88 0.16
16.35 0.61
6.52 0.24
3.26 0.24
Method
DeepSeekMath-RL 7B
MATH
MathOdyssey
OlympiadBench
AIME
Ppl
42.51 0.23
22.34 1.00
5.90 0.31
3.37 0.46
Verb
14.29 0.23
2.55 0.24
2.36 0.15
1.91 0.12
Sc
53.33 0.09
36.68 0.65
11.29 0.17
9.42 0.23
Rpc
53.37 0.11
37.25 0.69
11.30 0.11
9.52 0.31
4.2 Empirical Results
RQ1: Efficiency. How does Rpc reduce the number of samples required to achieve comparable performance through faster convergence?
We evaluate our proposed Rpc against the standard self-consistency method using four mathematical benchmark datasets with the InternLM-2-MATH-Plus 7B model. For the MATH dataset, we set the reasoning path size to 64, while we set the number of reasoning paths to 128 for the other datasets with Sc. We then record the best performance and minimum sampling requirements for Sc. For both Rpc and our Perplexity Consistency module (denoted as Pc), we report the minimum number of samples needed to match or exceed the performance of the Sc in Table1.
The results of Pc indicate improved convergence rates compared to Sc in several cases, while maintaining similar rates in others. These findings support the rapid convergence and degeneration issues of Pc in Theorem3. Rpc shows significant efficiency improvements by requiring fewer samples to achieve comparable performance relative to Sc. Moreover, the degeneration issues observed in Pc are effectively addressed in Rpc, validating both the effectiveness of the Reasoning Pruning module and our Theorem5.
RQ2: Efficacy. How does Rpc improve reasoning performance compared to existing methods?
We evaluate the performance of Pc and Rpc in Figure4 across various sample budgets. The results demonstrate that Rpc achieves better performance than both Ppl (which relies on internal LLM probabilities) and Sc (which employs Monte Carlo sampling). The detailed accuracy results, including mean and standard deviation in Table2 support these findings.
We also analyze the performance of Pc separately. The results indicate that Pc has a faster convergence rate than Sc, which aligns with Theorem3. The significant performance gains from Pc to Rpc, as shown in 9(a) and 9(b), validate the effectiveness of the Reasoning Pruning module. This suggests that Reasoning Pruning helps reduce model errors when the LLM exhibits good alignment by eliminating incorrect reasoning paths that carry low LLM probability scores.
RQ3: Reliability. How does Rpc enhance the reliability of confidence estimation compared to existing methods?
To evaluate the reliability of confidence estimation, we analyze the ECE results of Rpc and comparison methods in Table2. ECE measures the difference between predicted probabilities and empirical accuracy, as directly computing estimation error using ground-truth probabilities is virtually impractical. The results demonstrate that our Rpc approach achieves lower ECE values and higher accuracy compared to Ppl and Sc, indicating more reliable confidence estimation. We visualize this improvement through reliability diagrams comparing Sc and Rpc in Figure5 on MathOdyssey, which clearly shows the reduced calibration error of Rpc.
4.3 Further Discussion
Results on Code Generation Tasks.To investigate whether our proposed approaches can generalize to other reasoning tasks, such as code generation tasks, we evaluate Rpc and comparison methods on three code generation benchmarks, as illustrated in Figure6. The results show that Rpc achieves the highest accuracy across all datasets, demonstrating its effectiveness in reasoning tasks beyond mathematics.
Results Across Model Scales and Architectures.To evaluate the generalization ability of our approaches across different model scales and architectures, we conducted additional experiments using InternLM2-Math-Plus 1.8B and DeepSeek-Math 7B models. The results in Table3 show that Rpc consistently outperforms existing methods, which is consistent with results in Table2.
5 Related Work
This paper is related to the two research topics, i.e., LLM Reasoning Boosting and LLM Confidence Estimation.
LLM Reasoning Boosting.
Recent research has developed various methods to improve LLM reasoning capabilities. CoT(Kojima etal., 2022) proposes the “Let’s think step by step” prompt to guide LLMs in generating structured solutions. For complex problems, Least-to-Most(Zhou etal., 2023) introduces a decomposition strategy that breaks down challenges into manageable sub-problems. Few-shot methods(Wei etal., 2022; Fu etal., 2023; Zhang etal., 2023) leverage carefully selected examples to improve reasoning performance. To enable more comprehensive reasoning, search-based methods(Guan etal., 2025) integrate Monte Carlo Tree Search (MCTS). Recent advancements have further enhanced MCTS by incorporating reward models(Zhang etal., 2024; Park etal., 2024). To directly optimize reasoning abilities, researchers have explored fine-tuning approaches(Yu etal., 2024; Li etal., 2024b, d; Ying etal., 2024) using specialized datasets and reinforcement learning techniques(Shao etal., 2024; Luo etal., 2025).
While these methods focus on generating accurate reasoning paths, our Rpc can build upon them by utilizing multiple sampling strategies, enabling better reasoning performance.
LLM Confidence Estimation.The confidence estimation for LLM can be categorized into three types: (1) perplexity confidence, (2) verbalized confidence, and (3) self-consistency confidence. Perplexity confidence(Huang etal., 2023; Duan etal., 2024) utilizes the geometric mean of LLM prediction probabilities (i.e., perplexity(Chen etal., 1998; Blei etal., 2003)) to evaluate model adherence(Murugadoss etal., 2025) and prompt quality(YAO etal., 2024). Verbalized confidence(Kadavath etal., 2022; Xiong etal., 2024; Tian etal., 2023) directly asks LLMs to express their confidence through various approaches, such as multi-agent deliberation(Yang etal., 2024), multi-step evaluation(Xiong etal., 2024), top-k ranking(Tian etal., 2023), few-shot prompting(Liu etal., 2024), and reflection(Dhuliawala etal., 2024; Zhao etal., 2024). Self-consistency confidence(Wang etal., 2022; Chen etal., 2023b; Cheng etal., 2024) measures the agreement among multiple generated answers to improve reasoning performance, with recent work(Xiong etal., 2024; Abbasi-Yadkori etal., 2024; Becker & Soatto, 2024) further developing this approach as a confidence metric. Recent studies recognize its computational issues and propose early stopping(Li etal., 2024c) and dynamic sampling methods(Wang etal., 2024; Wan etal., 2024; Aggarwal etal., 2023) to improve efficiency.
Our proposed Rpc integrates LLM probability with a self-consistency framework, allowing perplexity and verbalized confidence to be used. Rpc achieves enhanced confidence estimation with fixed reasoning paths, showing a complement to existing self-consistency methods.
6 Conclusion
In this paper, we address a foundational challenge in LLM reasoning: determining the most reliable answer from multiple reasoning paths by measuring LLM confidence.We present a theoretical framework to decompose the reasoning error into estimation error and model error, revealing that perplexity methods suffer from substantial model error due to lacking consistency function while self-consistency suffers from high estimation error because of a slow error convergence rate.To tackle this limitation, we introduce Reasoning-pruning Perplexity Consistency (Rpc), a confidence estimation method with two key components:Perplexity Consistency utilizes internal LLM probabilities to achieve faster estimation error convergence in major cases.Reasoning Pruning prunes low-probability reasoning paths to prevent the remaining degeneration cases.Our theoretical analysis and extensive experiments demonstrate that Rpc achieves superior error convergence rates and reasoning performance compared to existing methods.
Limitations and Future Work. One limitation of this work is that we have only taken an initial step toward improving confidence estimation for self-consistency. The two components of Rpc can be further enhanced: Perplexity Consistency could incorporate additional probability metrics from LLM outputs, while Reasoning Pruning could be extended with more sophisticated pruning strategies, such as analyzing the probability distribution of each candidate answer. We believe our initial approach and theoretical analysis can guide future research in this promising direction.
Impact Statement
This work advances the efficiency and effectiveness of LLM reasoning with multiple reasoning paths.Our method can benefit various applications requiring reliable artificial intelligence reasoning, such as mathematical problem-solving and code generation. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.
References
Abbasi-Yadkori etal. (2024)Abbasi-Yadkori, Y., Kuzborskij, I., György, A., and Szepesvári, C.To believe or not to believe your LLM.CoRR, 2024.
Aggarwal etal. (2023)Aggarwal, A. M.P., Yang, Y., and Mausam.Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, 2023.
Austin etal. (2021)Austin, J., Odena, A., Nye, M.I., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C.J., Terry, M., Le, Q.V., and Sutton, C.Program synthesis with large language models.CoRR, abs/2108.07732, 2021.
Becker & Soatto (2024)Becker, E. and Soatto, S.Cycles of thought: Measuring LLM confidence through stable explanations.CoRR, abs/2406.03441, 2024.
Bendale & Boult (2016)Bendale, A. and Boult, T.E.Towards open set deep networks.In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572, 2016.
Blei etal. (2003)Blei, D.M., Ng, A.Y., and Jordan, M.I.Latent dirichlet allocation.Journal of Machine Learning Research, 3(1):993–1022, 2003.
Chen etal. (2023a)Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J., and Chen, W.Codet: Code generation with generated tests.In Proceedings of the 11th International Conference on Learning Representations, 2023a.
Chen etal. (2021)Chen, M., Tworek, J., Jun, H., Yuan, Q., deOliveiraPinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W.H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A.N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W.Evaluating large language models trained on code.CoRR, abs/2107.03374, 2021.
Chen etal. (1998)Chen, S.F., Beeferman, D., and Rosenfeld, R.Evaluation metrics for language models.1998.
Chen etal. (2023b)Chen, X., Aksitov, R., Alon, U., Ren, J., Xiao, K., Yin, P., Prakash, S., Sutton, C., Wang, X., and Zhou, D.Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023b.
Chen etal. (2024)Chen, Y., Jhamtani, H., Sharma, S., Fan, C., and Wang, C.Steering large language models between code execution and textual reasoning.CoRR, abs/2410.03524, 2024.
Cheng etal. (2024)Cheng, F., Zouhar, V., Arora, S., Sachan, M., Strobelt, H., and El-Assady, M.Relic: Investigating large language model responses using self-consistency.In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–18, 2024.
Deng etal. (2024)Deng, Y., Zhang, W., Lam, W., Ng, S., and Chua, T.Plug-and-play policy planner for large language model powered dialogue agents.In Proceedings of the 12th International Conference on Learning Representations, 2024.
Dhuliawala etal. (2024)Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J.Chain-of-verification reduces hallucination in large language models.In Findings of the Association for Computational Linguistics, pp. 3563–3578, 2024.
Duan etal. (2024)Duan, J., Cheng, H., Wang, S., Zavalny, A., Wang, C., Xu, R., Kailkhura, B., and Xu, K.Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 5050–5063, 2024.
Fang etal. (2024)Fang, M., Wan, X., Lu, F., Xing, F., and Zou, K.Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data.CoRR, abs/2406.18321, 2024.
Fu etal. (2023)Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T.Complexity-based prompting for multi-step reasoning.In Proceedings of the 11th International Conference on Learning Representations, 2023.
Guan etal. (2025)Guan, X., Zhang, L.L., Liu, Y., Shang, N., Sun, Y., Zhu, Y., Yang, F., and Yang, M.rstar-math: Small llms can master math reasoning with self-evolved deep thinking, 2025.
He etal. (2024)He, C., Luo, R., Bai, Y., Hu, S., Thai, Z.L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M.Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 3828–3850, 2024.
Hendrycks etal. (2021a)Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J.Measuring coding challenge competence with APPS.In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021a.
Hendrycks etal. (2021b)Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J.Measuring mathematical problem solving with the math dataset.In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021b.
Huang etal. (2023)Huang, Y., Song, J., Wang, Z., Zhao, S., Chen, H., Juefei-Xu, F., and Ma, L.Look before you leap: An exploratory study of uncertainty measurement for large language models.arXiv preprint arXiv:2307.10236, 2023.
Kadavath etal. (2022)Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., Showk, S.E., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., and Kaplan, J.Language models (mostly) know what they know.CoRR, abs/2207.05221, 2022.
Kojima etal. (2022)Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., and Iwasawa, Y.Large language models are zero-shot reasoners.In Advances in Neural Information Processing Systems, pp. 22199–22213, 2022.
Kozma (2021)Kozma, L.Useful inequalities, 2021.
Lewkowycz etal. (2022)Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V.V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V.Solving quantitative reasoning problems with language models.In Advances in Neural Information Processing Systems, pp. 3843–3857, 2022.
Li etal. (2024a)Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., and Ichter, B.Chain of code: Reasoning with a language model-augmented code emulator.In Proceedings of the 41st International Conference on Machine Learning, 2024a.
Li etal. (2024b)Li, C., Yuan, Z., Yuan, H., Dong, G., Lu, K., Wu, J., Tan, C., Wang, X., and Zhou, C.MuggleMath: Assessing the impact of query and response augmentation on math reasoning.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pp. 10230–10258, 2024b.
Li etal. (2024c)Li, Y., Yuan, P., Feng, S., Pan, B., Wang, X., Sun, B., Wang, H., and Li, K.Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning.In Proceddings of the 12th International Conference on Learning Representations, 2024c.
Li etal. (2024d)Li, Z., Zhou, Z., Yao, Y., Zhang, X., Li, Y.-F., Cao, C., Yang, F., and Ma, X.Neuro-symbolic data generation for math reasoning.In Proceedings of the 38th Annual Conference on Neural Information Processing Systems, 2024d.
Liu etal. (2024)Liu, Y., Yang, T., Huang, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., and Zhang, Q.Calibrating llm-based evaluator.In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, pp. 2638–2656, 2024.
Luo etal. (2025)Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., and Zhang, D.Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.In Proceddings of the 13th International Conference on Learning Representations, 2025.
Malík & Vojnar (2021)Malík, V. and Vojnar, T.Automatically checking semantic equivalence between versions of large-scale c projects.In Proceedings of the14th IEEE Conference on Software Testing, Verification and Validation, pp. 329–339, 2021.
Marazzi & Ruffieux (1999)Marazzi, A. and Ruffieux, C.The truncated mean of an asymmetric distribution.Computational Statistics & Data Analysis, 32(1):79–100, 1999.
Murugadoss etal. (2025)Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., and Sarkar, A.Evaluating the evaluator: Measuring LLMs’ adherence to task evaluation instructions.In Proceedings of the 39th AAAI Conference on Artificial Intelligence, 2025.
Ouyang & Li (2023)Ouyang, S. and Li, L.Autoplan: Automatic planning of interactive decision-making tasks with large language models.In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3114–3128, 2023.
Park etal. (2024)Park, S., Liu, X., Gong, Y., and Choi, E.Ensembling large language models with process reward-guided tree search for better complex reasoning.CoRR, abs/2412.15797, 2024.
Sblendorio etal. (2024)Sblendorio, E., Dentamaro, V., Cascio, A.L., Germini, F., Piredda, M., and Cicolini, G.Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models’ feasibility in clinical decision-making.International Journal of Medical Informatics, 188:105501, 2024.
Shao etal. (2024)Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., and Guo, D.Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024.
Tian etal. (2023)Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C.D.Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5433–5442. Association for Computational Linguistics, 2023.
Valmeekam etal. (2023)Valmeekam, K., Marquez, M., Sreedharan, S., and Kambhampati, S.On the planning abilities of large language models - A critical investigation.In Advances in Neural Information Processing Systems, pp. 75993–76005, 2023.
Wan etal. (2024)Wan, G., Wu, Y., Chen, J., and Li, S.Dynamic self-consistency: Leveraging reasoning paths for efficient LLM sampling.CoRR, abs/2408.17017, 2024.
Wang etal. (2022)Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowdhery, A., and Zhou, D.Self-consistency improves chain of thought reasoning in language models.In Proceedings of the 11th International Conference on Learning Representations, 2022.
Wang etal. (2024)Wang, X., Feng, S., Li, Y., Yuan, P., Zhang, Y., Pan, B., Wang, H., Hu, Y., and Li, K.Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning.CoRR, abs/2408.13457, 2024.
Wei etal. (2022)Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., and Zhou, D.Chain-of-thought prompting elicits reasoning in large language models.In Advances in Neural Information Processing Systems, pp. 24824–24837, 2022.
Weibull (1951)Weibull, W.A statistical distribution function of wide applicability.Journal of applied mechanics, 1951.
Xiong etal. (2024)Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., and Hooi, B.Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.In Proceedings of the 12th International Conference on Learning Representations, 2024.
Yang etal. (2024)Yang, R., Rajagopal, D., Hayati, S.A., Hu, B., and Kang, D.Confidence calibration and rationalization for LLMs via multi-agent deliberation.In International Conference on Learning Representations Workshop on Reliable and Responsible Foundation Models, 2024.
YAO etal. (2024)YAO, Y., Wu, H., Guo, Z., Biyan, Z., Gao, J., Luo, S., Hou, H., Fu, X., and Song, L.Learning from correctness without prompting makes LLM efficient reasoner.In Proceedings of the 1st Conference on Language Modeling, 2024.
Ying etal. (2024)Ying, H., Zhang, S., Li, L., Zhou, Z., Shao, Y., Fei, Z., Ma, Y., Hong, J., Liu, K., Wang, Z., Wang, Y., Wu, Z., Li, S., Zhou, F., Liu, H., Zhang, S., Zhang, W., Yan, H., Qiu, X., Wang, J., Chen, K., and Lin, D.Internlm-math: Open math large language models toward verifiable reasoning, 2024.
Yu etal. (2024)Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A., and Liu, W.Metamath: Bootstrap your own mathematical questions for large language models.In Proceddings of the 12th International Conference on Learning Representations, 2024.
Zamil & Rabby (2024)Zamil, P. and Rabby, G.Aime problems 1983 to 2024, 2024.
Zhang etal. (2024)Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., and Tang, J.ReST-MCTS*: LLM self-training via process reward guided tree search.In Proceedings of the 38th Annual Conference on Neural Information Processing Systems, 2024.
Zhang etal. (2023)Zhang, Z., Zhang, A., Li, M., and Smola, A.Automatic chain of thought prompting in large language models.In Proceedings of the 11th International Conference on Learning Representations, 2023.
Zhao etal. (2024)Zhao, X., Zhang, H., Pan, X., Yao, W., Yu, D., Wu, T., and Chen, J.Fact-and-reflection (far) improves confidence calibration of large language models.In Findings of the Association for Computational Linguistics, pp. 8702–8718, 2024.
Zhou etal. (2023)Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q.V., and Chi, E.H.Least-to-most prompting enables complex reasoning in large language models.In Proceedings of the 11th International Conference on Learning Representations, 2023.
Appendix A Theoretical Results
A.1 Proof of Proposition1 and Proposition2
Proof.
(Sc) First, we denote the sampling probability distribution of the LLM as ,and the confidence function as , where are sampled on the distribution .Apply the error decomposition, we have
(Ppl) Another way is to use the confidence function to build the sampling probability distribution,i.e.,
Now, we have
Hence,
Hence, we complete the proof.∎
A.2 Model Error Comparison in Ideal Case
Proposition 3(Comparison of Model Errors).
Consider a setting with infinite sampling of reasoning paths () where each incorrect reasoning path leads to a unique answer, that is, for any where and . The model error of self-consistency () and perplexity () satisfy:
(1)
The inequality is strict when the consistency function identifies equivalent correct reasoning paths more than once.
Remark 7.
Although the assumptions in Proposition3 are idealized, this special case demonstrates that the consistency function in self-consistency achieves better model error than the perplexity method.In practice, the perplexity method always gives larger model error compared to the self-consistency method, as it does not leverage the consistency function to analyze the structural properties of specific reasoning problems.The proof is presented as follows.
Proof.
We first recall the definitions of the model error for self-consistency and perplexity:
(2)
where is the set of reasoning paths sampled from the LLM.We can compute the difference between the model error of Sc and Ppl as follows:
(3)
Assuming infinite samplings, the unbiasedness of Sc gives us:
(4)
For any , consider two cases:
(i)
If , then is the correct answer. We have
(5)
Let and , then
(6)
Equality holds if . This indicates that if the consistency function is effective at least once, making .
(ii)
If , then is incorrect. Assuming distinct answers for wrong reasoning paths, we have , thus
(7)
since only one satisfying that equals the incorrect answer .
Therefore, , indicating that the model error of self-consistency is always less than or equal to the model error of perplexity under our assumptions. Moreover, if the consistency function is effective at least once, the model error of self-consistency is strictly less than the model error of perplexity.∎
A.3 Proof of Theorem3
Proof.
For given answer , the estimated probability of Pc is , allowing the reasoning error of Pc can be computed as follows.
We define , which means that how many reasoning paths whose answers are can be covered given limited sampling budget.Note that we further have , thus
where .Again, we have
Hence,
which completes the proof.∎
A.4 Proof of Theroem5
Proof.
Let us set the pruning threshold by . Then, we have
However, we only have an estimation of , i.e., , where are sampled reasoning paths whose answer is .Therefore, the reasoning error of our approximate version can be computed by
Hence, we only need to consider the probability .Using Hoeffding’s inequality, we can obtain that
We set , then
Hence, we complete the proof.∎
Appendix B Pseudo Code of Rpc Method
In this section, we provide the pseudo code of Rpc.The output of Algorithm1 is a set of reasoning paths with the highest confidence.The extraction function can be used to transform the reasoning paths to answers.
0:
1:Sampled Reasoning paths
2:LLM Internal Probabilities
3:Consisitency function
3:Most-confident reasoning paths with probabilities
4:{Phase 1: Reasoning Pruning}
5: Model probability distribution parameters from {Using subsection3.2}
6:
7: { is defined in subsection3.2}
8:{Phase 2: Perplexity Consistency}
9:
10:
11:foreach reasoning path do
12:
13:endfor
14:
15:return
Appendix C Detailed Experimental Settings
C.1 Datasets
For mathematical reasoning tasks, we evaluate our proposed methods and comparison methods on four mathematical datasets that include MATH, MathOdyssey, OlympiadBench, and AIME datasets.
•
MATH dataset(Hendrycks etal., 2021b) is a dataset comprised of challenging competition math problems and we use its 5,000 testing data for evaluation.
OlympiadBench dataset(He etal., 2024) contains 8,476 Olympiad-level mathematics and physics problems. We select the English problems without images, resulting in a testing dataset of 1,284 problems.
•
AIME dataset(Zamil & Rabby, 2024) contains 993 test problems collected from the American Invitational Mathematics Examination, spanning from 1983 to 2024.
For code generation tasks, we conduct experiments on three common benchmark datasets.HumanEval(Chen etal., 2021) contains 164 hand-written Python programming problems.MBPP(Austin etal., 2021)(sanitized version) consists of 427 entry-level programming problems.We also include the introductory-level problems of APPS(Hendrycks etal., 2021a), which contains 1000 problems.
C.2 Detailes of Mathematical Reasoning Task
For all the experiments in the main paper, we use a sampling temperature of 1.0 and set the top-k parameter to 0.95.
Prompt for Math Reasoning Tasks.
The InternLM2-MATH-Plus 1.8B and 7B models are chat models that facilitate conversations between two roles: “user” and “assistant”.The prompt for the “user” role is provided in PromptC.2. Similarly, the prompt for the DeepSeek-Math 7B model is shown in PromptC.2.
Prompt for Math Verbalized Method.
We observed that the tuned math models are challenging to prompt for generating confidence. Therefore, we adopted the methods from Tian etal. (2023) to calculate the probability based on the likelihood of the first generated “True” token and the first generated “False” token. The corresponding prompt is provided in PromptC.2.
C.3 Detailes of Code Generation Task
Code Generation.
On the code generation task, we let LLM generate a code snippet to solve a given programming problem, and then evaluate its functional correctness based on the ground-truth test cases provided by the dataset. In detail, we set the top p to 0.95, and the max generation length to 1024. For code snippet post-processing, we first extract the code text from the code block surrounded by triple-backticks(‘‘‘), and then we followChen etal. (2021) to truncate the generated code snippet before the following stop sequences: “\nclass”, “\ndef”, “\n#”, “\nif”, “\nprint”. At the same time, we also obtain the log-probability of each token from the LLM response. For ”verbalization” setting, the verbalized confidence is also extracted from the text generated by LLM along with the code snippet.
Self-consistency on Code.
We follow Chen etal. (2023a) to sample 100 test cases for each programming problem from the same model. Then, we achieved self-consistency in code at the semantic equivalence level, which is based on the execution behavior of any two codes on this set of test cases. More formally, we implemented the consistency function as an indicator function that indicates whether two codes are semantically equivalent, i.e., if and only if code and execute the same result on this set of test cases.
Prompt for Generating Code.
The prompt for generating code consists of a header, a functional signature, and a docstring and LLM needs to implement the body of this function. An Illustration is shown in PromptC.3.
Prompt for Code Verbalized Method.
For generating code with verbalized confidence, we added instructions for generating verbalized confidence, as well as format requirements to facilitate the extraction of code and confidence score. We also gave a simple example to help LLM understand the format requirements at the end of the prompt.An Illustration is shown in PromptC.3.
Appendix D Detailed Experimental Results
D.1 Performance on GSM8k dataset.
In Section2, we analyze the GSM8k dataset, which is relatively easy that allows accurate estimation of ground-truth probabilities with 100 samples. Due to this characteristic, we exclude the GSM8k dataset from our main experiments, as a limited number of samples is sufficient for accurate confidence evaluation.Table4 shows the performance of InternLM-2-MATH-Plus 7B model on GSM8k with various sampling temperatures, where the number of samples is set to to maintain consistency with Figure2. The results show that the Rpc method outperforms comparison methods.
Ppl
Verb
Sc
Rpc
T=1.0
86.97 0.29
63.67 0.98
89.32 0.26
89.45 0.38
T=1.1
86.78 0.43
62.25 1.00
89.44 0.35
89.51 0.36
T=1.3
86.65 0.57
61.29 0.84
88.92 0.32
89.08 0.57
D.2 Results with High Sampling Temperature.
Using a high sampling temperature enables language models to produce more diverse outputs, potentially enhancing reasoning performance.However, it also leads to an increase in estimation error.To investigate the effectiveness of our approaches in addressing the estimation error issue, we conducted experiments with higher sampling temperatures (T = 1.1 and T = 1.3) using the InternLM-2-MATH-Plus 7B model. The results in Table5 indicate that our Rpc approach consistently surpasses baseline methods. Notably, a significant performance gap persists between Rpc and Sc, indicating that our methods effectively tackle the estimation error issue even under high-temperature sampling conditions.
Method
Temperature = 1.1
MATH
MathOdyssey
OlympiadBench
AIME
Ppl
47.35 0.16
28.59 1.30
7.27 0.23
6.02 0.34
Verb
25.51 0.23
9.41 0.44
3.66 0.16
3.07 0.15
Sc
50.66 0.22
27.89 0.43
10.74 0.15
8.73 0.24
Rpc
52.58 0.14
32.98 0.69
11.00 0.24
9.30 0.29
Method
Temperature = 1.3
MATH
MathOdyssey
OlympiadBench
AIME
Ppl
47.58 0.31
26.38 1.41
7.76 0.46
6.50 0.41
Verb
24.62 0.33
8.60 0.26
3.11 0.17
2.29 0.12
Sc
50.65 0.14
27.61 0.67
10.49 0.18
8.02 0.20
Rpc
53.12 0.14
33.19 0.56
10.91 0.18
8.83 0.23
D.3 Performance with Diverse Number of Samplings
In the Figure4, we plot the accuracy of the InternLM-2-MATH-Plus 7B model on four mathematical reasoning datasets with different sample sizes . Here, we give the detailed results on different models and different sampling temperatures.
Different Models Scales.
The performance of relateively small model, InternLM-2-MATH-Plus 1.8B, is presented in Figure7. Similar conclusions can be drawn from these results. For the MathOdyssey dataset, the Ppl method shows superior performance compared to other methods, which can be attributed to the relatively low model error of Ppl on this dataset, allowing the perplexity-based approach to function effectively. Furthermore, the Rpc method consistently outperforms the Sc method, which demonstrates its ability to enhance the convergence properties of the Sc method.
Different Sampling Temperatures.
We also evaluate the InternLM-2-MATH-Plus 7B model with sampling temperatures and . The results are presented in Figure8 and Figure9.The results demonstrate that the Rpc method effectively improves the model’s reasoning performance at higher temperatures, as it leverages the increased diversity in sampling to enhance self-consistency. In contrast, the Sc method’s performance deteriorates due to increased estimation errors at higher temperatures.
Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543
Phone: +99513241752844
Job: Design Supervisor
Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles
Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.
We notice you're using an ad blocker
Without advertising income, we can't keep making this site awesome for you.