LLM Theory and Prompt Utilization Strategy Through the Lens of Probability - 2

BACKND

11 Jul 2025 — 10 min read

Written by Devops Traveler Tales, 14 July 2025

Taking Prompts to a New Level: CoT to ToT, and Inference Models

The prompt strategies we've examined so far have primarily focused on narrowing the model's response scope to enhance consistency and stability. Now, we take it a step further by exploring techniques that guide the model to engage in deeper reasoning and more accurate inference.

This chapter's prompts are based on the Chain-of-Thought (CoT) and Tree-of-Thought (ToT) sections from the Prompt Engineering Guide website.

Chain-of-Thought (CoT): Guiding Step-by-Step Reasoning

Chain-of-Thought (CoT) prompting doesn't just ask the model for a final answer—it instructs the model to articulate the reasoning process step by step. By doing so, CoT reduces the likelihood of incorrect outputs at each stage of reasoning, ultimately leading to more accurate conclusions.

Let's examine a practical example to gain a better understanding of how CoT works in action.

Solve the following math problem.

"If there are four boxes, and each box contains 2, 3, 5, and 7 apples respectively, what is the total number of apples?"

Now, let's think step-by-step.

Model's Response Example:

1. Problem Restatement:
"There are a total of four boxes, each containing 2, 3, 5, and 7 apples, respectively."

2. Step-by-Step Calculation:
"First, 2 + 3 = 5. Adding 5 gives 10, and finally, adding 7 gives 17."

3. Final Answer:
"The total number of apples is 17."

By using Chain-of-Thought (CoT) prompting, the model is guided to think through intermediate steps first, helping it control potential branching paths that could lead to incorrect answers before arriving at a final conclusion.

This method is especially effective for tasks involving complex logic problems, questions requiring deep reasoning, and debugging-related problem-solving. In such cases, adding a clear instruction at the end of the prompt, such as "Let's think step by step" or "Think first, then act" helps the model proceed in stages, suppressing incorrect branches at each step of the reasoning process.

Tree-of-Thought (ToT): Controlling Probabilistic Branching Through Parallel Reasoning

While Chain-of-Thought (CoT) prompting helps reduce the chance of incorrect answers by guiding the model through step-by-step reasoning, Tree-of-Thought (ToT) takes this a step further, allowing the model to explore and evaluate multiple reasoning paths in parallel. By considering several possible paths simultaneously, the model can more efficiently explore a wider range of solutions than if it were constrained to follow a single line of thought from the beginning.

At the core of ToT prompting is the instruction for the model to generate multiple potential solutions, assess them independently, and choose the most plausible one. The model processes these lines of reasoning in parallel, analyzing each path individually. This approach enhances the ability to suppress probabilistic branches that lead to incorrect conclusions, ultimately yielding more accurate and trustworthy outputs.

Let's take a deeper dive with an example in real life.

Prompting Example:

Regarding the following problem, present three approaches for the optimal pricing strategy for a new product.
Then, evaluate the pros and cons of each approach and recommend the most effective one.

[Expert A – Market analyst's perspective]
[Expert B – Accounting and financial perspective]
[Expert C – Marketing perspective]

Model Response Example (Summary):

[Expert A – Market analyst's perspective]
- Low pricing strategy compared to competitors (quick market share acquisition, but risk of profitability decline)
- Premium pricing strategy (high brand value, but limited initial demand)

[Expert B – Accounting and financial perspective]
- Low pricing strategy (possible increase in short-term sales, but risk of sales below costs)
- Market penetration first, then increase price (risk of customer churn due to price increase after market entry)

[Expert C – Marketing perspective]
- Premium strategy (enhanced brand image, but high initial cost)
- Market penetration first, then increase price (high post-marketing effectiveness, but requires management for long-term trust in price)

Final evaluation and recommendation:
Option 1 (low pricing strategy): Market share can be quickly acquired, but high risk in profitability exists
Option 2 (premium strategy): Brand image can be highly enhanced, but comes with limited initial demand
Option 3 (Penetration first, then gradual increase): Lowers initial entry barriers and allows profitability management after securing customers

Final recommendation: Option 3 (Penetration first, then gradual increase) is deemed to provide the most balanced profitability in the long term.

As shown in the example above, the Tree-of-Thought (ToT) method allows the model to explore multiple perspectives in parallel, minimizing probabilistic branches that could lead individual thought paths toward incorrect conclusions. It also enables a clear comparison of each path's strengths and weaknesses, helping the model arrive at a more accurate and reliable final answer.

Hulbert (2023) developed Tree-of-Thought into a concrete prompting technique by proposing a scenario in which multiple expert agents share and evaluate their reasoning step by step within a single prompt. An illustration of this idea is shown in the following:

Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realises they're wrong at any point then they leave.
The question is...

The Emergence of Inference Models and Practical Application Strategies

The latest models of the recently emerging GPT series, such as GPT-o4 are evolving to autonomously perform complex reasoning processes like CoT and ToT, which were previously guided through prompts. The model's ability to think through processes and find the optimal answer has been significantly enhanced, no longer requiring explicit cues like "let's think step by step."

This is because modern models excel at selecting the most appropriate and consistent answers by comparing and evaluating various answer candidates internally. In short, the latest GPT models can perform inference and reasoning naturally to some degree, meeting the user's desired level of performance without additional prompting.

However, in practice, we cannot rely solely on this automatic inference capability. Even with their advanced reasoning skills, these models may still produce unexpected answers or logical leaps, especially when handling complex problems or tasks that require careful pre-planning. For this reason, it's best to actively leverage the model's built-in reasoning ability while also applying structured prompting strategies, such as CoT and ToT, when needed.

By combining the model's autonomous reasoning with well-designed prompting techniques, you can build a balanced and efficient approach, achieving more accurate and efficient results for even the most complex challenges.

Structural Vulnerability of Probability-Based Approach

Vulnerability in High-Context Responses

Large language models (LLMs) inherently generate text using a probabilistic approach. That is, they are optimized to produce responses by selecting the words most frequently used by a large population and statistically plausible expressions. This means LLMs tend to regress to the mean of the data they were trained on.

People all look different. But when you stack them up, they tend to regress to some average.

However, real-world work environments often operate in the opposite direction. Organizations and individuals typically act based on specific contexts and a history of accumulated decisions. Over time, the generalized, averaged responses of LLMs increasingly diverge from these realities.

While generic answers from LLMs may be helpful in the early stages of a task, as work progresses, prior decisions, revisions, and exceptional cases accumulate, forming a high-context environment that deviates from the general patterns in the training data. As a result, "standardized" solutions offered by LLMs may prove inadequate or even misleading, requiring ongoing human intervention and correction.

This discrepancy arises because most real-world problems do not involve arriving at a single correct answer from the outset. Instead, they involve continuously making choices among many possibilities, based on each organization's or individual's unique values and judgment. Over time, these accumulated decisions create a context that is distinct and increasingly difficult for an LLM, trained on averaged data, to reflect accurately.

Therefore, it's essential to continuously evaluate whether an LLM's output is directly applicable to real-world tasks or whether direct intervention by human insight is needed to fill the gap between generalized responses and specific, high-context realities.

Between Plausibility and Truth: Avoiding the Illusion of a Correct Answer

The sentences generated by large language models (LLMs) are often remarkably fluent and natural. But it's important to remember this: these sentences aren't plausible because they're true—they're generated because they sound credible.

Most LLMs, including GPT, generate responses to the given situation by selecting the most statistically likely combinations of words based on their training data. In other words, they don't evaluate whether something is actually true or false—they reproduce how people have typically expressed similar ideas in the past. The model's logic is not "What is correct?" but rather "What is most likely to be said here?"

Due to this manner of operation, the sentences generated by LLMs appear very natural and convincing on the surface; however, in reality, incorrect facts or even non-existent information can be included persuasively.

A lot of the time, humans are more likely to accept something as true if its expressions are more fluent, especially when the model's response is closer to the information they already know. To prevent this illusion of a correct answer, one should always regard the sentences generated by the model as a "plausible draft" and be sure to apply the following checks:

Is the source clearly identifiable and verifiable?
Can the presented information be objectively confirmed?
Is there any missing or omitted information?
Has the model generated information that is plausible, yet non-existent?

By critically reviewing LLM-generated content through this lens, we can move beyond superficial plausibility and arrive at accurate, trustworthy information.

Reinforced Conformity, Disappearing Dissent: The Challenge of Ensuring Diversity

LLMs generate highly logical and persuasive sentences, but that doesn't mean they always offer a balanced perspective. In fact, they tend to respond sensitively to the user's phrasing and intent, often constructing answers that reinforce the viewpoint the user seems to favor.

This phenomenon is no coincidence. LLMs learn through statistical associations formed over massive datasets—for example, "questions like this are usually followed by answers like that." So, when a user asks about a topic in a positive tone, the model retrieves and amplifies similarly positive responses found in its training data.

The issue is that opposing viewpoints or alternative perspectives don't automatically emerge in this process. It's not that the model is unaware of them; instead, unless the user explicitly requests such views, the model typically won't surface them. As a result, users may see answers that echo their own opinions and fall more easily into confirmation bias, thinking, "See? I was right all along."

To avoid this kind of bias, it's essential to deliberately introduce diversity into the question itself. Consider adding instructions like:

"What are some opposing views on this argument?"
"How might this issue be seen from a different perspective?"
"Please compare and explain the logic of both sides."

Only by actively requesting diverse perspectives will the model offer hidden counterarguments or alternative viewpoints actively.

Ultimately, while LLMs are capable of presenting multiple perspectives, it remains the user's responsibility to evaluate those perspectives and reach a well-balanced conclusion.

A Question Too Big to Capture at Once: The Practical Limits of Large-Scale Task Descriptions

Many users try to automate complex tasks with LLMs, only to quickly encounter what might be called the "wall of large-scale task descriptions." A common assumption behind this effort is:

"If I just input everything from start to finish, won't the model figure it out on its own?"

In practice, however, that's rarely the case. When a prompt becomes too lengthy, three major problems tend to emerge:

The prompt may be cut off in the middle if the token limit is exceeded.
With too much information mixed together, the model may struggle to identify the core question properly.
Due to confusion, there is an increased risk of hallucination, where the model fabricates information due to confusion.

It becomes especially problematic when examples, conditions, instructions, and other elements are combined into a single, lengthy prompt. In such cases, the model may become fixated on incidental details or examples, rather than the actual task, leading to outputs that are irrelevant or misleading.

But oversimplifying the prompt creates a different set of issues. When a question lacks sufficient context, users are often forced to take a multi-turn approach, breaking the task into multiple rounds of interaction with the model.

Yet, multi-turn conversations come with their own set of challenges. As the dialogue gets longer, the model may lose track of earlier content or misjudge the priority of previously given information. Over time, this can cause its responses to drift away from the original context.

In the end, you're stuck in a dilemma:

Provide extensive information at once, and the model becomes overwhelmed, generating confused responses.
Provide the information in parts, and you risk losing context or experiencing a dip in quality.

Currently, there is no perfect solution to this dilemma, but the most effective methods in practice are as follows.

Write prompts clearly and concisely, but provide the necessary context gradually and incrementally.

Conclusion - Probability Control Strategy for Better AI Use

In this article, we began with the fundamental insight that LLMs generate text based on probability. We explored a range of strategies and examples that demonstrate why controlling that probability is so important.

Initially, we examined how LLMs are essentially probabilistic systems designed to predict the next word. That led to a better understanding of why prompts matter—not just as a way to ask questions, but as a tool to narrow and guide the model's probabilistic response path.

In particular, Chain-of-Thought (CoT) is a strategy that helps prevent incorrect answers by prompting the model to lay out its reasoning step-by-step explicitly. Tree-of-Thought (ToT) further developed this approach by exploring multiple paths in parallel, effectively managing incorrect probabilistic branches to lead to superior inference results.

On the other hand, we also examined the limitations of the probability-driven approaches. We found that LLMs are prone to providing inappropriate responses in high-context environments. It was also dangerous to trust their generated results just because they seemed plausible, and we saw that they may not spontaneously present diverse perspectives. We also clearly identified the issues that arise when prompts are too long or complex.

That brings us to a more recent development: the Mixture-of-Experts (MoE) architecture. MoE models are composed of multiple specialized "expert" modules, with the system dynamically selecting the most relevant one for each input. This design enables finer-grained control over probabilistic choices within the model itself, effectively internalizing the types of strategies we've discussed.

Ultimately, achieving strong results with LLMs requires a clear understanding of their probabilistic nature and the ability to control their output through effective prompting techniques. Sometimes that means leveraging the model's built-in capabilities—such as CoT, ToT, or MoE—and other times it means carefully crafting prompts to guide the model manually. The most practical and efficient approach is a balanced combination of both.

By applying the theoretical insights and practical strategies discussed here, you'll be better equipped to use LLMs in a more effective and trustworthy manner.

LLM Theory and Prompt Utilization Strategy Through the Lens of Probability - 2

BACKND

Taking Prompts to a New Level: CoT to ToT, and Inference Models