Introduction to Reasoning Systems
The rapid development of LLMs has transformed the way in which humans, organizations, and systems approach knowledge work. From reflection to task planning, symbolic systems, and chain of thought, the key property of all agentic systems in the future will be not just their scores on benchmarks, but how comfortable they are at using reasoning to do complex workloads at human-level performance. If it’s simple, the reasoning agent should be able to answer as such by relying on heuristics and short-cuts to guide its intelligence as reasoning and intelligence is primarily a function of pattern-recognition. In the absence of algorithmic design, can LLMs infer latent properties of problems and then go out and solve them like humans?
This process, of deduction and induction, requires both an LLM to be logical in problem-solving, fast and efficient so it doesn’t overcomplicate simple queries or can speed through these problems, and have feasible, interpretable logic. Dual process theory aims to explain how humans can switch between these 2 evolutionary modes of reasoning efficiently: an implicit process that is automatic and unconscious and the other that is explicit, controlled, and conscient. They are termed as System 1 and 2 respectively, where System 1 is fast and efficient but has biases and leads to systematic errors due to a lack of voluntary control over the thinking process.
Fast vs. Slow Reasoning Systems in LLMs
Fast LLMs (System 1) show probable output solutions, but they also don’t show the actual reasoning trace in their next-token prediction. Slow LLMs (System 2) use consequential thinking by breaking down problems into a concrete plan, and then executing the plan until they reach a highly likely answer. A reasoning trace is a step-by-step process that language models use to articulate their internal thought patterns while solving a task. For instance, in the context of A* planning, reasoning traces are generated by autoregressive models and their custom tokenization scheming outlining the next position in the path or backtracking. Famous agentic models like ReACT use reasoning traces in unison with the actions and observations of the model to generate new reasoning traces that feed back into the model.
Most models we all think of today like GPT-4o, Claude 3.5, etc. are fundamentally System 1 LLMs. Through their attention matrix, they are trained on countless domains of knowledge and embed the world’s knowledge, serving as a general predictor of questions that are passed in their context-length. Meanwhile, other System 2 chain of thought models like o1 are growing in popularity but they have several disadvantages. For simple tasks, o1-preview takes long periods of time and uses unnecessary search traces to generate an answer to queries. To solve these tradeoffs, companies use routers to switch-select models based on a given prompt and inject models with various prompts depending on the nature of the user query, but incorporating this System 1 and 2 dynamically onto a single model that is intelligent enough to shape its prompt response is much more useful to reduce costs and improve real-world model usability without the need for prompting techniques or additional meta-learners.
To achieve System 2 performance with System 1 speed, existing approaches fine-tune these larger, chain of thought System 2 models and make them more efficient and parameterized via efficient-ML techniques like pruning and distillation. The most popular approach in the field today is Searchformer, an encoder-decoder transformer-based approach to predicting A* search dynamics and optimal plans. It achieves this by integrating rotary embeddings, a popular positional encoding scheme in LLMs that gives the model global positional awareness. The training process used an initial training scheme to imitate A* search dynamics by predicting search traces, and then fine-tuned the model search dynamics bootstrapping to generate shorter execution traces. However, these execution traces can balloon in size, so training often requires handling very long sequences, making it computationally expensive as the token sequences are significantly longer than those typically used in LLM training.
To solve these challenges in both training, inference, and model-switching, researchers at Meta designed Dualformer, an approach to building a coherent System 1 and System 2 model for reasoning tasks without the need for computationally-expensive fine-tuning or a meta-controller. They achieved this by changing the way the model learned how to take in tasks, generate reasoning traces, and output final solutions where instead of providing complete reasoning traces in the training data, they instead gave partial or incomplete traces. This data recipe was able to achieve on-the-fly system configuration, where the model was capable of generating System 1 solutions at System 2 level accuracies for A* and Sokaban search. More specifically, Dualformer was innovative in its approach to drop reasoning traces during training to resemble human-like shortcuts when it comes to solving logical problems (ie. given a reasoning trace at t=1, can it predict a reasoning trace at t=5 without generating traces 2–4).
What is Dualformer?
Dualformer’s primary task involved generating solutions for 15x15 and 30x30 maze navigations with random walls and start/end positions. The input used a tokenization schema with tokens for cell class and coordinates. The A* algorithm used specific tokens (BOS, EOS, Start, Goal, Wall, Plan) to facilitate the search process. The A* algorithm generated a search trace containing node attributes (coordinates, cost, heuristic) and employed create/close tokens for node management. The training process used drop-out methods to randomly remove tokens from reasoning traces, forcing accurate plan generation with limited information (ie. given a reasoning trace at t=1, can it predict a reasoning trace at t=5 without generating traces 2–4).
The A* algorithm generated a search trace containing node attributes (coordinates, cost, heuristic) and employed create/close tokens for node management. The training process used drop-out methods to randomly remove tokens from reasoning traces, forcing accurate plan generation with limited information. Drop-out methods, first popularized for MLPs, are used to essentially drop certain tokens from the reasoning trace randomly so that the model is forced to generate an accurate plan with limited information. The training framework is designed around randomized reasoning traces, leveraging the fact that humans often use shortcuts and patterns in decision-making. This approach builds on the neural network dropout technique, which omits certain units during training to improve generalization. By simplifying the A* search process, the framework introduces variability into training data by dropping structured elements, forcing the model to adapt to incomplete or abstracted information.
Tokenization System and Preprocessing
Level 1 removes all Close Clauses, pushing the model to predict whether further computation is needed, reducing redundancy. Level 2 builds on Level 1 by also removing Cost Tokens, requiring the model to infer costs independently. Finally, Level 3 adds randomness by dropping 30% of Create Clauses, challenging the model to function with incomplete search space expansions.
A categorical distribution p0, p1, p2, p3, and p4 determines the likelihood of applying each dropping level. This approach generates more diverse solutions than Searchformer and creates shorter reasoning traces requiring fewer tokens. The framework differs from traditional masking by targeting the search trace rather than inputs, focusing on next-token generation for reasoning traces. It uses a BOS token followed by a control token that determines the mode (slow or fast). The approach differs from regular masking by focusing on the search trace instead of inputs. Rather than using BERT-style masked language modeling, it uses next-token generation for reasoning traces. The model learns by seeing both complete traces and ones with randomly dropped tokens and has two modes controlled by tokens at the start.
For baseline comparisons, the team implemented two different models. The fast mode baseline used a Solution-Only model that shared Dualformer’s architecture but trained exclusively on final solutions without reasoning traces. For slow mode comparisons, they used a Complete-Trace model trained on full A* search traces — essentially the base Searchformer model without search dynamics bootstrapping.
Model Card and Results for Benchmarks
Both baselines maintained consistent parameter counts with Dualformer: 15M for maze problems and 46M for Sokoban puzzles. The researchers developed several metrics to rigorously evaluate performance. The primary measures, 1-Solved-64 and 1-Optimal-64, involved sampling 64 responses for each task and checking if at least one response was correct or optimal. To assess robustness, they extended these to 3-Solved-64 and 3-Optimal-64 metrics. They also introduced a Success Weighted by Cost (SWC) metric to measure solution quality. It’s important to note that Searchformer, with longer traces, did beat Dualformer in most metrics but not by much.
The results proved remarkably impressive across all operating modes. In slow mode, Dualformer achieved a 97.6% success rate on challenging 30x30 mazes while using 45.5% fewer reasoning steps than Searchformer. Fast mode performance was equally compelling, with an 80% optimal rate that far surpassed the Solution-Only model’s 30%. When running in auto mode — where Dualformer independently chooses its reasoning approach — it maintained a 96.6% optimal rate while reducing reasoning steps by 59.9% compared to Searchformer.
A key advantage of this approach was its scalability — as the trace dropping probability increased, the number of reasoning steps decreased across both evaluation schemes while maintaining high accuracy. This suggests that the model successfully learned to take intelligent shortcuts while preserving solution quality. The success of Dualformer’s training raises intriguing questions about how applying selective pressures to model training, similar to learning, can provide the most useful feedback for learning and reasoning. Dualformer suggests that by better understanding how humans learn to reason efficiently, we can create AI systems that don’t just replicate human-level performance, but actually learn to think in human-like ways by navigating ambiguity.
What’s particularly notable is that Dualformer achieved these results without requiring computationally expensive bootstrapping processes. While Searchformer needed multiple training stages with millions of rollouts (3.2 million responses per bootstrapping step plus additional fine-tuning iterations), Dualformer accomplished everything in a single training stage of 800,000 iterations. To demonstrate the versatility of this approach, the researchers also applied the trace-dropping technique to mathematical reasoning, successfully fine-tuning Llama-3–8B and Mistral-7B models to handle math problems with improved efficiency.
A key advantage of this approach was its scalability — as the trace dropping probability increased, the number of reasoning steps decreased across both evaluation schemes while maintaining high accuracy. This suggests that the model successfully learned to take intelligent shortcuts while preserving solution quality.
Conclusion
The success of Dualformer’s training raises intriguing questions about how applying selective pressures to model training, similar to learning, can provide the most useful feedback for learning and reasoning. Traditional approaches often focus on making models either faster or more accurate by bombarding the model with the most accurate data with a specific objective. Dualformer suggests that by better understanding how humans learn to reason efficiently, we can create AI systems that don’t just replicate human-level performance, but actually learn to think in human-like ways by navigating ambiguity.
While Dualformer has proven effective for pathfinding and mathematical reasoning, exploring how this approach could be applied to more complex, open-ended problems could yield valuable insights. This might include tasks like strategic planning, creative problem-solving, or scientific discovery. Additionally, it seems as if structured trace dropping is universally applicable to countless reasoning problems, as seen by the math application. By making these approaches more interpretable and conducting a deeper analysis into how these models develop shortcuts and heuristics, it can be further applied in practical settings, particularly in high-stakes domains where both speed and accuracy are crucial.
For instance, reasoning workflows for patient diagnoses are a unique challenge where doctors can quickly rule out certain diseases with a handful of symptoms and measurements. This is particularly important for cost-savings so hospitals and insurance companies do not have to waste time and resources on unnecessary medical procedures and tests. On the flip side, applying improved reasoning trace shortcuts could be helpful for actually determining whether a patient is in need of a screening earlier on in the treatment process so that insurance companies do not need to spend excessive amounts for surgical operations and expensive medication. From law, finance, and coding, these reasoning trace drop-out techniques can be especially useful for essentially teaching the model to think n-steps ahead at the first token and be correct in predicting future research directions at a high likelihood.
The development of Dualformer points toward a future where AI systems can more naturally mirror human cognitive processes, adapting their thinking strategies to match the demands of different situations. Likely, Dualformer’s trace-dropping strategy will not be the defining property of future agents, but it will be well-integrated in the training process for various datasets. The challenge now lies in expanding these capabilities to handle increasingly complex real-world scenarios while maintaining a complex balance between fast and slow thinking that makes the approach so promising.
Works Cited:
- Dropout: A Simple Way to Prevent Neural Networks From …, www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf. Accessed 13 Dec. 2024.
- Ferber, Aaron, et al. “Surco: Learning Linear Surrogates for Combinatorial Nonlinear Optimization Problems.” arXiv.Org, 19 July 2023, arxiv.org/abs/2210.12547.
- Hellstrom, Erich. “O1 Preview vs O1 Mini: Comparing OpenAI’s Advanced AI Models.” PromptLayer, PromptLayer, 22 Nov. 2024, blog.promptlayer.com/an-analysis-of-openai-models-o1-preview-vs-o1-mini/.
- Lehnert, Lucas, et al. “Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping.” arXiv.Org, 26 Apr. 2024, arxiv.org/abs/2402.14083.
- “React: Synergizing Reasoning and Acting in Language Models.” Google Research, research.google/blog/react-synergizing-reasoning-and-acting-in-language-models/. Accessed 12 Dec. 2024.
- Su, DiJia, et al. “Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces.” arXiv.Org, 13 Oct. 2024, arxiv.org/abs/2410.09918.
- Synced. “Meta’s Dualformer: Bridging Fast and Slow Thinking in Transformers for Superior Ai Reasoning.” Synced, 19 Nov. 2024, syncedreview.com/2024/11/19/self-evolving-prompts-redefining-ai-alignment-with-deepmind-chicago-us-eva-framework-5/.