Studying and Evaluating Reasoning Differences

September 25, 2025•2 min read

Artificial Intelligence Reasoning Papers

How to Study Reasoning Differences

When comparing models' reasoning, researchers typically consider:

Prompting style (comparing direct vs CoT vs tree-of-thought)
Model architecture (encoder, decoder, encoder-decoder)
Training regime (fine-tuning, pre-training, symbolic augmentation)
Task type (arithmetic, logical, commonsense, relational, abductive reasoning)
Evaluation methods (answers vs intermediate reasoning correctness, error tracing)

Note these are for text and so far does not include models for image generation or video motion reasoning nor image techniques into the scope.

Reasoning with Transformer-based Models: Deep Learning, but Shallow Reasoning (Helwe et al.) – Discusses strengths and limits of reasoning in transformers.
https://openreview.net/forum?id=Ozp1WrgtF5_

IBM – What is Chain of Thought (CoT) Prompting?
https://www.ibm.com/think/topics/chain-of-thoughts

Key Papers & Ideas

This document lists key research papers and articles that explain how different AI models provide different forms of 'reasoning'. It includes summaries of what each paper contributes, the types of reasoning studied:

Chain‑of‑Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)

How providing examples with reasoning steps (chain of thought) boosts model reasoning abilities.

Key Contribution: Chain-of-thought emerges in large enough models; improves arithmetic, commonsense, and symbolic reasoning.

https://arxiv.org/abs/2201.11903

Relational Reasoning and Inductive Bias in Transformers (Geerts et al., 2025)

How transformers handle relational reasoning (transitive inference, relational structures).

Key Contribution: Shows how inductive biases, training data, and size affect relational inference abilities.

https://www.arxiv.org/abs/2506.04289

Is Chain‑of‑Thought Reasoning of LLMs a Mirage? (Zhao et al., 2025)

Whether CoT reasoning paths truly reflect internal reasoning or just mimic data patterns.

Key Contribution: Finds CoT is fragile under distribution shifts and sometimes not 'true reasoning'.

https://arxiv.org/abs/2508.01191

Understanding Transformer Reasoning Capabilities via Graph Algorithms (Sanford & Fatemi, 2024)

Compares transformers with graph neural networks on structured graph algorithm tasks.

Key Contribution: Shows transformers sometimes outperform GNNs, giving insight into structured reasoning.

https://research.google/blog/understanding-transformer-reasoning-capabilities-via-graph-algorithms/

Assessing Logical Reasoning Capabilities of Encoder‑Only Transformer LMs

Tests logical reasoning in encoder-only transformers, without CoT prompts.

Key Contribution: Finds architectural and layer-level limitations in logical reasoning abilities.

https://arxiv.org/html/2312.11720v1

Improving Chain‑of‑Thought Reasoning in LLMs (Zhang et al., 2024)

Techniques to make CoT reasoning more reliable.

Key Contribution: Introduces chain of preference optimization, tree-of-thought search to refine reasoning.

https://proceedings.neurips.cc/paper_files/paper/2024/file/00d80722b756de0166523a87805dd00f-Paper-Conference.pdf

Premise‑Augmented Reasoning Chains Improve Error Identification (Mukherjee et al., 2025)

Structured reasoning with premise-linked chains for mathematical reasoning.

Key Contribution: Adds premise links to improve interpretability and error tracing in reasoning chains.

https://arxiv.org/abs/2502.02362

Theorem‑of‑Thought: Multi‑Agent Framework for Abductive, Deductive, and Inductive Reasoning (Abdaljalil et al., 2025)

Combines multiple reasoning styles (deductive, inductive, abductive) with multi-agent framework.

Key Contribution: Uses reasoning graphs to enforce structure and consistency, outperforming simpler CoT.

https://arxiv.org/abs/2506.07106

Vincent Liu

Back to Blog