[Definitive Guide] What is the Hierarchical Reasoning Model (HRM)? — How the Brain-Inspired “Step-by-Step Reasoning” Works, How It Differs from Traditional LLMs, and How to Run It Locally (August 2025 Edition)
Key Takeaways (Inverted Pyramid Format):
- The Hierarchical Reasoning Model (HRM) is a recurrent architecture inspired by the brain’s hierarchical structure and multiple timescales. It features two nested modules: a slow-thinking planner (H) and a fast-processing worker (L). They interact internally, enabling deep computation (e.g., search and backtracking) without externalizing reasoning into language.
- HRM can be trained without full BPTT, using a 1-step gradient approximation for constant memory and training stability. With Adaptive Computation Time (ACT) and halt signals, it enables dynamic compute — long reasoning when needed, instant response when not.
- Benchmark results show that with just 27M parameters and ~1,000 examples, HRM can nearly solve Sudoku-Extreme, compute shortest paths in 30×30 mazes, and even outperform some LLMs on ARC-AGI (e.g., ~40% reported in the paper). However, ARC Prize re-evaluation shows reproducibility around 32%, prompting critical analysis of what’s actually driving performance. The model is highly discussed, but must be assessed with care.
- Local execution is well-supported on the official GitHub (Apache-2.0). With CUDA 12.6 + PyTorch + FlashAttention, you can train and infer on Sudoku, Mazes, ARC. On an RTX 4070 laptop, training on Sudoku takes ~10 hours. Pretrained checkpoints on Hugging Face are also available.
- Important name distinction: A different model, Hierarchical Reward Model (for reward assignment), was also released in 2025. It is not the same as HRM = Reasoning Model—do not confuse the two.
1|Why HRM? — Motivation for Introducing “Hierarchy × Timescale”
Traditional LLMs stack Transformers at fixed depths, limiting their ability to perform deep reasoning tasks such as multi-step search or backtracking. While Chain-of-Thought (CoT) methods externalize reasoning as language, they suffer from fragility in decomposition, long outputs, and latency/cost issues.
HRM flips this approach: rather than outputting each thought in language, it reasons internally in a latent space, enabling more compact and robust deep reasoning.
This design is inspired by how the human brain operates — different cortical layers process at different speeds, with slow, high-level integrative regions guiding faster, reactive circuits. HRM mirrors this by combining a high-level module (H) with a low-level module (L) in a nested loop, forming a hierarchical convergence process that deepens computation without increasing layer depth.
2|Core Architecture: H (High-Level) × L (Low-Level) × Halt Signal
2-1. Two Recurrent Modules
- H Module: Updates slowly. Acts as a planner, integrating abstract goals and hypotheses.
- L Module: Updates quickly. Acts as an executor, exploring, verifying, and refining details.
While L converges locally, H updates at a slower pace to provide new high-level guidance, forming a nested reasoning loop.
2-2. Training Innovation: One-Step Approximate Gradient (No BPTT)
Deep recurrence typically requires BPTT, which can be unstable and memory-intensive. HRM sidesteps this by approximating gradients from only the final state of each stage, enabling constant memory use and stable training, even with deep internal computation.
2-3. Adaptive Computation Time (ACT) and Halting
HRM learns a halt signal, optimizing computation cycles per task. For harder tasks, it can reason longer; for easier ones, it halts early. The model supports compute scaling during inference, backed by research on Q-learning-based stabilization of ACT.
Summary: The H×L×halt trio builds a circuit for internal deep reasoning, with BPTT-free approximation making training feasible — this is the core of HRM.
3|How Strong Is It? — ARC, Sudoku, Maze, and Critical Evaluations
With ~27M parameters and ~1,000 training examples, HRM has nearly solved Sudoku-Extreme and 30×30 mazes, and achieved ~40% on ARC-AGI-1, reportedly outperforming larger LLMs like o3-mini-high or Claude 3.7 8K on some tasks.
However, the ARC Prize team re-evaluated the model using semi-secret data, reporting 32% on ARC-AGI-1 and only 2% on ARC-AGI-2. Ablation studies suggest that hierarchy itself may not be the main contributor, but rather external loops, ACT, and task augmentation. Use of puzzle ID embeddings also limits generalization.
Editorial Note: HRM is a paradigm shift, but not a silver bullet. On abstract, high-difficulty tasks like ARC-AGI-2, strengths and weaknesses are clearly observed. Always pair paper claims with third-party validation.
4|Compared to LLMs: From CoT to “Latent Reasoning”
- CoT (Chain-of-Thought) externalizes reasoning in text, which is human-readable, but introduces language errors and latency.
- HRM, on the other hand, performs multi-step reasoning internally, making it suitable for algorithmic tasks like search/backtracking, with shorter final outputs.
- Training is stabilized via 1-step approximation, and ACT allows for adaptive compute at inference time.
Still, for tasks requiring broad common-sense knowledge and language fluency, large LLMs remain dominant. A hybrid division of labor by task type is a practical approach.
5|Don’t Confuse with “Hierarchical Reward Model”
In 2025, another concept called Hierarchical Reward Model (for reward signal shaping) was also introduced. Its focus is on evaluating reasoning, not executing it. Despite sharing the HRM acronym, it’s a different research direction—make sure to distinguish them by context.
6|Running Locally: Setup, Training, Inference
The official GitHub (Apache-2.0) provides scripts and dataset builders. Here’s a quick guide to run HRM locally (based on Linux + CUDA environment):
6-1. Requirements (Minimal Working Setup)
- OS/Driver: Linux, NVIDIA drivers
- CUDA: Version 12.6 (e.g., 12.6.3)
- PyTorch: Built for CUDA 12.6
- Extras: FlashAttention (v3 for Hopper, v2 for Ampere or older),
packaging
,ninja
,wheel
, and optionally Weights & Biases for experiment tracking
6-2. Install Dependencies (Example)
# CUDA 12.6 Toolkit
wget -O cuda_installer.run \
https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
sudo sh cuda_installer.run --silent --toolkit --override
export CUDA_HOME=/usr/local/cuda-12.6
# PyTorch (CUDA 12.6 version)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# FlashAttention (for Ampere)
pip3 install flash-attn
# Clone repo
git clone https://github.com/sapientinc/HRM.git
cd HRM
pip3 install -r requirements.txt
6-3. Run “Sudoku-Extreme” Example (Training + Inference)
# Build dataset
python dataset/build_sudoku_dataset.py \
--output-dir data/sudoku-extreme-1k-aug-1000 \
--subsample-size 1000 --num-aug 1000
# Train model (on single GPU)
OMP_NUM_THREADS=8 python pretrain.py \
data_path=data/sudoku-extreme-1k-aug-1000 \
epochs=20000 eval_interval=2000 \
global_batch_size=384 \
lr=7e-5 puzzle_emb_lr=7e-5 \
weight_decay=1.0 puzzle_emb_weight_decay=1.0
6-4. Evaluate Using Pretrained Checkpoint
# Evaluate (example with 8 GPUs)
OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>
6-5. Use Pretrained Checkpoints (Recommended for Quick Testing)
The Readme includes Hugging Face links for ARC-AGI-2, Sudoku, Maze, enabling evaluation without training. Start with reproduction, then proceed to training.
Editorial Tip: For your PoC, try this order: Sudoku (1K samples) → Maze → ARC. Even a single Ampere GPU can complete Sudoku. Get a win early.
7|Deployment Scenarios: HRM vs LLM vs Hybrid
-
Well-Structured Tasks (e.g., Sudoku, mazes, formal puzzles)
→ Best for HRM, especially where search/backtracking is required. -
Knowledge + Language Tasks (e.g., FAQs, summarization, RAG)
→ Best for LLMs, due to their language fluency and factual recall. -
Hybrid Setup: Use LLMs as planners, and HRM for search. Control depth using halt signals—a key strength of HRM.
8|FAQs
Q1: Is HRM “better than” LLMs?
A: Depends on the task. For puzzle-like, structured reasoning, HRM can perform well with fewer parameters. But for broad natural language tasks, LLMs still dominate.
Q2: Why does HRM perform well with small models?
A: Because it can reason internally, without needing long-form CoT or massive pretraining. But beware: external loops and augmentation contribute a lot.
Q3: Can it run locally? What GPU do I need?
A: Yes. It runs with CUDA 12.6 + PyTorch + FlashAttention. A laptop RTX 4070 can train Sudoku in ~10 hours. Pretrained models are also available.
Q4: Is it the same as “Hierarchical Reward Model”?
A: No. That’s for evaluating reasoning steps. HRM (Reasoning Model) is about performing the reasoning.
9|Cautions from the Community: Fair Evaluation Matters
ARC Prize has shown that HRM’s hierarchy alone doesn’t explain its performance. External loops, ACT, and augmentation play key roles. Reproducibility scores (ARC-AGI-1 ≈32%, AGI-2 ≈2%) underscore the importance of third-party evaluations. Treat paper claims as reference points, not definitive answers.
10|Minimal Inference Example (Using Pretrained Sudoku Model)
import torch
from models.hrm import HRM # Adjust path per repo structure
ckpt = torch.load("checkpoints/sudoku_extreme_1k.pt", map_location="cuda")
model = HRM(**ckpt["config"]).cuda()
model.load_state_dict(ckpt["state_dict"])
model.eval()
# Input: 9x9 grid (preprocessed as per official Dataset)
with torch.no_grad():
y = model(x, max_halt_steps=8) # ACT control
print(y)
Note: Follow the official codebase for tensor formatting. Use puzzle_visualizer.html
for debugging or visualization.
11|Target Users and Application Areas
-
Researchers / Algorithm Engineers: Study latent reasoning & hierarchical convergence. The 1-step gradient approximation offers a new training paradigm. Ablation reading is essential.
-
Product Developers (Puzzles, Optimization, Verification): Solve structured tasks with small data, leveraging ACT for latency control, even on edge devices.
-
IT / MLOps Teams: On-prem deployable (Apache-2.0). Track halt count and latency as operational KPIs. Optimize for “deep thinking cost”.
-
Education / Public Sector: A model that “thinks without verbalizing” is a powerful teaching tool for STEM. Use intermediate state visualizations for explainability.
12|30-Day Evaluation Plan
-
Week 1: Minimal Reproduction
- Train + evaluate on Sudoku-Extreme (1K examples). Log execution time, halt counts, accuracy.
-
Week 2: Maze + ARC Trials
- Add 30×30 mazes. Evaluate ARC using pretrained checkpoints. A/B test with/without ACT.
-
Week 3: Hybrid Experimentation
- Try LLM planner + HRM search setup. Tune halt thresholds for cost-performance balance.
-
Week 4: Operational Prep
- Document logging format, explanation templates, reproducibility steps. Run one external benchmark.
13|Editorial Summary: HRM Is a New Direction — Evaluate Soberly, Implement Carefully
-
What’s New? — A model that reasons step-by-step without using language, excelling in deep computation tasks like search/backtracking. Plus, BPTT-free training via approximation.
-
Where Does It Shine? — In structured, formal reasoning tasks. For broad natural language use, large LLMs still reign. Task matching is key.
-
What Should You Do Now? — Start with GitHub reproduction, use pretrained checkpoints, run controlled experiments. Evaluate contributions from loops, ACT, augmentation using your own data. Avoid overgeneralization.
Key Sources (Primary & Trusted)
- Paper: Hierarchical Reasoning Model (Sapient Intelligence, 2025) — Two-layer recurrence (H×L), 1-step gradient approximation, ACT/halt, ARC/Sudoku/Maze performance
- Critical Review: ARC Prize Blog — Repro scores (ARC-AGI-1 ≈32%, AGI-2 ≈2%), contributions from loops/ACT/augmentation, generalization caveats
- Implementation: Official GitHub (Apache-2.0) — CUDA 12.6, PyTorch, FlashAttention, data builders, training/inference scripts, pretrained checkpoints
- Name Clarification: Hierarchical Reward Model — Separate research, not to be confused with HRM (Reasoning Model)