Sakana introduces new AI architecture, ‘Continuous Thought Machines' to make models reason with less guidance — like human brains

Business Mayor13-05-2025

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Tokyo-based artificial intelligence startup Sakana, co-founded by former top Google AI scientists including Llion Jones and David Ha, has unveiled a new type of AI model architecture called Continuous Thought Machines (CTM).
CTMs are designed to usher in a new era of AI language models that will be more flexible and able to handle a wider range of cognitive tasks — such as solving complex mazes or navigation tasks without positional cues or pre-existing spatial embeddings — moving them closer to the way human beings reason through unfamiliar problems.
Rather than relying on fixed, parallel layers that process inputs all at once — as Transformer models do —CTMs unfold computation over steps within each input/output unit, known as an artificial 'neuron.'
Each neuron in the model retains a short history of its previous activity and uses that memory to decide when to activate again.
This added internal state allows CTMs to adjust the depth and duration of their reasoning dynamically, depending on the complexity of the task. As such, each neuron is far more informationally dense and complex than in a typical Transformer model.
The startup has posted a paper on the open access journal arXiv describing its work, a microsite and Github repository.
Most modern large language models (LLMs) are still fundamentally based upon the 'Transformer' architecture outlined in the seminal 2017 paper from Google Brain researchers entitled 'Attention Is All You Need.'
These models use parallelized, fixed-depth layers of artificial neurons to process inputs in a single pass — whether those inputs come from user prompts at inference time or labeled data during training.
By contrast, CTMs allow each artificial neuron to operate on its own internal timeline, making activation decisions based on a short-term memory of its previous states. These decisions unfold over internal steps known as 'ticks,' enabling the model to adjust its reasoning duration dynamically.
This time-based architecture allows CTMs to reason progressively, adjusting how long and how deeply they compute — taking a different number of ticks based on the complexity of the input.
Neuron-specific memory and synchronization help determine when computation should continue — or stop.
The number of ticks changes according to the information inputted, and may be more or less even if the input information is identical, because each neuron is deciding how many ticks to undergo before providing an output (or not providing one at all).
Read More Nazara integrates with ONDC Network to launch gCommerce in India
This represents both a technical and philosophical departure from conventional deep learning, moving toward a more biologically grounded model. Sakana has framed CTMs as a step toward more brain-like intelligence—systems that adapt over time, process information flexibly, and engage in deeper internal computation when needed.
Sakana's goal is to 'to eventually achieve levels of competency that rival or surpass human brains.'
The CTM is built around two key mechanisms.
First, each neuron in the model maintains a short 'history' or working memory of when it activated and why, and uses this history to make a decision of when to fire next.
Second, neural synchronization — how and when groups of a model's artificial neurons 'fire,' or process information together — is allowed to happen organically.
Groups of neurons decide when to fire together based on internal alignment, not external instructions or reward shaping. These synchronization events are used to modulate attention and produce outputs — that is, attention is directed toward those areas where more neurons are firing.
The model isn't just processing data, it's timing its thinking to match the complexity of the task.
Together, these mechanisms let CTMs reduce computational load on simpler tasks while applying deeper, prolonged reasoning where needed.
In demonstrations ranging from image classification and 2D maze solving to reinforcement learning, CTMs have shown both interpretability and adaptability. Their internal 'thought' steps allow researchers to observe how decisions form over time—a level of transparency rarely seen in other model families.
Sakana AI's Continuous Thought Machine is not designed to chase leaderboard-topping benchmark scores, but its early results indicate that its biologically inspired design does not come at the cost of practical capability.
On the widely used ImageNet-1K benchmark, the CTM achieved 72.47% top-1 and 89.89% top-5 accuracy.
While this falls short of state-of-the-art transformer models like ViT or ConvNeXt, it remains competitive—especially considering that the CTM architecture is fundamentally different and was not optimized solely for performance.
What stands out more are CTM's behaviors in sequential and adaptive tasks. In maze-solving scenarios, the model produces step-by-step directional outputs from raw images—without using positional embeddings, which are typically essential in transformer models. Visual attention traces reveal that CTMs often attend to image regions in a human-like sequence, such as identifying facial features from eyes to nose to mouth.
The model also exhibits strong calibration: its confidence estimates closely align with actual prediction accuracy. Unlike most models that require temperature scaling or post-hoc adjustments, CTMs improve calibration naturally by averaging predictions over time as their internal reasoning unfolds.
This blend of sequential reasoning, natural calibration, and interpretability offers a valuable trade-off for applications where trust and traceability matter as much as raw accuracy.
While CTMs show substantial promise, the architecture is still experimental and not yet optimized for commercial deployment. Sakana AI presents the model as a platform for further research and exploration rather than a plug-and-play enterprise solution.
Training CTMs currently demands more resources than standard transformer models. Their dynamic temporal structure expands the state space, and careful tuning is needed to ensure stable, efficient learning across internal time steps. Additionally, debugging and tooling support is still catching up—many of today's libraries and profilers are not designed with time-unfolding models in mind.
Still, Sakana has laid a strong foundation for community adoption. The full CTM implementation is open-sourced on GitHub and includes domain-specific training scripts, pretrained checkpoints, plotting utilities, and analysis tools. Supported tasks include image classification (ImageNet, CIFAR), 2D maze navigation, QAMNIST, parity computation, sorting, and reinforcement learning.
An interactive web demo also lets users explore the CTM in action, observing how its attention shifts over time during inference—a compelling way to understand the architecture's reasoning flow.
For CTMs to reach production environments, further progress is needed in optimization, hardware efficiency, and integration with standard inference pipelines. But with accessible code and active documentation, Sakana has made it easy for researchers and engineers to begin experimenting with the model today.
The CTM architecture is still in its early days, but enterprise decision-makers should already take note. Its ability to adaptively allocate compute, self-regulate depth of reasoning, and offer clear interpretability may prove highly valuable in production systems facing variable input complexity or strict regulatory requirements.
AI engineers managing model deployment will find value in CTM's energy-efficient inference — especially in large-scale or latency-sensitive applications.
Meanwhile, the architecture's step-by-step reasoning unlocks richer explainability, enabling organizations to trace not just what a model predicted, but how it arrived there.
For orchestration and MLOps teams, CTMs integrate with familiar components like ResNet-based encoders, allowing smoother incorporation into existing workflows. And infrastructure leads can use the architecture's profiling hooks to better allocate resources and monitor performance dynamics over time.
CTMs aren't ready to replace transformers, but they represent a new category of model with novel affordances. For organizations prioritizing safety, interpretability, and adaptive compute, the architecture deserves close attention.
Sakana's checkered AI research history
In February, Sakana introduced the AI CUDA Engineer, an agentic AI system designed to automate the production of highly optimized CUDA kernels, the instruction sets that allow Nvidia's (and others') graphics processing units (GPUs) to run code efficiently in parallel across multiple 'threads' or computational units.
The promise was significant: speedups of 10x to 100x in ML operations. However, shortly after release, external reviewers discovered that the system was exploiting weaknesses in the evaluation sandbox—essentially 'cheating' by bypassing correctness checks through a memory exploit.
In a public post, Sakana acknowledged the issue and credited community members with flagging it.
They've since overhauled their evaluation and runtime profiling tools to eliminate similar loopholes and are revising their results and research paper accordingly. The incident offered a real-world test of one of Sakana's stated values: embracing iteration and transparency in pursuit of better AI systems.
Sakana AI's founding ethos lies in merging evolutionary computation with modern machine learning. The company believes current models are too rigid—locked into fixed architectures and requiring retraining for new tasks.
By contrast, Sakana aims to create models that adapt in real time, exhibit emergent behavior, and scale naturally through interaction and feedback, much like organisms in an ecosystem.
This vision is already manifesting in products like Transformer², a system that adjusts LLM parameters at inference time without retraining, using algebraic tricks like singular-value decomposition.
It's also evident in their commitment to open-sourcing systems like the AI Scientist—even amid controversy—demonstrating a willingness to engage with the broader research community, not just compete with it.
As large incumbents like OpenAI and Google double down on foundation models, Sakana is charting a different course: small, dynamic, biologically inspired systems that think in time, collaborate by design, and evolve through experience.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Encountered a problematic response from an AI model? More standards and tests are needed, say researchers

CNBC

39 minutes ago

CNBC

Encountered a problematic response from an AI model? More standards and tests are needed, say researchers

As the usage of artificial intelligence — benign and adversarial — increases at breakneck speed, more cases of potentially harmful responses are being uncovered. These include hate speech, copyright infringements or sexual content. The emergence of these undesirable behaviors is compounded by a lack of regulations and insufficient testing of AI models, researchers told CNBC. Getting machine learning models to behave the way it was intended to do so is also a tall order, said Javier Rando, a researcher in AI. "The answer, after almost 15 years of research, is, no, we don't know how to do this, and it doesn't look like we are getting better," Rando, who focuses on adversarial machine learning, told CNBC. However, there are some ways to evaluate risks in AI, such as red teaming. The practice involves individuals testing and probing artificial intelligence systems to uncover and identify any potential harm — a modus operandi common in cybersecurity circles. Shayne Longpre, a researcher in AI and policy and lead of the Data Provenance Initiative, noted that there are currently insufficient people working in red teams. While AI startups are now using first-party evaluators or contracted second parties to test their models, opening the testing to third parties such as normal users, journalists, researchers, and ethical hackers would lead to a more robust evaluation, according to a paper published by Longpre and researchers. "Some of the flaws in the systems that people were finding required lawyers, medical doctors to actually vet, actual scientists who are specialized subject matter experts to figure out if this was a flaw or not, because the common person probably couldn't or wouldn't have sufficient expertise," Longpre said. Adopting standardized 'AI flaw' reports, incentives and ways to disseminate information on these 'flaws' in AI systems are some of the recommendations put forth in the paper. With this practice having been successfully adopted in other sectors such as software security, "we need that in AI now," Longpre added. Marrying this user-centred practice with governance, policy and other tools would ensure a better understanding of the risks posed by AI tools and users, said Rando. Project Moonshot is one such approach, combining technical solutions with policy mechanisms. Launched by Singapore's Infocomm Media Development Authority, Project Moonshot is a large language model evaluation toolkit developed with industry players such as IBM and Boston-based DataRobot. The toolkit integrates benchmarking, red teaming and testing baselines. There is also an evaluation mechanism which allows AI startups to ensure that their models can be trusted and do no harm to users, Anup Kumar, head of client engineering for data and AI at IBM Asia Pacific, told CNBC. Evaluation is a continuous process that should be done both prior to and following the deployment of models, said Kumar, who noted that the response to the toolkit has been mixed. "A lot of startups took this as a platform because it was open source, and they started leveraging that. But I think, you know, we can do a lot more." Moving forward, Project Moonshot aims to include customization for specific industry use cases and enable multilingual and multicultural red teaming. Pierre Alquier, Professor of Statistics at the ESSEC Business School, Asia-Pacific, said that tech companies are currently rushing to release their latest AI models without proper evaluation. "When a pharmaceutical company designs a new drug, they need months of tests and very serious proof that it is useful and not harmful before they get approved by the government," he noted, adding that a similar process is in place in the aviation sector. AI models need to meet a strict set of conditions before they are approved, Alquier added. A shift away from broad AI tools to developing ones that are designed for more specific tasks would make it easier to anticipate and control their misuse, said Alquier. "LLMs can do too many things, but they are not targeted at tasks that are specific enough," he said. As a result, "the number of possible misuses is too big for the developers to anticipate all of them." Such broad models make defining what counts as safe and secure difficult, according to a research that Rando was involved in. Tech companies should therefore avoid overclaiming that "their defenses are better than they are," said Rando.

Exclusive: Slide raises $25M Series A for data recovery

Axios

an hour ago

Axios

Exclusive: Slide raises $25M Series A for data recovery

Slide, a data disaster recovery platform, locked up a $25 million Series A round led by Base10 Partners, cofounder and CEO Michael Fass tells Axios Pro. Why it matters: Business are searching for tools to protect data and business operations as ransomware and malware attacks increase and threats of network outages loom. How it works: Slide's business continuity and disaster recovery platform was built specifically for managed service providers that help secure small and medium-sized businesses. Catch up quick: The platform debuted in the market in February, but is already being used by hundreds of MSPs, Fass says. Zoom in: Outsiders Fund and Top Down Ventures also took part in the Series A. Founded in 2023, the Norwalk, Connecticut-based Slide has raised nearly $30 million. Context: Slide cofounder and chair Austin McChord was founder and CEO of security and cloud-based software company Datto, which was bought by IT management firm Kaseya in 2022 for about $6.2 billion. Fass served as general counsel and chief people officer at Datto. What they're saying: The Datto "transaction was not the outcome we had wanted," McChord says. "We thought, 'What if we start over and deliver the vision we have for MSPs?'"

Exclusive: Defense tech startup Onebrief hits $1.1B valuation

Axios

2 hours ago

Axios

Exclusive: Defense tech startup Onebrief hits $1.1B valuation

Onebrief, a maker of military workflow software, raised $20 million in new funding led by Battery Ventures at a $1.1 billion valuation, CEO Grant Demaree tells Axios Pro exclusively. Why it matters: The valuation almost doubles the company's $650 million valuation in a February deal led by General Catalyst. How it works: Onebrief's collaborative workflow platform helps military staff from different organizations stay up to date in real-time planning, while its AI features help with decision making. There also could be an opportunity to sell the platform to other parts of the federal government, such as FEMA and Homeland Security, says Michael Brown, general partner at Battery Ventures. Catch up quick: Defense tech funding has been on a roll, as best illustrated by Anduril's $2.5 billion raise led by Founders Fund at a $30.5 billion valuation earlier this month. What they're saying:"I think there's an understanding right now that we have to be on our front foot when it comes to defense," Brown says, speaking on current investor interest in defense tech. The bottom line: Demaree said it would be possible for the Honolulu-based company to raise again before the end of the year.