Red Hat leads launch of llm-d to scale generative AI in clouds

Techday NZ21-05-2025

Red Hat has introduced llm-d, an open source project aimed at enabling large-scale distributed generative AI inference across hybrid cloud environments.
The llm-d initiative is the result of collaboration between Red Hat and a group of founding contributors comprising CoreWeave, Google Cloud, IBM Research and NVIDIA, with additional support from AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, and academic partners from the University of California, Berkeley, and the University of Chicago.
The new project utilises vLLM-based distributed inference, a native Kubernetes architecture, and AI-aware network routing to facilitate robust and scalable AI inference clouds that can meet demanding production service-level objectives. Red Hat asserts that this will support any AI model, on any hardware accelerator, in any cloud environment.
Brian Stevens, Senior Vice President and AI CTO at Red Hat, stated, "The launch of the llm-d community, backed by a vanguard of AI leaders, marks a pivotal moment in addressing the need for scalable gen AI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential."
Addressing the scaling needs of generative AI, Red Hat points to a Gartner forecast that suggests by 2028, more than 80% of data centre workload accelerators will be principally deployed for inference rather than model training. This projected shift highlights the necessity for efficient and scalable inference solutions as AI models become larger and more complex.
The llm-d project's architecture is designed to overcome the practical limitations of centralised AI inference, such as prohibitive costs and latency. Its main features include vLLM for rapid model support, Prefill and Decode Disaggregation for distributing computational workloads, KV Cache Offloading based on LMCache to shift memory loads onto standard storage, and AI-Aware Network Routing for optimised request scheduling. Further, the project supports Google Cloud's Tensor Processing Units and NVIDIA's Inference Xfer Library for high-performance data transfer.
The community formed around llm-d comprises both technology vendors and academic institutions. Each wants to address efficiency, cost, and performance at scale for AI-powered applications. Several of these partners provided statements regarding their involvement and the intended impact of the project.
Ramine Roane, Corporate Vice President, AI Product Management at AMD, said, "AMD is proud to be a founding member of the llm-d community, contributing our expertise in high-performance GPUs to advance AI inference for evolving enterprise AI needs. As organisations navigate the increasing complexity of generative AI to achieve greater scale and efficiency, AMD looks forward to meeting this industry demand through the llm-d project."
Shannon McFarland, Vice President, Cisco Open Source Program Office & Head of Cisco DevNet, remarked, "The llm-d project is an exciting step forward for practical generative AI. llm-d empowers developers to programmatically integrate and scale generative AI inference, unlocking new levels of innovation and efficiency in the modern AI landscape. Cisco is proud to be part of the llm-d community, where we're working together to explore real-world use cases that help organisations apply AI more effectively and efficiently."
Chen Goldberg, Senior Vice President, Engineering, CoreWeave, commented, "CoreWeave is proud to be a founding contributor to the llm-d project and to deepen our long-standing commitment to open source AI. From our early partnership with EleutherAI to our ongoing work advancing inference at scale, we've consistently invested in making powerful AI infrastructure more accessible. We're excited to collaborate with an incredible group of partners and the broader developer community to build a flexible, high-performance inference engine that accelerates innovation and lays the groundwork for open, interoperable AI."
Mark Lohmeyer, Vice President and General Manager, AI & Computing Infrastructure, Google Cloud, stated, "Efficient AI inference is paramount as organisations move to deploying AI at scale and deliver value for their users. As we enter this new age of inference, Google Cloud is proud to build upon our legacy of open source contributions as a founding contributor to the llm-d project. This new community will serve as a critical catalyst for distributed AI inference at scale, helping users realise enhanced workload efficiency with increased optionality for their infrastructure resources."
Jeff Boudier, Head of Product, Hugging Face, said, "We believe every company should be able to build and run their own models. With vLLM leveraging the Hugging Face transformers library as the source of truth for model definitions; a wide diversity of models large and small is available to power text, audio, image and video AI applications. Eight million AI Builders use Hugging Face to collaborate on over two million AI models and datasets openly shared with the global community. We are excited to support the llm-d project to enable developers to take these applications to scale."
Priya Nagpurkar, Vice President, Hybrid Cloud and AI Platform, IBM Research, commented, "At IBM, we believe the next phase of AI is about efficiency and scale. We're focused on unlocking value for enterprises through AI solutions they can deploy effectively. As a founding contributor to llm-d, IBM is proud to be a key part of building a differentiated hardware agnostic distributed AI inference platform. We're looking forward to continued contributions towards the growth and success of this community to transform the future of AI inference."
Bill Pearson, Vice President, Data Center & AI Software Solutions and Ecosystem, Intel, said, "The launch of llm-d will serve as a key inflection point for the industry in driving AI transformation at scale, and Intel is excited to participate as a founding supporter. Intel's involvement with llm-d is the latest milestone in our decades-long collaboration with Red Hat to empower enterprises with open source solutions that they can deploy anywhere, on their platform of choice. We look forward to further extending and building AI innovation through the llm-d community."
Eve Callicoat, Senior Staff Engineer, ML Platform, Lambda, commented, "Inference is where the real-world value of AI is delivered, and llm-d represents a major leap forward. Lambda is proud to support a project that makes state-of-the-art inference accessible, efficient, and open."
Ujval Kapasi, Vice President, Engineering AI Frameworks, NVIDIA, stated, "The llm-d project is an important addition to the open source AI ecosystem and reflects NVIDIA's support for collaboration to drive innovation in generative AI. Scalable, highly performant inference is key to the next wave of generative and agentic AI. We're working with Red Hat and other supporting partners to foster llm-d community engagement and industry adoption, helping accelerate llm-d with innovations from NVIDIA Dynamo such as NIXL."
Ion Stoica, Professor and Director of Sky Computing Lab, University of California, Berkeley, remarked, "We are pleased to see Red Hat build upon the established success of vLLM, which originated in our lab to help address the speed and memory challenges that come with running large AI models. Open source projects like vLLM, and now llm-d anchored in vLLM, are at the frontier of AI innovation tackling the most demanding AI inference requirements and moving the needle for the industry at large."
Junchen Jiang, Professor at the LMCache Lab, University of Chicago, added, "Distributed KV cache optimisations, such as offloading, compression, and blending, have been a key focus of our lab, and we are excited to see llm-d leveraging LMCache as a core component to reduce time to first token as well as improve throughput, particularly in long-context inference."

Hashtags

#InferenceXferLibrary

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Google search changes turning web into 'wild west'

RNZ News

15 hours ago

RNZ News

Google search changes turning web into 'wild west'

Google is transforming online search, and businesses wanting to get their websites in front of customers must change with it, according to a leading digital marketer here. The vast majority of searches online are done on Google, and the tech company began incorporating AI into its searches a little over a year ago. Last month its CEO announced a further step where the typical experience of getting links to websites would be gone entirely, replaced with an AI-generated article answering the search question. Auckland digital marketer Richard Conway says he has had to overhaul his business, moving from a focus on search engine optimisation to 'generative engine optimisation'. He says the ongoing changes to Google search are turning the web into something of a 'wild west' for those who operate businesses online. To embed this content on your own webpage, cut and paste the following: See terms of use.

Otago Daily Times

4 days ago

Otago Daily Times

‘Nanogirl' informs South on AI's use

Even though "Nanogirl", Dr Michelle Dickinson, has worked with world leading tech giants, she prefers to inspire the next generation. About 60 Great South guests were glued to their Kelvin Hotel seats on Thursday evening as the United Kingdom-born New Zealand nanotechnologist shared her knowledge and AI's future impact. Business needed to stay informed about technology so it could future-proof, she said. The days were gone where the traditional five year business plan would be enough to futureproof due to the breakneck speed technology has been advancing. Owners also needed to understand the importance of maintaining a customer-centric business or risk becoming quickly irrelevant. "I care about that we have empty stores." The number of legacy institutions closing was evidence of its model not moving with the customer. "Not being customer-centric is the biggest threat to business." Schools were another sector which needed to adapt to the changing world as it predominantly catered to produce an "average" student. "Nobody wants their kids to be average." Were AI technology to be implemented it could be used to develop personalised learning models while removing the stress-inducing and labour-intensive tasks from teachers' workload. "Now you can be the best teacher you can be and stay in the field you love. "I don't want our teachers to be burnt out, I want them to be excited to be teaching." In 30 seconds, new technology could now produce individualised 12-week teaching plans aligned to the curriculum, in both Ma¯ori and English she said. Agriculture was another sector to benefit from the developing technology. Better crop yields and cost savings could now be achieved through localised soil and crop tracking information which pinpointed what fertiliser needs or moisture levels were required in specific sections of a paddock. While AI was a problem-solving tool which provided outcomes on the information available to it, to work well, it still needed the creative ideas to come from humans, she said. "People are the fundamentals of the future . . . and human side of why we do things should be at the forefront. "We, as humans, make some pretty cool decisions that aren't always based on logic." Personal and commercial security had also become imperative now there was the ability to produce realistic "deep-fake" productions with videos and audio was about to hit us. She urged families and organisations to have "safe words" that would not be present in deep fake recordings and allow family members or staff to identify fake from genuine cries for help. "This is the stuff we need to be talking about with our kids right now." Great South chief executive Chami Abeysinghe said Dr Dickinson's presentation raised some "thought-provoking" questions for Southland's business leaders. She believed there needed to be discussions about how Southland could position itself to be at the forefront of tech-driven innovation. "I think some of the points that she really raised was a good indication that we probably need to get a bit quicker at adopting and adapting. "By the time we get around to thinking about it, it has already changed again." AI was able to process information and data in a fraction of the time humans did, but the technology did not come without risks and it was critical businesses protected their operations. "If we are going to use it, we need to be able to know that it's secure." Information on ChatGPT entered the public realm that everyone could have access to and business policies had not kept up. "You absolutely have to have a [AI security] policy."

Mirantis unveils architecture to speed & secure AI deployment

Techday NZ

4 days ago

Techday NZ

Mirantis unveils architecture to speed & secure AI deployment

Mirantis has released a comprehensive reference architecture to support IT infrastructure for AI workloads, aiming to assist enterprises in deploying AI systems quickly and securely. The Mirantis AI Factory Reference Architecture is based on the company's k0rdent AI platform and designed to offer a composable, scalable, and secure environment for artificial intelligence and machine learning (ML) workloads. According to Mirantis, the solution provides criteria for building, operating, and optimising AI and ML infrastructure at scale, and can be operational within days of hardware installation. The architecture leverages templated and declarative approaches provided by k0rdent AI, which Mirantis claims enables rapid provisioning of required resources. This, the company states, leads to accelerated prototyping, model iteration, and deployment—thereby shortening the overall AI development cycle. The platform features curated integrations, accessible via the k0rdent Catalog, for various AI and ML tools, observability frameworks, continuous integration and delivery, and security, all while adhering to open standards. Mirantis is positioning the reference architecture as a response to rising demand for specialised compute resources, such as GPUs and CPUs, crucial for the execution of complex AI models. "We've built and shared the reference architecture to help enterprises and service providers efficiently deploy and manage large-scale multi-tenant sovereign infrastructure solutions for AI and ML workloads," said Shaun O'Meara, chief technology officer, Mirantis. "This is in response to the significant increase in the need for specialized resources (GPU and CPU) to run AI models while providing a good user experience for developers and data scientists who don't want to learn infrastructure." The architecture addresses several high-performance computing challenges, including Remote Direct Memory Access (RDMA) networking, GPU allocation and slicing, advanced scheduling, performance tuning, and Kubernetes scaling. Additionally, it supports integration with multiple AI platform services, such as Gcore Everywhere Inference and the NVIDIA AI Enterprise software ecosystem. In contrast to typical cloud-native workloads, which are optimised for scale-out and multi-core environments, AI tasks often require the aggregation of multiple GPU servers into a single high-performance computing instance. This shift demands RDMA and ultra-high-performance networking, areas which the Mirantis reference architecture is designed to accommodate. The reference architecture uses Kubernetes and is adaptable to various AI workload types, including training, fine-tuning, and inference, across a range of environments. These include dedicated or shared servers, virtualised settings using KubeVirt or OpenStack, public cloud, hybrid or multi-cloud configurations, and edge locations. The solution addresses the specific needs of AI workloads, such as high-performance storage and high-speed networking technologies, including Ethernet, Infiniband, NVLink, NVSwitch, and CXL, to manage the movement of large data sets inherent to AI applications. Mirantis has identified and aimed to resolve several challenges in AI infrastructure, such as: Time-intensive fine-tuning and configuration compared to traditional compute systems; Support for hard multi-tenancy to ensure security, isolation, resource allocation, and contention management; Maintaining data sovereignty for data-driven AI and ML workloads, particularly where models contain proprietary information; Ensuring compliance with varied regional and regulatory standards; Managing distributed, large-scale infrastructure, which is common in edge deployments; Effective resource sharing, particularly of high-demand compute components such as GPUs; Enabling accessibility for users such as data scientists and developers who may not have specific IT infrastructure expertise. The composable nature of the Mirantis AI Factory Reference Architecture allows users to assemble infrastructure using reusable templates across compute, storage, GPU, and networking components, which can then be tailored to specific AI use cases. The architecture includes support for a variety of hardware accelerators, including products from NVIDIA, AMD, and Intel. Mirantis reports that its AI Factory Reference Architecture has been developed with the goal of supporting the unique operational requirements of enterprises seeking scalable, sovereign AI infrastructures, especially where control over data and regulatory compliance are paramount. The framework is intended as a guideline to streamline the deployment and ongoing management of these environments, offering modularity and integration with open standard tools and platforms.

Red Hat leads launch of llm-d to scale generative AI in clouds

Hashtags

Try Our AI Features

Comments

Related Articles

Google search changes turning web into 'wild west'

‘Nanogirl' informs South on AI's use

Mirantis unveils architecture to speed & secure AI deployment

Get Started Now: Download the App