logo
Mirantis unveils architecture to speed & secure AI deployment

Mirantis unveils architecture to speed & secure AI deployment

Techday NZ2 days ago

Mirantis has released a comprehensive reference architecture to support IT infrastructure for AI workloads, aiming to assist enterprises in deploying AI systems quickly and securely.
The Mirantis AI Factory Reference Architecture is based on the company's k0rdent AI platform and designed to offer a composable, scalable, and secure environment for artificial intelligence and machine learning (ML) workloads. According to Mirantis, the solution provides criteria for building, operating, and optimising AI and ML infrastructure at scale, and can be operational within days of hardware installation.
The architecture leverages templated and declarative approaches provided by k0rdent AI, which Mirantis claims enables rapid provisioning of required resources. This, the company states, leads to accelerated prototyping, model iteration, and deployment—thereby shortening the overall AI development cycle. The platform features curated integrations, accessible via the k0rdent Catalog, for various AI and ML tools, observability frameworks, continuous integration and delivery, and security, all while adhering to open standards.
Mirantis is positioning the reference architecture as a response to rising demand for specialised compute resources, such as GPUs and CPUs, crucial for the execution of complex AI models. "We've built and shared the reference architecture to help enterprises and service providers efficiently deploy and manage large-scale multi-tenant sovereign infrastructure solutions for AI and ML workloads," said Shaun O'Meara, chief technology officer, Mirantis. "This is in response to the significant increase in the need for specialized resources (GPU and CPU) to run AI models while providing a good user experience for developers and data scientists who don't want to learn infrastructure."
The architecture addresses several high-performance computing challenges, including Remote Direct Memory Access (RDMA) networking, GPU allocation and slicing, advanced scheduling, performance tuning, and Kubernetes scaling. Additionally, it supports integration with multiple AI platform services, such as Gcore Everywhere Inference and the NVIDIA AI Enterprise software ecosystem.
In contrast to typical cloud-native workloads, which are optimised for scale-out and multi-core environments, AI tasks often require the aggregation of multiple GPU servers into a single high-performance computing instance. This shift demands RDMA and ultra-high-performance networking, areas which the Mirantis reference architecture is designed to accommodate.
The reference architecture uses Kubernetes and is adaptable to various AI workload types, including training, fine-tuning, and inference, across a range of environments. These include dedicated or shared servers, virtualised settings using KubeVirt or OpenStack, public cloud, hybrid or multi-cloud configurations, and edge locations. The solution addresses the specific needs of AI workloads, such as high-performance storage and high-speed networking technologies, including Ethernet, Infiniband, NVLink, NVSwitch, and CXL, to manage the movement of large data sets inherent to AI applications.
Mirantis has identified and aimed to resolve several challenges in AI infrastructure, such as: Time-intensive fine-tuning and configuration compared to traditional compute systems;
Support for hard multi-tenancy to ensure security, isolation, resource allocation, and contention management;
Maintaining data sovereignty for data-driven AI and ML workloads, particularly where models contain proprietary information;
Ensuring compliance with varied regional and regulatory standards;
Managing distributed, large-scale infrastructure, which is common in edge deployments;
Effective resource sharing, particularly of high-demand compute components such as GPUs;
Enabling accessibility for users such as data scientists and developers who may not have specific IT infrastructure expertise.
The composable nature of the Mirantis AI Factory Reference Architecture allows users to assemble infrastructure using reusable templates across compute, storage, GPU, and networking components, which can then be tailored to specific AI use cases. The architecture includes support for a variety of hardware accelerators, including products from NVIDIA, AMD, and Intel.
Mirantis reports that its AI Factory Reference Architecture has been developed with the goal of supporting the unique operational requirements of enterprises seeking scalable, sovereign AI infrastructures, especially where control over data and regulatory compliance are paramount. The framework is intended as a guideline to streamline the deployment and ongoing management of these environments, offering modularity and integration with open standard tools and platforms.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Mirantis unveils architecture to speed & secure AI deployment
Mirantis unveils architecture to speed & secure AI deployment

Techday NZ

time2 days ago

  • Techday NZ

Mirantis unveils architecture to speed & secure AI deployment

Mirantis has released a comprehensive reference architecture to support IT infrastructure for AI workloads, aiming to assist enterprises in deploying AI systems quickly and securely. The Mirantis AI Factory Reference Architecture is based on the company's k0rdent AI platform and designed to offer a composable, scalable, and secure environment for artificial intelligence and machine learning (ML) workloads. According to Mirantis, the solution provides criteria for building, operating, and optimising AI and ML infrastructure at scale, and can be operational within days of hardware installation. The architecture leverages templated and declarative approaches provided by k0rdent AI, which Mirantis claims enables rapid provisioning of required resources. This, the company states, leads to accelerated prototyping, model iteration, and deployment—thereby shortening the overall AI development cycle. The platform features curated integrations, accessible via the k0rdent Catalog, for various AI and ML tools, observability frameworks, continuous integration and delivery, and security, all while adhering to open standards. Mirantis is positioning the reference architecture as a response to rising demand for specialised compute resources, such as GPUs and CPUs, crucial for the execution of complex AI models. "We've built and shared the reference architecture to help enterprises and service providers efficiently deploy and manage large-scale multi-tenant sovereign infrastructure solutions for AI and ML workloads," said Shaun O'Meara, chief technology officer, Mirantis. "This is in response to the significant increase in the need for specialized resources (GPU and CPU) to run AI models while providing a good user experience for developers and data scientists who don't want to learn infrastructure." The architecture addresses several high-performance computing challenges, including Remote Direct Memory Access (RDMA) networking, GPU allocation and slicing, advanced scheduling, performance tuning, and Kubernetes scaling. Additionally, it supports integration with multiple AI platform services, such as Gcore Everywhere Inference and the NVIDIA AI Enterprise software ecosystem. In contrast to typical cloud-native workloads, which are optimised for scale-out and multi-core environments, AI tasks often require the aggregation of multiple GPU servers into a single high-performance computing instance. This shift demands RDMA and ultra-high-performance networking, areas which the Mirantis reference architecture is designed to accommodate. The reference architecture uses Kubernetes and is adaptable to various AI workload types, including training, fine-tuning, and inference, across a range of environments. These include dedicated or shared servers, virtualised settings using KubeVirt or OpenStack, public cloud, hybrid or multi-cloud configurations, and edge locations. The solution addresses the specific needs of AI workloads, such as high-performance storage and high-speed networking technologies, including Ethernet, Infiniband, NVLink, NVSwitch, and CXL, to manage the movement of large data sets inherent to AI applications. Mirantis has identified and aimed to resolve several challenges in AI infrastructure, such as: Time-intensive fine-tuning and configuration compared to traditional compute systems; Support for hard multi-tenancy to ensure security, isolation, resource allocation, and contention management; Maintaining data sovereignty for data-driven AI and ML workloads, particularly where models contain proprietary information; Ensuring compliance with varied regional and regulatory standards; Managing distributed, large-scale infrastructure, which is common in edge deployments; Effective resource sharing, particularly of high-demand compute components such as GPUs; Enabling accessibility for users such as data scientists and developers who may not have specific IT infrastructure expertise. The composable nature of the Mirantis AI Factory Reference Architecture allows users to assemble infrastructure using reusable templates across compute, storage, GPU, and networking components, which can then be tailored to specific AI use cases. The architecture includes support for a variety of hardware accelerators, including products from NVIDIA, AMD, and Intel. Mirantis reports that its AI Factory Reference Architecture has been developed with the goal of supporting the unique operational requirements of enterprises seeking scalable, sovereign AI infrastructures, especially where control over data and regulatory compliance are paramount. The framework is intended as a guideline to streamline the deployment and ongoing management of these environments, offering modularity and integration with open standard tools and platforms.

Oracle & NVIDIA expand OCI partnership with 160 AI tools
Oracle & NVIDIA expand OCI partnership with 160 AI tools

Techday NZ

time12-06-2025

  • Techday NZ

Oracle & NVIDIA expand OCI partnership with 160 AI tools

Oracle and NVIDIA have expanded their partnership to enable customers to access more than 160 AI tools and agents while leveraging the necessary computing resources for AI development and deployment. The collaboration brings NVIDIA AI Enterprise, a cloud-native software platform, natively to the Oracle Cloud Infrastructure (OCI) Console. Oracle customers can now use this platform across OCI's distributed cloud, including public regions, Government Clouds, and sovereign cloud solutions. Platform access and capabilities By integrating NVIDIA AI Enterprise directly through the OCI Console rather than a marketplace, Oracle allows customers to utilise their existing Universal Credits, streamlining transactions and support. This approach is designed to speed up deployment and help customers meet security, regulatory, and compliance requirements in enterprise AI processes. Customers can now access over 160 AI tools focused on training and inference, including NVIDIA NIM microservices. These services aim to simplify the deployment of generative AI models and support a broad set of application-building and data management needs across various deployment scenarios. "Oracle has become the platform of choice for AI training and inferencing, and our work with NVIDIA boosts our ability to support customers running some of the world's most demanding AI workloads," said Karan Batta, Senior Vice President, Oracle Cloud Infrastructure. "Combining NVIDIA's full-stack AI computing platform with OCI's performance, security, and deployment flexibility enables us to deliver AI capabilities at scale to help advance AI efforts globally." The partnership includes making NVIDIA GB200 NVL72 systems available on the OCI Supercluster, supporting up to 131,072 NVIDIA Blackwell GPUs. The new architecture provides a liquid-cooled infrastructure that targets large-scale AI training and inference requirements. Governments and enterprises can take advantage of the so-called AI factories, using platforms like NVIDIA's GB200 NVL72 for agentic AI tasks reliant on advanced reasoning models and efficiency enhancements. Developer access to advanced resources Oracle has become one of the first major cloud providers to integrate with NVIDIA DGX Cloud Lepton, which links developers to a global marketplace of GPU compute. This integration offers developers access to OCI's high-performance GPU clusters for a range of needs, including AI training, inference, digital twin implementations, and parallel HPC applications. Ian Buck, Vice President of Hyperscale and HPC at NVIDIA, said: "Developers need the latest AI infrastructure and software to rapidly build and launch innovative solutions. With OCI and NVIDIA, they get the performance and tools to bring ideas to life, wherever their work happens." With this integration, developers are also able to select compute resources in precise regions to help achieve both strategic and sovereign AI aims and satisfy long-term and on-demand requirements. Customer projects using joint capabilities Enterprises in Europe and internationally are making use of the enhanced partnership between Oracle and NVIDIA. For example, Almawave, based in Italy, utilises OCI AI infrastructure and NVIDIA Hopper GPUs to run generative AI model training and inference for its Velvet family, which supports Italian alongside other European languages and is being deployed within Almawave's AIWave platform. "Our commitment is to accelerate innovation by building a high-performing, transparent, and fully integrated Italian foundational AI in a European context—and we are only just getting started," said Valeria Sandei, Chief Executive Officer, Almawave. "Oracle and NVIDIA are valued partners for us in this effort, given our common vision around AI and the powerful infrastructure capabilities they bring to the development and operation of Velvet." Danish health technology company Cerebriu is using OCI along with NVIDIA Hopper GPUs to build an AI-driven tool for clinical brain MRI analysis. Cerebriu's deep learning models, trained on thousands of multi-modal MRI images, aim to reduce the time required to interpret scans, potentially benefiting the clinical diagnosis of time-sensitive neurological conditions. "AI plays an increasingly critical role in how we design and differentiate our products," said Marko Bauer, Machine Learning Researcher, Cerebriu. "OCI and NVIDIA offer AI capabilities that are critical to helping us advance our product strategy, giving us the computing resources we need to discover and develop new AI use cases quickly, cost-effectively, and at scale. Finding the optimal way of training our models has been a key focus for us. While we've experimented with other cloud platforms for AI training, OCI and NVIDIA have provided us the best cloud infrastructure availability and price performance." By expanding the Oracle-NVIDIA partnership, customers are now able to choose from a wide variety of AI tools and infrastructure options within OCI, supporting both research and production environments for AI solution development.

iFLYTEK wins CNCF award for AI model training with Volcano
iFLYTEK wins CNCF award for AI model training with Volcano

Techday NZ

time10-06-2025

  • Techday NZ

iFLYTEK wins CNCF award for AI model training with Volcano

iFLYTEK has been named the winner of the Cloud Native Computing Foundation's End User Case Study Contest for advancements in scalable artificial intelligence infrastructure using the Volcano project. The selection recognises iFLYTEK's deployment of Volcano to address operational inefficiencies and resource management issues that arose as the company expanded its AI workloads. iFLYTEK, which specialises in speech and language artificial intelligence, reported experiencing underutilised GPUs, increasingly complex workflows, and competition among teams for resources as its computing demands expanded. These problems resulted in slower development progress and placed additional strain on infrastructure assets. With the implementation of Volcano, iFLYTEK introduced elastic scheduling, directed acyclic graph (DAG)-based workflows, and multi-tenant isolation into its AI model training operations. This transition allowed the business to improve the efficiency of its infrastructure and simplify the management of large-scale training projects. Key operational improvements cited include a significant increase in resource utilisation and reductions in system disruptions. DongJiang, Senior Platform Architect at iFLYTEK, said, "Before Volcano, coordinating training under large-scale GPU clusters across teams meant constant firefighting, from resource bottlenecks and job failures to debugging tangled training pipelines. Volcano gave us the flexibility and control to scale AI training reliably and efficiently. We're honoured to have our work recognized by CNCF, and we're excited to share our journey with the broader community at KubeCon + CloudNativeCon China." Volcano is a cloud native batch system built on Kubernetes and is designed to support performance-focused workloads such as artificial intelligence and machine learning training, big data processing, and scientific computing. The platform's features include job orchestration, resource fairness, and queue management, intended to maximise the efficient management of distributed workloads. Volcano was first accepted into the CNCF Sandbox in 2020 and achieved Incubating maturity level by 2022, reflecting increasing adoption for compute-intensive operations. iFLYTEK's engineering team cited the need for an infrastructure that could adapt to the rising scale and complexity of AI model training. Their objectives were to improve allocation of computing resources, manage multi-stage workflows efficiently, and limit disruptions to jobs while ensuring equitable resource access among multiple internal teams. The adoption of Volcano yielded several measurable outcomes for iFLYTEK's AI infrastructure. The company reported a 40% increase in GPU utilisation, contributing to lower infrastructure costs and reduced idle periods. Additionally, the company experienced a 70% faster recovery rate from training job failures, which contributed to more consistent and uninterrupted AI development. The speed of hyperparameter searches—a process integral to AI model optimisation—was accelerated by 50%, allowing the company's teams to test and refine models more swiftly. Chris Aniszczyk, Chief Technology Officer at CNCF, said, "iFLYTEK's case study shows how open source can solve complex, high-stakes challenges at scale. By using Volcano to boost GPU efficiency and streamline training workflows, they've cut costs, sped up development, and built a more reliable AI platform on top of Kubernetes, which is essential for any organization striving to lead in AI." As artificial intelligence workloads become increasingly complex and reliant on large-scale compute resources, the use of tools like Volcano has expanded among organisations seeking more effective operational strategies. iFLYTEK's experience with the platform will be the subject of a presentation at KubeCon + CloudNativeCon China, where company representatives will outline approaches to managing distributed model training within Kubernetes-based environments. iFLYTEK will present its case study, titled "Scaling Large Model Training in Kubernetes Clusters with Volcano," sharing technical and practical insights with participants seeking to optimise large-scale artificial intelligence training infrastructure.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store