
iFLYTEK wins CNCF award for AI model training with Volcano
iFLYTEK has been named the winner of the Cloud Native Computing Foundation's End User Case Study Contest for advancements in scalable artificial intelligence infrastructure using the Volcano project.
The selection recognises iFLYTEK's deployment of Volcano to address operational inefficiencies and resource management issues that arose as the company expanded its AI workloads. iFLYTEK, which specialises in speech and language artificial intelligence, reported experiencing underutilised GPUs, increasingly complex workflows, and competition among teams for resources as its computing demands expanded. These problems resulted in slower development progress and placed additional strain on infrastructure assets.
With the implementation of Volcano, iFLYTEK introduced elastic scheduling, directed acyclic graph (DAG)-based workflows, and multi-tenant isolation into its AI model training operations. This transition allowed the business to improve the efficiency of its infrastructure and simplify the management of large-scale training projects. Key operational improvements cited include a significant increase in resource utilisation and reductions in system disruptions.
DongJiang, Senior Platform Architect at iFLYTEK, said, "Before Volcano, coordinating training under large-scale GPU clusters across teams meant constant firefighting, from resource bottlenecks and job failures to debugging tangled training pipelines. Volcano gave us the flexibility and control to scale AI training reliably and efficiently. We're honoured to have our work recognized by CNCF, and we're excited to share our journey with the broader community at KubeCon + CloudNativeCon China."
Volcano is a cloud native batch system built on Kubernetes and is designed to support performance-focused workloads such as artificial intelligence and machine learning training, big data processing, and scientific computing. The platform's features include job orchestration, resource fairness, and queue management, intended to maximise the efficient management of distributed workloads. Volcano was first accepted into the CNCF Sandbox in 2020 and achieved Incubating maturity level by 2022, reflecting increasing adoption for compute-intensive operations.
iFLYTEK's engineering team cited the need for an infrastructure that could adapt to the rising scale and complexity of AI model training. Their objectives were to improve allocation of computing resources, manage multi-stage workflows efficiently, and limit disruptions to jobs while ensuring equitable resource access among multiple internal teams.
The adoption of Volcano yielded several measurable outcomes for iFLYTEK's AI infrastructure. The company reported a 40% increase in GPU utilisation, contributing to lower infrastructure costs and reduced idle periods. Additionally, the company experienced a 70% faster recovery rate from training job failures, which contributed to more consistent and uninterrupted AI development. The speed of hyperparameter searches—a process integral to AI model optimisation—was accelerated by 50%, allowing the company's teams to test and refine models more swiftly.
Chris Aniszczyk, Chief Technology Officer at CNCF, said, "iFLYTEK's case study shows how open source can solve complex, high-stakes challenges at scale. By using Volcano to boost GPU efficiency and streamline training workflows, they've cut costs, sped up development, and built a more reliable AI platform on top of Kubernetes, which is essential for any organization striving to lead in AI."
As artificial intelligence workloads become increasingly complex and reliant on large-scale compute resources, the use of tools like Volcano has expanded among organisations seeking more effective operational strategies. iFLYTEK's experience with the platform will be the subject of a presentation at KubeCon + CloudNativeCon China, where company representatives will outline approaches to managing distributed model training within Kubernetes-based environments.
iFLYTEK will present its case study, titled "Scaling Large Model Training in Kubernetes Clusters with Volcano," sharing technical and practical insights with participants seeking to optimise large-scale artificial intelligence training infrastructure.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Techday NZ
3 days ago
- Techday NZ
Mirantis unveils architecture to speed & secure AI deployment
Mirantis has released a comprehensive reference architecture to support IT infrastructure for AI workloads, aiming to assist enterprises in deploying AI systems quickly and securely. The Mirantis AI Factory Reference Architecture is based on the company's k0rdent AI platform and designed to offer a composable, scalable, and secure environment for artificial intelligence and machine learning (ML) workloads. According to Mirantis, the solution provides criteria for building, operating, and optimising AI and ML infrastructure at scale, and can be operational within days of hardware installation. The architecture leverages templated and declarative approaches provided by k0rdent AI, which Mirantis claims enables rapid provisioning of required resources. This, the company states, leads to accelerated prototyping, model iteration, and deployment—thereby shortening the overall AI development cycle. The platform features curated integrations, accessible via the k0rdent Catalog, for various AI and ML tools, observability frameworks, continuous integration and delivery, and security, all while adhering to open standards. Mirantis is positioning the reference architecture as a response to rising demand for specialised compute resources, such as GPUs and CPUs, crucial for the execution of complex AI models. "We've built and shared the reference architecture to help enterprises and service providers efficiently deploy and manage large-scale multi-tenant sovereign infrastructure solutions for AI and ML workloads," said Shaun O'Meara, chief technology officer, Mirantis. "This is in response to the significant increase in the need for specialized resources (GPU and CPU) to run AI models while providing a good user experience for developers and data scientists who don't want to learn infrastructure." The architecture addresses several high-performance computing challenges, including Remote Direct Memory Access (RDMA) networking, GPU allocation and slicing, advanced scheduling, performance tuning, and Kubernetes scaling. Additionally, it supports integration with multiple AI platform services, such as Gcore Everywhere Inference and the NVIDIA AI Enterprise software ecosystem. In contrast to typical cloud-native workloads, which are optimised for scale-out and multi-core environments, AI tasks often require the aggregation of multiple GPU servers into a single high-performance computing instance. This shift demands RDMA and ultra-high-performance networking, areas which the Mirantis reference architecture is designed to accommodate. The reference architecture uses Kubernetes and is adaptable to various AI workload types, including training, fine-tuning, and inference, across a range of environments. These include dedicated or shared servers, virtualised settings using KubeVirt or OpenStack, public cloud, hybrid or multi-cloud configurations, and edge locations. The solution addresses the specific needs of AI workloads, such as high-performance storage and high-speed networking technologies, including Ethernet, Infiniband, NVLink, NVSwitch, and CXL, to manage the movement of large data sets inherent to AI applications. Mirantis has identified and aimed to resolve several challenges in AI infrastructure, such as: Time-intensive fine-tuning and configuration compared to traditional compute systems; Support for hard multi-tenancy to ensure security, isolation, resource allocation, and contention management; Maintaining data sovereignty for data-driven AI and ML workloads, particularly where models contain proprietary information; Ensuring compliance with varied regional and regulatory standards; Managing distributed, large-scale infrastructure, which is common in edge deployments; Effective resource sharing, particularly of high-demand compute components such as GPUs; Enabling accessibility for users such as data scientists and developers who may not have specific IT infrastructure expertise. The composable nature of the Mirantis AI Factory Reference Architecture allows users to assemble infrastructure using reusable templates across compute, storage, GPU, and networking components, which can then be tailored to specific AI use cases. The architecture includes support for a variety of hardware accelerators, including products from NVIDIA, AMD, and Intel. Mirantis reports that its AI Factory Reference Architecture has been developed with the goal of supporting the unique operational requirements of enterprises seeking scalable, sovereign AI infrastructures, especially where control over data and regulatory compliance are paramount. The framework is intended as a guideline to streamline the deployment and ongoing management of these environments, offering modularity and integration with open standard tools and platforms.


Techday NZ
12-06-2025
- Techday NZ
Harness launches IDP 2.0 to boost developer speed & security
Harness has launched version 2.0 of its Internal Developer Portal (IDP) with a suite of updates designed to improve software delivery speed, quality, security, and the overall developer experience at enterprise scale. The latest release builds on the Backstage framework, a Cloud Native Computing Foundation project, and rolls out new features directed at large organisations. The update targets issues seen in previous portals, such as managing complexity, providing scalablility, and unifying fragmented developer experiences. Enterprise-focused enhancements Key additions in the updated Harness IDP include fine-grained Role-Based Access Control (RBAC), which provides tighter security and compliance. The RBAC system allows platform teams to specify exactly who can read, create, edit, delete, or execute particular services or workflows at a granular level. For companies in regulated sectors such as finance or healthcare, this is an essential tool for governance. The release notes, "In large organisations, access boundaries must be explicit. IDP now supports entity-level granular Role-Based Access Control (RBAC), so platform teams can define exactly who can read, create, edit, delete, or execute a given service or workflow. For companies in regulated industries, like financial services or healthcare, this level of control is essential for compliance and risk management. Imagine a company whose services require restricted access due to regulatory constraints. With RBAC, only the compliance engineering team can modify those entities, while broader developer groups can view, but not alter, related documentation or dependencies." The updated portal has also been integrated with real-time Git synchronisation, supporting webhooks for immediate updates to YAML configuration files. This eliminates the need for polling or manual refreshing and supports OAuth and central tokens for flexible authentication across all major Git providers. Harness explains, "For organisations with large, distributed engineering teams, keeping the developer portal in sync with Git can quickly become a bottleneck. Harness IDP now supports real-time updates via webhooks when YAML config files are updated in Git, eliminating the need for polling or manual refreshes. Teams can also make edits directly in the portal UI and push those changes back however they prefer, either directly or through a pull request. Authentication is flexible, with support for OAuth and central tokens, and it works with all major Git providers." Usability for large teams The new release aligns the entity organisation in the Harness-Native Platform Hierarchy, mapping catalog entities and workflows to real-world structures such as project, organisation, or account-level groupings. This enables tailored visibility for teams according to their functional or geographic targets, reducing clutter and the risk of error. Harness states, "Harness IDP now aligns with our native platform hierarchy, enabling teams to create catalog entities and workflows at the Project, Organisation, or Account level. This mirrors how engineering teams are actually structured - by product line, business unit, or geography - so developers see only what's relevant to their work." In addition to architectural changes, the user interface of the portal's catalog has been redesigned for greater clarity and efficiency. Developers are now able to filter services based on specific relevance, such as ownership or technology, and view integrated scorecards within the catalog. Metrics include service maturity, security compliance, and readiness for production. On this point, the company shared the comment, "With this release, the IDP catalog has been redesigned for speed, clarity, and scale. Teams can now filter views based on what matters to them, like services they own or APIs used across the organisation. Scorecards are now built directly into the catalog view, giving developers and platform teams immediate visibility into key metrics like service maturity, security standards alignment, and production readiness. Each entity page clearly shows scope, ownership, and references, making it easier for teams to stay organised and aligned." Onboarding and automation The update introduces a guided, form-based approach to creating and managing catalog entities, in addition to continued support for YAML-in-Git workflows. The shift is aimed at easing barriers for engineers unfamiliar with configuration syntax and fostering wider platform adoption. The company remarked, "Catalog entities can now be created and managed directly through the Harness IDP UI using a guided, form-based experience – no YAML required. This removes a major barrier for developers unfamiliar with configuration syntax, making it easier for more teams to get started and contribute. For those who prefer a config-as-code workflow, the traditional YAML-in-Git approach is still fully supported." For larger-scale organisations that depend on automation, Harness IDP now offers additional APIs allowing automatic catalog entity creation, auto-discovery, CLI integration, and Terraform provider support. Harness noted, "While UI is great for onboarding or making quick updates, large-scale adoption often demands automation. Harness IDP now includes new APIs to create and manage catalog entities, unlocking use cases like auto-discovery, auto-population, CLI integration, and Terraform provider support. The existing Catalog Ingestion API remains unchanged and will continue to function as before." Backstage plugin compatibility The IDP continues to extend compatibility with Backstage open-source plugins and supports teams seeking to build bespoke plugins on the framework. The company wrote, "Harness IDP continues to extend the Backstage open-source framework, so teams can keep using the Backstage plugin ecosystem they already know, or build their own custom plugins. Best of both worlds!" "With this major release of Harness IDP, we're redefining what an enterprise-grade internal developer portal can be. Extended from the Backstage framework and supercharged for scale, Harness IDP now delivers real-time Git synchronisation, true org-level hierarchy, API-first extensibility, and the most powerful RBAC system in the category. Whether you're supporting 100s or 1,000s of developers, Harness IDP gives platform teams the structure, speed, and control they need to transform developer experience at the enterprise level – without compromise." The series of updates are positioned to support organisations looking to scale software delivery without increasing risk or operational overhead for developer and platform teams.


Techday NZ
10-06-2025
- Techday NZ
iFLYTEK wins CNCF award for AI model training with Volcano
iFLYTEK has been named the winner of the Cloud Native Computing Foundation's End User Case Study Contest for advancements in scalable artificial intelligence infrastructure using the Volcano project. The selection recognises iFLYTEK's deployment of Volcano to address operational inefficiencies and resource management issues that arose as the company expanded its AI workloads. iFLYTEK, which specialises in speech and language artificial intelligence, reported experiencing underutilised GPUs, increasingly complex workflows, and competition among teams for resources as its computing demands expanded. These problems resulted in slower development progress and placed additional strain on infrastructure assets. With the implementation of Volcano, iFLYTEK introduced elastic scheduling, directed acyclic graph (DAG)-based workflows, and multi-tenant isolation into its AI model training operations. This transition allowed the business to improve the efficiency of its infrastructure and simplify the management of large-scale training projects. Key operational improvements cited include a significant increase in resource utilisation and reductions in system disruptions. DongJiang, Senior Platform Architect at iFLYTEK, said, "Before Volcano, coordinating training under large-scale GPU clusters across teams meant constant firefighting, from resource bottlenecks and job failures to debugging tangled training pipelines. Volcano gave us the flexibility and control to scale AI training reliably and efficiently. We're honoured to have our work recognized by CNCF, and we're excited to share our journey with the broader community at KubeCon + CloudNativeCon China." Volcano is a cloud native batch system built on Kubernetes and is designed to support performance-focused workloads such as artificial intelligence and machine learning training, big data processing, and scientific computing. The platform's features include job orchestration, resource fairness, and queue management, intended to maximise the efficient management of distributed workloads. Volcano was first accepted into the CNCF Sandbox in 2020 and achieved Incubating maturity level by 2022, reflecting increasing adoption for compute-intensive operations. iFLYTEK's engineering team cited the need for an infrastructure that could adapt to the rising scale and complexity of AI model training. Their objectives were to improve allocation of computing resources, manage multi-stage workflows efficiently, and limit disruptions to jobs while ensuring equitable resource access among multiple internal teams. The adoption of Volcano yielded several measurable outcomes for iFLYTEK's AI infrastructure. The company reported a 40% increase in GPU utilisation, contributing to lower infrastructure costs and reduced idle periods. Additionally, the company experienced a 70% faster recovery rate from training job failures, which contributed to more consistent and uninterrupted AI development. The speed of hyperparameter searches—a process integral to AI model optimisation—was accelerated by 50%, allowing the company's teams to test and refine models more swiftly. Chris Aniszczyk, Chief Technology Officer at CNCF, said, "iFLYTEK's case study shows how open source can solve complex, high-stakes challenges at scale. By using Volcano to boost GPU efficiency and streamline training workflows, they've cut costs, sped up development, and built a more reliable AI platform on top of Kubernetes, which is essential for any organization striving to lead in AI." As artificial intelligence workloads become increasingly complex and reliant on large-scale compute resources, the use of tools like Volcano has expanded among organisations seeking more effective operational strategies. iFLYTEK's experience with the platform will be the subject of a presentation at KubeCon + CloudNativeCon China, where company representatives will outline approaches to managing distributed model training within Kubernetes-based environments. iFLYTEK will present its case study, titled "Scaling Large Model Training in Kubernetes Clusters with Volcano," sharing technical and practical insights with participants seeking to optimise large-scale artificial intelligence training infrastructure.