logo
Operational Excellence In MLOps

Operational Excellence In MLOps

Forbes04-04-2025

Neel Sendas is a Principal Technical Account Manager at Amazon Web Services (AWS).
getty
MLOps (machine learning operations) represents the integration of DevOps principles into machine learning systems, emerging as a critical discipline as organizations increasingly embed AI/ML into their products. This engineering approach bridges the gap between ML development and deployment, creating a standardized framework for delivering high-performing models in production. By combining machine learning, DevOps and data engineering, MLOps enables organizations to automate and streamline the entire ML lifecycle. It ensures consistent quality and reproducibility in production environments through continuous integration, deployment and testing of both code and models while maintaining robust data engineering practices throughout the process.
MLOps bridges the gap between model development and production deployment, providing automated solutions for monitoring and managing ML systems—a crucial necessity in today's data-intensive AI landscape.
Some of the best practices for implementing operational excellence in MLOps are:
CI/CD in MLOps adapts DevOps principles to streamline machine learning workflows. Continuous integration ensures that every change to code, data or models triggers automated testing and validation through the ML pipeline, maintaining version control and quality standards. Continuous deployment extends this automation to production releases, enabling seamless model updates in live environments.
This integrated approach creates a robust framework where changes are systematically tested, validated and deployed, minimizing manual errors and accelerating development cycles. The result is a reliable, automated system that maintains high standards while enabling rapid iteration and deployment of ML models in production environments.
Infrastructure as code (IaC) is fundamental to modern MLOps, providing automated, scalable and reproducible practices for managing the complex infrastructure required for machine learning operations. By implementing IaC through version control systems, organizations can accelerate ML model development and deployment while reducing errors and operational costs.
The market offers various IaC tools tailored for ML environments, including Databricks Terraform, AWS CloudFormation, Kubernetes and Google Cloud Deployment Manager. These tools support two critical features of MLOps infrastructure automation:
• Automated Version Control: Version control tracks changes across data, code, configurations and models. Using tools like Git LFS, MLflow and Pachyderm, teams can efficiently monitor changes, troubleshoot issues and restore previous versions when needed. This systematic approach enhances collaboration and maintains reliability across large MLOps teams.
• Automated ML Pipeline Triggering: Pipeline triggering streamlines production processes through scheduled or event-driven executions. Pipelines can be triggered based on:
• Predetermined schedules (daily, weekly or monthly).
• Availability of new training data.
• Model performance degradation.
• Significant data drift.
This automation is particularly valuable given the resource-intensive nature of model retraining. By implementing thoughtful triggering strategies, organizations can optimize resource utilization while ensuring models remain accurate and effective.
Through these automated infrastructure practices, MLOps teams can maintain consistent quality, reduce manual intervention and focus on delivering value rather than managing infrastructure complexities.
Monitoring and observability are cornerstone elements of successful MLOps implementations, focusing primarily on maintaining model performance in production environments. As models face various challenges post-deployment, including data drift and environmental changes, comprehensive monitoring systems become essential for maintaining operational excellence.
Modern MLOps monitoring encompasses several critical areas, implemented through tools like OpenShift, DataRobot and AWS SageMaker. These tools create robust monitoring pipelines that track key performance indicators and trigger alerts when necessary. The monitoring framework typically covers these essential aspects:
• Model Performance Monitoring: In production environments, continuous performance evaluation is crucial. This involves tracking metrics related to incoming data, labels, model bias and environmental factors. Real-time visualization dashboards enable teams to monitor model health and respond quickly to performance issues.
• Data Quality Monitoring: Given the dynamic nature of production data, which often comes from multiple sources and undergoes various transformations, monitoring incoming data quality is vital. This helps identify inconsistencies, drift patterns and potential issues that could impact model performance over time.
There are several advanced monitoring components:
• Outlier Detection: Flags anomalous predictions that may be unreliable for production use, particularly important given the noisy nature of real-world data.
• Platform Monitoring: Oversees the entire MLOps infrastructure to ensure smooth operation.
• Cluster Monitoring: Ensures optimal resource utilization and system performance.
• Warehouse Monitoring: Tracks data storage efficiency and resource usage patterns.
• Stream Monitoring: Manages real-time data processing and analysis.
• Security Monitoring: Maintains system integrity and compliance with security protocols.
These monitoring systems work together to create a comprehensive observability framework that:
• Detects performance degradation early.
• Identifies data drift and quality issues.
• Maintains system reliability.
• Ensures resource optimization.
• Protects against security vulnerabilities.
When issues are detected, automated alerts notify relevant stakeholders, enabling prompt intervention. This proactive approach helps maintain model accuracy and system efficiency while minimizing downtime and performance issues.
The integration of these monitoring components creates a robust MLOps environment capable of handling the complexities of production ML systems while maintaining high performance and reliability standards. Regular monitoring and quick response to alerts ensure that ML models continue to deliver value in production environments while operating within expected parameters.
MLOps emerges from applying DevOps principles to machine learning systems, enabling a smooth transition from development to production environments. While traditionally there has been a gap between model creation and deployment, operational excellence in MLOps is helping bridge this divide.
Modern MLOps practices effectively address the complexities of data management, model construction and system monitoring. The goal is to achieve seamless production deployment of ML models, maximizing the benefits of artificial intelligence technology. Success in this area requires implementing operational excellence best practices throughout the MLOps lifecycle.
By following established frameworks and learning from real-world use cases, organizations can build robust MLOps pipelines that ensure consistent performance and reliability in production environments.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

I was laid off from Microsoft after 23 years, and I'm still going into the office. I feel responsible for my team and customers.
I was laid off from Microsoft after 23 years, and I'm still going into the office. I feel responsible for my team and customers.

Business Insider

time11 hours ago

  • Business Insider

I was laid off from Microsoft after 23 years, and I'm still going into the office. I feel responsible for my team and customers.

This as-told-to essay is based on a conversation with Freddy Kristiansen, a 59-year-old former Principal Product Manager at Microsoft's Denmark office who was laid off in May 2025. Business Insider has verified Kristiansen's employment. The following has been edited for length and clarity. A couple of weeks ago, after 23 years at Microsoft, I was laid off. Yet here I am, back in the office. It might sound strange to show up at the office after being let go, but I still feel committed to the products, the people using them, and my colleagues. I was laid off in May, and per Danish law as an employee of over nine years, I have a six-month notice period. I've been relieved of my duties, but I am still officially an employee until the end of November. I'm also entitled to three months of severance pay after my notice. I didn't plan to stay at Microsoft for two decades I was originally hired by Navision in 2002. I saw it as a job I'd stay in for a year or two, but shortly after I joined, Microsoft acquired Navision. From then on, I was a Microsoft employee. That's when I thought, "Maybe this could actually be something long-term." Indeed, it ended up being my professional home for the next 23 years. Over the years, I have held a variety of roles, from group program management to technical evangelist. Although I never had an official developer title, I have been developing products throughout. My last major project was AL-Go for GitHub — a tool that helps our partners use DevOps, a software development approach, in their daily work without needing to understand the complex technical details. I didn't expect to feel relieved when I got laid off I've found the work fulfilling, but around five years ago, I started dreaming of my own business. During the last round of Microsoft layoffs in 2023, I submitted an anonymous question during an all-hands asking if they would consider voluntary redundancies. If the option came up in the future, I might volunteer. It never did. One morning in May this year, I got an invite to a one-on-one meeting with my manager. I said to my wife, "This is it. I'm pretty sure I'm going to be laid off." I thought I might feel upset, but, in reality, it was kind of a relief. Some of my colleagues were devastated. They are worried about what the future might hold. But I'm nearing 60. For the past decade, I've worked very hard and put in long hours. However, I'm at the stage of life where I'm no longer interested in working 60-hour weeks. It felt like the right time to finally pursue my long-overdue dream of doing work on my own terms. During that layoff call with my manager and HR, I wasn't sad; I was already thinking about what I wanted to do next. I believe this new chapter will be good for me. I'll be able to take more time for myself, and hopefully I'll be less stressed as I can set my own hours. Starting a business is my silver lining My focus is now on figuring out a business plan that will allow me to deliver the most value to partners and customers in the least amount of time. I plan to offer CTO services, project management, and maybe even some motivational speaking, while squeezing in travel and getting back into a regular exercise routine. Since the layoffs, I've been reminding myself that every cloud has a silver lining. In Danish, we say, "Nothing is so bad that it isn't good for something." In this case, the upside was the severance package. If I'd quit, I'd have received nothing. Because I was laid off after so many years of service, I was entitled to at least nine months of pay. I can use this package as a foundation to build toward my future plans. I still am going into the office for talks and office hours I still have an office access card and my company laptop, at the latest until December when I'm officially terminated. In the meantime, I'm still keen to be helpful. I went into the office today because we had a call with our AL-Go for GitHub product users. Over the years, I introduced this tool to many customers and partners at conferences and in blog posts. I feel a responsibility not only to maintain the product but also to reassure them that they are in safe hands. I'm also in touch with my former team. If they need my help, I'll answer questions, share guidance, or whatever else helps. There's no reason to stop doing that. Next month, I'll be hosting a session for current staff — a kind of motivational talk about my career at Microsoft and the good, bad, and not-so-fun decisions I made. One of those decisions was working my butt off for years. Nobody told me to spend 20 hours on weekends or to work as hard as I did, but I did it because it felt like the right thing to do. I did it because I genuinely felt a connection to our partners, our customers, and my colleagues. And, honestly, I still do.

Amazon data centres to consume ‘as much electricity to power Burnley'
Amazon data centres to consume ‘as much electricity to power Burnley'

Yahoo

time13 hours ago

  • Yahoo

Amazon data centres to consume ‘as much electricity to power Burnley'

A complex of huge data centres being built by Amazon in Britain will consume as much electricity needed to power a town the size of Burnley, campaigners have claimed. The proposed data centres, near Houghton Regis in Bedfordshire, are projected to consume around 114.8 million kilowatt-hours (kWh) of electricity a year. This equivalent to the power consumed by more than 42,500 UK households, according to researchers at Global Action Plan, which is campaigning against the development. It exceeds the number of homes in Burnley, which stood at 41,955 after the most recent Census in 2021. Planning documents show the two data centres in the development will include 42 back-up diesel generators, each around 25 metres tall, that need to be fired up fortnightly to check they are working. It is estimated this will produce the same emissions as 1,079 homes heated by gas. The plans were lodged with Central Bedfordshire Council by Colliers Properties, a known partner of Amazon Web Services (AWS), the retail giant's cloud computing division. The documents name Amazon Data Services UK as the site's eventual operator. Known as Linmere Island, the project would sit on an empty 22-acre greenfield site. While the plans also include 140 solar panels, it is not clear how much power they will supply to the data centres. It comes amid growing pushback against a slew of 'hyperscale' data centre projects being lined up across Britain – a central part of the Prime Minister's strategy to boost economic growth – with campaigners raising concerns over their environmental impact. One complex near Blyth, Northumberland, is forecast to produce more greenhouse gas emissions than Birmingham Airport, which carries 12m passengers per year. Another in Elsham, Lincolnshire, is predicted to generate five times the carbon dioxide of the same airport. It underscores the challenge faced by Sir Keir Starmer as he battles to restore economic growth while hitting net zero targets. Data centres are vital to artificial intelligence, and were classed as critical national infrastructure last September, with Sir Keir establishing 'AI growth zones' to speed up such building projects. AWS set out plans last September to invest £8bn in the UK to build data centres. At the time, Tanuja Randery of AWS, said its strategy would help meet the 'growing needs' of its customers and 'support the transformation of the UK's digital economy'. The investment was hailed by Rachel Reeves, the Chancellor, as 'the start of the economic revival', and that it showed 'Britain is a place to do business'. But such developments require vast amounts of energy that puts it at odds with the Government's mission to become net zero by 2050. It has prompted Matt Garman, chief executive of AWS, to urge the UK to increase its supply of nuclear energy for data centres in an interview with the BBC last month. There are also mounting concerns over the amount of water that some data centres require to keep their computer banks cool enough to function properly. Oliver Hayes, head of policy and campaigns at Global Action Plan, said: 'It's astonishing that communities are expected to like it or lump it when it comes to this wave of giant data centres. 'Amazon is opaque about how much of Houghton Regis' water it will suck up in order to cool the IT equipment, but given this one data centre will require as much electricity as a town the size of Burnley, we can assume the pressure on local water supplies will be intense – to say nothing of the noise and air pollution caused by fortnightly testing of its 42 backup diesel generators.' Amazon declined to comment. Sign in to access your portfolio

5 Business Technology News Stories This Week: Amazon Replacing Workers With AI
5 Business Technology News Stories This Week: Amazon Replacing Workers With AI

Forbes

time19 hours ago

  • Forbes

5 Business Technology News Stories This Week: Amazon Replacing Workers With AI

LAS VEGAS, NEVADA - DECEMBER 3: Amazon CEO Andy Jassy speaks during a keynote address at AWS ... More re:Invent 2024, a conference hosted by Amazon Web Services, at The Venetian Las Vegas on December 3, 2024 in Las Vegas, Nevada. (Photo byfor Amazon Web Services) Here are five things in technology that happened this week and how they affect your business. Did you miss them? This Week in Business Technology News Business Technology News #1 – AI means Amazon will need fewer white-collar workers over time, CEO says. Amazon CEO Andy Jassy recently sent a company-wide email outlining how generative AI is reshaping Amazon's future and its workforce. He emphasized that while AI will boost efficiency and reinvent customer experiences, it will also lead to fewer white-collar jobs over time. Jassy noted that Amazon is already using AI across nearly every part of its operations – from Alexa+ and shopping tools to fulfillment logistics and advertising. The company has over 1,000 generative AI tools in development, and Jassy believes this is just the beginning. He encouraged employees to embrace the shift by learning about AI, attending workshops, and experimenting with the technology. Those who adapt will be 'well-positioned to have high impact and help us reinvent the company,' Jassy said. (Source: Quartz) Why this is important for your business: Everybody needs to take a breath. Knowing how unreliable, inaccurate, and error-prone today's AI tools are I don't see this technology replacing people right away. But…it's going to happen, and likely more so in the next 2-5 years. If you're an employee of a big company take to heart Jassy's advice: become expert with these tools so that you can become valuable, productive and profitable to your employer. If you're a small business owner the same applies. Business Technology News #2 – farm-ng updates Amiga robot software for small, midsize farms. Robotics platform Farm-ng has rolled out a major software update for its Amiga modular robot – designed to better serve small and midsize farms. A streamlined, grower-friendly interface simplifies robot control even for users with minimal technology experience. A new Job Manager lets farmers plan, save, and repeat tasks like seeding, weeding, and spraying with precision. The update also includes improved hands-free navigation and automated implement control within defined farming zones. All updates are delivered over-the-air (OTA) so existing users can upgrade without new hardware or complex installs. (Source: The Robot Report) Why this is important for your business: Often overlooked by the mainstream news, ag-tech has been booming over the past few years. Companies like Farm-ng have been rolling out robotics, internet-of-things-enabled equipment and sensors as well as advanced predictive analytics leveraging AI to better forecast production. This is an industry where ever penny counts and technology is making a difference. Business Technology News #3 – Vercepta launches to give small business owners real-time control over online reputation. Reputation management platform Vercepta has officially launched, giving small business owners real-time control over their online reputation. Developed by Stellar Analytics, Vercepta offers features typically reserved for large enterprises, including real-time alerts for new reviews across major platforms. Data-driven insights track sentiment trends and identify potential risks. These tools help businesses not just react to feedback but also proactively manage their public image. Founder Justin Jennings emphasized that most reputation tools are built for agencies, not the business owners themselves. (Source: KXAN Austin) Why this is important for your business: Reputation management remains one of the top security concerns for small businesses and in some cases even more important that malware, ransomware or data breaches. Hackers stealing data or impersonating businesses can cause far-reaching reputation issues, particularly on social platforms. Business Technology News #4 – Reddit unveils AI-driven ad tools to help brands tap into user discussions. Reddit has launched two new AI-powered advertising tools – Reddit Insights and Conversation Summary Add-ons – to help brands tap into real-time user discussions and trends across the platform. Reddit Insights uses AI to analyze billions of posts and comments, surfacing cultural trends, brand sentiment, and emerging topics to guide campaign planning. Conversation Summary Add-ons allow advertisers to display curated, positive user comments directly beneath promoted posts, adding authenticity and social proof. These tools are part of Reddit's broader push to monetize its vast trove of community-driven content while preserving its unique, human-centered vibe. (Source: Reuters) Why this is important for your business: Reddit continues to grow in popularity among people doing research, having conversations or looking for feedback on their ideas. I'm active on a couple of Reddit groups and have learned much from other users too. Which his why it's no surprise that advertisers – big and small – are looking to leverage this audience. And Reddit is responding – these new tools look promising to those who want to make the Reddit community part of their marketing campaigns. Business Technology News #5 – WhatsApp is officially getting ads. WhatsApp is officially rolling out advertisements in its Status feature, making a major shift for the previously ad-free messaging app. These ads will appear in the updates tab – similar to Instagram Stories – and will not affect private chats, which remain end-to-end encrypted. Alongside Status ads, WhatsApp is also introducing paid channel subscriptions allowing creators to offer exclusive content. Promoted channels will give businesses and influencers more visibility. Meta says ad targeting will rely on basic data like location, language, and user interactions – especially for those who link WhatsApp with other Meta services. This is part of the company's broader strategy to monetize WhatsApp's massive user base. (Source: The Verge) Why this is important for your business: Speaking of advertising! Like Reddit (above), WhatsApp has a huge community and – let's admit it – we've been using this free service for years without anything being asked of us. Well, that's changing. Going forward you can expect to see more ads on this platform, which maybe annoying but it can also present marketing opportunities for small businesses looking to reach new audiences. Every week I pick 5 business technology news stories and round them up here, along with my comments.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store