Nowadays, the practicality of machine learning is attracting attention, and the possibilities of using machine learning are expanding in various fields. At the same time, MLOps is attracting more and more attention as it enables the long-term and stable development and operation of machine learning models in a wide range of fields, rather than pinpointing them.
MLOps aims to create a development platform that facilitates the operation of machine learning systems in order to eliminate various difficulties related to the development and operation of machine learning systems. On the other hand, this MLOps does not have a clear definition, and there are many technologies that meet various demands under the name of MLOps.
In this article, we will introduce five points that you should keep in mind when considering the introduction of MLOps.
Table of Contents
- Point 1: Understanding MLOps | Part 1: DevOps and MLOps
- Point 2: Understanding MLOps | Part 2: Machine Learning Experiments and MLOps
- Point 3: Understanding MLOps | Part 3: Data Engineering and MLOps
- Point 4 Application to existing cases
- Point ⑤ Recruitment and support of MLOps human resources
Point 1: Understanding MLOps | Part 1: DevOps and MLOps
MLOps is an expression based on DevOps, and many of the techniques in MLOps can be viewed as extensions of DevOps techniques and practices for conventional software to the challenges unique to machine learning systems.
There are various theories, but DevOps is a term that refers to a set of technical and organizational practices designed to deal well with development factors that change with the environment, such as infrastructure, development requirements, and development personnel.
The importance of MLOps is further enhanced by the increasingly complex software development environment.
This is because the environment surrounding software development continues to increase in complexity, and on top of that, the complicated elements of data science have piled up.
Understanding MLOps essentially requires both expertise in development and operation technology developed in the context of DevOps and expertise in understanding issues during development of machine learning systems.
Core technologies in DevOps include version control (VCS), continuous integration (CI), continuous delivery (CD), and system monitoring. In the context of MLOps, pipelines (ML pipelines) are often provided and developed with the following features extended to DevOps development pipelines:
- Model version control (an extension of VCS)
- Data validation/model performance test (expansion of CI)
- Model serving and phased deployment platform (extension of CD)
- Continuous retraining of the model (sometimes called CT)
- Model performance monitoring
Such an element is a series of work (model It is provided to standardize and automate the lifecycle).
This kind of ML pipeline can be used efficiently in systems in business categories where machine learning models are updated frequently and data environments change frequently, such as fields with high frequency updates such as finance and advertising, and fields related to the physical environment such as manufacturing. It can be said that there is a large demand for high-performance and high-reliability.
MLOps is also important in terms of reducing the cost of developing and operating a simple machine learning system, but just as the expertise of application engineers and infrastructure engineers is connected to some extent in the context of DevOps, it bridges the expertise of software engineers and machine learning engineers. There is also the importance of the aspect of For example, if model behavior verification is automated, there are benefits such as being able to roll this work into the operational process.
In addition, for products involving machine learning, it is important that not only engineers but also business analysts and business domain experts have access to data and systems. This aspect is more supported in the data engineering aspect discussed below than in the context of DevOps.
DevOps has a history of nearly 15 years, and as it spreads, the impact and challenges it has on development and organizations are becoming apparent. Differences in how management and developers perceive DevOps are also emerging. Many of these impacts and challenges are likely to be faced in the same way when considering the introduction and operation of MLOps. It can be said that DevOps is the first theme to learn about MLOps.
Point 2: Understanding MLOps | Part 2: Machine Learning Experiments and MLOps
In terms of DevOps, there was a lot of interest in improving the efficiency of the flow that bridges the production environment and development work. On the other hand, there is also a demand for efficiency and scalability in the experimental development process, which is an essential task in machine learning development.
One example of this demand is the fact that machine learning development requires far more resources when training machine learning models than when predicting actual services. Models using deep learning almost always require GPU resources, and in some cases, there are cases of scale that require distributed learning. This resource demand is determined by multiplying the amount of computational resources required for one experiment (≒ data amount, model size, etc.) and the amount of experiment setting patterns to be explored. Depending on the difficulty of the experimental development goals, it is often possible that these resource demands will not be met before they are met.
For this reason, MLOps sometimes deals with technologies that improve the efficiency of machine learning model experimentation and provide a foundation that enables the scalability of experiments. Some of the techniques used in “Continuous retraining of models” explained in the DevOps section are covered, but the topics in this section are those that want to use a lot of computing power, want to explore many settings, and manage experiment results. The main motive is to promote constructive consideration efficiently by This demand is of particular interest to machine learning model development teams, machine learning researchers, and others who need to run large-scale experiments in teams at high volume and speed.
The following are representative technical elements.
- Automatic securing of necessary machine resources and distributed learning (experimental infrastructure)
- Experiment management (mechanism for managing experiment settings and used data)
- Management of experiment workflow (automation of data processing and experiment execution itself)
- Hyperparameter tuning (automation of experimental setup search)
- AutoML (automation of the task of selecting machine learning models)
- Feature Serving (management and sharing of developed features)
- Model explanation/visualization (efficiency of confirming experimental results)
Looking at technologies such as AutoML that automate the building of machine learning models, it seems likely that machine learning engineers will no longer be needed in the future. It is a useful technique for not only applying to tasks and data that are already known to be solvable, but also for trying out machine learning for the first time to get a feel for unknown data and tasks.
In the future, the work of machine learning engineers will focus on formulating specifications such as standardization of problems, reducing costs through systematization, debugging machine learning systems, which are still highly difficult tasks, and semi-customizing tasks. maybe. Currently, MLOps technology is still undergoing exploratory development, so one of the important roles of machine learning engineers is to determine where the efficiency can be improved in the life cycle of your company’s machine learning models. It can be said that Alternatively, resolving issues related to data, which will be described later, is likely to remain an important task.
The advantage of introducing machine learning technology is that by paying the high cost of experimental development, not only can many decisions be automated, but the complexity of the system can be greatly reduced compared to software that achieves equivalent functions. I have. It is a good idea to automate as much as possible the functions that can be realized easily through experiments and results, and to actively seek economies of scale for investment.
Point 3: Understanding MLOps | Part 3: Data Engineering and MLOps
Data in a machine learning system is so important that it can be compared to the power source of the system. Whether data can be properly organized is a prerequisite for the development of a machine learning system, and it is necessary to overcome difficult issues such as whether we can create data ourselves and whether we can aggregate data from various locations. There will be Many technical challenges arise even after the data is secured, and the aspect of dealing well with the complexity of the data is sometimes dealt with in the context of MLOps.
Although data is so important for machine learning systems, I have the impression that topics related to data-centered system development and maintenance are still low in maturity. In conventional system development such as DevOps, code development and maintenance have been the focus of interest, while in recent MLOps, the development and maintenance of machine learning models have been the focus of interest.
There are several reasons why data-based quality assurance technology is immature.
- There are as many meanings of data as there are business areas, and it was difficult to focus on general technical issues (knowledge of data is buried in each organization)
- Important data is more difficult to publish than code or models
- The industrial importance of data management, with human labeling and quality control, was non-existent until ML made it mandatory (data was mostly automatically accumulated).
- Combined with the above circumstances, the evaluation of research effort was low, and the interest of the research community was relatively low.
Due to the above circumstances, data management, which is important for machine learning systems, is still an immature area. On the other hand, issues related to speeding up and increasing the scale of data processing and general data management have existed for a long time, mainly in the fields of search systems, databases, and large-scale data processing, which are separate from machine learning systems. was In recent years, a group of technologies born from knowledge of such data infrastructure issues has come to be collectively called data engineering.
Topics like data management for applications and data management for analytics have traditionally been advanced in the context of data engineering. For example, RDBMS is used for databases for conventional core systems, and systems such as DWH are used for large-scale data analysis. In addition, as machine learning spreads, the demand for data to grow in size and data to change more frequently permeates, loosening the schema structure (data lake) and introducing distributed processing (products such as Hadoop/Spark). products have been developed.
As previously mentioned, it is important for business analysts and business domain experts to be able to access data, but data management systems suitable for analysis have traditionally been offered as BI products. Data engineers are usually in charge of creating the infrastructure for flowing data into BI-like systems.
As mentioned above, data engineering has already solved many problems in terms of data analysis, handling of data volume, and high-speed processing. On the other hand, there are still areas where MLOps has little support, and future development is expected in areas such as improving the efficiency of data creation, responding to data diversification, and data quality assurance related to the meaning of data. that’s right.
Data engineering MLOps includes the following technical topics:
- Data aggregation for analytics such as DWH and data lakes
- Maintenance of ETL and data pipelines (processing of data generated daily and wiring management)
- Data validation and data cleansing (data quality assurance)
- Data version management/experiment data management
- Creation, management, and efficiency of annotations (Human-in-the-loop data infrastructure)
- Model performance monitoring (requires data tracking, not just metrics)
Point 4 Application to existing cases
Most of the companies that lead the development of MLOps technology are tech giants that are expanding their business globally, and often do not match their own business scale and technical demands. However, as the number of MLOps technology introduction cases and the range of companies that have introduced them is increasing, the number of cases where you can refer to introduction cases that are close to your own business is gradually increasing.
If it fits your company’s case well, you may be able to reduce development and operation costs by using a service provided and managed by a cloud vendor. The fact that there are many precedents is attractive, and there is also the advantage that it is easier to connect to recruitment requirements. It is also attractive that the cost of using the service and the effect that can be obtained are immediately known, so it is easy to estimate the return on investment, which is always an issue with machine learning systems. It can be said that it is rare to recover the profits invested in in-house development of the foundation unless there are functions and business hypotheses that scale the value considerably. It is important to apply it to existing cases first and determine how much cloud services should be used and how much we should build a unique system.
I will introduce a viewpoint that is easy to apply to existing cases.
- Are the tasks and datasets similar to what is already known?
- You want to create a classification system in a situation where it is easy to collect teacher labels
- Build an automatic learning system into an existing managed service
- Can be used in cloud environment
- Most managed services are cloud-delivered
- Depending on the vendor, it may also be provided in an on-premises environment.
- Which team/side is in need
- I want to reduce the load on the development and operation team → DevOps
- The productivity of the R&D team does not increase → ML experiments
- The complexity of data is too high to manage/insufficient computing power → data engineering
However, there are areas that are difficult to fit into existing cases.
- Efficiency of data collection and creation, maintenance of data quality
- Fields with high data creation costs, such as medical data and legal data
- Small to medium sized organizations with increasing complexity of data and models
- Foundation on-premises environment
- There are examples, but they usually involve large technical hurdles and management costs.
Point ⑤ Recruitment and support of MLOps human resources
In introducing and operating MLOps, it is necessary to hire personnel who are familiar with cloud, infrastructure technology, and machine learning technology, regardless of whether the platform is developed in-house. MLOps and cloud infrastructure technology are often inseparable, as cloud service vendors are often the leaders in the MLOps space. In terms of the degree of impact on the work of implementing and operating MLOps technology, it can be said that the ability to design and operate infrastructure is more important. In recent years, these functions have been called SREs, but I get the impression that their roles are expanding beyond the scope of traditional DevOps-like infrastructure management.
In addition, the demand for MLOps, which goes beyond the traditional DevOps framework, stems from the difficulty of machine learning technology related to service operation, so there is also a need for machine learning experts who can evaluate the return on investment. If the demand for in-house development of models and data is small, the importance of machine learning engineers may be relatively low, but in that case, another person with sufficient knowledge of machine learning You will be responsible for the adoption decision.
Human resources with both skills are extremely rare, so we need a system that can recruit and cooperate with both SRE and machine learning engineers. Also, just like the introduction of DevOps, the introduction of MLOps is a decision that affects all developers, so it requires proper empowerment and organizational support. If you leave it up to a specific team to develop the infrastructure without being able to convince them of the success of the implementation, it is unlikely that you will get the cooperation of all the developers.
So far, I have tried to explain what MLOps is and what elements are necessary for its introduction.
As with conventional software development theory, the purpose of MLOps can be broken down into the aspects of improving productivity by improving maintainability and extensibility, and improving pre-release quality and post-release quality. However, since it involves the new technology of machine learning, the cost of dealing with new elements such as data, experiments, and behavioral uncertainty is rising in addition to the code and system to be dealt with. In addition, organizational and management costs are rising as the background skill sets involved in development, such as data science and domain knowledge, are diversifying. Due to the increase in development costs, the attention of MLOps is increasing.