Data Management Challenges In Production Machine Learning - Phdassistance

24
Data Management Challenges in Production Machine Learning An Academic presentation by Dr. Nancy Agnes, Head, Technical Operations, Phdassistance Group www.phdassistance.com Email: [email protected]

description

Machine learning’s importance in modern computing cannot be overstated. Machine learning is becoming increasingly popular as a method for extracting knowledge from data and tackling a wide range of computationally difficult tasks, including machine perception, language understanding, health care, genetics, and even the conservation of endangered species [1]. Machine learning is often used to describe the one-time application of a learning algorithm to a given dataset. The user of machine learning in these instances is usually a data scientist or analyst who wants to try it out or utilizes it to extract knowledge from data. Learn More:https://bit.ly/3zk9UkG Contact Us: Website: https://www.phdassistance.com/ UK: +44 7537144372 India No:+91-9176966446 Email: [email protected]

Transcript of Data Management Challenges In Production Machine Learning - Phdassistance

Page 1: Data Management Challenges In Production Machine Learning - Phdassistance

Data ManagementChallenges inProductionMachine LearningAn Academic presentation byDr. Nancy Agnes, Head, Technical Operations, PhdassistanceGroup  www.phdassistance.comEmail: [email protected]

Page 2: Data Management Challenges In Production Machine Learning - Phdassistance

Today's Discussion

Production Machine Learning: Overview and

Assumptions

Data Issues in Production Machine Learning

Enrichment

Future Scope

Page 4: Data Management Challenges In Production Machine Learning - Phdassistance

The user of machine learning in these instances is usually a data scientist oranalyst who wants to try it out or utilises it to extract knowledge from data.

Our focus here is different, and it considers machine learning implementation inproduction.

This entails creating a pipeline that reliably ingests training datasets as inputand produces a model as output, in most cases constantly and gracefullydealing with various forms of failures.

This scenario usually involves a group of engineers that spend a substantialamount of their time to the less glamorous parts of machine learning, such asmaintaining and monitoring machine learning pipelines.

Page 5: Data Management Challenges In Production Machine Learning - Phdassistance

PRODUCTION MACHINE LEARNING:OVERVIEW AND ASSUMPTIONS

A high-level representation of a production machinelearning pipeline is shown in Figure 1.

The training datasets that will be provided to themachine learning algorithm are the system's input.

The result is a machine-learned model, which is pickedup by serving infrastructure and combined with servingdata to provide predictions.

Contd...

Page 6: Data Management Challenges In Production Machine Learning - Phdassistance
Page 7: Data Management Challenges In Production Machine Learning - Phdassistance

Many of the issues we'll discuss below are also applicable in a pure streamingsystem, as well as for one-time data processing on a single batch.

PhD Assistance experts has experience in handling dissertation and assignmentin computer science research with assured 2:1 distinction. Talk to Experts Now

Page 8: Data Management Challenges In Production Machine Learning - Phdassistance

DATA ISSUES IN PRODUCTIONMACHINE LEARNING

The primary issues in handling data for productionmachine learning pipelines are discussed in this section.

Contd...

Engineers who are first setting up a machine learningpipeline spend a large amount of time evaluating theirraw data.

UNDERSTANDING

Page 9: Data Management Challenges In Production Machine Learning - Phdassistance
Page 10: Data Management Challenges In Production Machine Learning - Phdassistance

This procedure entails creating and visualising key aspects of the data, as wellas recognising any anomalies or outliers.

It can be difficult to scale this technique to enormous amounts of training data.

Techniques established for online analytical processing [3], data-drivenvisualisation recommendation [4], and approximation query processing [5] canall be used to create tools that help people comprehend their own data.

Another important step for engineers is to figure out how to encode their datainto features that the trainer can understand.

Contd...

Page 12: Data Management Challenges In Production Machine Learning - Phdassistance

In order to design a maintainable machine learning pipeline, it is critical to clearlyidentify explicit and implicit data dependencies, as described in [2].

Many of the tools developed for data-provenance management may be used to tracksome of these dependencies, allowing us to better understand how data travels throughthese complicated pipelines.

Page 13: Data Management Challenges In Production Machine Learning - Phdassistance

It is difficult to overlook the fact that data validity has asignificant impact on the quality of the model developed.

Validity entails ensuring that training data contains theexpected characteristics that these features have theexpected values that features are associated asexpected, and that serving data does not diverge fromtraining data.

Some of the issues can be solved by using well-knowndatabase system technologies.

VALIDATION

Contd...

Page 14: Data Management Challenges In Production Machine Learning - Phdassistance

The predicted properties and the characteristics of their values, for example,can be encoded using something close to a training data format.

Hire PhD Assistance experts to develop your algorithm and codingimplementation for your Computer Science dissertation Services.

Furthermore, machine learning introduces new restrictions that must beverified, such as bounds on the drift in the statistical distribution of featurevalues in the training data, or the usage of an embedding for some input featureif and only if other features are normalised in a specified way.

Contd...

Page 15: Data Management Challenges In Production Machine Learning - Phdassistance

Furthermore, unlike a traditional DBMS, any schemaover training data must be flexible enough to allowchanges in training data features as they reflect real-world occurrences.

In production machine learning pipelines, the differencebetween serving and training data is a primary sourceof issues.

The underlying problem is that the data used to buildthe model differs from the data used to test it, whichalmost always means that the predictions provided areinaccurate.

Contd...

Page 16: Data Management Challenges In Production Machine Learning - Phdassistance

The final stage is to clean the data in order to correct the problem.

Cleaning can be accomplished by addressing the source of the problem.

Patching the data within the machine learning pipeline as a temporaryworkaround until the fundamental problem is properly fixed is another option.

This method is based on a large body of research on database repair forspecific sorts of constraints [4].

A recent study [6] looked at how similar strategies could be used to a specificclass of machine learning algorithms.

Page 17: Data Management Challenges In Production Machine Learning - Phdassistance

Contd...

ENRICHMENT

Enrichment is the addition of new features to thetraining and serving data in order to increase the qualityof the created model.

Joining in a new data source to augment currentfeatures with new signals is a common form ofenrichment.

Discovering which extra signals or changes canmeaningfully enrich the data is a major difficulty in thissituation.

Page 18: Data Management Challenges In Production Machine Learning - Phdassistance

A catalogue of sources and signals can serve as a starting point for discovery,and recent research has looked into the difficulty of data cataloguing in manycontexts [7] as well as the finding of links between sources and signals.

Another significant issue is assisting the team in comprehending the increase inmodel quality achieved by adding a specific collection of characteristics to thedata.

This data will aid the team in determining whether or not to devote resources toapplying the enrichment in production.

Contd...

Page 19: Data Management Challenges In Production Machine Learning - Phdassistance

This topic was investigated in a recent study [3] for the situation of joining with newdata sources and a certain class of methods, and it would be interesting to considerextensions to additional cases.

Another wrinkle is that data sources may contain sensitive\information andconsequently may not be accessible unless the team goes through an access review.

Page 20: Data Management Challenges In Production Machine Learning - Phdassistance

Going through a review and gaining accessto sensitive data, on the other hand, canresult in operating costs.

As a result, it's worth considering whether theenrichment effect can be approximated in aprivacy-preserving manner without accessto sensitive data, in order to assist the teamin deciding whether to apply for access.

One option is to use techniques from privacy-preserving learning [7], while past researchhas focused on learning a privacy-preservingmodel rather than simulating the influence ofnew characteristics on model quality.

Page 21: Data Management Challenges In Production Machine Learning - Phdassistance
Page 23: Data Management Challenges In Production Machine Learning - Phdassistance

Achieving the vast potential of big data demands a thoughtful, holistic approachto data management, analysis and information intelligence.

Across industries, organizations that get ahead of big data will create newoperational efficiencies, new revenue streams, differentiated competitiveadvantage and entirely new business models.

Business leaders should start thinking strategically about how to prepare theorganizations for big data.

Page 24: Data Management Challenges In Production Machine Learning - Phdassistance

+44 7537144372UNITED KINGDOM

+91-9176966446

EMAIL

INDIA

[email protected]

Contact Us