Data Science – Nick Ball

Having worked on data science projects since the early 2000s, I have built up various things I can say about the data science process. A few of the more important/useful items are collected here. The too−long−didn’t−read summary of the most important items is below, followed by comments and considerations on the end-to-end data science process in more detail.

There are other existing data mining processes out there, such as CRISP−DM, but they do not usually consider the extra modern requirements of dealing with very large datasets and using machine learning (ML) as much as we do here. The focus here on machine learning being part of the process is because data of complexity to require data science (as opposed to more traditional analytics or statistics) usually also requires ML to be able to model it.

Most of the below is not sourced, and as such constitutes (hopefully) informed opinion as opposed to iron-clad and referenced facts. As a former actual scientist, I am always open to having anything expressed here debated and improved upon. This is a reason for writing it all here rather than a bunch of blog entries on, say, Medium, because here it can be more easily updated.

Why write all this at all? Partly, it helps me clarify my own thinking on all the steps, and if I get feedback maybe I’ll learn something, but also because I haven’t seen many practical guides to all the steps of the process. There’s more to it than building models …

Summary: The Most Important Things

These are from my experience. Others’ opinions may vary.

Business is an equal component with data science and engineering (IT, architecture).
The problem should be stated in business terms. “Build a better classifier” is not a business goal.
Buy-in from executive level is needed so that a successful model gets implemented: how is it actionable? Mock up the expected result.
Data preparation must be done properly, otherwise the rest will fail.
Production must be considered throughout: purpose, scale, existing business logic, integration with existing company systems. This has been described as “run a pilot, not a POC”.
The data science process is iterative between data prep ↔ models ↔ production, and cannot be done one−way from one step to the next.
Good models are important but there is no substitute for domain knowledge or better data.
Present the results in a comprehensible way to the intended audience (business, data science, engineering, other).
At least one technical person with authority should oversee the whole process to ensure coherency.
Requiring writing a single line of code within any given stage of the process makes it technical users only to complete that stage.
If you want to sell your product the most convincing item to have is successful reference−able customers who are already using it.

Details: The Data Science Process

What is Data Science?

Everyone has their own favorite definition and they are all a paragraph long to try to capture its value and distinguish it from other areas.

My Take

Data science is the discipline of solving a problem end-to-end from problem statement, via data, to actionable results. When the skills required to deal with some aspect of the data exceed the norm expected for some other field (the “data” in data science), and the practitioner is required to formulate the questions to be asked (the “science” in data science), then what is being done becomes data science. Data science is fundamentally different from engineering, as it is concerned with going from data to knowledge, not construction of tools or products.

While not being the same as any of them, this overlaps with many other domains, including statistics, computer science, engineering, analysis, IT, and the subject domain the person is working within.

Examples of data science include translating a desire stated by a nontechnical person to a technical analysis that can be done with the available data; dealing with data that are large or complex in various ways; extracting information or predictions from nonlinear patterns in data only findable by models that allow arbitrarily complex mappings, e.g., machine learning, and making sure the whole process makes sense. None of these are exclusive to data science, but all can be part of it.

Don’t data scientists just build machine learning models?

Some people would say the data scientist just builds the machine learning models, and perhaps does the feature engineering. Then ETL engineers, MLOps people, and so on, deal with getting the output of the data scientist’s Jupyter notebooks to work properly with the rest of the system. This may be so in some cases, but this is a too−narrow view of the subject. The entire end−to−end process, especially the business problem, is relevant to what constitutes a good solution, because the model is not the solution arrived at by data science, the final outcome is.

Isn’t it just software engineering?

This is another misconception I have seen, that because building a machine learning model is easier than it used to be, data science is now just a branch of software engineering. This is something I do not agree with, because most data scientists are not engineers. They are experimenters (hence data science as opposed to data engineering) who iteratively solve the business problem. Engineers’ primary responsibility is then that the result works correctly in a production setting. Each role is equally important, they are just different, and not branches of each other.

The Data Science Process

Having been involved in all steps of the data science process, this is a collection of my most significant considerations and learnings at each step.

General Considerations

The process is iterative, e.g., data prep ↔ model and model ↔ production (this, along with the ability to see data and visualizations inline, is why most data scientists prefer a notebook interface to a GUI or CLI)
If the company has more than a small data science team, a coherent platform is needed, otherwise IT will prevent or be unable to support the software needed. It also helps avoid teams being siloed and work being duplicated.
Data preparation is key: a model with improperly prepared data will always underperform
Because the process is fundamentally iterative from end-to-end, the agile/sprint model used in software engineering does not work as well for data science teams.

Examples of the need for coherence/iteration:

Data prep ↔ modeling: A supplied data feature may turn out to be a cheat variable that has leaked information from the testing set to the training set, necessitating altering the data prep and/or understanding of the data
Modeling ↔ production: Deployed ML models generally degrade in performance over time due to data drift and have to be retrained
Data prep ↔ production: Featurization in prep needs to be deployable with sufficiently low latency in production, which may necessitate simpler featurization than appears optimal from the purely modeling point of view, but still needs to perform well enough

In each case it is the data scientist or other similar technical person who understands what is happening on the two sides of the iteration and is thus able to correct it. This is why the end−to−end data science process cannot be completely siloed or done in a one−way manner of data prep then modeling then production.

Business Problem

This is always the place to start to have the best chance of a successful AI/ML project. If it is not clear what problem, written in business language, is being solved, it is less likely that even a successful model will be put into production to generate value. The points in this section apply equally to an internal project, or an engagement with a customer.

The business problem is of equal importance to the data science and engineering, and should be treated as such. This means appropriate time spent and level of understanding on the part of the data scientist.

(I am using “business problem”, but the same idea applies to non-business uses. E.g., in science, the “business problem” might be the hypothesis that the research is testing. Solving it, or obtaining a null result, is still the aim.)

Define success: How do we know when the problem is solved? The key is to have all parties agree to what success actually is. What outcome is agreed upon to constitute a successful outcome of the project?

A useful formulation to help define what problem is being solved is CoNVO, or Context, Needs, Vision, and Outcome. This sounds buzzword−y, but it helps to clarify that statements like “build a better classifier” are not describing a business problem. Why are we building a better classifier? If you say “to make more money”, then why will this better classifier make more money? Bringing these steps out is helpful in creating something that everyone can agree to.

Another useful early step is to mock up what the expected result of a project will look like before you start. Then get buy−in on this from the executive side. It could be used as an aide to the conversation that results in a written agreement on what constitutes success. If there is a customer involved, then this includes the customer. Include provisions not to change the project scope once it has been agreed.

Be sure that the project doesn’t just end with a successful model. Even if a model has good performance and can be deployed, many never are because no one took the actions implied by the model’s predictions. It may be that the data scientists (and even the engineers) see such actions as trivially obvious and of no need to document explicitly. But if they don’t document it and the business side hasn’t sufficiently understood the meaning of the work, the model may simply get left in an unused state because no one was responsible or motivated to take the next step. The lack of tangible actions or revenue from the project at a later date means that the executive side then deems it to have been a failure. Along with the difficulty of deploying models into production, this is one of the reasons for the well-known statistic that 85% of AI projects fail. If the business problem is well formulated, then solving it will incorporate this step of putting the model into action.

Once a formulation of the problem has been agreed upon by all parties, it should be signed by all as a written agreement. This gives all sides the accountability to carry the project through to success. Everyone on the team, not just the data scientists and businesspeople, should be aware of the statement of the problem being solved. This helps avoid some step in the process being nonsensical.

A final common problem that causes projects to fail is IT. IT is not a problem per se, indeed, it is vital to the successful function of any business. But especially in big companies there can be a large number of policies, procedures, and approvals that apply to the use of any new software, and data scientists always have a large stack of things they want to use.

If it is possible, this can be addressed by building a simple version of the project from end-to-end to be sure it can be done: make sure obvious needed software can be imported, models built, and deployed. Then going back and iteratively improve the end−to−end analysis. It is better this way than, say, spending a lot of effort on one step only to find the next step is impossible (e.g., the library used for the great model does not pass security policies and so cannot be put into production). This has been referred to as “run a pilot, not a PoC”, but the idea applies regardless of nomenclature.

The pilot approach is an argument for the use of a data science or machine learning platform, where the environment and platform software have been approved. This approval needs to include the execution of arbitrary code so that the data scientist can write their analysis, which in turn argues for the environment to be containerized. It may even be the case that installation of any new software is not allowed by general users and must also be approved. This can in principle be solved by a set of libraries and versions being part of whatever container, but it is difficult to include everything that might be needed in advance, so approval of a secure and contained environment where the users can then install things is preferable.

Beyond all the above, there are various other technical details that often come up and are worth considering

Although the problem should be well formulated, it is a bad idea to commit to a specific quantitative model performance: it may be impossible to reach because any data only has so much intrinsic information or signal. It is better to state something like “significant gain w.r.t. existing business value”.
It is easier to formulate success using supervised learning than unsupervised, because, as mentioned, you don’t want to commit to specific model performance. The performance can be more directly measured versus the training set ground truth than it can for methods such as clustering. When has the clustering become successful? Of course, unsupervised, etc., learning is not precluded because the aim is to solve the business problem, but if a project needs to be successful and it can be steered toward a supervised learning formulation this is generally better.
Avoid “science projects”, where a lot of time is spent investigating things just because they seem interesting. Data science and machine learning are such broad approaches that there are always unlimited tempting rabbit holes one can go down with the data, particularly if it is large. Experimentation is not precluded of course, but the eye should be kept on the problem. This is true even if the project is research, or even if it actually is a science project – even there the team has to know when to stop and publish the paper – but is particularly important when the driver is a business problem.

Management

Many people know a lot more than me about management. I have, however, seen the interaction of management with data science, so as with the other sections the data science process and its surrounding relevant ideas are the focus.

On bureaucracy: Most of the people in the data science process will be technical, and thus instinctive reaction against any bureaucracy perceived to be unnecessary will be common. It should therefore be minimized and trust should be placed in the people that they are motivated to do a good job. That is, they have intrinsic motivation and don’t need extra extrinsic motivators like too many rules. People generally don’t pursue these kinds of difficult technical jobs if they are not motivated. However, there are some areas where forcing a few things to be formalized is beneficial even though it can be irritating. An example already seen above is requiring a written problem statement and successful outcome agreed to by all parties. Others are mentioned where appropriate below. Often, the person leading the team, who has a managerial component in their job description already, can be responsible for most of the documenting, but they don’t always know every last detail of what each team member is doing, so others may need to help. Like many things it is a balance to be struck.

Someone technical should be included in the overseeing of the end-to-end process for coherence: Examples would be a principal data scientist, principal engineer, VP data science or engineering, chief data scientist, CIO, or CTO. They do not have to be expert in every step, just fluent enough that they can understand each step when they see it. “End-to-end” includes everything from problem definition & context, collection of raw data to model deployment in production, and business actions taken. The overseeing may be in collaboration with a businessperson at the same level. The technical person should have the authority to direct the team members if needed – keep them all pointing roughly the right direction – but micromanaging is generally not needed.

It should be clear who is responsible for executing each step: The RACI method (responsible, accountable, consulted, informed) is useful for this. This is worth the effort because it helps ensure people are not blocked waiting for someone else to complete some step, and it avoids the “bystander effect” of everyone waiting for someone else to complete some step because it is not clear who should be doing it.

Depending upon the scale of the problem and the team, there should be various roles filled besides data scientist: A small team may be generalists who pitch in with everything. As it grows, there should be an ETL specialist (because of the importance of data preparation), a data or machine learning engineer for production implementation, a technical person with business domain knowledge if others don’t have it, and a solution architect if multiple systems will be interacting. The people with the data scientist title become more focused on the featurization and machine learning stages, but the point about having a technical overseer above remains.

Don’t use sprints for data science: As detailed elsewhere here, the process is constantly iterative between data preparation, modeling, and production. It is not amenable to “sprints”, where a specific outcome is reached and then everyone moves on – all parts of the process may need to be revisited.

Daily “standup” meetings are helpful: This is something data science has in common with engineering. Like that field, few people in it like having lots of meetings each day, but given the complexity of end-to-end data science a short standup each day helps a lot in ensuring what everyone is doing remains coherent. In particular, having everyone state if they have anything blocking their work can address problems before they become longstanding.

Regulatory requirements, privacy, ethics, etc.: If the business problem is well-stated, then adhering to any necessary requirements such as these comes as part of it. What regulations apply, and what aspects of ethics are likely to be encountered, is domain dependent.

Rhetoric: The website for CoNVO also talks about Rhetoric. This doesn’t mean manipulating people (well, not overtly), but refers to the 2000 year old subject of types of argumentation for or against some proposition. It is useful in data science, especially for management of a project, because typically many different kinds of people are communicating with each other. Considering the basics of rhetoric helps you to be more clear about the various reasons people may be arguing for or against some aspect of the project. Knowing that data science’s claims to provide business value are often inductive rather than deductive, or that political considerations can be just as legitimate as technical considerations depending upon the context, helps everyone to think more clearly about what is going on day−to−day. It helps to smooth interactions with customers too. I won’t duplicate its content here, so take a look at its website for more.

IT

This was mentioned above, but is called out separately here too. The stereotype for data scientists, especially in large companies, is that they get stymied by IT. Security and other policies often were not written with the needs of our new field − a large and constantly evolving stack of tools and iteratively evolving projects whose requirements change − in mind. When it takes months for new tools (or even new datasets) to be approved, efficiently executing on a new AI project becomes impossible.

A complaint is not a useful end state, though, and enterprise IT has legitimate reasons to be cautious of new tools, especially ones that can execute arbitrary code, when the security of the company, its customers, and the company’s reputation may be at stake.

Some ways to smooth the process include:

Leverage the fact that machine learning models are very general. Algorithms that instantiate nonlinear ML models like decision trees and neural networks are highly flexible, and a good library that supports them such as TensorFlow or H2O can solve most business use cases, at least at the modeling stage. So it may not be necessary to support more than a few of them.
Use containers. These both allow the effects of executed code to be limited to within the container, and the libraries available along with their exact versions to be specified.
Alternatively, allow the data scientists to, in fact, use whatever tools are needed, and keep the project in a contained environment. If a project shows business value and has buy−in from the right decision makers, it should then be possible for whatever was used to be approved for production.
One such environment might be an ML platform for data scientists that uses containers. If such a platform is properly general then most data science work can be done within it.
Agree a company−wide policy for what happens when data scientists do need a new library or dataset. Formulate the policy for data science specifically, as the field’s requirements do not match those of software engineering, analytics, etc. Any approval time longer than zero is too long from the user point of view, so minimize it.
Use the fact that most data scientists, while they are curious and like to move fast, don’t have the let’s try to hack stuff and break it mentality. They are unlikely to try and exploit your system just to see if they can. This may not actually change any policy, but it’s useful to be aware of.

Data

It is often the role of the data scientist to work with data that has already been gathered by someone else, and they do not have input into its collection. However, it is well-known that no matter how good an algorithm is, there is no substitute for better data. Data provides information, and it is not possible to generate new information that was not present somewhere in the data already. Every dataset contains a certain amount of information that can be extracted and when this limit is reached the only way to improve is to add better data.

The user should also be aware of whether the data are biased. This may be for legal or ethical reasons, but also simply from the point of view of ensuring the project gives correct results. If the data being supplied by someone else has already been altered somehow: perhaps it was downsampled in a non-random way, or had particular business rules or logic applied to it, then the data scientist (and the people supplying the data) needs to be aware of this. It may be that the bias needs to simply be stated as part of describing what data are being worked with, but when the model is deployed the same sampling, rules, etc., need to be applied to new data coming in.

What data are discovered or gathered is obviously domain−specific and problem−dependent, but besides gathering the needed data for a project, another useful exercise can be data augmentation. Many facts about the world are generically true, and remain true independent of whatever problem is being solved. An example would be the correspondence between neighborhoods or areas of a city and their zipcodes. Since as mentioned there is no substitute for better data, adding information such as this that is relevant to the problem may result in improved model performance.

Data Integration

The ideal situation from a user point of view is to have all of the data “in one place”. That is, in a location where any part of it can be easily accessed by the right people, without the need to use more than one computer system or access method.

In practice, however, it is not practical for most companies to achieve this situation anytime soon. Their data is siloed and will remain so, because the investment of resources required to consolidate data into one place outweighs the extra effort needed to work around the silos. For large data, simply moving it around may be prohibitive in itself.

However, since all the data that gets used in a project must have been accessible, this means that it is possible to generate a coherent overview. The metadata can be consolidated, even if the data cannot be. Since as mentioned there is no substitute for better data, it is worth investing significant time in understanding what data is available to the company, and using a cataloging tool to do so. Such a tool needs to cope with arbitrary data types − flat files, databases, images, unstructured text, etc., otherwise you will end up with just another division of the data, into inside the catalog and outside of the catalog.

For most real projects, a specialist such as an ETL engineer should be hired to do the job of data integration. You want someone who combines the engineering knowledge to deal with company systems, and the interest and motivation to do it properly. To most data scientists, data integration and preparation is “boring”, and they won’t do the best job if the cliche “80% of your time is data preparation” is forced upon them. Nevertheless, the technical lead of a project who has an overview of it end-to-end (CTO, lead data scientist, etc.) should be involved so that the integration that is done makes sense.

Data Formatting

It is useful to distinguish between data formatting and data preparation:

Data formatting is making the data such that it can be read in at all by the tools the analyzers will use, and can be passed through the whole analysis process
Data preparation is then making the data such that what is passed in makes sense

An example would be a numeric table that is formatted as strings and has corrupted rows with the wrong number of columns. If a tool cannot even read the file due to the corruption then correcting that is formatting. If the resulting file can then be read but it makes no sense for the values to be strings and not numbers then that is preparation. Obviously the dividing line depends upon the particular use case.

If the data are large, the possibility of incorrect formatting is one argument for running a project as a pilot rather than PoC, i.e., putting something through at full scale end-to-end even if the result is not optimal. Say there is an old tool in the pipeline that has a 2GB file size limit − you want to find that out early and choose another tool rather than make a dataflow with a data subsample only to hit the limit later when you scale up and have to rewrite the whole pipeline.

Data should also be sanity checked, especially if they are large. For example, if the data are a billion rows, then a one−in−a−million event will happen on average 1000 times. A good data scientist will do many of their own sanity checks, but they will tend to be more oriented toward the data content than engineering or ETL aspects. Particularly important to check for are errors that might cause an incorrect result, but do so silently, for example if 2 datasets are to be cross-matched, and one is utf-8 encoded and the other is latin encoded, data rows that should have matched may not do so, and vice versa. Such an incorrect match might be subtle if the result looks largely the same as a correct match would have, but machine learning models are nonlinear and so any part of the data might get magnified to great importance. If a part that is incorrect gets magnified, then you could end up with a completely spurious result without realizing it.

Other obvious checks include that the data are not corrupted, e.g., the wrong number of columns will cause a file read to fail in many tools, and that they are of a type that can be used (string, float precision, and so on.) Many checks will be industry-, domain-, or company-specific, so it is not possible to list them all.

Data Preparation

Once the data is in a state where it can be read in by the user to their tools, it needs to go from something that is readable to something that makes sense to pass to a machine learning (or other) model. Data preparation can include feature engineering, which is described below, and exploratory data analysis (EDA), which is included in this section. Although as mentioned above, data preparation is often regarded as unglamorous or boring, it is crucial to do it well because a model can never be better than the data that is fed to it. It may be tempting to skimp on data prep if the result is empirically good enough, but because of the nonlinearity of ML models, less than rigorous data prep risks unexpected and bad results from your models.

Some key points regarding data preparation are:

It’s iterative: exploration and understanding of the data, and preparing it, are iterative. For example you may plot it, discover a bunch of values of 9999.9, remove them, and plot it again. This iterative nature continues to the modeling and even production stages. So data preparation can never be regarded as having been finalized.
You want to visualize all of the data: If you don’t visualize 100% of the dataset, you don’t know what is in it. The taking of subsamples, and machine learning algorithms, make various assumptions about the data that may not be valid: subsamples will miss outliers, and your ML may assume that the rows are independent and identically distributed when in fact they are not. Ideally, if your data is say 1 billion rows, then you want the number of points in your visualization limited by the number of pixels on your screen and not anything else. Then it should be zoomable so if there are more points than pixels you can see the points. (Or group them, or whatever.) In practice of course, such an ideal situation may not be possible, but in most projects trying to get as close as one can to it will be worth the effort.
Have someone to help: The data scientist understands the importance of data preparation but does not want to spend 80% of their time cleaning and preparing the data. Have someone whose specialist role is ETL engineer to help. This is not because the data scientist is some superior being and data prep is beneath them, but because a specialist will do a better job than someone who has to also run the machine learning, think of features to engineer, present the project, and so on. As above (and below), have the lead data scientist or equivalent involved to ensure coherency.
Understand the data: Surprisingly often, the data science team in a business is asked to work with some data that another team or a customer has, and no one knows what it actually represents. The documentation may be inadequate or nonexistent, or even if it is good, may require domain knowledge of some part of the business or industry that the data scientist doesn’t have. The best solution in these cases is to talk to the people supplying the data. An hour or two in a meeting going through the data columns with their technical people is almost always worth the investment. Then document what was concluded. When no one knows for sure what the data are, and assumptions do have to be made about what something is representing, then document those too.
Missing values: Real data essentially always has missing values. People don’t fully fill out surveys, images have bad pixels, sensors fail, even machine logs may output values like NaN rather than stopping logging. Missing values can occur in data in various ways, but their appearance is usually not random so simply removing rows that contain them will bias your result. The resolution depends on the domain and the problem, but in general the data flow needs to be robust to at least some form of missing values. Things like “sum()” need to actually be, say, “nansum()”, because sum(1+2+3+NaN) = NaN, and then the NaN propagates, but “nansum(1+2+3+NaN)” might instead be 6. The other common method of dealing with missing values is to estimate what they should have been. Assuming that they should have been anything – sometimes values are legitimately empty – a common method to use is some form of interpolation. More sophisticated, and usually needing more resource, is to use ML itself to predict the missing values with the rest of the data used for training.
Curse of dimensionality: This is when the data is hard to deal with because of the number of dimensions that is has. This manifests in 2 major ways – hard to prepare, and hard to run in ML. ML is discussed below, but real projects often get given data that may have, for example, 100s of columns and only a few are relevant (but you don’t know which), too many columns to usefully visualize, more columns than your ML algorithms can deal with, or even more columns than rows making it unlikely that a training set is representative of the underlying population. There are various ways to mitigate the issue, but at some point, if “big data = big preparation” there may have to be a tradeoff. If there are so many columns that it is not feasible to prepare each one fully and correctly, then some of them need to be removed. The solution is often some form of dimension reduction. This can be regarded as part of feature engineering, so see below. The impact of dimensionality on the data prep is of course problem−dependent.
Outlier/anomaly detection: Some people distinguish outliers as rare events that can sometimes happen (the temperature was 120F), and anomalies as events that should never happen (the temperature was -100 Kelvin). The distinction and its importance is problem−dependent, but knowledge of outliers or anomalies in your data is often crucial because they can mess up the results. Anomalies may be outright wrong values that should be removed, they may make an otherwise nice visualization routine fail or not show anything useful, and they can badly affect some ML algorithms that are not robust to them.
Watch for information leakage: When using supervised learning, the fact that you don’t want to allow information from your training set to be present in your testing set is well−known. However, in real projects with complex sets of datasets, dataflows, and intermingled information, it is easy to inadvertently let information leakage occur. At worst this can lead to spurious results that then cause the resulting models to fail in production. An example of potential leakage I have seen is a database of sales where a record with information available about the customer and the transaction prior to the purchase (customer demographics, previous interactions, etc.) was then updated with information available only after the purchase (what did they buy, when, how much). What they bought of course could be the ground truth for a model, but the other information is also only available after an event that someone might want to predict has occurred, and therefore also needs to be excluded. This means in this case that every database field has to be considered as to whether it is valid to use in training, and then the information disentangled to produce a valid training set. Avoiding information leakage has to be done as part of careful consideration of the business problem being solved.

Data Featurization

Feature engineering of data is a whole field in itself, and one that remains unsolved, as much art as science. Not because there are not legitimate techniques to use, but because it is hard to know which will work well for any given problem. Engineering features well, however, is undoubtedly the key to getting the best model performance. While the information in features is in principle there already in the data, and because nonlinear models can approximate arbitrary mappings of the data, one might expect a model to “just find” the answer, in practice engineering features from the data helps the model more easily see important patterns.

A classic and simple example of featurization is predicting someone’s financial wellbeing: their credit limit might be available, along with their current balance, but adding the ratio of their credit balance to their credit limit (how much of the credit have they used) usually does better than the 2 raw features alone.

Some people will respond “just use deep learning, it doesn’t need feature engineering”. Well this may be true in some cases, but deep learning is only suitable for solving a subset of business problems: those with sufficient data to be able to train a network, and in whose context the available interpretability of the resulting network is acceptable. Also, not all deep learning architectures can produce all kinds of features, because, e.g., they might cascade data from one layer to another, which doesn’t necessarily catch every statistical dependency. For business cases outside of these situations, feature engineering is still required.

Featurization can be divided up in various ways. One way that I have found useful is:

Feature filtering
Feature selection
Feature engineering

Feature filtering refers to the removal of dataset columns before the ML model is run. Thus it uses information only from the data and not from the output of the model.

Feature selection is choosing of dataset columns iteratively with the model, using the empirical performance of the model on the data to guide the selection of which columns to ultimately use. The final model of course should be tested on unseen testing data, and that result not used to further iterate, otherwise the result will be biased.

Feature engineering in this context then is the creation of new columns from existing ones. This is perhaps the most difficult of the three because it is the most unconstrained. How do you know, for example, to try the ratio of the two items of credit information in the example above, rather than any of the other possible ratios between columns (and there may be hundreds)?

It is also useful to consider differentiating between featurizations that are generic (ratios of columns, polynomial expansions, correlation with labels, etc.) and those that are domain-specific. This is because a featurization that can be formulated generically can be automated, reducing the need to duplicate work over different analyses or projects. While automation is becoming more useful, it still needs to be applied carefully because there are far more possible features that can be generated from any data than there are ones that will improve the model performance. Depending on the algorithm, creation of too many non-useful features can actually degrade model performance.

If auto-featurization can be applied usefully, it is worth doing, because it has the same advantages as auto-ml (save user time, find combinations they may not), and particularly if auto-featurize and auto-ml can be combined, i.e., feature filtering, selection, and engineering all being iteratively improved at once along with the model hyperparameters, this is potentially very powerful. Doing this well, however, remains an unsolved problem.

Another important subtlety in feature engineering, and another example of why you want to run an end-to-end pilot to prototype your project, is that whatever feature engineering is done in training must also be done in production. This is not so much a question of can it be done, but can it be done fast enough. In production setups where the data are coming in real time and the deployed model needs to provide a response with low latency (e.g., it’s part of an interactive website), the raw data must be able to be featurized to be passed to the model quickly enough that the model can output its prediction in the required timescale. The need for low latency in deployment may place constraints on the computational complexity of the featurizations that can be considered, even at the expense of ultimate model performance. If your model works great in training but is too slow in production due to high latency, it has not produced a useful solution to the business problem.

Interpretability is likewise another featurization consideration. Many settings require ML models to have interpretable output, such as reason codes for making a given decision like denying someone credit. Such outputs are usually expressed in terms of the input data, which means that the data columns being input into the model need to be human-interpretable. This is fine with methods like feature filtering and selection, but may be less so if something convoluted is created with feature engineering, or some method that replaces the original columns like principal component analysis is used.

Feature filtering and selection are ways of selecting a subset of the features in the data that performs better than using all of the columns (either performance is higher, or the performance is similar but it runs quicker, or some other benefit). If a subset of features has been selected then this is dimension reduction. Dimension reduction might be done by filtering purely based upon the data − which columns correlate to the ground truth labels, for example, or by selection in an iterative manner with the performance of the ML model. A simple example of the latter is backward elimination: run the model, remove some columns and see which subset performs best, and iterate. Forward selection is the same but start with no columns and add, or combine the two. A more sophisticated variation is backward elimination by variable importances: iteratively remove the columns of lowest variable importance. Importances have their own issues, but this can be useful with large real datasets because most of the columns will have zero importance and can be immediately eliminated.

The other obvious method of dimension reduction is finding the most important components of the data and using those instead of the original columns, i.e., principal component analysis, or the more general version of it, singular value decomposition. These can be useful because you don’t have to know which columns to choose, but they come with the caveat that the resulting columns are less (or not at all) interpretable, they work most easily with numerical data, and they are making assumptions about the nature of the data such that it comes from a Gaussian distribution. Methods like PCA can be generalized to nonlinear but they usually become more computationally intensive and do not get any more interpretable.

As with data preparation above, the curse of dimensionality appears in feature engineering. If there are many columns, which ones to use? If we want to cluster the data, how do we do it? In high dimensional space essentially all data points are in the corners and all the same distance apart. How do we know which columns are good to select if we don’t have the resources to prepare them all first? While there are methods to do high-dimensional clustering, such as self-organizing maps (SOM), some of these questions may not have good answers for a given problem. It is then up to the data scientist (or feature engineering or data prep specialist) to use their judgement to find the best solution.

Similarly, feature engineering has to be robust to other real-world aspects of your data such as missing values and outliers. Furthermore, values that may not have been bad on their own may become bad when put into some combination when being feature engineered, e.g., values of zero may be fine in their own columns, but their ratio cannot be taken because division by zero is undefined.

Which featurizations can be used on which columns obviously depends upon what type of data they are: numerical columns allow different things to be done than do categorical columns. Or the data might be unstructured, sparse, images, time series, spatial, and so on.

In a situation where there are lots of data types, one thing that might be borne in mind is that most data types can be converted from their original form into a graphical representation. Then all the information from different columns can be combined and more signal available for training the model. It does, however, mean that the user needs to have both the knowledge and tools to use graphical models effectively, which is not usually the first thing most data scientists learn.

A final idea that is not new but has caught on recently is an obvious extension to starting from a blank sheet of featurizations: the feature store. This can be a combination of generic featurizations and those built up by a team or within a domain that can be reused on future data.

Modeling

A full discussion of machine learning modeling is not my intention here (this page is long enough already!). I will highlight some main points from my experience.

These days ML modeling is regarded as the “easy” part: just take your neural network, decision tree or whatever from your favorite library, put in a few parameters and let it run. Or even just drag & drop a few icons, link them together and click Go, without writing any code. While it is true that the tools have improved, making building ML models easier than it was, when it comes to building good models there is a lot more to it than this.

A few considerations when building ML models:

Auto-ml: Auto-ml is useful. Unfortunately, the phrase has now acquired more than one meaning. Originally it meant automatically search the ML hyperparameter space as part of your code (as in H2O automl), a tool that can both save data scientists’ time and find better results. But now it has become conflated with the idea from above that anyone can be a data scientist by just clicking the right button. This is not true, but it is in vendors’ interests to sell their products, so it has this more hyped connotation.
Unsupervised learning: This has many uses, but it is harder than supervised learning to measure in a quantitative fashion whether it has been successful. If there is a choice in formulation of the analysis for a project, the supervised one is often the one where it is easier to prove quantitatively that a good result was achieved, and to provide a finishing point where the project can be declared done. Unsupervised could, of course, be used as part of the process, e.g., to cluster the data in order to explore or simplify it.
Time series: In a time series, each data row is correlated to the rows around it. This means that the rows are not independent and identically distributed (IID), which is an assumption of some ML algorithms. The data prep done for a time series is likely to already be different from that done for a static dataset, but in the modeling stage it also means one must be careful about how the data are split. You don’t want to randomly split the data into two samples because the rows are not IID − instead, use one time range for training and a different time range for testing.
Overfitting: This is a classic mistake that someone new to ML might make – you train the model on the training set and then use some (or all) of that same data in the validation set while tuning the model. Because ML models are arbitrarily complex and nonlinear, they are able to fit whatever elaborate description is needed to maximize performance on those exact data rows, i.e., fit the noise. The model looks good in training but then fails on new data. Overfitting can put a model arbitrarily close to spurious perfect accuracy depending how complex the model is. While data scientists will know to avoid overfitting, it remains a danger because it can occur in other ways without people realizing. E.g., the information leakage example from the customer database in the data prep section above; complacency − “obviously I wouldn’t overfit, that’s dumb”, and then you do; using the testing set to tune the model (see below); etc. In a situation where there is pressure to reach a given performance number for a model because of a badly written business problem statement, a dishonest actor could deliberately leak training data into the testing set (who would notice?) to get the model’s performance up to what is needed. This is one of many reasons why it is better to have your team intrinsically motivated and then trust them to do their best rather than to put external pressures on them like performance targets. (As a sidenote, note that as well as arbitrarily accurate complex models, it is in fact trivially possible to score 100% accuracy on any training set: simply use a k nearest neighbors (kNN) model with number of neighbors k set to 1. Since in kNN the data is the model, if you then test on the training set, the model will give you back the same data points which are all at zero distance from themselves, hence 100% accurate.)

Baselining: If a problem can be solved by a parametric model without the need for machine learning, then it will usually be better: if your solution is just, say, y = x² then it is better to use that formula than a nonlinear ML approximation to it. But if your problem does need ML, and a simple model works as well as a complex one, then the simple one is preferred. A way to establish if the complex model is in fact needed for best performance is baselining: run a simple model on your data as a baseline, then compare the performance of the complex model to it. In this way the good performance of a complex model is seen to be coming from the model and not just because the problem was easy to solve anyway with a simple model.

Covariate shift: This happens when the data in the testing set is different from the training set, either obviously, or subtly. It means that the model does not perform as well on the test data as expected, and hence won’t perform as well in production as expected. Such shift is very easy to allow into real projects because data is coming from whatever source that is usually from someone else. An obvious example would be using a few months of customer data to train but then validating and testing on different months. Even if the data are not a time series, there is probably covariate shift.
Treat models as primary objects: Models should be treated in a dataflow as primary objects, the same way that datasets are. One way to do this is a model registry, which also allows a company to have an overview of what has been created.
Reproducibility: The idea is for any model to be exactly reproducible. Run the same dataflow again and the output should be identical. Unfortunately, this is not always possible because, for example, if a model is run on a distributed system and the data is distributed, the file line ordering is not necessarily preserved. Depending on the ML algorithm, changing the order of the data rows input can change the exact values of the model parameters learned. If exact reproducibility is not possible, the results can still be measured to see if they are statistically reproducible
Use the test data only at the end: Model training should be done on training data and validation data. Once the parameters have been tuned and finalized, only then should the model be run on the unseen testing data for which the ground truth is also available, to get an unbiased result. If the result from the testing data is used to feed back and do more training, and then the same testing data is used again, the result will now be biased. This is because you are favoring models that happen to perform well on the sample of the underlying population represented by that particular testing set, and not the true underlying population. If the datasets are so big that any subsample is basically the same, this may not matter so much, but feeding information back from the testing set should be avoided when possible.
Robustness: Because ML models are nonlinear and can be arbitrarily complex, this means that tiny changes in some part of the data or parameter space could have outsized effects on the result. Mostly this won’t matter, but if your model is unstable around some critical threshold, it could be unreliable when deployed. A way to counter this and add assurance that the model is stable is sensitivity analysis: add some small variations to the input data and see if the model output is dramatically affected. If not, then you can have more confidence that your model is robust. This adding data variation is also the basis of one method of assigning variable importances to dataset features, and some methods of model interpretability.
Business logic: It is common for a data science and ML project to have to integrate into a business’s existing workflows. This means that some rules of business logic have to be part of the pre- or post-processing of a model. Since in most cases data preparation is already executing arbitrary code, necessary rules can be incorporated as part of it, along with sanity checks that the logic has been correctly followed. If the data supplied at the start of the preparation has had rules already applied to it, then it is important to be aware of this, in particular to be sure that data sent to the model in production has been subject to the same rules.
Curse of dimensionality: This was discussed in the data preparation section above, but it can manifest directly in the ML model stage too. A high dimensional space is hard to search because in the hypercube representing the space, essentially every point is in a corner, and they are all roughly the same distance apart, or most columns probably do not contribute to model performance but you don’t necessarily know which ones. The best solution is to try to reduce the number of dimensions in the data prep stage, or if many dimensions must still be present, one can use ML methods more well suited to high dimensions such as support vector machine, if your problem is amenable to it.
Ensembling and mixture of experts: In many settings, the best performing ML models are not a single model but the combined output of an ensemble. This might be multiple instances of the same algorithm (bagging, gradient bosted decision trees, etc.), or a weighted or voted combination of completely different algorithms applied to the same data (mixture of experts). If a further model is used to combine the predictions of the first models, then this is stacking. Ensembling works because it reduces the likelihood of selecting a poor model. While ensembling often gives better performance, its downsides are that it is more computationally intensive, the results are less interpretable, and it is harder to tune because it is likely the combination of tuning hyperparameters in different algorithms and combining their results has more manual steps than tuning a single algorithm. So whether to use it and to what extent is problem-dependent.
Cross-validation: Cross-validation sounds easy but is notoriously tricky to always do perfectly. For K-fold cross-validation, the data are split into K parts, one part is held out as the validation set and the others are used for training. This is then repeated with each fold held out in turn, and the results combined. K folds takes K times more compute time, so people usually trade off robustness of a result with resources needed and use an arbitrary number like K=5 or K=10. But what happens when you have a dataflow? Say your model is part of the featurization idea from above where you are iteratively selecting features by running the ML. If you are selecting a set of features, then running a cross-validated model, then selecting the next set of features, you will be biased because you are only exploring one route through the sets of features. The key is to cross-validate the pipeline, and not just the model; select your folds, and then do the whole dataflow for each one.
Train the final model on all the data: When a model has been tuned, some of the data was held out as the validation set, say 20%. Unless the data are large enough that it doesn’t matter, once the best hyperparameter values have been found, you then want to train that model on all the training data, and not just the 80% from tuning. This means your model performance value from validation may not be exactly right, but it’s usually OK because your model trained on 100% of the data will do better than on 80%. The performance of this trained−on−100% then of course can’t be measured on any of that training data because you will overfit, so its result is quoted on the testing set. And then you want to avoid feeding back that value into more training, because you will be biased, as mentioned above.
Transfer learning: Especially with deep learning, starting a model from scratch may require a lot of training data and computing power to train a model with sufficient performance to solve a business problem. In many cases this can be intractable because the resources or data are insufficient. One solution to this is transfer learning: use an already available model that is pre-trained on some basic version of what you want, and then (in a neural network at least) train only the final layers of your model. An example would be detecting shapes in images – a model can be pretrained on common components like straight lines or corners, then that forms the basis for your particular model to detect your particular objects. Even if you do have the resources to train from scratch, sometimes it is the case that a model trained on top of an existing model can still do better. So transfer learning is potentially applicable in many situations.
Privacy & ethics: Like any powerful tool, data science and machine learning can be used for good and bad purposes. Say you can predict whether someone will get sick − you could use the result to warn them and recommend a doctor visit, or you could use the same result to deny them health insurance (at least in America). These issues are best navigated in a few ways − (1) The legal route. What are the laws and requirements in the country or area where you are working? Can you prove that you are following them? In the US, for example, certain characteristics such as gender and race are designated as protected, and it is illegal to discriminate based upon them. What laws might apply to a place that you are not in physically but is relevant to your project? An example here is the European GDPR data protection – it applies to the data of anyone resident in the European Union, even if it is being analyzed somewhere else. Many other countries and industries have further laws. (2) The data route: Is your data biased? This is not just an ethical question (see section above on data gathering), but it can be. A well-known example has been that facial recognition algorithms perform less well on people with darker skin, and this can be attributed at least in part to the training sets containing fewer examples. Most likely no one intended the results to be biased in such a way, but it happened anyway, and in this case on a characteristic that it is illegal to discriminate on in most countries. (3) The moral route. Things can be legal but still immoral, so what are you comfortable with? It’s easy to opt for the path of last resistance in any situation, especially when it is your job and you are being paid, but it’s not worth compromising your values over some data science project. Without an additional hypothesis such as morality should maximize wellbeing, it is not possible for the correct morality to be arrived at objectively, so any such set of values, and hence what you are comfortable using data science to do, must be arrived at for oneself.

Evaluation

Once a model is built, is it performing well enough? While models can be evaluated with various quantitative metrics, the people on a project should always return to whether or not the business problem has been solved. Aside from the trained model, part of the evaluation will be of the model when it is in production. That part is described below.

The most obvious evaluation is did it work? If it did, then we can see how well. Supervised learning models are trained against the training set and then evaluated against the validation set using various metrics. Most good ML software will let you select from various standard measures: accuracy, Gini index, F−score, mean absolute error for regression, and others. The model maximizes (or minimizes) the metric and reports the number on the validation set. Final results should then be quoted on the testing set (see the Modeling section, above).

As with all the other steps in the data science process, there are some classic gotchas and mistakes to avoid. One here is selecting the wrong metric for a problem. Say you want to detect if people have a disease but only 1 person in 1000 has it. Then your training set will have (unless you use some sampling method and correct for it) 999 negatives and 1 positive. If you use accuracy as a metric (the number of correct classifications), your model can classify all 1000 people as negative and still be 99.9% accurate. It will look like it is good but in fact has collapsed to the trivial case of just predicting no one has the disease. So something like recall and precision on the positives should be used instead.

Standard metrics are nice but in real business problems people often want some metric that is close to standard but is in fact customized. “I want to predict what price someone will pay for a house, but not if it’s more than $2 million − we’ll treat those people specially.” (or whatever; that one is made up). But it means that off-the-shelf metrics from models may not be sufficient. In these cases, there is a tradeoff in getting the metric needed versus time and resources used to enable it. Maybe the software allows you to define a custom metric but this takes time. A well−formulated business problem and plan can help determine what metric is really needed.

Another aspect that people don’t always consider is that the ML metrics like Gini and so on are almost always indirect. The business problem is not to get the highest Gini index but (say) to maximize the amount of dollars earned. So why not tune directly for dollars? If a training set is available with a dollar amount associated with each data row then that can be used as a direct weight and input to the model on how much it is worth to get that row right. Then your output can be expressed as how many dollars you expect to earn for applying a model to a dataset of equal number of rows. It may in fact be more difficult than just this: a false positive such as marketing to someone who didn’t buy may lose less money than a false negative of not marketing to someone who would have bought, making the 4 possibilities of true/false positive/negative have four different values instead of just right/wrong, but the principle of directly maximizing business value instead of an abstract ML metric still applies.

A final consideration for evaluation is what is a good result? We have our numbers, so is 90% accuracy (or whatever) good? Unfortunately, the answer to this is completely problem dependent. In some domains any accuracy slightly above chance is considered good, say 60% accuracy in sending out email marketing, but in others like something mission-critical, 99.9% may be unacceptably bad. The same applies to any other evaluation metric. The best measure of whether the result is good that does always apply is − was the business problem solved? Note that a null result, that the problem has been shown to be unsolvable, is also valid.

Interpretation

Your model is working and it can give the desired outputs, such as predictions. In many situations, people will want to know why the model made those predictions. A classic example is if someone is denied a credit card: in the US and various other countries it is required that the denial reasons are given, and these can be in the form of reason codes from the model.

As with data prep, featurization, etc., model interpretability is a whole field in itself, so I am not attempting to cover it all, just main points.

Interpretability can be thought about and divided up in some of these ways:

Decide its importance: How interpretable does the model need to be? In some cases the model can be a black box and only performance matters. In others, there may need to be an explanation of the output such as a reason code, the data features used for a given output may need to be given, the features may need to be human-readable (so original columns or combinations, not PCA components), and whatever data was used may need to be kept so that decisions can be audited later.
Are there regulatory requirements? A model may need to comply with these even if performance is compromised. This is one reason why many deployments still use simpler models such as logistic regression rather than deep learning or decision tree ensembles, because a simple explanation for a decision can be given.
Model-dependent or independent: Model-dependent measures of interpretability depend upon details of the particular model. Intrinsic variable importances, for example, in a decision tree are calculated by the cumulative improvement at each tree node from a given feature, making them model-dependent. Model-independent (or model-agnostic) measures use only the inputs and the outputs of the model, and so can be used on any ML algorithm. Random permutation variable importances use this, as do other well-known schemes such as LIME or SHAP.
Local versus global interpretability: Interpretation values such as importances may be derived for the whole dataset, then each data point is given the same explanation. This is global interpretability. Or they can be derived for each individual point (or groups of points), which is local interpretability. Deriving importances or other reasons for each individual data point is of course more resource-intensive, but since the significance of data features may vary widely between different data points, and the decisions of most interest are often the unusual ones, it is often worth doing.
Reason codes: The significant data features used in a given model output and their relative weighting can be said to constitute the reason why a model gave a given output for a given data row, hence a reason code for a decision. This sounds straightforward, but there may be additional requirements, for example, not just that (say) someone’s income was significant, but that the decision was made (say) because the income was above or below some value. This means the directionality of the effect of a feature is important. You probably want it to be saying things like “if income is below $50,000 then don’t offer them the expensive car, but if it is above then do”. But if the model jumps around and has things like if income is below $32,950 don’t, between $32,951 and $38,857 do, then $38,858 to $43,784 don’t, and so on, it’s hard to give a reason that makes sense. One way to mitigate this is to enforce monotonicity, i.e., values traveling in a given direction either always increase or always decrease. Then in the case here the income threshold would be much clearer. This sort of enforcement can be done for some ML models, for example decision trees.
Provenance: Finally, if a decision is to be interpretable, the user needs to know where the data and the model came from, that is, their provenance. The best way to do this is version control, including for the data, the model, all of the software used, the analysis code, and the particular run of that code. Then make all this information easily accessible as an audit trail. If all of this is easily available and starts with the location and some guarantee of content like the MD5 checksum of the source dataset, then the result from the model is reproducible, and interpretable.

Production and Monitoring

The penultimate part of any end-to-end data science project should be to show that a model works in production. That is, it can be deployed on new incoming data (inference data) and produce correct and useful results. Putting machine learning models into production is currently much too difficult compared to what it should be: engineers can do it but they don’t know data science, and data scientists know what is needed but cannot do it because they don’t know the engineering. In addition to getting the model running in production, it also needs to be monitored to ensure that it is working correctly.

The increasing realization of this difficulty of putting models in production across the community of businesses using AI and ML is called MLOps, or DevOps for machine learning. By analogy to the improvement in general software that has been enabled in the last decade by DevOps, MLOps is enabling ML models to get into production in settings beyond companies that have a room full of engineers to figure it out.

(Note that DevOps for ML refers to using DevOps processes updated for the needs of ML to deploy any ML model, and not to the idea of using ML to improve the process of DevOps itself. That is ML for DevOps!)

Recoding: A common situation is the following: the data science team has coded a great model using Python in their Jupyter notebooks on their Mac laptops, and it is ready to be deployed. The production engineering team uses IDEs, not notebooks, needs the code to be in Java, and put on their system that uses APIs and a high performance distributed Linux system. You can see the problem. The entire pipeline has to be recoded to be deployable. The engineers can recode the pipeline but they don’t understand what the analysis is doing. The data scientists understand what the analysis is doing but they don’t even know Java, let alone all the other architectural intricacies of the company’s production setup. The best way to mitigate this is to try to lower the amount of differences between the two parts of the process, through some combination of a technical leader who has reach into both teams, some crossover knowledge so each can at least sanity check the other, improved tooling such as an ML platform that outputs results that are more production ready, and the use of more modern MLOps techniques such as deploying models as microservices in containers.
Real time streaming: Some deployed models will be on batches of data coming in at certain times, but many will be in real time on new (inference) data constantly coming in. This means the inference data is in effect a time series, even if it does not actually represent a time series. And of course the deployment infrastructure needs to be able to deal with data flowing through it in real time.
Data volume: A lot of data now is becoming so large that it can only be treated in real time. The internet of things (IoT) is an obvious example, but it is occurring in other fields too. The volume of data can cause a problem for the accountability of a deployed model and its decisions because the data may be too large to keep. One solution might be to have a mechanism to log data but only do so for decisions that were unusual, measured by for example the reason codes discussed above. This probably works better than keeping data but only for a limited time because the likely timescale of some required audit may be months or years, by which time you have too much data to try to keep.
Pre− and post−processing: Almost all deployed models will have a pre−processing step, because they need to convert raw data coming in into data of the correct format for passing into the model. This means it needs to be cleaned, transformed as it was in the data prep stage for training, featurized, perhaps subject to problem-specific business logic, all in an identical manner to the experimentation and training phase. Remembering the point above about recoding, this means the recoded version of the pipeline has to be doing the same thing as the training version. The correct pre-processing also has to be associated with the correct model, making the actual deployment a combination of a dataflow and the model and not just the ML model itself. The same applies to post-processing, where the outputs of the model may be further processed (e.g., prediction labels 0s and 1s converted into something human-readable), and transformed in other ways to make them actionable (e.g., probability(customer buys) > 0.9 = send them promotion).
Latency: Depending on the situation, whether the model is real-time or batch, service level agreements, user experience, and so on, there may be a maximum length of time that it is acceptable for a model to take to produce an output for any given data row. While throughput of the whole system can easily be increased because the model’s output on any given inference data point is independent of other rows and so can be done in parallel, the sequence of pre-processing, model, post-processing has to be done in order within the acceptable latency time, which can often be measured in milliseconds.
Data drift, concept drift, and model drift: In general when a model is deployed, the inference data coming in will change over time. This might be due to seasonal effects, economic changes, or just simply the fact that the world changes over time. Because most ML models are trained specifically for their input training data, it is very likely that the effect of such change will be to degrade the performance of the model. Such changes come in three types and are referred to as drift. Data drift is when the statistical distribution of any of the incoming data changes. An example would be people are spending more money in shops because there are some sales. Concept drift is when the underlying ground truth distribution is changing, for example a higher percentage of people are earning more money and you are trying to predict their income. Model drift is the resulting change in the performance of the model. Model drift is the most important one to monitor for, and ideally one would monitor data drift as well to help see the cause of model drift if it is the data. Concept drift can be monitored too if new ground truth is available.
Personnel: As with data preparation, it may be that the data scientist does not have the bandwidth to both build models and deploy them. In the deployment case, with the present state of tools available, they are also quite likely to not have the engineering knowledge to do the deployment properly. So it makes sense to have people on the team in the role of ML engineer to ensure a deployment is correct. As everywhere else in the end−to−end process, a technical leader should oversee to make sure what is being done makes sense. Another aspect to personnel is being on-call: if a model is running for a customer 24/7 then someone needs to be available for support 24/7 (or at least many hours) in case something goes wrong. Most data scientists likely didn’t sign up to be on-call in such a manner when they accepted their job, so the engineering or support role should be filled by someone who did.
Sanity check the inputs: When it comes to monitoring, it is obvious that the model outputs should be monitored. However, the input data should also be monitored, and in particular, it should be constantly sanity checked. This is not only for the basic reason of making sure the data is what the user thinks it is, but also because ML models, being complex and nonlinear, could start to output garbage yet it still looks like real results. Say, for example, the data is being given in batches, two files are being combined, and one of them starts to have a header row. Then the rows become offset. The model will still output results but now they are meaningless because two rows (2 people maybe) have been mixed together. Sanity checks should include things like is there any data coming in at all, how many columns are there are what are their data types, what are the column names and ordering, what is the data volume and resource usage, are there missing or bad values and what are they, and some measures of the statistical distribution of the data itself.
Upstream changes: The inference data coming to a deployed model is usually coming from somewhere else, and is not necessarily in the data science or production team’s control. This means that aside from the mentioned sanity checks, having direct information on changes in the upstream sources of the inference data is wise. Maybe there was a schema change which will mess up your model, or new information is being recorded which will enable better performance. This all becomes an example of doing good communication, good leadership, working well with other teams, etc.
Ground truth: Usually when inference data is coming in, there is no ground truth. If there were, we wouldn’t need the model. One ramification of this is that the inference data has one fewer column than when the model was being trained, which needs to be accounted for, but a more major one is that the model performance cannot be measured directly like it was with the training and testing sets. This means that more indirect measures of performance need to be used, such as the distribution of the model outputs, and detection of drift. In some cases, the ground truth may become available at a later time, from seconds (did the user click on the recommended link) to years (did the user default on their loan). In these cases, it might be possible to set up the system so that the model’s performance on what turned out to be the ground truth can be assessed. But given the difficulty we have already seen with getting a model deployed and monitored, and the varying usefulness of knowing the performance sometime after−the−fact, these setups are less common.
Alerts: If a model is being monitored, then the user needs to know when something significant has happened. The user is unlikely to be watching the screen 24/7, so this means some kind of procedure for generating alerts. Which alerts to generate is problem-dependent, but various general conditions like the model performance has drifted, the model has obviously failed (e.g., all the same outputs, or no output), the input data is wrong, resource or memory usage is too high, model latency is too high, and others, could generate alerts. A thing to be careful with when designing alerts, however, is “alarm fatigue”. If there are too many alerts, then the important one might get ignored or missed. One way to avoid too many alerts, suggested on the Prometheus website, is to focus them on events which affect what the user is seeing, rather than every intermediate internal step. What to do about an alert should of course also have a procedure. An ML model failing for a customer and no one noticing or acting until the customer has lost a lot of results and perhaps a lot of money, has happened more than once.
It’s a time series: As mentioned above, the incoming inference data is often in real time, and always in a sequence, so it constitutes a time series. The question then becomes how to monitor the distributions of these inputs, and the similarly in−sequence model outputs, if there is only one point at a time. The answer of course is to group them into time periods, and then look at the data distributions within those. Monitoring tools such as the Prometheus−Grafana combination do this automatically, with the time periods, number of points and other settings being adjustable.
Detecting drift: The importance of detecting drift has been mentioned, so how is this done. Since the input data and outputs are usually changing from one moment to another, the question to answer is whether a change over some time is significant. There is no one way to measure this, but one notion that can be used is a measure of distance between, say, the distribution of values of features in the training set, and their distribution now (in the last whatever period of time). Expressing such quantities as (say) percentages, a distance, like the Euclidean one sqrt[(I1-T1)² + (I2-T2)² + … (IN-TN)²] for training data T, inference data I, for N points. Whether this distance has become too large can then be calculated by some notion such as Z-score, the number of standard deviations away from the typical distance. This is the same as is used in outlier detection to determine if the outlier scores of data points show them to be outliers. The same notion can be used on the distributions of the model outputs too, or even model performance versus ground truth when available.
Replacing models: If a deployed model has drifted, or for other reasons, it will need to be replaced by a new one. Normally this will want to be done carefully, because deploying a new model that is not working correctly could seriously impact a business. There are thus various common ways of replacing models. The champion-challenger method is a form of A/B testing where the performance of a new model is compared to the current one, and if it is performing significantly better, becomes the candidate replacement. Performance might be at the training stage, where ground truth is available, but a new candidate model should always be checked in production as well. Ways to replace a model include blue/green, where the entire deployment setup is replicated and then which one is live is switched over, shadow, where a model is (say) deployed on another API endpoint and outputs predictions, but the predictions are not used until the current model is switched to the shadow one, and canary, where the traffic is split between two models, and gradually switched over from the old one to the new one. This all implies various basic or more complex setups, and few tools exist to do these methods of replacing one deployment with another at present, so the role of ML engineers and similar team members remains vital.
Triggering retraining: A combination of monitoring and alerting can be used to automatically trigger retraining of a model if it is starting to drift. This is useful as the presence of new training data and the need for retraining may be well defined, so it saves on doing such a process manually.
Monitoring tools: While monitoring machine learning models has overlap with monitoring other deployed applications (uptime, throughput, latency, etc.), the presence of incoming data that must be correct, and the tendency of even correctly coded and deployed models to eventually drift, means that extra aspects like data distributions must be monitored. There are not many tools yet to cater to all the requirements of monitoring ML models specifically, although their presence is increasing as the MLOps field becomes larger and more established. One combination of tools carried over from the regular DevOps world is the Prometheus time series database and the Grafana dashboard tool. This allows the stream of outputs from a model to be displayed, and, crucially, allows arbitrary queries to be made via its query language PromQL. So things like distributions of model predictions can be derived. This means a new language to learn but it is no harder than something like SQL, which many data scientists know already at some level. Unfortunately, this setup still has limitations, because it queries the metrics from the data and not the raw data itself, and also, it uses regular expressions for matching, which are designed for strings and not numerical ranges. So there is room for improvement in model monitoring tools.
Dashboards: Grafana provides a nice view of the outputs of a deployed model, but it is a technical view and doesn’t differentiate between items of interest mainly to the engineers, and those to the data scientist. There is room for more tools that provide views of the same information for different interested people, for example, a data science view based around model inputs, outputs, drift, etc., an engineering or IT view based on application health and the architecture of the system, a support view highlighting deployments with any issues and their state of resolution, and a business view, showing revenue, sales volume, or some other metric. Such a tool would allow the usefulness and health of deployed ML models to be seen by all.

Action

Model production and monitoring are the penultimate part of a project. The final part is action. This means taking the outputs from your model and using them to provide the business value agreed upon when the project began.

Sometimes the part about making outputs actionable can seem trivial or annoying management hassle − of course we are going to act upon the outputs, otherwise we wouldn’t have made the model! But as mentioned back in the business problem section above, how the model outputs will be acted upon should be explicitly considered because it closes the loop and links the results of the work back to the successful outcome of solving the business problem. It makes it easy to show that your project has done something: even if the nontechnical people don’t understand your model, they will understand that by actions being taken as a result of its outputs, it is being used for something useful.

Another reason to explicitly consider how the model outputs are actionable and what will be done with them is that usually the actions are taken in some other system within the company. Say your model outputs whether a person is likely to buy a product. Then the action might be to email that person a promotion, and that emailing will be done from some system for sending out bulk email. How this happens should be laid out, because otherwise it is too easy for the model to be putting out useful predictions but then nothing happens. If that is the case then the project has failed.

Automation?

If you read all the sections above, it should be clear that automating the data science process for real projects is not yet feasible. Indeed, it is likely that there will be little difference between software that can eventually automate data science and an artificial general intelligence that can reproduce all of the capabilities of humans.

Nevertheless, various vendors and articles are starting to claim that automated data science is starting to take over, their product solves it, and all you need to do is click a button. “The death of the data scientist” is not far away. The polite response to this is I don’t see it yet.

So is automation useful? Definitely. Being able to search your model’s hyperparameter space in a smart way that the computer does instead of doing it manually, or being able to automatically try various generic featurizations on your data to see if they improve the model, can both save a lot of time and get better results than the user might themselves. Combine the two – feature engineering and modeling in an iterative way – and it becomes even more powerful. Similarly, generation of EDA visualizations and statistics relevant to ML are all hugely useful and can save a lot of time.

The part that is not yet automated, and as mentioned likely won’t be until we have an AGI, is linking all of the analysis back to the most important thing – solving the business problem, presenting the analysis to an audience other than fellow data scientists, and simply making sure everything done at every step makes sense.

The closest to this is the automation of taking action as a result of the outputs of a model, which can be an aspect of robotic process automation. But it is likely that, while automated generic parts of the process may increase, they will be linked together for the foreseeable future by problem-specific code or steps defined by human practitioners.

Conclusions

This has been my take on the end-to-end data science process. The main takeaway is that data science is not about building machine learning models but is about solving the business problem (or whatever other problem if it’s not business). When the data are large, the modeling is machine learning, the user is formulating the questions, and designing the analysis to be able to answer them, is what distinguishes this process as being data science as opposed to other equally valid subjects.

Comments / Feedback

This page covers a lot of ground and like any long piece of writing contains parts that can no doubt be improved. I’m happy to take any comments or feedback via the Contact Me page.