Home > Tech > Expert Contributor

Inside the Mind and Methodology of a Data Scientist

By Jose Rivero - Infor
Director

STORY INLINE POST

By Jose Rivero | Country Manager - Thu, 10/06/2022 - 15:00

share it

When you hear about data science, big data, analytics, artificial intelligence, machine learning, or deep learning, you may end up feeling a bit confused about what these terms mean. And it doesn’t help reduce the confusion when every tech vendor rebrands their products as AI.

So, what do these terms really mean? Where are the overlaps and what are the differences? And most importantly, what can this do for your business? The simplest answer is that these terms refer to some of the many analytical methods available to data scientists. Artificial intelligence is simply an umbrella term for this collection of analytic methods. To solve practical decision problems, the data scientist typically uses combinations of these methods. In this article, we provide a high-level overview of the most important analytical methods, relate them to one another, and show that successful solutions are not crafted with just one tool.

Using analytical methods is not new. During World War II, Britain engaged a thousand people in operational control: “a scientific method of providing executive departments with a quantitative basis for decisions regarding the operations under their control.” Since then, and certainly since the advent of the computer, the set of analytical methods has been growing enormously. In such a fast-paced environment, different research communities invent their own names, which partially explains today’s terminology chaos.

Data scientists start a project by talking to business users to understand the question at hand. They then explore the data that is available for the project, and this usually generates follow-up questions to be discussed with business users. After a few iterations, this results in a well-defined business question with identifiable supporting data. Business intelligence tools support this iterative process: data may be available or made available in a data warehouse, and its analytics tools, such as charts, reports, and dashboards, provide the visual support for the business discussions.

A foundational data analysis tool is statistics, and everyone intuitively applies it daily. When you make an observation, whether it is the level of traffic you encounter on the way to work or the total charge when you pick up your coffee and pastry, you will automatically take notice if what you observe is out of the ordinary. Statistics provides the mathematical foundation to determine how data behaves and when you have an outlier. Outliers may point to data entry or software integration errors, but also may point to threats or opportunities. A solid data science solution must detect outliers and handle them appropriately.

There are many different approaches to assist humans in decision-making. For example, when you ask your navigation software to find the best route to your destination, you ask it to solve a mathematical optimization problem: given the network of roads, find the fastest (or shortest) path through the network. This is not just looking it up in a database: the software runs an algorithm that evaluates the possible routes and, thus, finds the optimal route with respect to the specified goal (fastest or shortest route).

Because the possible speed per road segment is an important input for finding the fastest route, GPS information from smartphones is used to estimate the current speed per segment. Statistical methods identify the unusual data: a broken car at the side of a road with fluid traffic is an exception and should be ignored when calculating the current speed at the road segment.

Once the navigation software has found the optimal route, it calculates the ETA, and usually this is fairly reliable, with the exception of unanticipated disruptions, such as a new accident.

But not all processes can be planned so reliably. Think about a hospital storage location, where nurses take materials needed to treat a patient. A stock-out of an item may put a patient’s health at risk but keeping huge amounts of inventory is very costly. The hospital wants to balance this tradeoff, and wants to understand how it can reduce inventory levels without jeopardizing patient health.

In this environment, the daily demand of each item varies substantially, and data scientists use statistics to understand this variability. Then they use mathematical optimization to calculate which inventory level minimizes inventory costs but still guarantees a minimum stock-out risk. They would typically also use simulation to evaluate how inventory and stock-outs behave under different replenishment scenarios. Finally, business intelligence tools visualize the results for end-users.

Many logistical and financial processes have been designed by humans and, hence, are well understood. For such processes, the data scientist can define and solve a mathematical model to optimize the defined goals.

As another example, a factory needs a production schedule that respects the capacity of resources and optimizes the on-time delivery of customer orders. Typically, there are millions of theoretical options (production schedules, routes to a destination), and mathematical optimization algorithms evaluate these options in a structured manner to find an optimal or near-optimal solution.

The data scientist’s toolbox contains many such algorithms, some of which have been inspired by nature. For example, evolutionary algorithms create a set (population) of reasonable solutions, combine these (breeding) to create new solutions (offspring), and then eliminate the worst solutions (survival of the fittest). After repeating this process many times, the population evolves and contains better solutions.

Optimization and simulation methods allow for what-if analysis by virtually changing the world, i.e. by changing the input data, and rerunning the algorithms: the factory can virtually add or remove a machine and evaluate the impact on the resulting production schedule; the hospital can evaluate cost and patient risk if storage is replenished weekly instead of daily; and traffic authorities can virtually close a road segment and study the impact on traffic flows. Because these methods are driven by models that describe the business, they can analyze imaginary scenarios for which no historical data is available.

Where optimization and simulation are driven by models, machine learning is driven by data. Because it doesn’t require a human to build a mathematical model of the business process, machine learning can be used if a business process is not well understood. Machine learning algorithms analyze data to learn structures and patterns about your business. This process is called training. This knowledge supports future decisions and predictions. This also identifies two major weaknesses: machine learning cannot be used to support new processes or imaginary what-if processes due to a lack of historical data and it cannot support a rapidly changing process because patterns in historical data are not representative of the future.

As an example, suppose that you have recorded a large set of historical sales opportunities with many attributes, such as customer, sales team, dates of opportunity creation, customer meeting dates, product(s) offered, price, and outcome (win/lose). Because we don’t really understand the process by which customers decide to buy or not, we cannot formulate this as a mathematical optimization problem. However, machine learning can find patterns in the historical opportunities data and predict whether a new opportunity will be won or lost.

The data scientist starts with statistical methods to detect and remove strange historical data. This cleansing process is critically important because erroneous data ruins patterns and greatly reduces the usability of machine learning. The next step is to prepare the data through feature engineering.

Returning to the example of sales opportunities, the data scientist realizes that the creation and closing dates are unlikely to be meaningful for future decision-making, while age (the time between creation date and closing) is very relevant. This feature engineering process is important to obtain high-quality solutions. When a vendor shows you how easy it is to “drag and drop” a data set into a machine learning tool, they generally forget to mention that a data scientist has spent countless hours on feature engineering to prepare the data.

There exist many different machine learning algorithms, and one of the simplest is the decision tree. After the data set has been prepared, training such a tree is an automatic process: an algorithm identifies the best set of decision rules to sufficiently capture patterns in the data and remain generic enough to be applicable to future data. To predict the outcome of a new sales opportunity, the algorithm follows the rules in the previously trained decision tree. It ends up in a node (a segment of similar historical opportunities) and if 80 percent of those historical opportunities have been won, then it predicts an 80 percent probability to win the new opportunity.

Decision trees are easy to understand and visualize. Business users could interpret the decision tree and, hence, understand why a certain prediction has been made. However, decision trees are often too simple to obtain accurate predictions. More advanced machine learning algorithms can capture more patterns in the data and, hence, can provide more accurate decisions.

For example, some methods use ensembles of decision trees, creatively named decision forests. But increased complexity reduces transparency and the prediction process becomes a black box. This is particularly problematic if the resulting decisions have legal consequences, or when (government) agencies need to be transparent in their decision processes. Each use case dictates how to balance transparency versus accuracy.

The current state-of-the-art in (black box) machine learning is called deep learning, a technique that mimics the human brain by training an artificial neural net to produce desired outputs from given inputs. In our example, during the training process, we feed the attributes of the historical sales opportunities to the input neurons of the network and then adjust the artificial neurons to make the neural net produce the correct output (win or loss). A new sales opportunity can now be fed to the trained neural net to obtain a win or loss prediction. These predictions can be much more accurate than through a decision tree or forest, but the process is not transparent.

Deep learning is currently providing exciting results, especially for image and speech recognition. Although data scientists are still learning how to optimize neural networks for specific business problems, Deep Learning is the newest tool in a toolbox that is already filled with many other analytical methods.

To summarize, data scientists use a rich collection of analytical methods, including business intelligence, statistics, simulation, mathematical optimization, and machine learning. Every method excels at some specific task, but none of them solve complete business problems on their own. To build real-life decision solutions, the data scientist works with business users to investigate the problem and the available data, and then selects and implements the combination of methods that best fits the need and budget. 

Photo by:   José Rivero

You May Like

Most popular

Newsletter