There’s more to machine learning and predictive modelling than playing around with a bunch of tools such as R and Hadoop. To help get beginners started, Beat Schwegler, director cloud evangelism at Microsoft, discussed the basic concepts and common algorithms at recent the YOW! Developer Conference in Sydney.
Machine learning and predictive modelling is where a computer is programmed to pick up on patterns in sample or past data to predict future outcomes or train itself on how to best respond in certain situations.
“I can’t tell you that this is [absolute] truth, but I can give you an indicator or a probability of whether I think this is going to happen and how much I believe this is going to happen,” Schwegler said.
“For example, we have stored ticket prices for flights, and over time we make connections and say ‘you should buy this flight now because my model tells me that there’s an 80 per cent chance that the ticket price will increase by at least $50’.”
Machine learning and predictive modelling also allow companies to pre-emptively take action, Schwegler said. For example, a model could predict whether a car’s breaks are likely to fail in the next three months so the driver can get his/her car serviced before potentially encountering a dangerous accident on the road.
Predictions can also be used to get personal with customers, offering companies a better idea of their preferences, likes, dislikes and needs, he added.
The fundamental aspects to machine learning and predictive modelling are data features and labels. The features, or attributes, describe the domain. For example, a potential credit customer’s income and current debt situation.
A label is a specific feature that augments the model. For example, the customer is known to have a bad credit history. That label is used to train the model so that when it comes to making an actual prediction on whether a customer is credit risk worthy or not it is able to do so.
When it comes to the data itself, there are two main types: Numerical and categorical. Numerical data is basically measurable or numbers based such as the height, length, width and depth of an object. Categorical is basically groups or categories such as gender, the town/city a person resides in, marital status and so on.
Before a model is even trained and tested, a question needs to be formed that can provide a useful answer to the problem that the business is trying to solve, Schwegler said.
“Asking the right question is absolutely crucial. There is a huge difference between when will the clutch fail versus what’s the probability that the clutch fails within the next three months?
“There’s a huge difference in asking what is the financial risk in dollars amount with your credit versus are you most likely going to be a credit burden or not? They are really different type of questions that require different type of data to find an answer.”
Using all of the dataset to train the model is not the way to go in ensuring a probability or prediction is accurate. The model needs to be tested, so the dataset is split into a training portion, optimise and test portion, and ideally another test potion, Schwegler said.
Read more: Automating data analysis with R
“We can split our training and test data so we can evaluate how well it performs against data we never saw before. You use one portion of the data to train your model, you use the other portion to optimise the algorithm and then you use the third portion to score it against data it never saw before,” he said.
Usually more than one suitable algorithm is trained and tested in parallel to see which one performs the best at solving the problem or question, Schwegler said.
Linear regression might be used when trying to predict the average age of a customer that buys a particular product, for example, whereas a two-class decision tree might be used to decide if a potential customer is credit-worthy or not. Multi-class decision trees can be used when trying to predict more than two possible outcomes such as ‘bad’, ‘good’ and ‘great’.
One of the challenges in machine learning and predictive modelling is overfitting or noise in the training or sample data due to too many features making the model overly complex, Schwegler said. For example, a detailed decision tree creates fewer errors when using the training data, but as soon as the tree is given data it has not used before it cannot easily fit into all the nodes or leaves.
“You build a model that fits near 100 per cent your training data, but completely fails if it sees data it never saw before,” Schwegler said. “So we have to build models that are generalised and usable.”
When it comes to scoring and evaluating a model, it is usually calculated on true positives (TPs), false positives (FPs), true negatives (TNs) and false negatives (FNs). TPs are positives correctly labelled as positives, FPs are negatives labelled as incorrectly labelled as positives, TNs are negatives correctly labelled as negatives, and FNs are positives incorrectly labelled and negatives.
The question or problem determines how best to calculate the model, Schwegler said.
"When you have to predict the likelihood of breast cancer, for example, it's better to make a prediction that there might be breast cancer so you can do a second screening. So in this case you would ensure that you have no false negatives; you really want to avoid any false negatives.
"If you want to make a prediction on someone being credit worthy, you would say 'no' more often than you would say 'yes'. That's the complete opposite to the breast cancer prediction where there's a second screening and then you eliminate the false negatives."
This article was updated for the purpose of adding information.