Machine Learning

Machine Learning (ML) uses are so close to everyone now! Look at our inseparable best friend called ‘Smartphone’. This little device is filled with many softwares and apps using machine learning to secretly monitor our usage patterns. Optimising the phone’s performance and improving our experience with it.

Do a search online. You will find articles predicting the future jobs market. Where many will be replaced by intelligent systems capable of performing knowledge work and automation. Bright or gloomy? I am not sure…

Many companies have started data mining projects for predictive insights or set up machine learning departments to find hidden gems in their trove of accumulated data.

A topic on Machine learning is too big to cover in one post. Here you will get short answers to three questions.

1) Why is it so popular? 2) How to go about using machine learning? 3) What it means to us?

Why is it so popular?

So why the hype now when people have been using Machine Learning since 1950? Applications of it have mostly been within academical interest until recent decade. Before its rapid adoption by businesses, the financial industry is probably the pioneer to commercially develop the use of machine learning in quantitative trading and profiting from it. Other than creating algorithms that trade profitably, among others uses are risks assessment on loans default and investments profitability.

During the early days of machine learning, when floppy disk made in 1986 has a storage of 1.44MB, there wasn’t enough data to feed the system. It was unlikely to have many instances when a machine learning problem can learn a highly optimised algorithm. Even when there are data, hair wrenching computation speed back then will make any attempt to process these data an arduous process. Adding to the fact that after investing so much time and resources, you might not successfully create a good optimised algorithm. Not difficult to see why its uses are unpopular outside of the academic world.

Came the big data explosion fuelled by the internet, coupled with advancement on computers processing speed cleared the two big barriers. We have got better tools and the richer data amassed by the interconnected world, allowing us to mine valuable nuggets of insight. Interest to develop machine learning efficiency increases when insightful patterns and predictive trends are found in these data.

It turns out Machine Learning using statistical learning is an effective method to achieve predictive models with high accuracy. Machine learning system can learn a suitable algorithm for the task, without needing an expert to create a fix algorithm code. As condition changes, the algorithm continuously changes with the new information.

There is no need to learn to sort numbers. We already have algorithms for that, but there are many applications for which we do not have an algorithm but have lots of data.

– Ethem Alpaydi, Machine Learning: The New AI

For companies hoping to gain an extra edge through insights from their vast data collected, adopting machine learning become an important move.

How to go about using Machine Learning?

Machine learning is not the solution for all types of data analysis! Depending on the purpose of analysing your data, most data sorting and summarising tasks are better suited for common analytics tool (eg: histogram, waterfall and pie chart).

If you think artificial intelligence and machine learning are the same, their definitions are actually different. Artificial intelligent (AI) is the capability of a machine to imitate intelligent human behaviour and machine learning gives computers the ability to learn without being explicitly programmed. Use any ways to create a system that imitates intelligent human behaviour and that system has artificial intelligence. A product or service using artificial intelligent may have one or more machine learning modules in its operating system.

For those who are keen to explore learning Machine Learning, I have recently done a popular course from Coursera and wrote a short post on what I learnt here.

Broadly you will be working with math, data analysis, programming and a good amount of computing power.

My layman explanation:

Maths commonly needed in machine learning problem are Linear Algebra, Probability Theory & Statistics calculus and Calculus. Working with calculus always remind me of drawing lines on a graph paper when studying elementary math. Don’t be misled, for complex or innovative tasks, higher level math is commonly needed.

Data analysis is a lot about selecting, sorting, reviewing and managing data and its structure. This is an important part to make sure you’re using the most relevant data for optimal learning – on both algorithm accuracy and computational time.

Programming is to engineer how the system will learn. From processing dataset, monitoring and troubleshooting during training, to visualising optimisation results.

Computing time for your algorithm to learn meaningfully can vary greatly with two main factors. The size of data which can sometimes get astronomical. Some learning methods may need more computing time than others when optimising algorithms.

Size of data. Understanding social behaviours and genomics are areas where a huge amount of data in millions up to billions sets are often needed to achieve high accuracy.

Learning methods. Computing time varies with the method used, complex problems using deep learning often need more computing time.

Speech and face recognitions, self-driving car and DNA sequencing. These are some tasks that require both massive data and many hours of computer learning to reach reliable accuracy.

As a beginner looking at the math topics, programming and complexity it can get, I felt incapable after finishing my machine learning course. This very consoling blog from Sharp Sight Labs shine some lights to me on the prerequisite of machine learning. I feel less crippled after reading. Sharing an infographic I like on the blog.

Machine learning is very suitable (in my simple view) to automate tasks that:

Involve predicting outcome basing on a list of attributes – eg: Predicting property price, selling price for new products.
Make decisions by using probability to choose from a specific range of outcomes – eg: Speech and image recognition, classifying email categories, stocks trading.
Find consistent or inconsistent patterns from a situation that is repetitive and possibly have some degrees of variations – eg: Anomaly detection used in cyber security and monitoring manufacturing line performance, categorising news articles, customer behaviour on buying.
Mix above methods together to create some cool stuffs with artificial intelligent
Of course those I have not thought of…

What does this mean to us?

In recent years, harvesting big data using machine learning have given rise to many systems with artificial narrow intelligence. Now, it is necessary to use machine learning in almost any systems harnessing artificial intelligence. With only more sophisticated artificial intelligence products or services to come.

For the future advancement of artificial intelligence, there are two polarised views from experts in this field. Optimists who believed we will eventually create a superintelligence in time and might even accidentally cause human extinction (this part is not encouraging). Pessimists who feel it will be difficult to even build a system with artificial general intelligence (human level intelligence) and understanding emotional intelligence. Hence, creating superintelligence is impossible. You can read this blog for leisure to get some good explanations and interesting possibilities on artificial intelligence.

I do agree with what Andrew Ng said during an interview.

People often ask me, “Andrew, what industries do you think AI will transform?”

I usually answer that it might be easier to think about what industries AI will not transform. To be honest, I struggled to think of one.

– Andrew Ng, Co-founder, Coursera; Adjunct Professor, Stanford University; formerly head of Baidu AI Group/Google Brain

Our use of devices to connect socially and speed up working during the information age are evolving. We are already expanding toward connecting different devices to enrich personal experiences, managing lifestyle/health and improving our quality of life. In this Internet of Things era, ever more devices will get connected and be communicating with one another, creating an invisible orchestration that immerses us with them. Making sense and managing the vast amount of data from end to end will rely greatly on machine learning.

I do think machine learning is going to affect everyone’s life in a big way and hopefully more for the better. Many jobs will probably be replaced. Perhaps these jobs are unlikely in any near future to be replaced and some more.

Learning it

Having taken the popular Stanford Machine Learning course, here is what I have learnt. The flow of my learning will follow the course outline and suitable for a layman to get the gist of it.

Let us first try to understand what is an algorithm? An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions.

A system programmed to learn an algorithm on its own to solve a problem. That’s machine learning! Technically, you do not need to hard code a logic to do the task, the system will learn the logic needed through the data you provide it.

That’s cool! You do not need an expert (or many) to come up with a solution. The programme will learn from the data.

There are quite a few common and useful algorithms taught throughout the course. In between the modules, you will be introduced to different useful techniques to tackle common mistakes and issues. Broadly machine learning comes in 2 forms; supervised learning and unsupervised learning.

Supervised learning methods train an algorithm to predict an outcome after learning the data. Each data in the training dataset will consist of the result you want the system to learn and all its standard features (factors). It can also perform classification task by calculating the probability score on the data to a specific outcome and assign a category to it.

Unsupervised learning methods train an algorithm to learn patterns or meaning in a set of data. Before training, you will need to decide what kind of patterns to find and how many possible clusters the data will be sorted according to their similarities.

Like this Quora answer using a simple analogy to explain the differences. From here on I will be walking through what I learnt from the course.

Supervised Learnings:

Linear Regression (1 variable, multiple variables)

The most basic form of algorithm using linear equation which is the basis for most predictive analytics problems. When you want to get a reliable prediction that is determined by one or several variables or conditions, this might be suitable.

The example in the course lecture is to estimate the price of the house which considers conditions like number of bedrooms, number of floors, square feet and age of the house. All these factors will contribute to how the price might be.

Predicting pricing of houses with different features (size, location, no. of rooms, age etc) and such can also be applied to car prices, commodity prices and anything you would like to understand its estimated value using present standard conditions.

I know what is going in your mind when you see ‘predicting prices’ – you will not be able to predict stock trend with this method. It is also not taught in this course (but taught in this course).

Some other relatable tasks are product profitability, test or exams scores, crop yield, sales campaign costs, return on investments, cost to build, energy consumption, degradation of products, network latency, insurance premiums.

Logistic Regression – P (A|X)

After training this algorithm with data, it will be capable of calculating probability scores on each new data, where it can possibly be classified under. It will then use decision boundary to classify data into different types of outcomes based on its probability score.

The example used in course is to check a list of emails for possible spam. Each email will have a probability score of 0-100% allocated on whether it is a spam. A decision boundary will then decide at what probability the email should be classified as a spam.

Other explanations used in course are determining whether you are having a cold or flu and classifying the weather conditions.

The course continues to show how logistics regression is applied when multiple classifications are needed and explain how self-driving car made decisions using it. Multiple classifications data will commonly be allocated to the outcome with the highest probability (defining vehicle type, labelling emails etc).

Some other relatable tasks are identifying vehicle type, labelling email types, medical diagnosis, words translator, bank loan approval, toxicity of complex chemicals, grading crops quality, risk profiling, music genre.

Neural Network

With deep learning a popular word in the machine learning world, it is actually using neural network to train the algorithm. This method learns an algorithm through modelling a neural network that constitutes animal brains. Similar to logistics regression, this is a classification method.

Why use neural network to classify data when we already have logistics regression? It turns out that neural network is a state of the art method for learning complex non-linear classification problems. Complex here will mean data with a lot of features needed to more accurately classify the training data. Its accuracy is often more superior than logistic regressions and sometimes as good or better than a human expert on the task. Logistic regression is a linear equation and for complex classification problem, such equation is not very robust. Neural network with multiple layers to refine learning is more suitable.

Deep learning term is used when the neural network train the data through many layers (synapses) first before the final output. With the complex computation occurring when running the hidden layers, it is actually still not well understood how these layers worked out their accurate hypothesis. People trying to figure it out are called AI detective in a ScienceMag article.

This method achieves better accuracy on classifications that have high complexity and nuances. Some tasks are computer vision, grading high school essay, DNA / genomics, automatic game playing, automatic machine translation and image coloriser.

The development of voice and image recognitions into widespread use is also propagated with deep learning. This is an interesting news on two chatbots learning their own negotiation language when train against one another.

Advice for Applying Machine Learning

Machine learning problems are time consuming and you can sometimes spend weeks or months but not get the right result. This part of the course is full of tips to help you on when working on machine learning problems. From designing, troubleshoot and avoiding common mistakes.

A list of what will be covered:

Advice on allocating data into Training/Validation/Test sets
Troubleshoot Overfitting (High Variance)
Troubleshoot Underfitting (High Bias)
Troubleshoot Training set total errors per degree of polynomial
Troubleshoot Lambda/Regularisation
Troubleshoot Learning curves
Machine Learning System design:
- Prioritising what to work on
- Error Analysis
- Error message for skewed classes
- Trading off precision and recall
- Data for machine learning

Support Vector Machines (Optimisation objective)

It is all about keeping computational power low. Other than using Logistics Regression, a sometimes more powerful method is to use Support Vector Machines to learn algorithms.

There are also SVMLib to be used when programming to optimise computation and reduce training time greatly. You might think this is a fast method. Yes and No. Learning good algorithms still need loads of time, after SVM saves some computing time for us.

Not all problems are suitable for Support Vector Machine method. It is powerful when number of features are small (1-1000) and size of data is intermediate (10-1000 data)

Unsupervised learning:

K-Means

A method to find correlation patterns and clustering up the data. You will be able to determine the number of clusters you would like data to be sorted into.

The example provided in course is to find the shirt sizes of a community of people. You may have the height and weight data of these people. By setting 3 clusters, you will get the average distribution of the group into 3 sizes; small, medium and large.

Other relatable tasks for clustering information are market segmentation, patterns in genomics and astronomy.

Dimensionality Reduction

When clustering complex dataset using K-Mean, it’s often time consuming to compute and hard to visualise the clustered data in many dimensions. I think anything from 4D onwards is impossible to visualise. One technique is to apply Principal Component Analysis (PCA) to map dataset from high dimensions into lower dimension (eg: 5D to 3D) before computing. Saving time and easier to visualise result.

Anomaly Detection

When a pattern is found in a set of data, anomaly detection is a method to statistically identify data that deviates greatly from the norms.

Some relatable examples of use are in cyber security to monitor network or activities, in factories to detect abnormal system behaviour earlier to prevent loss time or quality issue.

Recommender Systems

It’s great to be understood. That’s what Netflix and Amazon are good with when subtly suggesting products to you.

Recommender algorithm is able to suggest relevant or similar content information to you based on what you have selected, used or like.

Example use in this course is how customers are being classified when choosing movies. Once a customer starts to provide some selection preference or start making choices, the system will be able to assign an estimated gradings to him/her and provide recommendations. Accuracy on the gradings will improve as the amount real data used for learning increases.

Large Scale Machine Learning

Two most common barriers for machine learning problems are access to right data and computing time spent. Having huge amount of data means more computing time. With multiple tests needed to validate your algorithms accuracy, it means more time again.

This module is all tips to working on large scale data.

1. Stochastic gradient descent convergence. A quicker way for linear regression, logistics regression, neural network or any learning methods that use gradient descent to train the algorithm. This method of optimisation will not reach a global optimal point but only lingers close to it.

2. Map Reduce. Using multiple computers to process huge dataset split the time needed on 1 computer into a few running in parallel.

3. Online Learning. When it comes to live information from Online platforms, patterns of users or traffics may change over time. Such live information stream may also accumulate massive data over time. One way is to continuously learn only with the new data. This reduces computing time needed to relearn historical information which sometimes are no longer relevant.

Application Example (Pipeline)

This module shows us steps on how to applying machine learning on a problem and then giving more working tips. Illustration being used is how Photo OCR classify pictures using a pipeline consisting of different machine learning modules.

For complex machine learning problem, it usually needs multiple modules. Each module may have different machine learning algorithms.

Another creative tip was given to create more data when lack of it. By synthesising data and using it to train the algorithm. Examples used to explain are training synthesise data on text and sound for the system to learn and improve.

The last tip in this course is to use ceiling analysis to save troubleshooting time. It is a simple and effective way to identify which modules have the highest potential to improve the overall system. Best to start first improving them first.

Lastly…

I feel the most critical part of machine learning is gathering right data and preparing data for training. A lot around deciding what data to use for training dataset, how to error check the data, sorting the data, having sufficient data and whatever your task at hand critically needs to consider.

A challenge on big data today is how to get the value we want from all the data collected. A more perplex question is what kind of unknown value we can get from these data?

With the level of complexity and possibilities these data can become, no wonder we call them data scientist/engineer/analyst.