Learning it

Having taken the popular Stanford Machine Learning course, here is what I have learnt. The flow of my learning will follow the course outline and suitable for a layman to get the gist of it.

Let us first try to understand what is an algorithm? An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions.

A system programmed to learn an algorithm on its own to solve a problem. That’s machine learning! Technically, you do not need to hard code a logic to do the task, the system will learn the logic needed through the data you provide it.

That’s cool! You do not need an expert (or many) to come up with a solution. The programme will learn from the data.

There are quite a few common and useful algorithms taught throughout the course. In between the modules, you will be introduced to different useful techniques to tackle common mistakes and issues. Broadly machine learning comes in 2 forms; supervised learning and unsupervised learning.

Supervised learning methods train an algorithm to predict an outcome after learning the data. Each data in the training dataset will consist of the result you want the system to learn and all its standard features (factors). It can also perform classification task by calculating the probability score on the data to a specific outcome and assign a category to it.

Unsupervised learning methods train an algorithm to learn patterns or meaning in a set of data. Before training, you will need to decide what kind of patterns to find and how many possible clusters the data will be sorted according to their similarities.

Like this Quora answer using a simple analogy to explain the differences. From here on I will be walking through what I learnt from the course.

Supervised Learnings:

Linear Regression (1 variable, multiple variables)

The most basic form of algorithm using linear equation which is the basis for most predictive analytics problems. When you want to get a reliable prediction that is determined by one or several variables or conditions, this might be suitable.

The example in the course lecture is to estimate the price of the house which considers conditions like number of bedrooms, number of floors, square feet and age of the house. All these factors will contribute to how the price might be.

Predicting pricing of houses with different features (size, location, no. of rooms, age etc) and such can also be applied to car prices, commodity prices and anything you would like to understand its estimated value using present standard conditions.

I know what is going in your mind when you see ‘predicting prices’ – you will not be able to predict stock trend with this method. It is also not taught in this course (but taught in this course).

Some other relatable tasks are product profitability, test or exams scores, crop yield, sales campaign costs, return on investments, cost to build, energy consumption, degradation of products, network latency, insurance premiums.

Logistic Regression – P (A|X)

After training this algorithm with data, it will be capable of calculating probability scores on each new data, where it can possibly be classified under. It will then use decision boundary to classify data into different types of outcomes based on its probability score.

The example used in course is to check a list of emails for possible spam. Each email will have a probability score of 0-100% allocated on whether it is a spam. A decision boundary will then decide at what probability the email should be classified as a spam.

Other explanations used in course are determining whether you are having a cold or flu and classifying the weather conditions.

The course continues to show how logistics regression is applied when multiple classifications are needed and explain how self-driving car made decisions using it. Multiple classifications data will commonly be allocated to the outcome with the highest probability (defining vehicle type, labelling emails etc).

Some other relatable tasks are identifying vehicle type, labelling email types, medical diagnosis, words translator, bank loan approval, toxicity of complex chemicals, grading crops quality, risk profiling, music genre.

Neural Network

With deep learning a popular word in the machine learning world, it is actually using neural network to train the algorithm. This method learns an algorithm through modelling a neural network that constitutes animal brains. Similar to logistics regression, this is a classification method.

Why use neural network to classify data when we already have logistics regression? It turns out that neural network is a state of the art method for learning complex non-linear classification problems. Complex here will mean data with a lot of features needed to more accurately classify the training data. Its accuracy is often more superior than logistic regressions and sometimes as good or better than a human expert on the task. Logistic regression is a linear equation and for complex classification problem, such equation is not very robust. Neural network with multiple layers to refine learning is more suitable.

Deep learning term is used when the neural network train the data through many layers (synapses) first before the final output. With the complex computation occurring when running the hidden layers, it is actually still not well understood how these layers worked out their accurate hypothesis. People trying to figure it out are called AI detective in a ScienceMag article.

This method achieves better accuracy on classifications that have high complexity and nuances. Some tasks are computer vision, grading high school essay, DNA / genomics, automatic game playing, automatic machine translation and image coloriser.

The development of voice and image recognitions into widespread use is also propagated with deep learning. This is an interesting news on two chatbots learning their own negotiation language when train against one another.

Advice for Applying Machine Learning

Machine learning problems are time consuming and you can sometimes spend weeks or months but not get the right result. This part of the course is full of tips to help you on when working on machine learning problems. From designing, troubleshoot and avoiding common mistakes.

A list of what will be covered:

Advice on allocating data into Training/Validation/Test sets
Troubleshoot Overfitting (High Variance)
Troubleshoot Underfitting (High Bias)
Troubleshoot Training set total errors per degree of polynomial
Troubleshoot Lambda/Regularisation
Troubleshoot Learning curves
Machine Learning System design:
- Prioritising what to work on
- Error Analysis
- Error message for skewed classes
- Trading off precision and recall
- Data for machine learning

Support Vector Machines (Optimisation objective)

It is all about keeping computational power low. Other than using Logistics Regression, a sometimes more powerful method is to use Support Vector Machines to learn algorithms.

There are also SVMLib to be used when programming to optimise computation and reduce training time greatly. You might think this is a fast method. Yes and No. Learning good algorithms still need loads of time, after SVM saves some computing time for us.

Not all problems are suitable for Support Vector Machine method. It is powerful when number of features are small (1-1000) and size of data is intermediate (10-1000 data)

Unsupervised learning:

K-Means

A method to find correlation patterns and clustering up the data. You will be able to determine the number of clusters you would like data to be sorted into.

The example provided in course is to find the shirt sizes of a community of people. You may have the height and weight data of these people. By setting 3 clusters, you will get the average distribution of the group into 3 sizes; small, medium and large.

Other relatable tasks for clustering information are market segmentation, patterns in genomics and astronomy.

Dimensionality Reduction

When clustering complex dataset using K-Mean, it’s often time consuming to compute and hard to visualise the clustered data in many dimensions. I think anything from 4D onwards is impossible to visualise. One technique is to apply Principal Component Analysis (PCA) to map dataset from high dimensions into lower dimension (eg: 5D to 3D) before computing. Saving time and easier to visualise result.

Anomaly Detection

When a pattern is found in a set of data, anomaly detection is a method to statistically identify data that deviates greatly from the norms.

Some relatable examples of use are in cyber security to monitor network or activities, in factories to detect abnormal system behaviour earlier to prevent loss time or quality issue.

Recommender Systems

It’s great to be understood. That’s what Netflix and Amazon are good with when subtly suggesting products to you.

Recommender algorithm is able to suggest relevant or similar content information to you based on what you have selected, used or like.

Example use in this course is how customers are being classified when choosing movies. Once a customer starts to provide some selection preference or start making choices, the system will be able to assign an estimated gradings to him/her and provide recommendations. Accuracy on the gradings will improve as the amount real data used for learning increases.

Large Scale Machine Learning

Two most common barriers for machine learning problems are access to right data and computing time spent. Having huge amount of data means more computing time. With multiple tests needed to validate your algorithms accuracy, it means more time again.

This module is all tips to working on large scale data.

1. Stochastic gradient descent convergence. A quicker way for linear regression, logistics regression, neural network or any learning methods that use gradient descent to train the algorithm. This method of optimisation will not reach a global optimal point but only lingers close to it.

2. Map Reduce. Using multiple computers to process huge dataset split the time needed on 1 computer into a few running in parallel.

3. Online Learning. When it comes to live information from Online platforms, patterns of users or traffics may change over time. Such live information stream may also accumulate massive data over time. One way is to continuously learn only with the new data. This reduces computing time needed to relearn historical information which sometimes are no longer relevant.

Application Example (Pipeline)

This module shows us steps on how to applying machine learning on a problem and then giving more working tips. Illustration being used is how Photo OCR classify pictures using a pipeline consisting of different machine learning modules.

For complex machine learning problem, it usually needs multiple modules. Each module may have different machine learning algorithms.

Another creative tip was given to create more data when lack of it. By synthesising data and using it to train the algorithm. Examples used to explain are training synthesise data on text and sound for the system to learn and improve.

The last tip in this course is to use ceiling analysis to save troubleshooting time. It is a simple and effective way to identify which modules have the highest potential to improve the overall system. Best to start first improving them first.

Lastly…

I feel the most critical part of machine learning is gathering right data and preparing data for training. A lot around deciding what data to use for training dataset, how to error check the data, sorting the data, having sufficient data and whatever your task at hand critically needs to consider.

A challenge on big data today is how to get the value we want from all the data collected. A more perplex question is what kind of unknown value we can get from these data?

With the level of complexity and possibilities these data can become, no wonder we call them data scientist/engineer/analyst.