CIOReview
| | MAY 20189CIOReviewpercent of the time on modelling and tuning. So it is critical to understand the data sets prior to building models. Data usually suffers from data quality issues, incomplete sets, imbalance of labels or skewed distribution of data attributed. Rectifying this would require cleansing data, imputing missing values, and normalizing the skews. This is a critical step in the process, and important to manage in the process of getting value from machine learning.If the data set being used is small, then feature engineering is required. Feature engineering is the art/science of teasing out latent information in the data so that the machine learning algorithm can use it to learn. Lets take an example ­ consider a data set of retail transactions from a store. This data set may have a date time stamp of the transaction, a set of skus (stock keeping units) of product purchased, a store ID, a customer ID, dollar value of the transaction amongst other fields. Each of these fields have a lot of latent information. Consider the data time stamp ­ it contains information such as day of week (week day or weekend in a simplified scenario), time of the transaction (or day parts like morning, afternoon, evening, night), whether it was a public holiday etc. A customer ID maybe present if the customer is part of a loyalty program. Using the store ID to get the location of the store, and customer address, a feature can be engineered that calculates the distance the customer travelled to make this particular purchase. Such features may be really important for machine learning. These features can be extracted by data scientists who understand the business domain as well problem being solved. Success of a machine learning initiative can often depend on feature engineering and understanding of the data.Deep Learning and Deep ThinkingNeural network and deep learning are dominating the airwaves at the moment. One might feel that by not using these techniques they are missing out. Let's put this perspective through Kaggle 2017 State of Data Science and Machine Learning survey:This chart shows that top three methods being used are not deep learning methods. State of the art deep learning methods require a lot of data to train today. Goodfellow et al in their Deep Learning book propose:As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.The key to note here is per category. Ability to convert a task into a supervised learning problem, and making sure there are enough examples to train a deep learning algorithm can be a very challenging task.Machine ThinkingKey to being successful in applying latest algorithms is to cast problems as supervised problems. Supervised learning is a class of machine learning problems where the desired output is known with the input data. For example, the input could be an image and desired output is the label `cat'. Both of these would need to be fed into a deep learning network, in fact many thousands and more likely thousands of them, for the algorithm to learn and work effectively. Another way to phrase this problem is to consider pairs of the form (A -> B). B represents one of the many types of labels the machine is learning to discriminate. In case of the cat detector, these could be the two possible values (cat, not cat). During training, multiple (A, B) pairs are provided. Once the algorithm is trained, input of the form (A, ?) is passed in and the machine guesses the correct label. To be successful, start with casting your problem into a supervised learning problem. This is often the hard part of Machine Thinking. Once you are able to do that, there is very little to stop you from building amazing machine learning solutions and products. Neural network and deep learning are dominating the airwaves at the moment
< Page 8 | Page 10 >