Thursday, March 29, 2018

10 Machine Learning Algorithms every Data Scientist should know

28-Mar-2018 |


An analytical model is a statistical model that is designed to perform a specific task or to predict the probability of a specific event.
In layman terms, a model is simply a mathematical representation of a business problem. A simple equation y=a+bx can be termed as a model with a set of predefined data input and desired output. Yet, as the business problems evolve, the models grow in complexity as well. Modeling is the most complex part in the lifecycle of successful analytics implementation.
Scalable and efficient modeling is critically consequential to enable organizations to apply these techniques to ever-more sizably voluminous data sets for reducing the time taken to perform these analyses. Thus models are engendered that implement key algorithms to determine the solution to our business quandary.
Supervised vs Unsupervised learning models
Supervised Learning models are the models where there is a clear distinction between explanatory and dependent variables. The models are trained to explain dependent variables using explanatory variables. In other words, the model output attributes are known beforehand. Eg:
  • Prediction (e.g., linear regression)
  • Classification (e.g., decision trees, k-nearest neighbors)
  • Time-series forecasting (e.g., regression-based)
In unsupervised learning, the model outputs are unknown or there are no target attributes: there is no distinction between explanatory and dependent variables. The models are created to find out the intrinsic structure of data. Eg:
  • Association rules
  • Cluster analysis
Here we plan to briefly discuss the following 10 basic machine learning algorithms/ techniques that any data scientist should have in his/her arsenal. There are many more techniques that are powerful, like Discriminant analysis, Factor analysis etc but we wanted to focus on these 10 most basic and important techniques.
Machine Learning Algorithms
1. Hypothesis Testing
2. Linear Regression
3. Logistic Regression
4. Clustering
5. ANOVA
6. Principal Component Analysis
7. Conjoint Analysis
8. Neural Networks
9. Decision Trees
10. Ensemble Methods
1. Hypothesis Testing
Hypothesis testing is not exactly an algorithm, but it's a must know for any data scientist. Do not move ahead before you completely master this technique.
Hypothesis testing is the process in which statistical tests are used to check if a hypothesis is true or not using the data. Based on hypothetical testing, we choose to accept or reject the hypothesis. When an event occurs, it can be a trend or happens by chance. To check whether the event is an important occurrence or just by chance, hypothesis testing is necessary.
There are many tests for hypothesis testing, but the following 2 are most popular:
t-test: t-test is a popular statistical test to make inferences about single means or inferences about two means or variances to check if the two groups' means are statistically different from each other where n<30 and standard deviation is unknown.
Chi-square test: A chi square (Ï?2) test is used to examine if 2 distributions of categorical variables are significantly different from other.
2. Linear Regression
Linear regression is a statistical modelling technique, which attempts to model the relationship between an explanatory variable and a dependent variable by fitting the observed data points on a linear equation. For eg: Modelling the BMI of individuals using weight.
A linear regression is used if there is relationship or significant association between the variables. This can be checked by scatterplots. If no association appears between the variables, fitting a linear regression model to the data will not provide useful model.
A linear regression line has equation in the following form:
Y = a + bX,
Where, X = explanatory variable and
Y = dependent variable.
b = slope of the line
a = intercept (the value of y when x = 0).
3. Logistic Regression
Logistic regression is the technique to find relationship between a set of input variables and a output variable (just like any regression) but the output variable in this case would be a binary outcome (think of 0/1 or yes/no).
For eg: Will there be traffic jam in a certain location in Bangalore is a binary variable. The output is a categorical Yes or no.
The probability of occurrence of traffic jam can be dependent on attributes like weather condition, day of week and month, time of day, number of vehicles etc. Using logistic regression, we can find the best fitting model that explains the relationship between independent attributes and traffic jam occurrence rates and predicts probability of jam occurrence.
4. Clustering Techniques
Clustering (or segmentation) is a kind of unsupervised learning algorithm where a dataset is grouped into unique, differentiated clusters.
Lets say, we have customer data spanning 1000 rows. Using clustering we can group the customers into differentiated clusters or segments, based on the variables. In case of customers' data, the variables can be demographic information or purchasing behavior.
Clustering is an unsupervised learning algorithm because the output is unknown to the analyst. We do not train the algorithm on any past input - output information, but let the algorithm define the output for us. Therefore (just like any other modeling exercise), there is no right solution to clustering algorithm; rather the best solution is based on business usability. Some people also call Clustering as unsupervised classification.
There are 2 basic types of clustering techniques:
  • Hierarchical clustering
  • Partitional clustering
5. ANOVA
The one-way analysis of variance (ANOVA) test is used to determine whether the mean of more than 2 groups of dataset are significantly different from each other.
For eg. A campaign of BOGO (Buy one get one) is executed on 5 groups of 100 customers each. Each group is different in terms of their demographic attributes. We would like to determine whether these 5 respond differently for the campaign. This would help us optimize the right campaign for the right demographic group, increase the response rate and reduce cost of campaign.
The "analysis of variance" works by comparing the variance between the groups to that of within group variance. The core of this technique lies in the assessing whether all the groups are infact part of one larger population or completely different population with different characteristics.
6. Principal Component Analysis
Dimension (variable) reduction techniques aim to reduce the data set with higher dimension to that of lower dimension without the loss of feature of information that is conveyed by the dataset. The dimension here can be conceived as the number of variables that a data set contain.
Two commonly used variable reduction techniques are:
  1. Principal Component Analysis (PCA)
  2. Factor Analysis
The crux of PCA lies in measuring the data from perspective of a principal component. A principal component of a data set is the direction with largest variance. A PCA analysis involves rotating the axis of each variable to highest Eigen vector/ Eigen value pair and defining the principal components i.e. the highest variance axis or in other words the direction that most defines the data. Principal components are uncorrelated and orthogonal.
7. Conjoint Analysis
Conjoint analysis is widely used in market research to identify customers' preference for various attributes that make up a product. The attributes can be various features like size, color, usability, price etc.
Using conjoint (tradeoff) analysis, brand managers can identify which features would customer's tradeoff for a certain price points. Thus it is highly used technique in new product design or pricing strategies.
8. Neural Networks
Neural network (also known as Artificial Neural Network) is inspired by human nervous system, how complex information is absorbed and processed by the system. Just like humans, neural networks learn by example and are configured to a specific application.
Neural networks are used to find patterns in complex data and thus provide forecast and classify data points. Neural networks are normally organized in layers. Layers are made up of a number of interconnected â??nodes'. Patterns are presented to the network via the â??input layer', which communicates to one or more â??hidden layers' where the actual processing is done. The hidden layers then link to an â??output layer' where the answer is output as shown in the graphic below.
9. Decision Trees
Decision trees, as the name suggest, is a tree-shaped visual representation of one can reach to a particular decision by laying down all options and their probability of occurrence. Decision trees are extremely easy to understand and interpret. At each node of the tree, one can interpret what would be the consequence of selecting that node or option.
10. Ensemble Methods
Ensemble methods works on the philosophy that many weak learners can come together to give a strong prediction. Random forest is currently the most accurate of all classification techniques available. Random forest is an ensemble method. In this case, the weak learner is a simple decision tree and random forest is strong learner.
Random forest optimizes the output from many decision trees formed from sample of same dataset. Thus finding the most accurate of classification model.
Source: AIM
Published By: Nand Kishor

Tuesday, March 27, 2018

11 Javascript Machine Learning Libraries To Use In Your App


iOrca whales trained through neural networks, that’s the future.

“ Wait, what?? That’s a horrible idea! “
Were the exact words of our leading NLP researcher when I first talked to her about this concept. Maybe she’s right, but it’s also definitely a very interesting concept which is getting more attention in the Javascript community lately.
During the past year our team is building Bit which makes it simpler to build software using components. As part of our work, we develop ML and NLP algorithms to better understand how code is written, organized and used.
While naturally most of this work is done in languages like python, Bit lives in the Javascript ecosystem with its great front and back ends communities.
This interesting intersection led us to explore and experiment with the odd possibilities of using Javascript and Machine Learning together. Sharing from our research, here are some neat libraries which bring Javascript, Machine Learning, DNN and even NLP together. Take a look.

1. Brain.js

Brain.js is a Javascript library for Neural Networks replacing the (now deprecated) “brain” library, which can be used with Node.js or in the browser(note computation ) and provides different types of networks for different tasks. Here is a demo of training the network to recognize color contrast.

Training Brain.js color contrast recognition

2. Synaptic

Synaptic is a Javascript neural network library for node.js and the browser which enables you to train first and even second order neural network architectures. The project includes a few built-in architectures like multilayer perceptrons, multilayer long-short term memory networks, liquid state machines and a trainer capable of training a verity of networks.

Training Synaptic image-filter perceptron

3. Neataptic

This library provides fast neuro-evolution & backpropagation for the browser and Node.js, with a few built-in networks including perceptron, LSTM, GRU, Nark and more. Here is a rookie tutorial for simple training.

Neataptic target-seeking AI demo

4. Conventjs

Developed by Stanford U PhD this popular library hasn’t been maintained for the past 4 years, but is definitely one of the most interesting projects on the list. It’s a Javascript implementation of neural networks supporting common modules, classification, regression, an experimental Reinforcement Learning module and is even able to train convolutional networks that process images.

Conventjs demo for toy 2d classification with 2-layer neural network

5. Webdnn

This Japanese-made library is built to run deep neural network pre-trained model on the browser, and fast. Since executing a DNN on a browser consumes a lot of computational resources, this framework optimizes the DNN model to compress the model data and accelerate execution through JavaScript APIs such as WebAssembly and WebGPU.

Neural style transfer example

6. Deeplearnjs

This popular library allows you to train neural networks in a browser or run pre-trained models in inference mode, and even claims it can be used as NumPy for the web. With an easy-to-pick-up API this library can be used for a verity for useful applications, and is actively maintained.


Deeplearnjs teachable machine web-demo

7. Tensorflow Deep Playground

Deep playground is an interactive visualization of neural networks, written in TypeScript using d3.js. Although this project basically contains a very basic playground for tensorflow, it can be repurposed for different means or used as a very impressive educational feature for different purposes.

Tensorflow web playground

8. Compromise

This very popular library provides “modest natural-language processing in javascript”. It’s pretty basic and straight forward, and even compiles down to a single small file. For some reason, its modest “good enough” approach makes it a prime candidate for usage in almost any app in need of basic NLP.

9. Neuro.js

This beautiful project is a deep learning and reinforcement learning Javascript library framework for the browser. Implementing a full stack neural-network based machine learning framework with extended reinforcement-learning support, some consider this project to be the successor of convnet.js.

Self-driving cars with Neuro.js

10. mljs

A group of repositories providing Machine Learning tools for Javascriptdeveloped by the mljs organization which include supervised and unsupervised learning, artificial neural networks, regression algorithms and supporting libraries for statistics, math etc. Here’s a short walkthrough.

mljs projects on GitHub

11. Mind

A flexible neural network library for Node.js and the browser, which basically learns to make predictions, using a matrix implementation to process training data and enabling configurable network topology. You can also plug-and-play “minds” which already learned, which can be useful for your apps.

Really? 0/5? way to predict, mind!

Honorable metnions:

Natural

An actively maintained library for Node.js which provides tokenizing, stemming (reducing a word to a not-necessarily morphological root), classification, phonetics, tf-idf, WordNet, string similarity, and more.

Incubator-mxnet

Apache MXNet is a deep learning framework that allows you to mix symbolic and imperative programming on the fly with a graph optimization layer for performance. MXnet.js brings a deep learning inference API to the browser.

Keras JS

This library runs Keras models in the browser, with GPU support using WebGL. since Keras uses a number of frameworks as backends, the models can be trained in TensorFlow, CNTK, and other frameworks as well.

Deepforge

A development environment for deep learning that enables you to quickly design neural network architectures and machine learning pipelines with built-in version control for experiment reproduction. Worth checking out.

Land Lines

Not even as much of a library as a very cool demo / web game based on a chrome experiment by Google. Although I’m not sure what to do with it, it’s guaranteed to become the most enjoyable 15 minutes of your day.

Land lines by Google