Dealing with Unbalanced Class, SVM, Random Forest and Decision Tree in Python
So far I have talked about decision trees and ensembles. But I hope, I have made you understand the logic behind these concepts without getting too much into the mathematical details. In this post let’s get into action, I will be implementing the concepts that we learned in these two blog posts. The only concept that I haven’t discussed about is SVM. I suggest you to watch Professor Andrew Ngs week 7 videos on coursera.
Can a winemaker predict how a wine will be received based on the chemical properties of the wine? Are there chemical indicators that correlate more strongly with the perceived “quality” of a wine? In this problem we’ll examine the wine quality dataset hosted on the UCI website. This data records 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. In this problem, we will only look at the data for red wine.
Let me first import the libraries,
I import only the data for red wine, then I build a pandas dataframe and print the head.
wine_df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv’, sep=’;’)
I have the feature data, usually labeled as X, and the target data, labeled Y. Every row in the matrix X is a datapoint (i.e. a wine) and every column in X is a feature of the data (e.g. pH). For a classification problem, Y is a column vector containing the class of every datapoint.
I will use the quality column as my target variable. I am going to save the qualitycolumn as a separate numpy array (labeled Y) and remove(drop) the quality column from the dataframe.
Also, I will simplify the problem as a binary one in which wines are either “bad” (score<7) or “good” (score≥7). This means that I am going to change the Y array accordingly such that it only contains zeros (“bad” wines) and ones (“good” wines). For example, if originally Y=[1,3,8,4,7], the new Y should be [0,0,1,0,1].
I then use the as_matrix function in Pandas to save the feature information in my data frame as a numpy array. This is my X matrix.
My goal is to predict the target Y(quality of wine) as a function of the features X. In previous section, I have defined Y as a binary variable (bad as 0 and good as 1), this is a classification problem. First I will use random forests to classify the quality of wine, later on I will implement Svm and decision trees on this data set.
As you know Random forest basically aggregates a group of decision trees together. It adds randomness in 2 ways , one is by sampling with replacement(boot strap sampling) from the training data and then fitting a tree for each of these samples. Then splitting on a feature in the decision tree, random forest considers random subset of variables to split on.
There are many ways to construct a random forest — these differences in the method of construction are described as tuning parameters. One of the most important tuning parameters in building a random forest is the number of trees to construct.
Here, I am going to apply the random forest classifier to the wine data and use cross-validation to explore how the score of the classifier changes when varying the number of trees in the forest. I am going to use the random forest classifier function in the scikit-learn library and the cross_val_score function (using the default scoring method) to plot the scores of the random forests as a function of the number of trees in the random forest, ranging from 1 (simple decision tree) to 40. I am going to use 10-fold cross-validation.
If you don’t know what is meant by parameter selection and cross validation,please watch week 6 videos of coursera’s machine learning course.
First I import RandomForestClassifier and cross_val_score from scikit learnlibrary. n_estimators is the parameter specifying the number of trees inRandomForestClassifier. Then I create a list of all the classifier scores for the trees ranging from 1 to 41.
Let me show you what is going with a random forest classifier that has 2 trees.
I got classification scores for each cross validation set. Here I have fixed the the number of trees to be 2 and number of folds as 10.
You can see that I have used sea born to plot my scores.
You can notice that accuracy seems to improve with additional trees. You should also consider the computational cost of fitting additional trees compared to the small accuracy benefit.
Evaluating The Unbalanced Classes
In a binary classification problems, accuracy can be misleading if one class (say, bad wine) is much more common than another (say, good wine), this is when the classes are unbalanced.
I Print the percentage of wines that are labeled as “bad” in the dataset and plot a boxplot, but this time I draw a line across the plot denoting the accuracy of always guessing zero (“bad wine”).
When there are unbalanced classes in a dataset, guessing the more common class will often yield very high accuracy. For this reason, you usually want to use different metrics that are less sensitive to imbalance when evaluating the predictive performance of classifiers.
The goal is to identify the members of the positive class (the rare class) successfully — this could be either the good wines or the patients presenting a rare disease. For which you have to use precision and recall
I am not going to discuss about precision and recall here, so please watch professor Andrew Ngs week 6 system design videos. Because precision and recall both provide valuable information about the quality of a classifier, you often want to combine them into a single general-purpose score. The F1 score is defined as the harmonic mean of recall and precision:
F1 = (2 x recall x precision) / (recall + precision)
The F1 score thus tends to favor classifiers that are strong in both precision and recall, rather than classifiers that emphasize one at the cost of the other.
This may all seem very complicated, but implementing f1 scores using scikit learn is very easy. You have to just change the scoring parameter of the cross_val_score function thats it.
You can see that the scores are clustered around the 40% mark. There is only very little gain now by increasing the number of trees.
Setting the cutoff value for prediction
Many classifiers( including random forests) can return prediction probabilities, for example: given a point X there is 70% probability that it belongs to class 1 and 30% probability that it belongs to class 0. However, when the classes in the training data are unbalanced, these predictions calculated by the classifier can be inaccurate because many classifiers do not know how to adjust for this imbalance. This problem is solved using calibration.
If a classifier’s prediction probabilities are accurate, the appropriate way to convert its probabilities into predictions is to simply choose the class with probability > 0.5. This is the default behavior of classifiers when you call their predict method. When the probabilities are inaccurate, this does not work well, but you can still get good predictions by choosing a more appropriate cutoff. In this section, I will choose a cutoff by cross validation.
First, you have to understand how a predict proba function works in scikit learn. I will illustrate this with an example.
I am going to Fit a random forest classifier to the wine data using 15 trees. Then I compute the predicted probabilities that the classifier assigned to each of the training examples, this can be done using predict_proba method. As a test case, I am going to construct a prediction based on these predicted probabilities that labels all wines with a predicted probability of being in class 1 > 0.5 with a 1 and 0 otherwise. For example, if originally probabilities =[0.1,0.4,0.5,0.6,0.7], the predictions should be [0,0,0,1,1].
predict_proba returns two columns, in which column one is class 0 and column two is class 1.
This was just an example to show you how things work.
Using 10-fold cross validation, I am going to find a cutoff in np.arange(0.1,0.9,0.1) that gives the best average F1 score when converting prediction probabilities from a 15-tree random forest classifier into predictions.
The custom_f1(cutoff) returns the f1 score by getting a cutoff value, the cutoff value ranges from 0.1 to 0.9. The sklearn.metrics.f1_score accepts real y and predicted y as parameters and returns the f1 score.
Then I use a box plot to show the scores.
A cutoff of about 0.3-0.5 appears to give the best predictive performance. It is intuitive that the cutoff is less than 0.5 because the training data contains many fewer examples of “good” wines, so you need to adjust the classifier’s cutoff to reflect that fact that good wines are, in general, rarer.
Visualizing The Decision Boundary
A trained classifier takes in X and tries to predict the target variable Y. You can visualize how the classifier translates different inputs X into a guess for Y by plotting the classifier’s prediction probability (that is, for a given class c, the assigned probability that Y=c) as a function of the features X. One common visual summary of a classifier is its decision boundary. Most classifiers in scikit-learn have a method called predict_proba that computes this quantity for new examples after the classifier has been trained.
Decision surface visualizations are really only meaningful if they are plotted against inputs X that are one- or two-dimensional. So before I plot these surfaces, I will first find two “important” dimensions of X to focus on. In my previous blog posts I discussed about Truncated SVD and PCA to perform dimensionality reduction. Here, I will use a different dimension reduction method based on random forests.
Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is. Importance can be a useful way to select a small number of features for visualization. This is called as variable importance, I have talked about it in my previous post.
Now, I am going to train a random forest classifier on the wine data using 15 trees. I am going to use the feature_importances_ attribute of the classifier to obtain the relative importance of the features. These features are the columns of the dataframe. Then I plot a simple bar plot to show the relative importance of the named features.
It is always nice to visualize the features, matplot lib documentation has got very good recipes of plots. I used one such recipe to construct the above horizontal bar chart.
The decision surfaces for the decision tree and random forest are very complex . The decision tree is by far the most sensitive, showing only extreme classification probabilities that are heavily influenced by single points. The random forest shows lower sensitivity, with isolated points having much less extreme classification probabilities. The SVM is the least sensitive, since it has a very smooth decision boundary.
The SVM implementation of sklearn has an optional parameter class_weight. This parameter is set to None per default, but it also provides an auto mode, which uses the values of the labels Y to automatically adjust weights inversely proportional to class frequencies. I am going to draw the decision boundaries for two SVM classifiers. I am going to use C=1.0, and gamma=1.0 for both models, but for the first SVM I set class_weight to None, and for the second SVM, I set class_weight to ‘auto’.
The first SVM with equal class weights only classifies a small subset of the positive training points correctly, but it only produces very few false positive predictions on the training set. Thus, it has higher precision, but lower recall than the second SVM with the auto weighting option. The overall performance of the SVMs seems to be quite poor, with a lot of misclassified data points for both models. To improve the performance you would have to tune the parameters(C and class_weight).
What other things can you do to improve the performance of the classifiers?