博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
随机森林入门
阅读量:5933 次
发布时间:2019-06-19

本文共 7013 字,大约阅读时间需要 23 分钟。

hot3.png

 is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to  on customer acquisition, retention, and churn or to  in patients.

Random forest is capable of regression and classification. It can handle a large number of features, and it's helpful for estimating which of your variables are important in the underlying data being modeled.

This is a post about random forests using Python.

What is a Random Forest?

Random forest is solid choice for nearly any prediction problem (even non-linear ones). It's a relatively new machine learning strategy (it came out of Bell Labs in the 90s) and it can be used for just about anything. It belongs to a larger class of machine learning algorithms called ensemble methods.

Ensemble Learning

 involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer.

Random forest is a brand of ensemble learning, as it relies on an ensemble of decision trees. More on ensemble learning in Python here: .

Randomized Decision Trees

So we know that random forest is an aggregation of other models, but what types of models is it aggregating? As you might have guessed from its name, random forest aggregates . A decision tree is composed of a series of decisions that can be used to classify an observation in a dataset.

Random Forest

The algorithm to induce a random forest will create a bunch of random decision trees automatically. Since the trees are generated at random, most won't be all that meaningful to learning your classification/regression problem (maybe 99.9% of trees).

decision_tree_example.png

If an observation has a length of 45, blue eyes, and 2 legs, it's going to be classified as red.

Arboreal Voting

So what good are 10000 (probably) bad models? Well it turns out that they really aren't that helpful. Butwhat is helpful are the few really good decision trees that you also generated along with the bad ones.

When you make a prediction, the new observation gets pushed down each decision tree and assigned a predicted value/label. Once each of the trees in the forest have reported its predicted value/label, the predictions are tallied up and the mode vote of all trees is returned as the final prediction.

Simply, the 99.9% of trees that are irrelevant make predictions that are all over the map and cancel each another out. The predictions of the minority of trees that are good top that noise and yield a good prediction.

a_random_forest.png

Why you should I use it?

It's Easy

Random forest is the  of learning methods. You can throw pretty much anything at it and it'll do a serviceable job. It does a particularly good job of estimating inferred transformations, and, as a result, doesn't require much tuning like SVM (i.e. it's good for folks with tight deadlines).

An Example Transformation

Random forest is capable of learning without carefully crafted data transformations. Take the thef(x) = log(x) function for example.

Create some fake data and add a little noise.

import numpy as npx = np.random.uniform(1, 100, 1000)y = np.log(x) + np.random.normal(0, .3, 1000)

log_with_noise.png

If we try and build a basic linear model to predict y using x we wind up with a straight line that sort of bisects the log(x) function. Whereas if we use a random forest, it does a much better job of approximating the log(x) curve and we get something that looks much more like the true function.log_lm_vs_rf.pnglog_lm_vs_rf_fit.png

You could argue that the random forest overfits the log(x) function a little bit. Either way, I think this does a nice job of illustrating how the random forest isn't bound by linear constraints.

Uses

Variable Selection

One of the best use cases for random forest is feature selection. One of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

When a certain tree uses one variable and another doesn't, you can compare the value lost or gained from the inclusion/exclusion of that variable. The good random forest implementations are going to do that for you, so all you need to do is know which method or variable to look at.

In the following examples, we're trying to figure out which variables are most important for classifying a wine as being red or white.

rf_wine_importance.pngrf_feature_count_vs_f1.png

Classification

Random forest is also great for classification. It can be used to make predictions for categories with multiple possible values and it can be calibrated to output probabilities as well. One thing you do need to watch out for is . Random forest can be prone to overfitting, especially when working with relatively small datasets. You should be suspicious if your model is making "too good" of predictions on our test set.

One way to overfitting is to only use really relevant features in your model. While this isn't always cut and dry, using a feature selection technique (like the one mentioned previously) can make it a lot easier.

predicting_wine_type.png

Regression

Yep. It does regression too.

I've found that random forest--unlike other algorithms--does really well learning on categorical variables or a mixture of categorical and real variables. Categorical variables with high cardinality (# of possible values) can be tricky, so having something like this in your back pocket can come in quite useful.

A Short Python Example

Scikit-Learn is a great way to get started with random forest. The scikit-learn API is extremely consistent across algorithms, so you horse race and switch between models very easily. A lot of times I start with something simple and then move to random forest.

One of the best features of the random forest implementation in scikit-learn is the n_jobs parameter. This will automatically parallelize fitting your random forest based on the number of cores you want to use.  by scikit-learn contributor Olivier Grisel where he talks about training a random forest on a 20 node EC2 cluster.

from sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierimport pandas as pdimport numpy as npiris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75df['species'] = pd.Factor(iris.target, iris.target_names)df.head()train, test = df[df['is_train']==True], df[df['is_train']==False]features = df.columns[:4]clf = RandomForestClassifier(n_jobs=2)y, _ = pd.factorize(train['species'])clf.fit(train[features], y)preds = iris.target_names[clf.predict(test[features])]pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

Looks pretty good!

iris_confusion_matrix.png

 

Final Thoughts

Random forests are remarkably easy to use given how advanced they are. As with any modeling, be wary of overfitting. If you're interested in getting started with random forest in R, check out the package.

转载于:https://my.oschina.net/u/1450520/blog/686644

你可能感兴趣的文章
延迟加载
查看>>
分享35款超酷的免费英文涂鸦字体
查看>>
EXT.NET复杂布局(四)——系统首页设计(上)
查看>>
atmega8 默认内部RC振荡 + 解锁
查看>>
CvvImage类以及在MFC中显示IplImage图像的方法
查看>>
PMP 英文术语缩写
查看>>
异步操作之初入门槛
查看>>
学习之路七:一步一步学习ASP.NET数据绑定
查看>>
java连接sql时候,获取表格各列属性
查看>>
只要使用 Gmail 拨打国内电话,通话双方均全部免费!
查看>>
C++ 纯虚函数, 记上一笔!
查看>>
sort 命令范例
查看>>
Windows 事件查看器(收集)
查看>>
HashMap的内部实现机制,Hash是怎样实现的,什么时候ReHash
查看>>
驱动开发之 一个简单的截取键盘按键的驱动
查看>>
windows服务 2.实时刷新App.config
查看>>
数据库连接池的工作机制
查看>>
机器学习 —— 概率图模型(完结)
查看>>
jquery 怎么触发select的change事件
查看>>
曲演杂坛--为什么SELECT语句会被其他SELECT阻塞?
查看>>