나의 인생 이야기: PREDICTING LOAN CREDIT RISK USING APACHE SPARK MACHINE LEARNING RANDOM FORESTS

source: https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/

July 12, 2016 | BY Carol McDonald

In this blog post, I’ll help you get started using Apache Spark’s spark.ml Random forests for classification of bank loan credit risk. Spark’s spark.ml library goal is to provide a set of APIs on top of DataFrames that help users create and tune machine learning workflows or pipelines. Using spark.ml with dataframes improves performance through intelligent optimizations.

CLASSIFICATION

Classification is a family of supervised machine learning algorithms that identify which category an item belongs to (for example whether a transaction is fraud or not fraud), based on labeled examples of known items (for example transactions known to be fraud or not). Classification takes a set of data with known labels and pre-determined features and learns how to label new records based on that information. Features are the “if questions” that you ask. The label is the answer to those questions. In the example below, if it walks, swims, and quacks like a duck, then the label is "duck".

Let’s go through an example of Credit Risk for Bank Loans:

What are we trying to predict?
Whether a person will pay back a loan or not.
This is the Label: The Creditability of a person.
What are the “if questions” or properties that you can use to predict ?
An applicant’s demographic and socio-economic profile: Occupation, age, savings, marital status, savings...
These are the Features, to build a classifier model, you extract the features of interest that most contribute to the classification.

DECISION TREES

Decision trees create a model that predicts the class or label based on several input features. Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. A possible decision tree for predicting Credit Risk is shown below. The feature questions are the nodes, and the answers “yes” or “no” are the branches in the tree to the child nodes.

Q1: Is checking account balance > 200DM ?
- no
- Q2: Is Length of current employment > 1 year?
  - No
  - Not Creditable

RANDOM FORESTS

Ensemble learning algorithms combine multiple machine learning algorithms to obtain a better model. Random Forest is a popular ensemble learning method for Classification and regression. The algorithm builds a model consisting of multiple decision trees , based on different subsets of data at the training stage. Predictions are made by combining the output from all of the trees which reduces the variance, and improves the predictive accuracy. For Random Forest Classification each tree’s prediction is counted as a vote for one class. The label is predicted to be the class which receives the most votes.

ANALYZE CREDIT RISK WITH SPARK MACHINE LEARNING SCENARIO

Our data is from the German Credit Data Set which classifies people described by a set of attributes as good or bad credit risks. For each bank loan application we have the following information:

The german credit csv file has the following format :

1,1,18,4,2,1049,1,2,4,2,1,4,2,21,3,1,1,3,1,1,1
1,1,9,4,0,2799,1,3,2,3,1,2,1,36,3,1,2,3,2,1,1
1,2,12,2,9,841,2,4,2,2,1,4,1,23,3,1,1,2,1,1,1

In this scenario, we will build a random forest of decision trees to predict the label / classification of Creditable or not based on the following features:

Label → Creditable or Not Creditable (1 or 0)
Features → {balance, history, purpose…}

SOFTWARE

This tutorial will run on Spark 1.6.1

You can download the code and data to run these examples from here: https://github.com/caroljmcdonald/spark-ml-randomforest-creditrisk
The examples in this post can be run in the Spark shell, after launching with the spark-shell command.
You can also run the code as a standalone application as described in the tutorial on Getting Started with Spark on MapR Sandbox.

Log into the MapR Sandbox, as explained in Getting Started with Spark on MapR Sandbox, using userid user01, password mapr. Copy the sample data file to your sandbox home directory /user/user01 using scp. (Note you may have to update the Spark version on you Sandbox) Start the spark shell with:

$spark-shell --master local[1]

LOAD AND PARSE THE DATA FROM A CSV FILE

First, we will import the machine learning packages.
(In the code boxes, comments are in Green and output is in Blue)

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import sqlContext.implicits._
import sqlContext._
import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator }
import org.apache.spark.ml.{ Pipeline, PipelineStage }

We use a Scala case class to define the Credit schema corresponding to a line in the csv data file.

**// define the Credit Schema**
case class Credit(
    creditability: Double,
    balance: Double, duration: Double, history: Double, purpose: Double, amount: Double,
    savings: Double, employment: Double, instPercent: Double, sexMarried: Double, guarantors: Double,
    residenceDuration: Double, assets: Double, age: Double, concCredit: Double, apartment: Double,
    credits: Double, occupation: Double, dependents: Double, hasPhone: Double, foreign: Double
  )

The functions below parse a line from the data file into the Credit class. A 1 is subtracted from some categorical values so that they all consistently start with 0.

**// function to create a  Credit class from an Array of Double**
def parseCredit(line: Array[Double]): Credit = {
    Credit(
      line(0),
      line(1) - 1, line(2), line(3), line(4) , line(5),
      line(6) - 1, line(7) - 1, line(8), line(9) - 1, line(10) - 1,
      line(11) - 1, line(12) - 1, line(13), line(14) - 1, line(15) - 1,
      line(16) - 1, line(17) - 1, line(18) - 1, line(19) - 1, line(20) - 1
    )
  }
**// function to transform an RDD of Strings into an RDD of Double**
  def parseRDD(rdd: RDD[String]): RDD[Array[Double]] = {
    rdd.map(_.split(",")).map(_.map(_.toDouble))
  }

Below we load the data from the germancredit.csv file into an RDD of Strings. Then we use the map transformation on the rdd, which will apply the ParseRDD function to transform each String element in the RDD into an Array of Double . Then we use another map transformation, which will apply the ParseCredit function to transform each Array of Double in the RDD into an Array of Credit objects. The toDF() method transforms the RDD of Array[[Credit]] into a Dataframe with the Credit class schema.

**// load the data into a  RDD**
val creditDF= parseRDD(sc.textFile("germancredit.csv")).map(parseCredit).toDF().cache()
creditDF.registerTempTable("credit")

DataFrame printSchema() Prints the schema to the console in a tree format

**// Return the schema of this DataFrame**
creditDF.printSchema
 root
 |-- creditability: double (nullable = false)
 |-- balance: double (nullable = false)
 |-- duration: double (nullable = false)
 |-- history: double (nullable = false)
 |-- purpose: double (nullable = false)
 |-- amount: double (nullable = false)
 |-- savings: double (nullable = false)
 |-- employment: double (nullable = false)
 |-- instPercent: double (nullable = false)
 |-- sexMarried: double (nullable = false)
 |-- guarantors: double (nullable = false)
 |-- residenceDuration: double (nullable = false)
 |-- assets: double (nullable = false)
 |-- age: double (nullable = false)
 |-- concCredit: double (nullable = false)
 |-- apartment: double (nullable = false)
 |-- credits: double (nullable = false)
 |-- occupation: double (nullable = false)
 |-- dependents: double (nullable = false)
 |-- hasPhone: double (nullable = false)
 |-- foreign: double (nullable = false)
**// Display the top 20 rows of DataFrame**
creditDF.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
|          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|     9.0|    4.0|    0.0|2799.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|
|          1.0|    1.0|    12.0|    2.0|    9.0| 841.0|    1.0|       3.0|        2.0|       1.0|       0.0|              3.0|   0.0|23.0|       2.0|      0.0|    0.0|       1.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|    12.0|    4.0|    0.0|2122.0|    0.0|       2.0|        3.0|       2.0|       0.0|              1.0|   0.0|39.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|
|          1.0|    0.0|    12.0|    4.0|    0.0|2171.0|    0.0|       2.0|        4.0|       2.0|       0.0|              3.0|   1.0|38.0|       0.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|
|          1.0|    0.0|    10.0|    4.0|    0.0|2241.0|    0.0|       1.0|        1.0|       2.0|       0.0|              2.0|   0.0|48.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|
|          1.0|    0.0|     8.0|    4.0|    0.0|3398.0|    0.0|       3.0|        1.0|       2.0|       0.0|              3.0|   0.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|
|          1.0|    0.0|     6.0|    4.0|    0.0|1361.0|    0.0|       1.0|        2.0|       2.0|       0.0|              3.0|   0.0|40.0|       2.0|      1.0|    0.0|       1.0|       1.0|     0.0|    1.0|
|          1.0|    3.0|    18.0|    4.0|    3.0|1098.0|    0.0|       0.0|        4.0|       1.0|       0.0|              3.0|   2.0|65.0|       2.0|      1.0|    1.0|       0.0|       0.0|     0.0|    0.0|
|          1.0|    1.0|    24.0|    2.0|    3.0|3758.0|    2.0|       0.0|        1.0|       1.0|       0.0|              3.0|   3.0|23.0|       2.0|      0.0|    0.0|       0.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|    11.0|    4.0|    0.0|3905.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|
|          1.0|    0.0|    30.0|    4.0|    1.0|6187.0|    1.0|       3.0|        1.0|       3.0|       0.0|              3.0|   2.0|24.0|       2.0|      0.0|    1.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|     6.0|    4.0|    3.0|1957.0|    0.0|       3.0|        1.0|       1.0|       0.0|              3.0|   2.0|31.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    1.0|    48.0|    3.0|   10.0|7582.0|    1.0|       0.0|        2.0|       2.0|       0.0|              3.0|   3.0|31.0|       2.0|      1.0|    0.0|       3.0|       0.0|     1.0|    0.0|
|          1.0|    0.0|    18.0|    2.0|    3.0|1936.0|    4.0|       3.0|        2.0|       3.0|       0.0|              3.0|   2.0|23.0|       2.0|      0.0|    1.0|       1.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|     6.0|    2.0|    3.0|2647.0|    2.0|       2.0|        2.0|       2.0|       0.0|              2.0|   0.0|44.0|       2.0|      0.0|    0.0|       2.0|       1.0|     0.0|    0.0|
|          1.0|    0.0|    11.0|    4.0|    0.0|3939.0|    0.0|       2.0|        1.0|       2.0|       0.0|              1.0|   0.0|40.0|       2.0|      1.0|    1.0|       1.0|       1.0|     0.0|    0.0|
|          1.0|    1.0|    18.0|    2.0|    3.0|3213.0|    2.0|       1.0|        1.0|       3.0|       0.0|              2.0|   0.0|25.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    1.0|    36.0|    4.0|    3.0|2337.0|    0.0|       4.0|        4.0|       2.0|       0.0|              3.0|   0.0|36.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    3.0|    11.0|    4.0|    0.0|7228.0|    0.0|       2.0|        1.0|       2.0|       0.0|              3.0|   1.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    0.0|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+

After a dataframe is instantiated, you can query it using SQL queries. Here are some example queries using the Scala DataFrame API:

describe computes statistics for numeric columns, including count, mean, stddev, min, and max

**//  computes statistics for balance**
  creditDF.describe("balance").show
 +-------+-----------------+
|summary|          balance|
+-------+-----------------+
|  count|             1000|
|   mean|            1.577|
| stddev|1.257637727110893|
|    min|              0.0|
|    max|              3.0|
+-------+-----------------+

**// compute the avg balance by creditability (the label)**
 creditDF.groupBy("creditability").avg("balance").show
 +-------------+------------------+
|creditability|      avg(balance)|
+-------------+------------------+
|          1.0|1.8657142857142857|
|          0.0|0.9033333333333333|
+-------------+------------------+

You can register a DataFrame as a temporary table using a given name, and then run SQL statements using the sql methods provided by sqlContext. Here are some example queries using sqlContext:

**// Compute the average balance, amount, duration grouped by creditability**
 sqlContext.sql("SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, avg(duration) as avgdur  FROM credit GROUP BY creditability ").show
 +-------------+------------------+------------------+------------------+
|creditability|        avgbalance|            avgamt|            avgdur|
+-------------+------------------+------------------+------------------+
|          1.0|1.8657142857142857| 2985.442857142857|19.207142857142856|
|          0.0|0.9033333333333333|3938.1266666666666|             24.86|
+-------------+------------------+------------------+------------------+

EXTRACT FEATURES

To build a classifier model, you first extract the features that most contribute to the classification. In the german credit data set the data is labeled with two classes – 1 (creditable) and 0 (not creditable).

The features for each item consists of the fields shown below:

Label → creditable: 0 or 1
Features → {"balance", "duration", "history", "purpose", "amount", "savings", "employment", "instPercent", "sexMarried", "guarantors", "residenceDuration", "assets", "age", "concCredit", "apartment", "credits", "occupation", "dependents", "hasPhone", "foreign"}

DEFINE FEATURES ARRAY

(reference Learning Spark)

In order for the features to be used by a machine learning algorithm, the features are transformed and put into Feature Vectors, which are vectors of numbers representing the value for each feature.

Below a VectorAssembler is used to transform and return a new dataframe with all of the feature columns in a vector column

**//define the feature columns to put in the feature vector**
val featureCols = Array("balance", "duration", "history", "purpose", "amount",
    "savings", "employment", "instPercent", "sexMarried",  "guarantors",
    "residenceDuration", "assets",  "age", "concCredit", "apartment",
    "credits",  "occupation", "dependents",  "hasPhone", "foreign" )
**//set the input and output column names**
  val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
**//return a dataframe with all of the  feature columns in  a vector column**
val df2 = assembler.transform( creditDF)
**// the transform method produced a new column: features.**
df2.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
|          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|(20,[1,2,3,4,6,7,...|

Next, we use a StringIndexer to return a Dataframe with the creditability column added as a label .

**//  Create a label column with the StringIndexer**
val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
**// the  transform method produced a new column: label.**
df3.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|label|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
|          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|(20,[1,2,3,4,6,7,...|  0.0|

Below the data it is split into a training data set and a test data set, 70% of the data is used to train the model, 30% will be used for testing.

**//  split the dataframe into training and test data**
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)

TRAIN THE MODEL

Next, we train a RandomForest Classifier with the parameters:

maxDepth: Maximum depth of a tree. Increasing the depth makes the model more powerful, but deep trees take longer to train.
maxBins: Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.
impurity:Criterion used for information gain calculation
auto:Automatically select the number of features to consider for splits at each tree node
seed:Use a random seed number , allowing to repeat the results

The model is trained by making associations between the input features and the labeled output associated with those features.

**// create the classifier,  set parameters for training**
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
**//  use the random forest classifier  to train (fit) the model**
val model = classifier.fit(trainingData)

**// print out the random forest trees**
model.toDebugString
res20: String =
res5: String =
"RandomForestClassificationModel (uid=rfc_6c4ceb92ba78) with 20 trees
  Tree 0 (weight 1.0):
    If (feature 0 <= 3="" 10="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 6.0)
       Predict: 0.0
     Else (feature 10 > 0.0)
      If (feature 12 <= 12="" 63.0)="" predict:="" 0.0="" else="" (feature=""> 63.0)
       Predict: 0.0
    Else (feature 0 > 1.0)
     If (feature 13 <= 3="" 1.0)="" if="" (feature="" <="3.0)" predict:="" 0.0="" else=""> 3.0)
       Predict: 1.0
     Else (feature 13 > 1.0)
      If (feature 7 <= 7="" 1.0)="" predict:="" 0.0="" else="" (feature=""> 1.0)
       Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 2 <= 11="" 15="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
       Predict: 1.0
     Else (feature 15 > 0.0)
      If (feature 11 <= 11="" 0.0)="" predict:="" 0.0="" else="" (feature=""> 0.0)
       Predict: 1.0
    Else (feature 2 > 1.0)
     If (feature 12 <= 5="" 31.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
       Predict: 0.0
     Else (feature 12 > 31.0)
      If (feature 3 <= 3="" 4.0)="" predict:="" 0.0="" else="" (feature=""> 4.0)
       Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 8 <= 4="" 6="" 1.0)="" if="" (feature="" <="2.0)" predict:="" 0.0="" else=""> 10875.0)
       Predict: 1.0
     Else (feature 6 > 2.0)
      If (feature 1 <= 1="" 36.0)="" predict:="" 0.0="" else="" (feature=""> 36.0)
       Predict: 1.0
    Else (feature 8 > 1.0)
     If (feature 5 <= 4="" 0.0)="" if="" (feature="" <="4113.0)" predict:="" 0.0="" else=""> 4113.0)
       Predict: 1.0
     Else (feature 5 > 0.0)
      If (feature 11 <= 11="" 2.0)="" predict:="" 0.0="" else="" (feature=""> 2.0)
       Predict: 0.0
  Tree 3 ...

TEST THE MODEL

Next we use the test data to get predictions.

**// run the  model on test features to get predictions**
val predictions = model.transform(testData)
**//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
predictions.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+--------------------+--------------------+----------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|label|       rawPrediction|         probability|prediction|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+--------------------+--------------------+----------+
|          0.0|    0.0|    12.0|    0.0|    5.0|1108.0|    0.0|       3.0|        4.0|       2.0|       0.0|              2.0|   0.0|28.0|       2.0|      1.0|    1.0|       2.0|       0.0|     0.0|    0.0|(20,[1,3,4,6,7,8,...|  1.0|[14.1964586927573...|[0.70982293463786...|       0.0|

Below we evaluate the predictions, we use a BinaryClassificationEvaluator which returns a precision metric (The Area Under an ROC Curve) by comparing the test label column with the test prediction column. In this case the evaluation returns 78% precision.

**// create an Evaluator for binary classification, which expects two input columns: rawPrediction and label.**
val evaluator = new BinaryClassificationEvaluator().setLabelCol("label")
**// Evaluates predictions and returns a scalar metric areaUnderROC(larger is better).**
val accuracy = evaluator.evaluate(predictions)
accuracy: Double = 0.7824906081835722

USING AN ML PIPELINE

We will next train the model using a pipeline, which can give better results. A pipeline provides a simple way to try out different combinations of parameters, using a process called grid search, where you set up the parameters to test, and MLLib will test all the combinations. Pipelines make it easy to tune an entire model building workflow at once, rather than tuning each element in the Pipeline separately.

Below we use the ParamGridBuilder utility to construct the parameter grid.

_**// We use a ParamGridBuilder to construct a grid of parameters to search over**_
val paramGrid = new ParamGridBuilder()
  .addGrid(classifier.maxBins, Array(25, 28, 31))
  .addGrid(classifier.maxDepth, Array(4, 6, 8))
  .addGrid(classifier.impurity, Array("entropy", "gini"))
  .build()

Create and set up a pipeline. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer.

val steps: Array[PipelineStage] = Array(classifier)
val pipeline = new Pipeline().setStages(steps)

We use the CrossValidator class for model selection. The CrossValidator uses an Estimator, a set of ParamMaps, and an Evaluator. Note using a CrossValidator can be very expensive.

**// Evaluate model on test instances and compute test error**
val evaluator = new BinaryClassificationEvaluator()
  .setLabelCol("label")
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(10)

The pipeline automatically optimizes by exploring the parameter grid: for each ParamMap, the CrossValidator trains the given Estimator and evaluates it using the given Evaluator, then it fits the best Estimator using the best ParamMap and the entire dataset.

**// When fit is called, the stages are executed in order.
// Fit will run cross-validation,  and choose the best set of parameters
//The fitted model from a Pipeline is an PipelineModel, which consists of fitted models and transformers**
val pipelineFittedModel = cv.fit(trainingData)

Now we can evaluate the pipeline best-fitted model by comparing test predictions with test labels. The evaluator now returns 82% accuracy compared to 78% before.

**//  call tranform to make predictions on test data. The fitted model will use the best model found**
val predictions = pipelineFittedModel.transform(testData)
val accuracy = evaluator.evaluate(predictions)  
Double = 0.8204386232104784
**// Calculate Binary Classification Metrics**
val predictionAndLabels =predictions.select("prediction", "label").rdd.map(x =>
  (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
**// A Precision-Recall curve plots (precision, recall) points for different threshold values, while a receiver operating characteristic, or ROC, curve plots (recall, false positive rate) points.**
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
 area under the precision-recall curve: 0.6482521795731916
area under the receiver operating characteristic (ROC) curve : 0.6332876434155752

WANT TO LEARN MORE?

In this blog post, we showed you how to get started using Apache Spark’s machine learning Random Forests and ml pipelines for classification. If you have any further questions about this tutorial, please ask them in the comments section below.

나의 인생 이야기

Thursday, March 30, 2017

PREDICTING LOAN CREDIT RISK USING APACHE SPARK MACHINE LEARNING RANDOM FORESTS