Thursday, August 3, 2017

8.2 부동산 대책



[그래픽 뉴스] 8.2 부동산 대책 적용 시점 정리

[8·2부동산대책]과열지구·투기지역·조정지역…대체 무슨 차이?
다주택자 정조준, 투기세력 봉쇄에 초점




8ㆍ2  부동산 대책   양도소득세 강화


<주요 용어 >
1. 투기과열지구
2. 투기지구
3. 조정 대상 지역
4. 담보인정비율(LTV)
5. 총부채상환비율(DTI)
6. 분양가 상한제
7. 재건축초과이익 환수제
8. 양도소득세 


Tuesday, May 30, 2017

스파크 관련 핵심 개념과 활용사례 참조


<스파크 관련 핵심 개념과 활용사례 참조>

http://www.comworld.co.kr/news/articleView.html?idxno=47869
http://www.slideshare.net/rxin/stanford-cs347-guest-lecture-apache-spark
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://spark.apache.org



https://spark-summit.org/east-2017/events/what-to-expect-for-big-data-and-apache-spark-in-2017
https://www.youtube.com/watch?v=kmrWkU0PCCs
https://spark-summit.org/east-2017/events/what-to-expect-for-big-data-and-apache-spark-in-2017

<스파크 아키텍처>

http://spark.apache.org/docs/latest/cluster-overview.html


http://www-bcf.usc.edu/~minlanyu/teach/csci599-fall12/papers/nsdi_spark.pdf
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
courses.csail.mit.edu/18.337/2015/docs/6338.pptx

<스파크 서밋 서베이>
http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/2016_Spark_Survey/2016_Spark_Infographic.pdf

<활용사례>
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/2016_Spark_Survey/2016_Spark_Infographic.pdf
ING : https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51013
Goldman Sachs: http://www.slideshare.net/SparkSummit/how-spark-is-making-an-impact-at-goldman-sachs-by-vincent-saulys
baidu: http://www.slideshare.net/SparkSummit/how-spark-fits-into-baidus-scale-james-peng
Toyota: http://www.slideshare.net/SparkSummit/brian-kursar
ING: https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51013
Netflix : https://cdn.oreillystatic.com/en/assets/1/event/132/Netflix_%20Integrating%20Spark%20at%20petabyte%20scale%20Presentation.pdf
telefonica: https://spark-summit.org/2014/spark-use-case-at-telefonica-cbs/

Tuesday, April 4, 2017

Example Source: Advanced Analytics with Spark

source: https://github.com/sryza/aas

Advanced Analytics with Spark
목차:
1. 오디오스크로블로 데이터셋으로 음악 추천하기
2. 의사 결정 나무로 산림 식샐 분포 예측하기
3. K-평균 군집화로 네트워크 이상 감지하기
4. 숨은 의미 분석으로 위키 백과 이해하기
5. 그래프엑스로 동시 발생 네트워크 분석하기
6. 뉴욕 택시 운행 데이터로 위치 및 시간 데이터 분석하기
7. 몬테카를로 시뮬레이션으로 금융 리스크 추정하기
8. BDG 프로젝트와 유전체학 데이터 분석하기
9. PySpark와 Thunder로 신경 영상 데이터 분석하기.












Code to accompany Advanced Analytics with Spark from O'Reilly Media



ScalaPythonRShell

Latest commit e8754e0 2 days ago@sryza  committed on GitHub Fix LSA issues and harmonize with the text (#104)

 README.md


Advanced Analytics with Spark Source Code

Advanced Analytics with Spark

1st Edition (current)

The source to accompany the 1st edition may be found in the 1st-edition branch.

2nd Edition (coming H1 2017)

The source to accompany the 2nd edition is found in this, the default master branch.

Build

Apache Maven 3.2.5+ and Java 8+ are required to build. From the root level of the project, run mvn package to compile artifacts into target/ subdirectories beneath each chapter's directory.

Data Sets

Build Status

Monday, April 3, 2017

Applied Data Mining and Statistical Learning-Analysis of German Credit Data

source: https://onlinecourses.science.psu.edu/stat857/node/215

Analysis of German Credit Data

Printer-friendly versionPrinter-friendly versiongerman flagData mining is a critical step in knowledge discovery involving theories, methodologies and tools for revealing patterns in data. It is important to understand the rationale behind the methods so that tools and methods have appropriate fit with the data and the objective of pattern recognition. There may be several options for tools available for a data set.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision –
  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

Objective of Analysis:

Minimization of risk and maximization of profit on behalf of the bank.
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link to the German Credit data (right-click and "save as" ).  A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
Data Files for this case (right-click and "save as" ) :
The following analytical approaches are taken:
  • Logistic regression: The response is binary (Good credit risk or Bad) and several predictors are available.
  • Discriminant Analysis:
  • Tree-based method and Random Forest
///////////////////////////////////////////////////////////////////////////////////////////////////////

Thursday, March 30, 2017

PREDICTING LOAN CREDIT RISK USING APACHE SPARK MACHINE LEARNING RANDOM FORESTS

source: https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/

July 12, 2016 | BY Carol McDonald
In this blog post, I’ll help you get started using Apache Spark’s spark.ml Random forests for classification of bank loan credit risk. Spark’s spark.ml library goal is to provide a set of APIs on top of DataFrames that help users create and tune machine learning workflows or pipelines. Using spark.ml with dataframes improves performance through intelligent optimizations.

CLASSIFICATION

Classification is a family of supervised machine learning algorithms that identify which category an item belongs to (for example whether a transaction is fraud or not fraud), based on labeled examples of known items (for example transactions known to be fraud or not). Classification takes a set of data with known labels and pre-determined features and learns how to label new records based on that information. Features are the “if questions” that you ask. The label is the answer to those questions. In the example below, if it walks, swims, and quacks like a duck, then the label is "duck".
Let’s go through an example of Credit Risk for Bank Loans:
  • What are we trying to predict?
  • Whether a person will pay back a loan or not.
  • This is the Label: The Creditability of a person.
  • What are the “if questions” or properties that you can use to predict ?
  • An applicant’s demographic and socio-economic profile: Occupation, age, savings, marital status, savings...
  • These are the Features, to build a classifier model, you extract the features of interest that most contribute to the classification.

DECISION TREES

Decision trees create a model that predicts the class or label based on several input features. Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. A possible decision tree for predicting Credit Risk is shown below. The feature questions are the nodes, and the answers “yes” or “no” are the branches in the tree to the child nodes.
  • Q1: Is checking account balance > 200DM ?
    • no
    • Q2: Is Length of current employment > 1 year?
      • No
      • Not Creditable

RANDOM FORESTS

Ensemble learning algorithms combine multiple machine learning algorithms to obtain a better model. Random Forest is a popular ensemble learning method for Classification and regression. The algorithm builds a model consisting of multiple decision trees , based on different subsets of data at the training stage. Predictions are made by combining the output from all of the trees which reduces the variance, and improves the predictive accuracy. For Random Forest Classification each tree’s prediction is counted as a vote for one class. The label is predicted to be the class which receives the most votes.

ANALYZE CREDIT RISK WITH SPARK MACHINE LEARNING SCENARIO

Our data is from the German Credit Data Set which classifies people described by a set of attributes as good or bad credit risks. For each bank loan application we have the following information:
The german credit csv file has the following format :
1,1,18,4,2,1049,1,2,4,2,1,4,2,21,3,1,1,3,1,1,1
1,1,9,4,0,2799,1,3,2,3,1,2,1,36,3,1,2,3,2,1,1
1,2,12,2,9,841,2,4,2,2,1,4,1,23,3,1,1,2,1,1,1
In this scenario, we will build a random forest of decision trees to predict the label / classification of Creditable or not based on the following features:
  • Label → Creditable or Not Creditable (1 or 0)
  • Features → {balance, history, purpose…}

SOFTWARE

This tutorial will run on Spark 1.6.1
Log into the MapR Sandbox, as explained in Getting Started with Spark on MapR Sandbox, using userid user01, password mapr. Copy the sample data file to your sandbox home directory /user/user01 using scp. (Note you may have to update the Spark version on you Sandbox) Start the spark shell with:
$spark-shell --master local[1]

LOAD AND PARSE THE DATA FROM A CSV FILE

First, we will import the machine learning packages.
(In the code boxes, comments are in Green and output is in Blue)
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import sqlContext.implicits._
import sqlContext._
import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator }
import org.apache.spark.ml.{ Pipeline, PipelineStage }
We use a Scala case class to define the Credit schema corresponding to a line in the csv data file.
**// define the Credit Schema**
case class Credit(
    creditability: Double,
    balance: Double, duration: Double, history: Double, purpose: Double, amount: Double,
    savings: Double, employment: Double, instPercent: Double, sexMarried: Double, guarantors: Double,
    residenceDuration: Double, assets: Double, age: Double, concCredit: Double, apartment: Double,
    credits: Double, occupation: Double, dependents: Double, hasPhone: Double, foreign: Double
  )
The functions below parse a line from the data file into the Credit class. A 1 is subtracted from some categorical values so that they all consistently start with 0.
**// function to create a  Credit class from an Array of Double**
def parseCredit(line: Array[Double]): Credit = {
    Credit(
      line(0),
      line(1) - 1, line(2), line(3), line(4) , line(5),
      line(6) - 1, line(7) - 1, line(8), line(9) - 1, line(10) - 1,
      line(11) - 1, line(12) - 1, line(13), line(14) - 1, line(15) - 1,
      line(16) - 1, line(17) - 1, line(18) - 1, line(19) - 1, line(20) - 1
    )
  }
**// function to transform an RDD of Strings into an RDD of Double**
  def parseRDD(rdd: RDD[String]): RDD[Array[Double]] = {
    rdd.map(_.split(",")).map(_.map(_.toDouble))
  }
Below we load the data from the germancredit.csv file into an RDD of Strings. Then we use the map transformation on the rdd, which will apply the ParseRDD function to transform each String element in the RDD into an Array of Double . Then we use another map transformation, which will apply the ParseCredit function to transform each Array of Double in the RDD into an Array of Credit objects. The toDF() method transforms the RDD of Array[[Credit]] into a Dataframe with the Credit class schema.
**// load the data into a  RDD**
val creditDF= parseRDD(sc.textFile("germancredit.csv")).map(parseCredit).toDF().cache()
creditDF.registerTempTable("credit")
DataFrame printSchema() Prints the schema to the console in a tree format
**// Return the schema of this DataFrame**
creditDF.printSchema
 root
 |-- creditability: double (nullable = false)
 |-- balance: double (nullable = false)
 |-- duration: double (nullable = false)
 |-- history: double (nullable = false)
 |-- purpose: double (nullable = false)
 |-- amount: double (nullable = false)
 |-- savings: double (nullable = false)
 |-- employment: double (nullable = false)
 |-- instPercent: double (nullable = false)
 |-- sexMarried: double (nullable = false)
 |-- guarantors: double (nullable = false)
 |-- residenceDuration: double (nullable = false)
 |-- assets: double (nullable = false)
 |-- age: double (nullable = false)
 |-- concCredit: double (nullable = false)
 |-- apartment: double (nullable = false)
 |-- credits: double (nullable = false)
 |-- occupation: double (nullable = false)
 |-- dependents: double (nullable = false)
 |-- hasPhone: double (nullable = false)
 |-- foreign: double (nullable = false)
**// Display the top 20 rows of DataFrame**
creditDF.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
|          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|     9.0|    4.0|    0.0|2799.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|
|          1.0|    1.0|    12.0|    2.0|    9.0| 841.0|    1.0|       3.0|        2.0|       1.0|       0.0|              3.0|   0.0|23.0|       2.0|      0.0|    0.0|       1.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|    12.0|    4.0|    0.0|2122.0|    0.0|       2.0|        3.0|       2.0|       0.0|              1.0|   0.0|39.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|
|          1.0|    0.0|    12.0|    4.0|    0.0|2171.0|    0.0|       2.0|        4.0|       2.0|       0.0|              3.0|   1.0|38.0|       0.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|
|          1.0|    0.0|    10.0|    4.0|    0.0|2241.0|    0.0|       1.0|        1.0|       2.0|       0.0|              2.0|   0.0|48.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|
|          1.0|    0.0|     8.0|    4.0|    0.0|3398.0|    0.0|       3.0|        1.0|       2.0|       0.0|              3.0|   0.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|
|          1.0|    0.0|     6.0|    4.0|    0.0|1361.0|    0.0|       1.0|        2.0|       2.0|       0.0|              3.0|   0.0|40.0|       2.0|      1.0|    0.0|       1.0|       1.0|     0.0|    1.0|
|          1.0|    3.0|    18.0|    4.0|    3.0|1098.0|    0.0|       0.0|        4.0|       1.0|       0.0|              3.0|   2.0|65.0|       2.0|      1.0|    1.0|       0.0|       0.0|     0.0|    0.0|
|          1.0|    1.0|    24.0|    2.0|    3.0|3758.0|    2.0|       0.0|        1.0|       1.0|       0.0|              3.0|   3.0|23.0|       2.0|      0.0|    0.0|       0.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|    11.0|    4.0|    0.0|3905.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|
|          1.0|    0.0|    30.0|    4.0|    1.0|6187.0|    1.0|       3.0|        1.0|       3.0|       0.0|              3.0|   2.0|24.0|       2.0|      0.0|    1.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|     6.0|    4.0|    3.0|1957.0|    0.0|       3.0|        1.0|       1.0|       0.0|              3.0|   2.0|31.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    1.0|    48.0|    3.0|   10.0|7582.0|    1.0|       0.0|        2.0|       2.0|       0.0|              3.0|   3.0|31.0|       2.0|      1.0|    0.0|       3.0|       0.0|     1.0|    0.0|
|          1.0|    0.0|    18.0|    2.0|    3.0|1936.0|    4.0|       3.0|        2.0|       3.0|       0.0|              3.0|   2.0|23.0|       2.0|      0.0|    1.0|       1.0|       0.0|     0.0|    0.0|
|          1.0|    0.0|     6.0|    2.0|    3.0|2647.0|    2.0|       2.0|        2.0|       2.0|       0.0|              2.0|   0.0|44.0|       2.0|      0.0|    0.0|       2.0|       1.0|     0.0|    0.0|
|          1.0|    0.0|    11.0|    4.0|    0.0|3939.0|    0.0|       2.0|        1.0|       2.0|       0.0|              1.0|   0.0|40.0|       2.0|      1.0|    1.0|       1.0|       1.0|     0.0|    0.0|
|          1.0|    1.0|    18.0|    2.0|    3.0|3213.0|    2.0|       1.0|        1.0|       3.0|       0.0|              2.0|   0.0|25.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    1.0|    36.0|    4.0|    3.0|2337.0|    0.0|       4.0|        4.0|       2.0|       0.0|              3.0|   0.0|36.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|
|          1.0|    3.0|    11.0|    4.0|    0.0|7228.0|    0.0|       2.0|        1.0|       2.0|       0.0|              3.0|   1.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    0.0|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
After a dataframe is instantiated, you can query it using SQL queries. Here are some example queries using the Scala DataFrame API:
describe computes statistics for numeric columns, including count, mean, stddev, min, and max
**//  computes statistics for balance**
  creditDF.describe("balance").show
 +-------+-----------------+
|summary|          balance|
+-------+-----------------+
|  count|             1000|
|   mean|            1.577|
| stddev|1.257637727110893|
|    min|              0.0|
|    max|              3.0|
+-------+-----------------+

**// compute the avg balance by creditability (the label)**
 creditDF.groupBy("creditability").avg("balance").show
 +-------------+------------------+
|creditability|      avg(balance)|
+-------------+------------------+
|          1.0|1.8657142857142857|
|          0.0|0.9033333333333333|
+-------------+------------------+
You can register a DataFrame as a temporary table using a given name, and then run SQL statements using the sql methods provided by sqlContext. Here are some example queries using sqlContext:
**// Compute the average balance, amount, duration grouped by creditability**
 sqlContext.sql("SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, avg(duration) as avgdur  FROM credit GROUP BY creditability ").show
 +-------------+------------------+------------------+------------------+
|creditability|        avgbalance|            avgamt|            avgdur|
+-------------+------------------+------------------+------------------+
|          1.0|1.8657142857142857| 2985.442857142857|19.207142857142856|
|          0.0|0.9033333333333333|3938.1266666666666|             24.86|
+-------------+------------------+------------------+------------------+

EXTRACT FEATURES

To build a classifier model, you first extract the features that most contribute to the classification. In the german credit data set the data is labeled with two classes – 1 (creditable) and 0 (not creditable).
The features for each item consists of the fields shown below:
  • Label → creditable: 0 or 1
  • Features → {"balance", "duration", "history", "purpose", "amount", "savings", "employment", "instPercent", "sexMarried", "guarantors", "residenceDuration", "assets", "age", "concCredit", "apartment", "credits", "occupation", "dependents", "hasPhone", "foreign"}

DEFINE FEATURES ARRAY

In order for the features to be used by a machine learning algorithm, the features are transformed and put into Feature Vectors, which are vectors of numbers representing the value for each feature.
Below a VectorAssembler is used to transform and return a new dataframe with all of the feature columns in a vector column
**//define the feature columns to put in the feature vector**
val featureCols = Array("balance", "duration", "history", "purpose", "amount",
    "savings", "employment", "instPercent", "sexMarried",  "guarantors",
    "residenceDuration", "assets",  "age", "concCredit", "apartment",
    "credits",  "occupation", "dependents",  "hasPhone", "foreign" )
**//set the input and output column names**
  val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
**//return a dataframe with all of the  feature columns in  a vector column**
val df2 = assembler.transform( creditDF)
**// the transform method produced a new column: features.**
df2.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
|          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|(20,[1,2,3,4,6,7,...|
Next, we use a StringIndexer to return a Dataframe with the creditability column added as a label .
**//  Create a label column with the StringIndexer**
val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
**// the  transform method produced a new column: label.**
df3.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|label|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
|          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|(20,[1,2,3,4,6,7,...|  0.0|
Below the data it is split into a training data set and a test data set, 70% of the data is used to train the model, 30% will be used for testing.
**//  split the dataframe into training and test data**
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)

TRAIN THE MODEL

Next, we train a RandomForest Classifier with the parameters:
  • maxDepth: Maximum depth of a tree. Increasing the depth makes the model more powerful, but deep trees take longer to train.
  • maxBins: Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.
  • impurity:Criterion used for information gain calculation
  • auto:Automatically select the number of features to consider for splits at each tree node
  • seed:Use a random seed number , allowing to repeat the results
The model is trained by making associations between the input features and the labeled output associated with those features.
**// create the classifier,  set parameters for training**
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
**//  use the random forest classifier  to train (fit) the model**
val model = classifier.fit(trainingData)

**// print out the random forest trees**
model.toDebugString
res20: String =
res5: String =
"RandomForestClassificationModel (uid=rfc_6c4ceb92ba78) with 20 trees
  Tree 0 (weight 1.0):
    If (feature 0 <= 3="" 10="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 6.0)
       Predict: 0.0
     Else (feature 10 > 0.0)
      If (feature 12 <= 12="" 63.0)="" predict:="" 0.0="" else="" (feature=""> 63.0)
       Predict: 0.0
    Else (feature 0 > 1.0)
     If (feature 13 <= 3="" 1.0)="" if="" (feature="" <="3.0)" predict:="" 0.0="" else=""> 3.0)
       Predict: 1.0
     Else (feature 13 > 1.0)
      If (feature 7 <= 7="" 1.0)="" predict:="" 0.0="" else="" (feature=""> 1.0)
       Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 2 <= 11="" 15="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
       Predict: 1.0
     Else (feature 15 > 0.0)
      If (feature 11 <= 11="" 0.0)="" predict:="" 0.0="" else="" (feature=""> 0.0)
       Predict: 1.0
    Else (feature 2 > 1.0)
     If (feature 12 <= 5="" 31.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 0.0)
       Predict: 0.0
     Else (feature 12 > 31.0)
      If (feature 3 <= 3="" 4.0)="" predict:="" 0.0="" else="" (feature=""> 4.0)
       Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 8 <= 4="" 6="" 1.0)="" if="" (feature="" <="2.0)" predict:="" 0.0="" else=""> 10875.0)
       Predict: 1.0
     Else (feature 6 > 2.0)
      If (feature 1 <= 1="" 36.0)="" predict:="" 0.0="" else="" (feature=""> 36.0)
       Predict: 1.0
    Else (feature 8 > 1.0)
     If (feature 5 <= 4="" 0.0)="" if="" (feature="" <="4113.0)" predict:="" 0.0="" else=""> 4113.0)
       Predict: 1.0
     Else (feature 5 > 0.0)
      If (feature 11 <= 11="" 2.0)="" predict:="" 0.0="" else="" (feature=""> 2.0)
       Predict: 0.0
  Tree 3 ...

TEST THE MODEL

Next we use the test data to get predictions.
**// run the  model on test features to get predictions**
val predictions = model.transform(testData)
**//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
predictions.show
 +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+--------------------+--------------------+----------+
|creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|label|       rawPrediction|         probability|prediction|
+-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+--------------------+--------------------+----------+
|          0.0|    0.0|    12.0|    0.0|    5.0|1108.0|    0.0|       3.0|        4.0|       2.0|       0.0|              2.0|   0.0|28.0|       2.0|      1.0|    1.0|       2.0|       0.0|     0.0|    0.0|(20,[1,3,4,6,7,8,...|  1.0|[14.1964586927573...|[0.70982293463786...|       0.0|
Below we evaluate the predictions, we use a BinaryClassificationEvaluator which returns a precision metric (The Area Under an ROC Curve) by comparing the test label column with the test prediction column. In this case the evaluation returns 78% precision.
**// create an Evaluator for binary classification, which expects two input columns: rawPrediction and label.**
val evaluator = new BinaryClassificationEvaluator().setLabelCol("label")
**// Evaluates predictions and returns a scalar metric areaUnderROC(larger is better).**
val accuracy = evaluator.evaluate(predictions)
accuracy: Double = 0.7824906081835722

USING AN ML PIPELINE

We will next train the model using a pipeline, which can give better results. A pipeline provides a simple way to try out different combinations of parameters, using a process called grid search, where you set up the parameters to test, and MLLib will test all the combinations. Pipelines make it easy to tune an entire model building workflow at once, rather than tuning each element in the Pipeline separately.
Below we use the ParamGridBuilder utility to construct the parameter grid.
_**// We use a ParamGridBuilder to construct a grid of parameters to search over**_
val paramGrid = new ParamGridBuilder()
  .addGrid(classifier.maxBins, Array(25, 28, 31))
  .addGrid(classifier.maxDepth, Array(4, 6, 8))
  .addGrid(classifier.impurity, Array("entropy", "gini"))
  .build()
Create and set up a pipeline. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer.
val steps: Array[PipelineStage] = Array(classifier)
val pipeline = new Pipeline().setStages(steps)
We use the CrossValidator class for model selection. The CrossValidator uses an Estimator, a set of ParamMaps, and an Evaluator. Note using a CrossValidator can be very expensive.
**// Evaluate model on test instances and compute test error**
val evaluator = new BinaryClassificationEvaluator()
  .setLabelCol("label")
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(10)
The pipeline automatically optimizes by exploring the parameter grid: for each ParamMap, the CrossValidator trains the given Estimator and evaluates it using the given Evaluator, then it fits the best Estimator using the best ParamMap and the entire dataset.
**// When fit is called, the stages are executed in order.
// Fit will run cross-validation,  and choose the best set of parameters
//The fitted model from a Pipeline is an PipelineModel, which consists of fitted models and transformers**
val pipelineFittedModel = cv.fit(trainingData)
Now we can evaluate the pipeline best-fitted model by comparing test predictions with test labels. The evaluator now returns 82% accuracy compared to 78% before.
**//  call tranform to make predictions on test data. The fitted model will use the best model found**
val predictions = pipelineFittedModel.transform(testData)
val accuracy = evaluator.evaluate(predictions)  
Double = 0.8204386232104784
**// Calculate Binary Classification Metrics**
val predictionAndLabels =predictions.select("prediction", "label").rdd.map(x =>
  (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
**// A Precision-Recall curve plots (precision, recall) points for different threshold values, while a receiver operating characteristic, or ROC, curve plots (recall, false positive rate) points.**
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
 area under the precision-recall curve: 0.6482521795731916
area under the receiver operating characteristic (ROC) curve : 0.6332876434155752

WANT TO LEARN MORE?

In this blog post, we showed you how to get started using Apache Spark’s machine learning Random Forests and ml pipelines for classification. If you have any further questions about this tutorial, please ask them in the comments section below.