Tuesday, April 4, 2017

Example Source: Advanced Analytics with Spark

source: https://github.com/sryza/aas

Advanced Analytics with Spark
목차:
1. 오디오스크로블로 데이터셋으로 음악 추천하기
2. 의사 결정 나무로 산림 식샐 분포 예측하기
3. K-평균 군집화로 네트워크 이상 감지하기
4. 숨은 의미 분석으로 위키 백과 이해하기
5. 그래프엑스로 동시 발생 네트워크 분석하기
6. 뉴욕 택시 운행 데이터로 위치 및 시간 데이터 분석하기
7. 몬테카를로 시뮬레이션으로 금융 리스크 추정하기
8. BDG 프로젝트와 유전체학 데이터 분석하기
9. PySpark와 Thunder로 신경 영상 데이터 분석하기.












Code to accompany Advanced Analytics with Spark from O'Reilly Media



ScalaPythonRShell

Latest commit e8754e0 2 days ago@sryza  committed on GitHub Fix LSA issues and harmonize with the text (#104)

 README.md


Advanced Analytics with Spark Source Code

Advanced Analytics with Spark

1st Edition (current)

The source to accompany the 1st edition may be found in the 1st-edition branch.

2nd Edition (coming H1 2017)

The source to accompany the 2nd edition is found in this, the default master branch.

Build

Apache Maven 3.2.5+ and Java 8+ are required to build. From the root level of the project, run mvn package to compile artifacts into target/ subdirectories beneath each chapter's directory.

Data Sets

Build Status

Monday, April 3, 2017

Applied Data Mining and Statistical Learning-Analysis of German Credit Data

source: https://onlinecourses.science.psu.edu/stat857/node/215

Analysis of German Credit Data

Printer-friendly versionPrinter-friendly versiongerman flagData mining is a critical step in knowledge discovery involving theories, methodologies and tools for revealing patterns in data. It is important to understand the rationale behind the methods so that tools and methods have appropriate fit with the data and the objective of pattern recognition. There may be several options for tools available for a data set.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision –
  • If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
  • If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

Objective of Analysis:

Minimization of risk and maximization of profit on behalf of the bank.
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link to the German Credit data (right-click and "save as" ).  A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
Data Files for this case (right-click and "save as" ) :
The following analytical approaches are taken:
  • Logistic regression: The response is binary (Good credit risk or Bad) and several predictors are available.
  • Discriminant Analysis:
  • Tree-based method and Random Forest
///////////////////////////////////////////////////////////////////////////////////////////////////////