Data Analysis

  • Uncategorized

Nameof Student



Predictivedata analysis is a branch in statistical data analysis that usesprovided data set to algorithmically predict or forecast the outcomeor behavior of individual variables. Integration of technology inscientific studies has led to the improvement and development ofadvanced tools that can easily handle these kinds of analysis (WN &ampRipley, 2013) and (MJ, 2012). The IBM SPSS Modeler version 18 is aninteractive software that is used to conduct such advanced analyticsamong others. The software is developed with an interactive interfaceto guide the analyst in designing a model that will analyze data setprovided. This paper applies techniques learnt in IT managementcourse to conduct the advanced analysis and provide models of eachdata set while interpreting the results.

Universal Bank Analysis

k-NN(k Nearest Neighbors) is a non-parametric method used forclassification of data sets in statistical estimation and patternrecognition since the 1970s. It is a simple algorithm that stores allavailable cases and classifies new cases based on a similaritymeasures like distance functions.

Figure 1: k-NN Analysis of Universal Bank Dataset

Source: IBM SPSS Modeler version 18 Analysis, 2017

Figure 2: Summary of k Nearest Neighbors and Distances

Source: IBM SPSS Modeler version 18 Analysis, 2017

This analysis was done using the Universal bank dataset, which waspartitioned into 60% training and 40% validation. Using k=10 toperform the classification with 11 out of 13 predictors, theprescribed customer response to the last personal loan campaign wouldbe classified as a No response. The choice is based on the focalrecords in figure 2 where the 10 nearest neighbors had a similarresponse of “No” to accepting the offer for personal loan.

K value That Balances Overfitting and PredictorInformation

The choice of k that balances between overfitting and ignoring thepredictor information can be determined by allowing the SPSS modelerto predict a target field in the objectives tab and customizinganalysis in the settings tab to automatically select k in a rangebetween 3 as the minimum and 10 as the maximum. The resultingpredictor space provides the choice of k that balances overfitting asshown below in figure 3. For this case the choice of k = 4.

Figure 3: Choice of K Value Balancing Overfitting

Source: IBM SPSS Modeler version 18 Analysis, 2017

Classification of Prescribed Customer Using theBest K

Followingthe previous model settings done above and adjusting the k value tobe 4 as the best, k-NN classification output was as shown in figure 4below.

Figure 4: k-NN Classification

Source: IBM SPSS Modeler version 18 Analysis, 2017

Figure 5: Summary of k Nearest Neighbors and Distances

Source: IBM SPSS Modeler version 18 Analysis, 2017

Basingour classification on the k-NN in figures 4 and 5 above, most of theneighbors were in the “No” category of accepting personal loans.Therefore, we can conclude that the prescribed customer will also beclassified under “No” category of accepting personal loans.

eBay Auctions Analysis

Classificationand prediction analysis of eBay auction dataset was done using theCART Decision Tree in the SPSS Modeler. Classification and RegressionTree Analysis tool that partitions data into smaller clusters toimprove the slope fitness (Kuhn &amp Johnson, 2013) and (Shirali,2016). The resulting output of the data set is shown in figure 6below.

Figure 6: CART Decision Tree Analysis

Source: IBM SPSS Modeler version 18 Analysis, 2017

Theabove classification shows that the most important predictors of thecompetitiveness of an auction were the initial price set by theseller (OpenPrice) and the price at which the item was sold(ClosePrice). while the rating of the seller by eBay (sellerRating)was the least important predictors in the model. For the aboveanalysis 52.518% of the auctions were competitive as indicated by thefirst node which represented 438 out of 834 auctions.

Themodel is expected to predict the outcome of any new auction thatwould be transacted on any time upholding the assumptions ofCeteris Paribus. This can be evaluated using the gain or liftchart analysis as shown on figure 7 below. The greater the areabetween the lift curve and the baseline, the better the model. Figure8 also provides an evaluation analysis of the partitions recordingsin the model. The test indicated an over 82% correct recording ofboth training and validation partitions.

Figure 7: Gain/Lift Chart Analysis

Source: IBM SPSS Modeler version 18 Analysis, 2017

Figure 8: Evaluation Test

Source: IBM SPSS Modeler version 18 Analysis, 2017

Someof the useful and non-useful information that the CART decision rulesprovide are ability of the decision tree to partition data intosmaller clusters to improve the slope fitness which is indicatedwithin the tree (Gordon, 2013). The decision tree can also beadjusted to the desired size of the analyst (Mangal, Nachiappan,Elangovan, &amp Sugumaran, 2016).

Adjusted CART Analysis

Thefitted classification tree was done using only the three importantpredictors, Open Price, Close Price and Seller Rating, as per theprevious analysis. Figure 9 below shows the output analysis of theadjusted CART analysis.

Anew auction on is likely to be competitive if its Open Priceis less than $ 1.805 basing our decision on the high probabilityvalue of 0.821 in node 1. Only 27.458% of the auction items sellbelow this price (less than $ 1.805) while the remaining 72.542%sells more than $ 1.805 and the bid auction is not competitive. Ifthe item’s price exceeds 1.810 on the second bid, the auction endsand closes at that price.

Figure 9: Adjusted CART Analysis

Source: IBM SPSS Modeler version 18 Analysis, 2017

Lift Curve and Classification Matrix

Liftcharts are a graphical representation of the advantages of usingpredictive models in choice or decision making. It evaluates modelperformance using portions of the available data. On the other hand,classification matrix evaluates model performance using the wholepopulation.

Figure 10: Lift Curve

Source: IBM SPSS Modeler version 18 Analysis, 2017

Thelift curve depicts that the model used to predict competitiveness ishighly reliable with a prediction probability more than 0.5. The liftor raise of the model curve from the baseline also indicates thepower of the model to provide predictive measures.

Figure 11: Classification Matrices

A: Category by End Day Matrix

Source: IBM SPSS Modeler version 18 Analysis, 2017

B: Category by Currency Matrix

Source: IBM SPSS Modeler version 18 Analysis, 2017

TheChi-square test on the cross tabulation between Category of theauctioned item and the day of the week that the auction is closedindicates that the two variables are related at 5% level ofsignificance (Figure 10: A). Similarly, Chi-square test on Categoryby Currency matrix indicates a relationship at 5% level ofsignificance.

Inconclusion, the chances of an auction obtaining at least two bids is0.413. The item has a chance of getting more than 9 bids at 0.525 andif they are less, the predict is 0.265. The bid is expected to Pickup from a new open price set by the seller and the bids are predictedat 0.725 to be less than 10.345. The auction is predicted at 0.805 toclose with more than 10 bids. If there is no close price, the sellerwill set a new open price for the auction which is predicted to closewith less than 8.725 bids.

Referringto the adjusted classification tree above, an item will be morecompetitive on if the seller sets a high open price and alonger duration in the auction.

Analysis of the Pharmaceutical Industry

Anauto cluster node was used to determine the best cluster model forthe defined quantitative variables. Out of three clusters, K-means,Kohonen and TwoStep, the initial was determined by the modeleralgorithmically to be the best with a silhouette measure of cohesionand separation of 0.475 which is considered as fair cluster quality.There were only 5 clusters that were formed using the 9 quantitativevariables used. The figure below provides a summary details of theK-Means Cluster model used.

Figure 12: K-Means Cluster Summary

Source: IBM SPSS Modeler version 18 Analysis, 2017

Cluster Output Interpretation

Themost important predictors according to the K-Means clusters wereMarket Cap, Beta and Leverage in all clusters of all the firms. Thecontribution of each quantitative variable in each cluster are shownin the appendix below. The figure below shows the general predictorimportance for a quick view.

Figure 13: Predictor Importance

Source: IBM SPSS Modeler version 18 Analysis, 2017

Thereis an irregular pattern in the clustering depending on their sizeswhich was determined by the quantitative variables. The largest wascluster 1 occupying a 57.1% share in the of equity market followed bycluster 5 with a 19% share of equity market. The Chart below showsthe cluster sizes and their percentage equity market share.

Chart 1: Cluster Size and Their Equity Shares

Source: IBM SPSS Modeler version 18 Analysis, 2017

Theresults of the cluster analysis above will assist the equity analystto determine the best firm out of the 21 to invest or buy shares andcan recommend to his or her clients what to look for beforepurchasing share in a lucrative firm. For the above analysis, thebest firm to invest in is one that has the largest marketcapitalization with a high leverage value and High Beta during upmarkets.


Theuse of predictive analysis in different sectors may be old but isgaining an interest gradually as more and more analyst adopt them forforecasting and predicting future outcomes of risk factors, consumertiers, distribution networks, profitability of mergers and resourcemanagement (Valencia, 2017).


Gordon, L. (2013). Using Classification and Regression Trees (CART) in SAS Enterprise Miner. Lexington: SAS Global Forum.

Kuhn, M., &amp Johnson, K. (2013). Applied Predictive Modeling. New York: Springer.

Mangal, N., Nachiappan, M. R., Elangovan, M., &amp Sugumaran, V. (2016). Fault Diagnosis of a Single Point Cutting Tool using Statistical Features by Simple CART Classifier. Indian Journal of Science &amp Technology, 1-7.

Shirali, R. (2016). Classification Trees and Rule-Based Modeling Using the C5.0 Algorithm for Self-Image Across Sex and Race in St. Louis. Arts &amp Sciences Electronic Theses and Dissertations, 1-23.

AppendixFigure 14: Cluster and Predictor ImportanceSummary

Source: IBM SPSS Modeler version 18 Analysis, 2017