Skip to main content

Predictive Analytics: Amlgo Labs Seven Step Process

Hi Friends, today we are going to give you a sequence of steps on how you can solve a predictive modelling problem in simple steps. This will help you to understand how an analytics problem is solved, what is the most important step in the process, and are these steps fit in every analytics problem and most importantly why do we need to follow these steps.
So, before we jump on any of these questions and the seven-step process, let us look at below example to understand – “what is predictive analytics?”.  My loan application was rejected by the Bank because they predicted I may not repay the loan. In this example banks used some parameters to check what is the probability of me doing default, and this probability must be below certain threshold.  So, how does the Bank predicted it? bank has an algorithm where they provide some parameters (like salary, age, job type, experience, etc.) and it gives the probability of getting default (like 78%, 35%, etc.).  So, the process of building this algorithm by analyzing the past data is called predictive analytics. This is a simple way of understanding the predictive analytics, refer https://en.wikipedia.org/wiki/Predictive_analytics for technical definition and today we will discuss how to build this Algorithm.

Once we have our problem statement defined, i.e. the predictive analytics use case and we know what kind of prediction we want to do with our data, lets follow the below steps. Let’s take the same example of “predicting the probability of a loan applicant doing the default”.

Step-1: Data Quality

Since the base of predictive modelling is the historical data, one must have a high-quality data to conclude anything out of the algorithm. So, remember if your data is not of good quality you can never trust on your results. Now the question comes – what is good quality data? You can do some basic checks – like fill rate or missing values in your data i.e. NULL etc., bad groupings like month can be represented as months, mnth, month, m, M, etc., outliers https://en.wikipedia.org/wiki/Outlier , Significant number of records etc.

Data Cleaning
Source: pingcap.com

Data quality can be improved by some techniques like missing value treatment,outlier treatment, oversampling etc. and these techniques can be referred as data cleaning.

Step-2: Exploratory Data Analysis

Once you have a good quality data in place, the important part is to understand what data is saying, prediction is always the second part. The more you analyze your data better you can predict the probability. You can find n number of insights from your data that can make you and customer decide certain actions even before doing the modelling.  We can do – Univariate and Bivariate Analysis, Correlation, Box Plots, Scatter Plot etc. in order to find trends in data. You can use some libraries if using python or R for visualizing the analysis like Seaborn, Matplotlib, R Shiny dashboards etc.

Exploratory Data Analysis
Source: Researchgate.net

Step-3: Data Enrichment

Now we have a good quality data along with analysis i.e. EDA, so we know how the distribution looks like, target variable(Y), analysis with independent variable (X1, X2...XN). The next step is to enrich the data. This can be data preprocessing, standardizing and scaling the data, getting the derived attribute, calculation-based columns etc. You can find some inbuilt libraries to do the transformation of the data and enrich the data both in R and Python. Also, we need to be very careful while doing the transformation of data because sometimes it is very difficult to get the actual value from the transformed data. For some algorithms this is mandatory to do the transformation like SVM and for some this is good to have.

Step-4: Algorithm Selection and Model Development

We now have a complete dataset with our historical features that is ready to use. Here comes our next step to choose the algorithm, there is no free lunch i.e. one algorithm cannot be suitable for all analytical problems so you need to decide very wisely which algorithm works best for the problem, based upon problem statement supervised or unsupervised, discrete or continuous we can select the most suitable algorithm. We might need to iterate this step for multiple algorithms. So, split your dataset into training and test dataset and get some idea about the algorithm performance. Remember till now we have not tuned our algorithm it is just a blank algo we are running. The next important step is Hyperparameter Tuning. Let’s jump to that.

Step-5: Optimize the Parameters 

This is an important part of the process to get the most relevant parameters for your algorithm. As there is no free lunch, there is not a fix set of parameters fixed for all problems, so we need to do hyperparameter tuning. This is nothing but to get the most relevant parameters for our given problem or dataset. You can use random search, grid search to get the best parameters.

Hyperparameter Tuning
Source: scikit-learn.org

Step- 6: Validation 

This is an important step post you have accuracy score based on your test dataset. Check if the process is consistent across all the samples. You can use this using cross-val, however you can use cross-val in step 5 as well. Because our goal is to get the consistent results across all dataset samples. Validate the results using other parameters as well like confusion matrix, recall, false positive rate, etc. based upon the problem statement. 

Data Validation

Step-7: Presentation is the Key 

You have built a perfect model; its results are super cool. But you know unless you present it very well to the key audiences/customer it is not usable. So, remember to present the results in a nice dashboard which can tell a story. Know your audience and present the information as per their skills don’t make it too technical if presenting to end user. 

This comes to the end of the step wise process, the idea here was to give you a sequence of steps in a very layman terms so that everybody can relate it. And none of the step is less important all have an importance. This process makes it clear that you don’t deviate from the path till you get the best results. We manage end to end development of data solutions from Analysis to prediction to visualization or presentation. Feel free to reach us at info@amlgolabs.com

About Amlgo Labs : Amlgo Labs is an advanced data analytics and decision sciences company based out in Gurgaon and Bangalore, India. We help our clients in different areas of data solutions includes design/development of end to end solutions (Cloud, Big Data, UI/UX, Data Engineering, Advanced Analytics and Data Sciences) with a focus on improving businesses and providing insights to make intelligent data-driven decisions across verticals. We have another vertical of business that we call - Financial Regulatory Reporting for (MASAPRAHKMAEBAFEDRBI etc) all major regulators in the world and our team is specialized in commonly used regulatory tools across the globe (AxiomSL Controller ViewOneSumX DevelopmentMoody’s RiskIBM Open Pages etc).We build innovative concepts and then solutions to give an extra edge to the business outcomes and help to visualize and execute effective decision strategies. We are among top 10 Data Analytics Start-ups in India, 2019 and 2020.

Please feel free to comment or share your views and thoughts. You can always reach out to us by sending an email at info@amlgolabs.com or filling a contact form at the end of the page.

 

Comments

  1. This was huge information for all ,those who needed these type article. This was really good and of course knowledgeable. Thank you for sharing this much information with us.Russia Import Data

    ReplyDelete

Post a Comment

More Popular Posts

Amlgo Blog - Experience The Experiments

Amlgo Labs Blog  is a step towards our vision to share knowledge and experiences, Amlgoites accept every challenge very enthusiastically. We do experiments, we fail but we learn and build complex solutions to help our clients solve their problems in Data, Analytics, Prediction, Forecasting, Reporting, Designing and Development area. During this process we enjoy immense learning everyday and we have decided to share our thoughts, learnings, experiments and experiences so that we don't work in silos and contribute the best of our knowledge towards community and learn more by views and reviews. This website is maintained and brough to you by  Amlgo Labs Professionals .   Our Strong Basics -  1)   KISS (Keep It Simple and Straightforward) :  We believe most of the problems can be solved by keeping things simple and straight. This is the learning we had in past, sometimes we try to solve technical problems using high end algorithms and complex codes but this results into complications.

JavaScript: The Important Basics

In this weird world of technology people often get confused how does a website work? What all things should I be aware of? So we are here to help you. Most of the content on web is developed with the help of JavaScript. JavaScript is really becoming popular these days with the coming of ECMAScript2015 and ECMAScript 2016, for this reason, some beginners learning React and Angular and are also trying to tackle more modern JavaScript syntax at the same time. If you're new to both, it can be confusing as to "what is JavaScript and what are its important features". This document should serve as a primer to help you get up-to-speed on JavaScript syntax that we feel matters the most for learning React as well as Angular.

Power BI Blog - What is Power BI and Why we use Power BI.

This Power BI blog is an introduction to the Power BI. In this Power BI tutorial, you will learn Power BI basics - what is Power BI, Power BI Desktop and Power BI Service. Power BI tool is a combination of Business Intelligence and Visualization. Before we dig deep let’s understand what is Business Intelligence. Business Intelligence is a broad spectrum which consists of using different business strategies and techniques to gather data, refine it and process it for the purpose of data analysis which gives us some meaningful information which helps in data driven decision making. When we present this information in the form of some charts, graphs it takes the form of Visualization.

Polybase Installation for Scale-Out process

This part is the continuation of the previous blog about the introduction of  Polybase Scale-Out Group . As we have discussed in our earlier blog PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage. Starting in SQL Server 2019, PolyBase can be used to access external data in SQL Server, Oracle, Teradata, and MongoDB.

Polybase Blog - Introduction

Overview: This Polybase blog series is all about the use of Polybase Technology in today’s era to be able to take advantage of the Data(Relational and Non-Relational) by using T-SQL only. Data whether Big or not is the lifeline to many different sectors to cope up with Production, Maintenance, Predictions, Taking Precautionary Measures, Customer Satisfaction, Customer Retention, Sales, Revenue Generation and many more.

A simplified view of a QlikView tool:

In our last blog related to Qlikview, we discussed the use of Qlikview for Financial Data Analytics . Qlikview is a Business Intelligence tool that consists of a front end to visualize the processed data and a back end to provide the security and publication of the mechanism for QlikView user documents.

Stages of New Regulatory Reporting Implementation

In the last blog, we discussed what is  Financial Regulatory Reporting  and its importance in the banking industry. Regulatory Reporting comprises the submission of business numbers as required by regulators to evaluate and track the financial and operational status of financial institutions and their compliance with required regulatory provisions.