Skip to main content

Predictive Analytics: Amlgo Labs Seven Step Process

Hi Friends, today we are going to give you a sequence of steps on how you can solve a predictive modelling problem in simple steps. This will help you to understand how an analytics problem is solved, what is the most important step in the process, and are these steps fit in every analytics problem and most importantly why do we need to follow these steps.
So, before we jump on any of these questions and the seven-step process, let us look at below example to understand – “what is predictive analytics?”.  My loan application was rejected by the Bank because they predicted I may not repay the loan. In this example banks used some parameters to check what is the probability of me doing default, and this probability must be below certain threshold.  So, how does the Bank predicted it? bank has an algorithm where they provide some parameters (like salary, age, job type, experience, etc.) and it gives the probability of getting default (like 78%, 35%, etc.).  So, the process of building this algorithm by analyzing the past data is called predictive analytics. This is a simple way of understanding the predictive analytics, refer https://en.wikipedia.org/wiki/Predictive_analytics for technical definition and today we will discuss how to build this Algorithm.

Once we have our problem statement defined, i.e. the predictive analytics use case and we know what kind of prediction we want to do with our data, lets follow the below steps. Let’s take the same example of “predicting the probability of a loan applicant doing the default”.

Step-1: Data Quality

Since the base of predictive modelling is the historical data, one must have a high-quality data to conclude anything out of the algorithm. So, remember if your data is not of good quality you can never trust on your results. Now the question comes – what is good quality data? You can do some basic checks – like fill rate or missing values in your data i.e. NULL etc., bad groupings like month can be represented as months, mnth, month, m, M, etc., outliers https://en.wikipedia.org/wiki/Outlier , Significant number of records etc.

Data Cleaning
Source: pingcap.com

Data quality can be improved by some techniques like missing value treatment,outlier treatment, oversampling etc. and these techniques can be referred as data cleaning.

Step-2: Exploratory Data Analysis

Once you have a good quality data in place, the important part is to understand what data is saying, prediction is always the second part. The more you analyze your data better you can predict the probability. You can find n number of insights from your data that can make you and customer decide certain actions even before doing the modelling.  We can do – Univariate and Bivariate Analysis, Correlation, Box Plots, Scatter Plot etc. in order to find trends in data. You can use some libraries if using python or R for visualizing the analysis like Seaborn, Matplotlib, R Shiny dashboards etc.

Exploratory Data Analysis
Source: Researchgate.net

Step-3: Data Enrichment

Now we have a good quality data along with analysis i.e. EDA, so we know how the distribution looks like, target variable(Y), analysis with independent variable (X1, X2...XN). The next step is to enrich the data. This can be data preprocessing, standardizing and scaling the data, getting the derived attribute, calculation-based columns etc. You can find some inbuilt libraries to do the transformation of the data and enrich the data both in R and Python. Also, we need to be very careful while doing the transformation of data because sometimes it is very difficult to get the actual value from the transformed data. For some algorithms this is mandatory to do the transformation like SVM and for some this is good to have.

Step-4: Algorithm Selection and Model Development

We now have a complete dataset with our historical features that is ready to use. Here comes our next step to choose the algorithm, there is no free lunch i.e. one algorithm cannot be suitable for all analytical problems so you need to decide very wisely which algorithm works best for the problem, based upon problem statement supervised or unsupervised, discrete or continuous we can select the most suitable algorithm. We might need to iterate this step for multiple algorithms. So, split your dataset into training and test dataset and get some idea about the algorithm performance. Remember till now we have not tuned our algorithm it is just a blank algo we are running. The next important step is Hyperparameter Tuning. Let’s jump to that.

Step-5: Optimize the Parameters 

This is an important part of the process to get the most relevant parameters for your algorithm. As there is no free lunch, there is not a fix set of parameters fixed for all problems, so we need to do hyperparameter tuning. This is nothing but to get the most relevant parameters for our given problem or dataset. You can use random search, grid search to get the best parameters.

Hyperparameter Tuning
Source: scikit-learn.org

Step- 6: Validation 

This is an important step post you have accuracy score based on your test dataset. Check if the process is consistent across all the samples. You can use this using cross-val, however you can use cross-val in step 5 as well. Because our goal is to get the consistent results across all dataset samples. Validate the results using other parameters as well like confusion matrix, recall, false positive rate, etc. based upon the problem statement. 

Data Validation

Step-7: Presentation is the Key 

You have built a perfect model; its results are super cool. But you know unless you present it very well to the key audiences/customer it is not usable. So, remember to present the results in a nice dashboard which can tell a story. Know your audience and present the information as per their skills don’t make it too technical if presenting to end user. 

This comes to the end of the step wise process, the idea here was to give you a sequence of steps in a very layman terms so that everybody can relate it. And none of the step is less important all have an importance. This process makes it clear that you don’t deviate from the path till you get the best results. We manage end to end development of data solutions from Analysis to prediction to visualization or presentation. Feel free to reach us at info@amlgolabs.com

About Amlgo Labs : Amlgo Labs is an advanced data analytics and decision sciences company based out in Gurgaon and Bangalore, India. We help our clients in different areas of data solutions includes design/development of end to end solutions (Cloud, Big Data, UI/UX, Data Engineering, Advanced Analytics and Data Sciences) with a focus on improving businesses and providing insights to make intelligent data-driven decisions across verticals. We have another vertical of business that we call - Financial Regulatory Reporting for (MASAPRAHKMAEBAFEDRBI etc) all major regulators in the world and our team is specialized in commonly used regulatory tools across the globe (AxiomSL Controller ViewOneSumX DevelopmentMoody’s RiskIBM Open Pages etc).We build innovative concepts and then solutions to give an extra edge to the business outcomes and help to visualize and execute effective decision strategies. We are among top 10 Data Analytics Start-ups in India, 2019 and 2020.

Please feel free to comment or share your views and thoughts. You can always reach out to us by sending an email at info@amlgolabs.com or filling a contact form at the end of the page.

 

Comments

More Popular Posts

Amlgo Blog - Experience The Experiments

Amlgo Labs Blog  is a step towards our vision to share knowledge and experiences, Amlgoites accept every challenge very enthusiastically. We do experiments, we fail but we learn and build complex solutions to help our clients solve their problems in Data, Analytics, Prediction, Forecasting, Reporting, Designing and Development area. During this process we enjoy immense learning everyday and we have decided to share our thoughts, learnings, experiments and experiences so that we don't work in silos and contribute the best of our knowledge towards community and learn more by views and reviews. This website is maintained and brough to you by  Amlgo Labs Professionals .   Our Strong Basics -  1)   KISS (Keep It Simple and Straightforward) :  We believe most of the problems can be solved by keeping things simple and straight. This is the learning we had in past, sometimes we try to solve technical problems using high end algorithms and complex codes but this results into complications.

Polybase Blog - Introduction

Overview: This Polybase blog series is all about the use of Polybase Technology in today’s era to be able to take advantage of the Data(Relational and Non-Relational) by using T-SQL only. Data whether Big or not is the lifeline to many different sectors to cope up with Production, Maintenance, Predictions, Taking Precautionary Measures, Customer Satisfaction, Customer Retention, Sales, Revenue Generation and many more.

Polybase : Polybase Scale-Out Group

In the last blog, we discussed the Introduction of the Polybase and the Implementation process of Polybase in SQL Server . PolyBase Scale-out Group consists of multiple virtual machines, each having its own SQL server instances which help in parallel processing and distribution of data. Data loading and query performance can increase in the direct proportion of the number of SQL server instances on each virtual machine.

Financial Regulatory Reporting

This blog is an introduction to the Regulatory Reporting. Regulatory reporting is mandatory activity banks have to perform with the coordination of Treasury, Group Finance, IT, and business lines. Regulators across the globe depend on accurate and timely submission of various Risk and non-risk reports by banks to measure the overall health of the banking sector.

Polybase Installation for Scale-Out process

This part is the continuation of the previous blog about the introduction of  Polybase Scale-Out Group . As we have discussed in our earlier blog PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage. Starting in SQL Server 2019, PolyBase can be used to access external data in SQL Server, Oracle, Teradata, and MongoDB.

Qlikview tool for Financial Data Analytics

QlikView is a Business Intelligence and Data Visualization tool used for getting relevant, actionable, and timely data that help companies in taking the right decisions. Other competitor tools are Tableau, SAP Business Objects,  Microsoft Power BI, IBM Cognos Analytics. Amid uncertain economic conditions, changing dynamics, and a crisis of confidence in the financial markets, customer focus and risk management continue to be key drivers for profitability in banking. The urgent need for information to help address these priorities compels banks to attempt complex data integration and warehouse initiatives.QlikView in-memory analysis helps in faster data integration of data coming from disparate data sources and provides analytical capabilities to business users. The use of the Qlikview tool for financial data analytics is explained as below: Day On Day Variance :  The data analytics team within the Finance department needs to do DoD ,  Month-over-month, Quarter-over-Quarter, YT