Once we have our problem statement defined, i.e. the predictive analytics use case and we know what kind of prediction we want to do with our data, lets follow the below steps. Let’s take the same example of “predicting the probability of a loan applicant doing the default”.
Step-1: Data Quality
Since the base of predictive modelling is the historical data, one must have a high-quality data to conclude anything out of the algorithm. So, remember if your data is not of good quality you can never trust on your results. Now the question comes – what is good quality data? You can do some basic checks – like fill rate or missing values in your data i.e. NULL etc., bad groupings like month can be represented as months, mnth, month, m, M, etc., outliers https://en.wikipedia.org/wiki/Outlier , Significant number of records etc.
![]() |
Source: pingcap.com |
Data quality can be improved by some techniques like missing value treatment,outlier treatment, oversampling etc. and these techniques can be referred as data cleaning.
Step-2: Exploratory Data Analysis
Once you have a good quality data in place, the important part is to understand what data is saying, prediction is always the second part. The more you analyze your data better you can predict the probability. You can find n number of insights from your data that can make you and customer decide certain actions even before doing the modelling. We can do – Univariate and Bivariate Analysis, Correlation, Box Plots, Scatter Plot etc. in order to find trends in data. You can use some libraries if using python or R for visualizing the analysis like Seaborn, Matplotlib, R Shiny dashboards etc.
![]() |
Source: Researchgate.net |
Step-3: Data Enrichment
Now we have a good quality data along with analysis i.e. EDA, so we know how the distribution looks like, target variable(Y), analysis with independent variable (X1, X2...XN). The next step is to enrich the data. This can be data preprocessing, standardizing and scaling the data, getting the derived attribute, calculation-based columns etc. You can find some inbuilt libraries to do the transformation of the data and enrich the data both in R and Python. Also, we need to be very careful while doing the transformation of data because sometimes it is very difficult to get the actual value from the transformed data. For some algorithms this is mandatory to do the transformation like SVM and for some this is good to have.
Step-4: Algorithm Selection and Model Development
We now have a complete dataset with our historical features that is ready to use. Here comes our next step to choose the algorithm, there is no free lunch i.e. one algorithm cannot be suitable for all analytical problems so you need to decide very wisely which algorithm works best for the problem, based upon problem statement supervised or unsupervised, discrete or continuous we can select the most suitable algorithm. We might need to iterate this step for multiple algorithms. So, split your dataset into training and test dataset and get some idea about the algorithm performance. Remember till now we have not tuned our algorithm it is just a blank algo we are running. The next important step is Hyperparameter Tuning. Let’s jump to that.
Step-5: Optimize the Parameters
This
is an important part of the process to get the most relevant parameters for
your algorithm. As there is no free lunch, there is not a fix set of parameters
fixed for all problems, so we need to do hyperparameter tuning. This is nothing
but to get the most relevant parameters for our given problem or dataset. You
can use random search, grid search to get the best parameters.
![]() |
Source: scikit-learn.org |
Step- 6: Validation
This is an important step post you have accuracy score based on your test dataset. Check if the process is consistent across all the samples. You can use this using cross-val, however you can use cross-val in step 5 as well. Because our goal is to get the consistent results across all dataset samples. Validate the results using other parameters as well like confusion matrix, recall, false positive rate, etc. based upon the problem statement.
Step-7: Presentation is the Key
You have built a perfect model; its results are super cool. But you know unless you present it very well to the key audiences/customer it is not usable. So, remember to present the results in a nice dashboard which can tell a story. Know your audience and present the information as per their skills don’t make it too technical if presenting to end user.
This
comes to the end of the step wise process, the idea here was to give you a
sequence of steps in a very layman terms so that everybody can relate it. And
none of the step is less important all have an importance. This process makes
it clear that you don’t deviate from the path till you get the best results. We
manage end to end development of data solutions from Analysis to prediction to
visualization or presentation. Feel free to reach us at info@amlgolabs.com
About Amlgo Labs : Amlgo Labs is an
advanced data analytics and decision sciences company based out in Gurgaon and
Bangalore, India. We help our clients in different areas of data
solutions includes design/development of end to end
solutions (Cloud, Big Data, UI/UX, Data Engineering, Advanced Analytics
and Data Sciences) with a focus on improving businesses and providing
insights to make intelligent data-driven decisions across verticals. We have
another vertical of business that we call - Financial Regulatory Reporting for
(MAS, APRA, HKMA, EBA, FED, RBI etc) all major regulators in the world
and our team is specialized in commonly used regulatory tools across the
globe (AxiomSL
Controller View, OneSumX Development, Moody’s
Risk, IBM Open Pages etc).We build
innovative concepts and then solutions to give an extra edge to the business
outcomes and help to visualize and execute effective decision strategies. We are
among top 10 Data Analytics Start-ups in India, 2019 and 2020.
Please feel free to comment or share your views and
thoughts. You can always reach out to us by sending an email at info@amlgolabs.com or filling a
contact form at the end of the page.
This was huge information for all ,those who needed these type article. This was really good and of course knowledgeable. Thank you for sharing this much information with us.Russia Import Data
ReplyDeleteThank you for sharing such a useful article. It will be useful to those who are looking for knowledge. Continue to share your knowledge with others through posts like these, and keep posting on
ReplyDeleteBig Data Solutions
Advanced Data Analytics Services
Data Modernization Solutions
AI & ML Service Provider