A Beginner's guide to Data Strategy

Amit Bhardwaj
3 min readAug 21, 2020

Ever wondered why there is so much hype around data analytics and how it is getting leveraged to improve business .Alright,then you are thinking about it

You have problems with your business/process occasionally and you can see that people are solving same problems around you using data that you also have ,they are keeping up with trends and improving while you are yet to start and you don’t know HOW ?Congratulations ! you have already crossed you first step by identifying the problem that you are trying to solve.

Since necessity is the mother of invention,here your problem is leading you towards creating your own Data Strategy Framework .

Let’s get started then,

I am listing 4 easy steps with examples to get you started with you data strategy framework. I am trying to make this framework as flexible as possible so that it can be applied to multiple types of business and also robust so that it can handle data maturity.

Knowing the problem

Tools required : Paper,pen (Strongly Recommended)

Writing down objectives,what are we trying to solve through the data and prioritizing if multiple objectives are there.

For example if you are a fashion store and want to open a store in different countries and want to find what what would be the best locations to open my store so that my customer reach will be maximum.

Data,Right type of data

Tools Required : Paper, pen ,References,Public Data Repositories

Yes,there is wrong data and it’s very crucial steps for building a robust strategy.Look inside your business what can be used ,look outside your business what kind of data can be leveraged and also you have access to.Concentrating on quality in initial periods can help when there is large volume of data.

Continuing with above example we can see that for opening store in any country you need to first see that what kind of customers do you have(as per budget) currently and then you can look in different future locations for a particular country accordingly .Also,demographics data and trends data if you can access then it will add to your analysis.

Technology

Tools : Apache Kafka, Python

Now ,leave the Pen and paper since it’s time to convert all this data into insights and solve your problem.For next step ,I will like you to introduce you to Data Analysis Pipeline(Automated,No manual stuff) which has following three major tasks :

1. Extract data from multiple data sources that matter to you.(Congratulations ! Step 2- Automated)

2. Clean, transform and enrich this data to make it analysis-ready.

3. Load this data to a single source of truth — more often a data lake or data warehouse.

I will be using open source technologies to create the pipeline.

Simple Open Source Data Pipeline with Open Source Apache Kafka

Using Apache Kafka and its features like Kafka Connect and Kafka Stream all the above three steps can be achieved.Simple working of Apache kafka can be understood below with this diagram.

Working of Apache Kafka

We can see that each data source is an event(clicks,trade,login for SaaS or any manual defined event) for Kafka and it’s stored as log in topic structure which can be as long as or as short and can be stored for as long and as short.This pipeline collects the data(event in kafka terms) and process using Kafka streams and can be redirected to any of the RDBMS or other sources.

Visualize

Tools : Tableau(Free for 1 year student account),Python Language,Good’ol HTML and CSS

Now that you have data pipeline sorted and all the things are in place ,deployed and you are getting your beautiful data. But wait what ? all numbers and text, that’s not beautiful ,that’s not what you had in mind .Thus for visualizing your cooked data you need a good visualization tool or you can do same with Python language ,it provides many packages like matplotlib, seaborn to name a few.

When you visualize, then you materialize

All of the above steps together makes a robust and flexible data strategy framework .

Once each module is understood latest cloud technologies(GCP,Azure and AWS) can be integrated in further run for fault tolerance and scalability.

Thanks!

--

--