Hambone Blues Jam

Home Decoration Tips
Data Collection – Data Architecture – Part 1

Data Collection – Data Architecture – Part 1


Hello All! In this video series of End to End
machine learning lifecycle we are going to touch upon the data collection. I have
divided this data collection into two parts. In the first part, we are going to talk about
data architecture, various architecture and its relevance to the machine
learning or the artificial intelligence cycle. In the second part we are
going to talk about the data collection mechanism from multiple system of record. So, let’s get started. Now, if you see this architecture basically this is the End to End
machine learning lifecycle architecture, the data collection part comes right
after you understand your business and you hit your data need and also your
high-level non-functional requirement need. So, this is the first technical step
towards your machine learning life cycle. I am not going to talk much about
the architecture in this series, but coming into the details of it, this is
our typical life cycle of your data. The ultimate goal of any
enterprise or organization is to be action driven. Now
if you say action driven, it can be either opinion given or it can be data driven.
Now both has its merits and demerits but in this series I’m going to focus more
on the data driven action. Now for me to be action driven,I need
to have insight about my business and in the data driven initiative the insights
come with data. So data is basically our underlying knowledge of which our
business operates on or underlying knowledge in which our organization
operates on. When I’m separating what business and operation is, basically
I see business as a profit entity and anything organizations in the nonprofit
entity, so this is applicable to both, typical
profit entity and non-profit entity. So basically data is our underlying
knowledge and inside is understanding of our knowledge. So basically, we have data
what we’re doing is generating insights to understand our
underlying business and finally using the understanding of the
underlying business, we execute actions. Now, what are these actions? Actions can
be anything that impacts your business. It can be a marketing initiative
to personalize offers, or it can be an initiative to increase your top line, or
it can be initiative to reduce your cost, it can be initiating to detect fraud in
your business environment. So the action can be anything.
This looks simple right but the challenge over there in order for me to
get to that action, to get to that insight, I have to collect data from this disparate
source systems, that there will be an operational system which are typically a database, or
file system which are more like structured in nature, which is spread
across the enterprise, my line of business.It’s all over
the enterprise. It’s not in one a single place. This is the
structured information. I mean a lot of unstructured data in being form of voice
called records, in the form of PDF files, in the form of basically emails and text,
and everything so that is a second part of the challenge. The third part is I have
a lot of external information. It can be the company social media page generating
information for customer, but it can be an external data sources like data.gov
where they pull data to get weather information or basically the market
information, economic information or something like that, or it can be your
partner website also like the credit bureaus which the finance uses or even
other partners that is relevant to the business. These three are there and apart from that there is sources like high velocity streaming data. This can be a self-driving car, this can be IOT,
If you are in the manufacturing industry basically all the machinery will have
some embedded intelligence into that, which is continuously generating data to monitor the health of machinery, so that it can
be a smart device, it can be our smartphones which the customer is using.
They may be using some of your apps as well. So basically the data is spread
across multiple source system and what happens is that basically when you have an
expert data scientist over here and you tell them that I add data all over the
place it’s very difficult for them to generate any insights out of it. So the
problem with expert data scientist is basically his questions will be “How do i find my data?”, like the data is everywhere right, he is going to spend
rest of his life only just integrating the data, if you just don’t
centralize the data here and each data is in its own format like XML format,
PDF format the IOT will come in a JSON file, or your mainframe data over there.
So it’s multiple format and it’s very difficult for data scientists to kind of
spend more time on data integration, rather than inside generation right and
then other question “How do i pull the data?” , “How do i centralize it?”,” How can i make
this data into insight right format finally what we model understands
numerical input?” so how can we convert this multiple data formats, take it,
collect it, and generate a model ready file that my model can act on.This is a
pretty big challenge in any enterprise today. So one of the key foundations of data
architecture over here is, centralizing data assets within the enterprise. We
take all the data, centralize in a format, uniform format that a data
scientist can consume. Right, that’s the first part of it. Now,
okay i have centralized my data, I also need to make sure I govern my data assets because, if you throw open data sets there may be lot of customer
sensitive information which everybody will get access to. So i need to make sure I
need to govern and secure data assets and really give access to those who require
it. Not to everyone, right? Now that is a second part. The third part is, I also need
to catalog my data.Now, I have 5000 source systems. I centralized all my data
in a single place, 5000 feeds. Now how will my data scientist know which data resides and what are the definitions of the data? That’s when I go and catalog my
data, I create technical metadata, business metadata and also add operational metadata to it. Now if i need to organize, that is create a common structure which is easily accessible,
which is insight ready, at least in sight ready where a data scientist can work
and do the feature engineering and other stuff. The ultimate goal of the entire
data architecture phase is to democratize data for everyone who has access to it, in a secure and reliable format. To achieve this, basically, if we talk about
data architecture, you might already know about it, and we’re not going to
go to the detail of it, but that is like the data warehouse of the world and the
data lakes, and these two are not technologies, these are a concept!
Data warehouse is a concept. Now there are technology that support data warehouse like teradata, exadata , netezza and all. Similarly data lake is a concept
where you can build the data lake on Hadoop file system or you can build data lake on S3
or Google Cloud storage or any other technology. Now the key
differentiation between the data warehouse and data Lake is how the data
is organized. If you see the database you are the same source systems I mentioned
earlier. Now when it comes to data warehouse what you do is, you basically
take the data, you transform and then load it. So basically when you are loading the data into the data warehouse, the data is consumption ready already. The data is transformed. You
don’t have track of the raw data at all, how the data looked like in the
system of record. Basically you have only a version of transformed data sets. Now you can create raw data sets in EDW or the data warehouse. There is nothing wrong, only thing it’s very expensive and data warehouse does not have good processing
mechanism to store unstructured data. You can store unstructured structure data, you can transform it and extract metadata and then store it but just adding the raw
data for further analysis and doing analysis with it is pretty difficult
in a traditional enterprise data warehouse system. So when we talk about EDW, you
can extract, transform, and load it. So you have a consumption ready data set.
Contrarily, in the data lake what we do is, we just extract and load
up. So basically what we call in data warehouse as ETL and in data lake we call ELT, Extract Load Transform. So basically what we do is maintain the
data to as much closer to the source system as possible. That is in a
pretty much raw format. So we take all the data, we collect it,we add the
data as much raw as possible. The only thing we can additionally do is, we convert the
data into standardized format. So if I have a data in the mainframe COBOL
format and json and CSV and XMLs and all, I’ll kind of
give a structure to it. Either I can give a physical structure to it or logical structure it. I am going to cover more this in the next series but just understand the data will be as raw as possible. Now once you have this data you
can transform it as per your downstream requirement as per your
consumption requirement. It can be a BI reporting requirement, it can be a kind of
a machine learning requirement, it can be advanced learning requirement and one more aspect i forgot to mention when I talk about action driven, you action may not
be a machine learning at all. Your action can be advanced analytics, your action
can be even rules. We are using the insights to generate the rules or
using insights to build your models. That’s the difference right? So having
said that the key difference between the data warehouse and the data lake is
basically data warehouses are schema on write, so basically you define schema and
then you transform the data into the definition and populate it. It is very good
for interactive reporting , there is nothing like data warehouse or data lake is superior than data warehouse. It all depends on your requirement. Both has its purpose.So
data warehouse is good for low latency interactive reporting, dashboards are
pretty fast compared to data lake. There are technologies supporting data lake that can make it fast but again it’s kind of adding more technology tools to make it work. And data warehouse is pretty good in high frequency updates. If you have lot of updates
coming in, then data warehouse can do it, whereas data lake you can do it but you
have to do a lot of work and you have to batch your updates in such a way
the updates are reflected into a file system. At the other end data lakes is schema on read, so what you do is you load the data as well as possible, you can define
schema on the raw data or basically why because schema on read is based on your
consuming application need, you define schema to it and then you
access it. It is pretty good for heavy lifting jobs like machine learning,
advanced analytics or data mining or name it all. It is also good with
unstructured data, you just structure it for insights, but still it is
pretty good to store a historical unstructured data and also archive it
which you can use it later. So this is about data architecture. In the second
part we will talk about multiple source system .The source data may come in a batch, in real time, API based. We will talk about how we can
collect the data and then use it in machine learning model.Please subscribe
to the channel AI Engineering if you want to get updates. Thank you very
much

9 comments found

  1. A very good start Sri. My 2 cents. If you slow down your pace of speaking that would greatly benefit a lot of audience. Thanks.

  2. Good content, if possible, please invest in a good microphone, Yeti or Rode. Bit pricey, but they make a lot of difference in the quality of your audio.

Leave comment

Your email address will not be published. Required fields are marked with *.