Data Explosion and Importance of Analytics in Business
The whole world is experiencing unprecedented growth in quantum of data generation through Business activities and through social media. Since inception of a business house, terabytes of business data has been accumulated without owning clarity to use it effectively and optimally in the business. But now, such business houses are expecting to use it to increase precision in their decision making at all levels. Even small and medium sized organizations also expecting to inculcate habit of “Data DrivenDecision Making”.
Businesses today have realized the significance of social media to know sentiments of general people and customers to effectively to use in business decision making specifically in the event of growing cut-throat business competition. Many social media software are being made popular in massesas a tool to gather their opinionsand thoughts on diversified subjects related to society mindset.
Challenges before Industry
Days are gone for Enterprises to make decisions on gut feelings. They are striving for being Data Driven Enterprises making decisions on facts and figures. But many of them still are perusing their dreams of to be. There is a need of the paradigm shift in technology spectrum for gaining and streamlining movementand momentums in this arena. Industry is posing lots of challenges and thus to meet them, technology is going through rigorous phases of evolutions. A holistic view of these challenges before industry is essential to understand the need and the process of evolution of new solutions. A glance at the following list can give a sense of gravity of herculean task to overcome them.
- Data Formats: Unlike in past when data was mostly in text and figures, now it is also being generated in Images, Videos, Audio, Columnar, Document and so many formats like JSON, Binary, Parquet, Avro, ORC.
- Data Storages: Unlike in past when data was preserved in flat file systems and RDBMS, variety of storages are needed today to meet the needs of storing big data size (Volume), high speed R/W (Velocity), preserving data of variety of formats (Variety) in addition to basic needs like Cost, Simplicity, Security of data on storages.
- Data Sources: Unlike in past when data used to be generated mostly out of business transaction, today its generated from Mobile Devices, IoT devices, logs, social media, B2B and B2C transactions and so on. The tools to use to ingest data must have a wide variety of connectors to interact with such data sources.
- Impedance in Data Formats and Structures: Many data sources to ingest the data from, increases complexity multi-fold to deal with impedances in data formats and structures. Cross system data formats may have different schemas, views, indexes, biases on same type of data. Simplifying alignment of veracity of data is a great challenge to address.
- Dirty Data: Unlike in past when data was accumulated from handful of data sources, the task of cleaning and wrangling of the data was much simpler. While accumulating data from variety of data sources in variety of formats with serious impedances, this problem has been aggravated to serious level of complexity. Data normally comes with noise, outliers, discrepancies, distortions, duplicates, missing values and so on.
- Integration with other Systems: Todays applications are of the kind of complex heterogeneous systems in the sense that in a single system, there are integrationswith other systems of the kind of B2B, B2C, Legacy, Microservices, Cloud services, Hybrid applications and so on. Such systems can not be designed properly by sticking to old trends, tools, architectures, practices etc. Altogether new shift in centroids of problems is essential on an architectural canvas to approach to solutions today.
- Organizational Operations: Every organization was struggling on many fronts like working in silos, rate of attrition, encouraging human resources towards inculcating multi-skill sets, unfriendly environments, complex/clumsy systems etc making difficult for organization to do even moderate shifts in approaches.
- Data Security and Privacy: Cyber crimes have become a serious pain point for whole IT industry to protect the data and its privacy from being destroyed, stole, and misused. The threat is growing with the number and skills of ill-brain in the world.
Evolution of Data Architecture for Lake House
Its now pretty clear that evolution on one front cannot be a solution on all these challenges but has to happen on many fronts with introducing new Storages, Computes, Tools, Services, Practices, Architectures, Approaches and so on. Obviously, industry in the process of evolution must look for versatile system imbibed with following features…
a) Storage’swith humongous volume of data handling diligently and reliably with high R/W speed optimized for performance and cost.
b) Tools to integrate and ingest data from variety of data sources seamlessly and capable of easily and quickly resolving data impedances.
c) Tools to apply high speed data cleaning, munging, wrangling to make big data suitable and ready for analytics processing.
d) Systems to process big data in minimal time to create alerts, predictions, reports, visuals in almost real-time on devices even like Mobile.
e) System compliant with regional norms with globalization/localization features inculcated.
f) Systems with are automated, fault tolerant, resilient, disaster safe, highly available, self-healing, fully consistent still optimized for performance and efficiency.
Since few years, the IT industry is in the phase of evolution on many fronts through many phases. Just few of them have been discussed here.
I) The Data Warehouse: It was introduced to overcome weaknesses of databases for not able to process complex business processes. It did have large storage with Massive Parallel Processing units working to execute complex queries concurrently in short span of time. It was also capable of responding to too many queries at the peakhours with minimal latency. Mostly the product was well equipped with phenomenal data management and business intelligence capabilities. It did cater business needs diligently in an era but later started failing consistently to meet the needs like: Not able to represent data in different formats, provide access to outside world other than SQL interface, may not support advanced analytics and lack of in-build support of proper ecosystem around it to meet all new data processing needs.
II) The Data Lake: The demand of the industry changed as time passed and industry started looking for the bulk sized storage which can represent data of different formats, access with high R/W speed and can be integrated with variety of tools and applications. The Data Lake like storages offered solution and ideally met the needs like capable of representing data of formats including Images, Audios, Videos and so on. It also offered interfaces to connect to other parallelly evolving systems like Big data Processing, Advanced Analytics and Business Intelligence Tools. (Ex. Azure Data Lake Store with Azure Data Lake Analytics, HDInsight and so on.)
Although it catered most of the needs of industry, the system soon started falling short of services like in-build data management platform, in-build support for compute and analytics services and industry needs of working with more complex data and data processing scenarios.And again, with the change in demands and trends triggered the need to search for another system.
By this time, industry tried different alternatives with trade-offs between pain and soothing points.
a) Apache Hive on HDFS/S3: Hive is a Data Warehouse System build at the top of Hadoop. It facilitates analyzing large data volumes to exclusively handle queries on structured data that is collected in Logs. Hive runs at the top of highly scalable, fault-tolerant, and distributed cluster of nodes through MapReduce. The mole on the beautiful face was the ability to process and create results real-time.
b) The Lambda Architecture: It’s a typical approach in which batch and stream pipelines prepare records in co-ordination concurrently. The results are then blended to offer complete answer. Industry noticed the architecture as it is meeting the strict latency requirements for processing old and freshly produced events. But the biggest dis-advantage cropped up was Development and Operational Burdon of maintaining two independent systems.
III) The Lake House: The desperation of Industry finally came to an end with the introduction of the Lake House Architecture. The Lake House is an Open Data Management Architecture which unifies Data warehousing and Advanced Analytics.And different vendors quickly implemented without delay the concept in their products. Just to name them are Databricks and Azure Databricks, Azure Synapse Analytics and so on. The Lake House combined the best part of Data Warehouse and Data Lake and addressed almost all concerns listed above.
The solution came up with lots of features under one roofto manage Batch and Real-Time processing data pipelines. The specification addresses need of features like-
- Phenomenal Data Management and Business Intelligence
- Features to cater needs of roles like
- Data Engineer for designing Data Processing Pipelines.
- Data Scientist for building AI and ML processing pipelines.
- Data Analyst for building BI and analysis pipelines
- Data Administrator with Dashboard to manage administration.
- Connectors to connect to variety of Data Sources and Data Sinks.
- Support for variety of data formats and humongous storage
- Real-time processing of both historical data and stream data.
- Dramatically low latency to process data in real-time.
- Processing engines for batch and stream data pipelines.
In addition to all above features, specification mandates following points to it…
- Offers Data Management and ACID Transactions of Data Warehouse.
- Enables Business Intelligence, Machine Learning and Deep Learning on all types of data.
- Exists in 2-tier architecture
- Support of even 3rd Party libraries like TensorFlow, Keras, PyTorch
- Support of Audit History and Time Travel.
- Separate detachable Metadata Layer to associate Schema Enforcement, Table Versioning, Data Validations.
The Lake House has brought a light of hope to easily design combined data pipeline of Batch and Stream Processing otherwise which is difficult to implement as a Lambda or Kappa Architecture. Our next module will discuss implementation of Lake House in Azure Data Bricks and in Azure Synapse Analytics.
Blog by Mr. Chandrashekhar Deshpande (MCT)
Technology expertise: Analytics, Machine Learning and Artificial Intelligence, Hadoop spectrum and Azure HDInsight, Apache Spark-Azure Databricks-Azure ML Services-Azure Cognitive Services, Java-Python-Scala, GoF and JEE and Frameworks.