Exploring Data Science with Microsoft Tools and Frameworks

data science

1. Data Science and its growing importance

An interdisciplinary field, data science deals with processes and systems, that are used to extract knowledge or insights from large amounts of data.  It uses a lot of theories and techniques that are a part of other fields like information science, mathematics, statistics, chemometric and computer science.

Over the last decade there’s been a massive explosion in both the data generated and retained by companies, as well as you and me.  Ninety percent of the data in the world today has been created in the last two years alone. Our current output of data is roughly 2.5 quintillion bytes a day.(Infographic 2017). Entirely different ecosystem is on the way to process, analyze such a huge data.  The Bigdata is a ultimate result of parallel processing of such a huge data in less time.

The Data Science is not restricted to big data, as big data solutions are more focused on organizing and pre-processing the data rather than analyzing the data.

Few of the analyzing methods which are core part of the data science are probability models, machine learning, signal processing, data mining, statistical learning, database, data engineering, visualization, pattern recognition and learning, uncertainty modeling, computer programming among others.  Each of them is gaining an importance at the enterprise level.

2. How data science may add a value to the business?

blog data science

3. Few trending Data Science platforms

The fastest growing importance of the subject in almost every business is leading to availability of wide spectrum of competitive tools in the market.  Different cloud technologies like Azure, AWS, Google, TERADATA are leading the bandwagon and providing highly user friendly services.

The Microsoft Azure provides ultimate range of products and tools to facilitate End to End unified development of analytical solutions. I am limiting this blog to discuss the range of solutions in Azure and their significances on the canvas of analytical technologies.

4. Data Science support in Azure

Having said that, lots of elegant special tools and solutions of workload of Data Science/Machine Learning have been introduced in the form of Libraries, Frameworks, Language API for development and production level deployment to meet the need.

(A) Analytical Language interfaces and tools:

The prominent languages data scientist and analyst use are Python and R (Java and Scala are also being used). In the interactive environment, this code runs interactively, with the data scientists using it to query and explore the data, generating visualizations and statistics to help determine the relationships with it. The commonly used tools include…

  1. The Jupyter Notebook and ‘Azure Jupyter Notebook’ as an online Jupyter service for data scientist to create, run, and share Jupyter Notebook script in cloud-based libraries.
  2. Spyder: An IDE provided by Anaconda Python Distribution.
  3. R Studio: An IDE for R Programming Language
  4. Visual Studio Code: A lightweight, cross-platform coding environment that supports Python as well as commonly used frameworks for machine learning and AI development.

(B) Data Science virtual machine (DSVM)

It is an Azure virtual machine image that includes the tools and frameworks commonly used by data scientists, including R, Python, Jupyter Notebooks, Visual Studio Code, and libraries for machine learning modeling such as the Microsoft Cognitive Toolkit, PySpark, MatPlotLib etc.  It can be used to create an environment ready container without needing to deal with complexities of installation, managing inter-dependencies of other tools w.r.t versions of different libraries and tools related to Analytics, Data Science, Machine Learning, Deep Learning, cognitive services and Neural networks.  Just few of the advantages are…

  1. The latest versions of all commonly used tools and frameworks are included.
  2. Virtual machine options include highly scalable images with GPU capabilities for intensive data modeling.

(C) Azure Machine Learning Services:

It is a cloud-based service for managing machine learning experiments and models. It includes an experimentation service that tracks data preparation and modeling training scripts, maintaining a history of all executions so you can compare model performance across iterations to choose one among best performing models.  The Azure Machine Learning Model Management service then enables you to track and manage model deployments in the cloud, on edge devices, or across the enterprise.

  1. The Azure Machine Learning WorkBench: A cross-platform client tool provides a central interface for script management and history, while still enabling data scientists to create scripts in their tool of choice, such as Jupyter Notebooks or Visual Studio Code. The workbench follows discipline of Team Data Science Process and provides solutions to follow life cycle from the point of Data Transformation up to deploying most performing Analytical or Machine Learning model for production.  It provides variety of ways of script execution environment like: run model training scripts locally, in a scalable Docker container, or in Spark.  When you are ready to deploy your model, use the Workbench environment to package the model and deploy it as a web service to a Docker container, Spark on Azure HDinsight, Microsoft Machine Learning Server, or SQL Server. The Azure Machine Learning Model Management service then enables you to track and manage model deployments in the cloud, on edge devices, or across the enterprise.
  2. The Azure Machine Learning Studio: It is a cloud-based, visual development environment for creating data experiments, training machine learning models, and publishing them as web services in Azure. Its visual drag-and-drop interface lets data scientists and power users create machine learning solutions quickly. It is enriched with a wide range of established statistical algorithms and techniques for machine learning modeling tasks and a built-in support for Jupyter Notebooks. It can do direct deployment of the trained models to the Azure Web Services. It’s a boon for data scientist who wants a quick solution without engaging themselves into to cycle of code development.
  3. Azure Batch AI: It enables you to run your machine learning experiments in parallel and perform model training at scale across a cluster of virtual machines with GPUs. Batch AI enables you to scale out deep learning jobs across clustered GPUs, using frameworks such as Cognitive Toolkit, Caffe, Chainer and TensorFlow. Azure Machine Learning Model Management can be used to take models from Batch AI training to deploy, manage, and monitor them.

(D) Tools for deploying Machine Learning Models:

After a data scientist has created a machine learning model, you will typically need to deploy and consume it from applications or in other data flows. There are numerous potential deployment targets for machine learning models.

  1. The Apache Spark on HDInsight: Apache Spark is a distributed platform that offers high scalability for high-volume machine learning processes. It allows Batch as well as Real time processing in the distributed manner. Well equipped with different kinds of analytical and ML libraries it includes Spark MLlib, a framework and library for machine learning models. Also its Microsoft Machine Learning library for Spark (MMLSpark) provides deep learning algorithm support for predictive models in Spark. You can deploy models directly to Spark in HDinsight from Azure Machine Learning Workbench, and manage them using the Azure Machine Learning Model Management service. The HDInsight instance of Spark can consume data from variety of Data Sources like Hadoop HBase, Hive, Azure Storage, Azure Data Lake, Azure Even Hub and last but not least Apache Kafka.
  2. Web Services in Container: Containers are a lightweight and generally cost effective way to package and deploy services. The Machine Learning Models are deployable on variety of platforms other than Azure Model Management. Deploy them as Python web service in a Docker container or to an edge device, where it can be used locally with the data on which it operates.  The ability to deploy to an edge device enables you to move your predictive logic closer to the data.
  3. Microsoft R Servers/Microsoft Machine Learning Server: It is a highly scalable platform for R and Python code, specifically designed for machine learning scenarios. The models designed in Azure Work Bench also are deployable to these servers.  The server instances can be created on-premise so is the good solution in case to abide by the business or company policies.
  4. Microsoft SQL Server: It supports R and Python natively, enabling you to encapsulate machine learning models built in these languages as Transact-SQL functions in a database. Thus it facilitates encapsulating predictive logic in a database function, making it easy to include in data-tier logic.
  5. Azure Machine Learning Web Services: The machine learning model created using Azure Machine Learning Studio, can be deployed as a web service which thus can be presented to consume through a REST interface from any client applications capable of communicating by HTTP. It also has a Built-in support for calling Azure Machine Learning web services from Azure Data Lake Analytics, Azure Data Factory, and Azure Stream Analytics.

(E) Visualization services:

Microsoft’s Power BI content pack for Microsoft Azure Enterprise Users is providing solutions at par with Tablue(BI and Data visualization tool) or Spotfire(Enterprise grade analytical platform). It is a suite of business analytics tools that allows you to explore to deliver insight and create visually compelling reports. It can connect to hundreds of data sources, simplify data prep, and drive ad hoc analysis. Produce beautiful reports, publish them for your organization to consume on the web and across mobile devices. Everyone can create personalized dashboards with a unique, 360-degree view of their business. And scale across the enterprise, with governance and security built-in.

5. Conclusion

The analytics and ML is one of the topmost trends today and certainly in coming years. Microsoft is striving to provide end to end efficient, highly scalable, reliable solutions for complete Data Science cycle from the phase of procuring, cleansing, wrangling, transforming data, applying different kinds of analytics and machine learning effects to the data, publishing the data model for the production up to visualizing static or real time analytical reports.  Their solutions are enriched with latest trends like Deep Learning, Neural analytics and Cognitive services for Predictive and prescriptive analytics.  All these solutions are not only cost effective but also are available as PaaS and SaaS services on their Azure Cloud making additional advantages.  Will take an opportunity to discuss more on Spark with Kafka and Azure Workbench in my next coming blogs.