In my last post, I discussed the concept of a data lake as means of collecting data organized by a data-driven design pattern. This pattern can capture a wide range of data varieties on a large scale. Used as an approach to organize, catalogue and retrieve data that leverages a technology platform, the data lake has emerged as part of a larger Big Data and advanced analytics platform. This blog will look at a specific technology platform to enable this pattern – Microsoft’s Azure Data Lake services.
Like other data lake solutions, Azure Data Lake leverage the scale and convenience of the cloud and runs in a platform-as-a-service model. By combining two services – Azure Data Lake Store and Azure Data Lake Analytics – for data storage and data analytics, this cloud-based platform delivers a highly elastic and available solution with low start-up costs, easy configuration and no hardware purchase required.
What is Data Lake Store?
The Data Lake Store is a cost-effective repository for unstructured data that is built for Hadoop and provides petabytes of storage. You can store your data in its native format (even individual files can be petabytes in size).
The separation of storage and compute is an important advantage of the Azure solution. Many other Hadoop-based data lake solutions combine compute and storage, with each compute and storage node storing data and responsible for the data’s management and retrieval.
With the Azure solution, storage is managed (and priced) separately from compute. There are three advantages with this approach:
1. It provides a reduction in cost. You can store more data, even before you are sure how you will use it (but don’t forget about governance to avoid your data lake store becoming a data swamp).
2. You can use multiple tools to access your data in the data lake depending on your requirements.
3. You can persist your data in the data lake and scale your computing power up and down (or even turn it off) depending on the type and frequency of your analysis.
What is Data Lake Analytics?
Data Lake Analytics is a built-in tool for accessing and querying the data lake with the scalability and performance cloud delivers. It uses U-SQL, a query language, that allows for distributed compute. For the Hadoop purists, U-SQL is another Yarn-based HDFS query language like Hive or Pig. For those new to Hadoop, it is like SQL and is accessible to users who are familiar with SQL-based databases. It also allows developers to extend USQL functionality using C#, which enables you to write your own U-SQL procedures and functions.
It is important to note that Data Lake Analytics is not the only way to query data from the Data Lake Store. You can leverage other tools like HD Insights which we’ll cover in a subsequent blog post.
Combatting the Issue of Security
Azure Data Lake offers simplified security and integrates with Azure Active Directory to allow for single sign on. For those who have worked with traditional Hadoop systems, configuring security, particularly if you need to integrate it into your directory services, is a major concern.
With Azure Data Lake the entire platform is managed and supported by Microsoft and backed by an enterprise-grade SLA. Tied into Azure Active Directory for identity management and access control, the data lake can be encrypted and audited when necessary.
Are you interested in learning more about Azure Data Lake Services? Join me on June 7th for a Big Data discussion with Microsoft Canada experts. We’ll be talking about big data and how it is impacting businesses today. Check out the registration details here.