how to build a data lake in azure

This document captures these considerations and best practices that we have learnt based on working with our customers. At the folder level, you can set fine grained access controls using ACLs. this would be enriched sales data ensuring that the sales data is schematized, enriched with other product or inventory information and also separated into multiple datasets for the different business units inside Contoso. Storage accounts, containers. The following table shows the main Azure services you can use to build your data lake architecture. As our enterprise customers serve the needs of multiple organizations including analytics use-cases on a central data lake, their data and transactions tend to increase dramatically. In addition, they can use the same sales data and social media trends in the data lake to build intelligent machine learning models for personalized recommendations on their website. When deciding the structure of your data, consider both the semantics of the data itself as well as the consumers who access the data to identify the right data organization strategy for you. In the case of processing real time data, you can use a real time streaming engine (such as Azure Stream Analytics or Spark Streaming) in conjunction with a message broker (such as Event Hub or Apache Kafka) to store your data as larger files. E.g. You can find more information about the access control here. A folder also has access control lists (ACLs) associated with it, there are two types of ACLs associated with a folder access ACLs and default ACLs, you can read more about them here. high-quality sales data (that is data in the enriched data zone correlated with other demand forecasting signals such as social media trending patterns) for a business unit that is used for predictive analytics on determining the sales projections for the next fiscal year. A common question that comes up is when to use a data warehouse vs a data lake. The SPNs/MSIs for ADF as well as the users and the service engineering team can be added to the LogsWriter group. Object/file: A file is an entity that holds data that can be read/written. Given this is customer data, there are sovereignty requirements that need to be met, so the data cannot leave the region. This data has structure and can be served to the consumers either as is (E.g. This document assumes that you have an account in Azure. A data lake is a store for all types of data from various sources. For the purposes of this document, we will focus on the Modern Data Warehouse pattern used prolifically by our large-scale enterprise customers on Azure , including our solutions such as Azure Synapse Analytics. Consider the access control model you would want to follow when deciding your folder structures. Data may arrive to your data lake account in a variety of formats human readable formats such as JSON, CSV or XML files, compressed binary formats such as .tar.gz and a variety of sizes huge files (a few TBs) such as an export of a SQL table from your on-premise systems or a large number of tiny files (a few KBs) such as real-time events from your IoT solution. All data lakes are based on Azures core infrastructure, including blob storage, Azure Data Factory, and Hadoop YARN. In a lot of cases, if your raw data (from various sources) itself is not large, you have the following options to ensure the data set your analytics engines operate on is still optimized with large file sizes. Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters. Folder structure mirrors organization, e.g. An enterprise data lake is designed to be a central repository of unstructured , semi-structured and structured data used in your big data platform. a retail customer can store the past 5 years worth of sales data in a data lake, and in addition they can process data from social media to extract the new trends in consumption and intelligence from retail analytics solutions on the competitive landscape and use all these as input together to generate a data set that can be used to project the next years sales targets. If you want to store your logs for both near real-time query and long term retention, you can configure your diagnostic settings to send logs to both a Log Analytics workspace and a storage account. Another common questions that our customers ask if when to use containers and when to use folders to organize the data. When ingesting data into a data lake, you should plan data structure to facilitate security, efficient processing and partitioning. You can view the number of role assigments per subscription in any of the access control (IAM) blades in the portal. There are 2 types of ACLs Access ADLs that control access to a file or a directory, Default ACLs are templates of ACLs set for directories that are associated with a directory, a snapshot of these ACLs are inherited by any child items that are created under that directory. As an example, let us follow the journey of sales data as it travels through the data analytics platform of Contoso.com. Consider the workload's target recovery time objective (RTO) and recovery point objective (RPO). Before we talk about the best practices in building your data lake, its important to get familiar with the various terminology we will use this document in the context of building your data lake with ADLS Gen2. Contoso wants to provide a personalized buyer experience based on their profile and buying patterns. LogsWriter added to the ACLs of the /logs folder with rwx permissions. ADLS Gen2 offers a data lake store for your analytics scenarios with the goal of lowering your total cost of ownership. The table below provides a framework for you to think about the different zones of the data and the associated management of the zones with a commonly observed pattern. In the meantime, while we call out specific engines as examples, please do note that these samples talk primarily about storage performance. The SPNs/MSIs for Databricks will be added to the LogsReader group. Azure Data Lake is based on Azure Blob Storage, an elastic object storage solution that provides low-cost tiered storage, high availability, and robust disaster recovery capabilities. Parquet is one such prevalent data format that is worth exploring for your big data analytics pipeline. One common question that our customers ask is if a single storage account can infinitely continue to scale to their data, transaction and throughput needs. In addition, you also have various Databricks clusters analyzing the logs. If you want to optimize for ease of management, specially if you adopt a centralized data lake strategy, this would be a good model to consider. RBACs are essentially scoped to top-level resources either storage accounts or containers in ADLS Gen2. Data that can be shared globally across all regions E.g. Data that needs to be isolated to a region E.g. It lets you store data in two ways: Azure Data Lake Analytics is a compute service that lets you connect and process data from ADLS. When using RBAC at the container level as the only mechanism for data access control, be cautious of the 2000 limit, particularly if you are likely to have a large number of containers. Folder structure to mirror the ingestion patterns. You can use the Cool and Archive tiers in ADLS Gen2 to store this data. E.g. Identify the different logical sets of your data and think about your needs to manage them in a unified or isolated fashion this will help determine your account boundaries. While at a higher level, they both are used for logical organizations of the data, they have a few key differences. When is ADLS Gen2 the right choice for your data lake? In this case, you would want to optimize for the organization by date and attribute over the sensorID. Beyond this, organizations can optionally use Azure Data Lake Storage, a specialized storage service for large-scale datasets, and Azure Data Lake Analytics, a compute service that processes large scale data sets using T-SQL. this would be raw sales data that is ingested from Contosos sales management tool that is running in their on-prem systems. Do I want a centralized or a federated data lake implementation? A reference of the the full list of metrics and resources logs and their associated schema can be found in the Azure Storage monitoring data reference. Please note that the scenarios that we talk about is primarily with the focus of optimizing ADLS Gen2 performance. You can read more about resource groups here. RBACs let you assign roles to security principals (user, group, service principal or managed identity in AAD) and these roles are associated with sets of permissions to the data in your container. These RBACs apply to all data inside the container. A storage account has no limits on the number of containers, and the container can store an unlimited number of folders and files. The goal of the enterprise data lake is to eliminate data silos (where the data can only be accessed by one part of your organization) and promote a single storage layer that can accommodate the various data needs of the organization For more information on picking the right storage for your solution, please visit the Choosing a big data storage technology in Azure article. There are scenarios where enterprise data lakes serve multiple customer (internal/external) scenarios that may be subject to different requirements different query patterns and different access requirements. Apache Parquet is an open source file format that is optimized for read heavy analytics pipelines. As you are building your enterprise data lake on ADLS Gen2, its important to understand your requirements around your key use cases, including. Hadoop has a set of file formats it supports for optimized storage and processing of structured data. However, when we talk about optimizing your data lake for performance, scalability and even cost, it boils down to two key factors :-. A comprehensive guide on key considerations involved in building your enterprise data lake, Share this page using https://aka.ms/adls/hitchhikersguide. container (also referred to as container for non-HNS enabled accounts): A container organizes a set of objects (or files). A data lake solution in Azure typically consists of four building blocks. Use Azure Data Factory to migrate data from an on-premises Hadoop cluster to ADLS Gen2(Azure Storage), Use Azure Data Factory to migrate data from an AWS S3 to ADLS Gen2(Azure Storage), Securing access to ADLS Gen2 from Azure Databricks, Understanding access control and data lake configurations in ADLS Gen2. Create security groups for the level of permissions you want for an object (typically a directory from what we have seen with our customers) and add them to the ACLs. Workspace data: In addition to the data that is ingested by the data engineering team from the source, the consumers of the data can also choose to bring other data sets that could be valuable. You can read more about these policies, Ensure that you are choosing the right replication option for your accounts, you can read the, Being able to audit your data lake in terms of frequent operations, Having visibiliy into key performace indicators such as operations with high latency, Undestanding common errors, the operations that caused the error, and operations which cause service-side throttling. Where your choose to store your logs from Azure Storage logs becomes important when you consider how you will access it: If you want to access your logs in near real-time and be able to correlate events in logs with other metrics from Azure Monitor, you can store your logs in a Log Analytics workspace. A very common point of discussion as we work with our customers to build their data lake strategy is how they can best organize their data. Avro file format is favored where the I/O patterns are more write heavy or the query patterns favor retrieving multiple rows of records in their entirety. Azure Data Lake Storage has a capability called Query Acceleration available in preview that is intended to optimize your performance while lowering the cost. Here are some best practices that will help you make the most of your data lake deployment on Azure. A file has an access control list associated with it. Let us look at some common file formats Avro, Parquet and ORC. in this section, we will focus on the basic principles that help you optimize the storage transactions. Depending on what your business needs, you can choose to leave the data as is (E.g. In this case, they have various data sources employee data, customers/campaign data and financial data that are subject to different governance and access rules and are also possibly managed by different organizations within the company. Azure Storage logs in Azure Monitor can be enabled through the Azure Portal, PowerShell, the Azure CLI, and Azure Resource Manager templates. E.g. Fore more information on RBACs, you can read this article. As an enterprise data lake, you have two available options either centralize all the data management for your analytics needs within one organization, or have a federated model, where your customers manage their own data lakes while the centralized data team provides guidance and also manages a few key aspects of the data lake such as security and data governance. ACLs let you manage a specific set of permissions for a security principal to a much narrower scope a file or a directory in ADLS Gen2. ADLS Gen2 provides policy management that you can use to leverage the lifecycle of data stored in your Gen2 account.