data lake hardware requirements

These two steps form a data catalog and generate security settings and access control policies. I would like to answer this question by making the following points: As the big data infrastructure evolves, enterprises and organizations manage data as an important asset type. This requires permission management capabilities. Big data analytics is essential for the conversion from advertising traffic into sales. This is one of the reasons why many fast-growing startups find it difficult to build a data warehouse or data mid-end to meet their needs. A data lake is designed based on a consideration of asset management. However, data warehouses do not support flexible data analytics and processing. First, let's take a look at a comparison table from the AWS website. The Lambda architecture integrates stream computing and batch processing. This article attempts to answer these questions and provide deep insights into the concept of data lakes. It stored its advertising data in a data lake built on Amazon S3 and used Amazon Athena for interactive analytics. A data lake holds a vast amount of raw data in its native format until it is needed. That is all that needs to be done in the data analysis step. The interface layer is WebHDFS, which provides an HDFS interface for Azure Object Storage. As shown in Figure 8, User A is only granted data access permission and cannot create tables in the specified bucket of Amazon S3. Figure 23: Basic Data Lake Construction Process. Then, the engine writes the data processing results to OSS, a relational database management system (RDBMS), NoSQL database, or data warehouse as needed. An especially important aspect of your architectural plans is a good data-management strategy that includes data governance and metadata, and how you will capture that. Azure allows you to develop a custom data lake solution based on Visual Studio. First, let's take a look at existing stream computing models: When the real-time stream computing model is processing data, the data is still in the network or memory but has not yet been stored in the data lake. Data applications in the data lake are closely related to data governance. The permissions of AWS Lake Formation are divided into the data catalog access permissions and underlying data access permissions, which control access to metadata and actually stored data, respectively. If you need these two functions, you can develop them as needed. IoT, Blockchain, AI Expert | Faculty | Author | Keynote Speaker. There is a core logic behind data analytics. Enterprise data accumulates over a long period of time, which is analogous to how rain fills a lake. Quality management and data governance are closely related to the organizational structure and business type of an enterprise, requiring a great deal of customization and development work. What are the differences between a data lake and a big data platform? Organizations that use data warehouses often do so to guide management decisionsall those data-driven decisions you always hear about. Data in the ODS is processed based on the predefined EDW paradigm and then transferred to the EDW. The Lambda architecture allows applications to read data consistently. With support for data governance, Huawei Cloud's data lake solution covers the entire data processing lifecycle and provides data governance process tools based on schemas and metrics. I currently work in the data business unit. However, the rapid development of Internet advertising has created several challenges for the advertising industry. Figure 8: Permission Separation for AWS's Data Lake Solution. DLI uses its built-in OBS as the core storage engine, which provides capabilities comparable to those of Amazon S3. It retrieves full and incremental data from data sources and stores the retrieved data in a standard manner. Merchants can analyze this tracking data at a fine granularity through data analytics services. ), Figure 11: Implementation of the DAYU Data Governance Methodology. Therefore, a general solution usually does not provide these two functions. If you have any questions when reading this article, please do not hesitate to let me know. Then, you can analyze the data based on the click-through rate (CTR) to develop strategies that lead to better performance and higher productivity. What is a data lake? It should be flexible to run in batch, one-time, or real-time modes, and it should support all types of data along with new data sources. In contrast, a data lake can be built in an agile manner. Data lakes are designed for data scientists and analysts. DLA processes incoming data based on SQL and Spark and externally provides visual data integration and development capabilities based on DataWorks and DMS. The difference is that a data lake is built in a more agile manner. When necessary, a data lake can collaborate with external storage engines to meet the needs of various applications. Once you have the business alignment and you know what your priorities are, you need to define the upfront architecture: what are the different components you will need, and what will the end technical platform look like? Since business changes are unpredictable, you can always keep data as-is and process data as needed. A data lake provides comprehensive capabilities for data retrieval and publishing. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports it. Other cloud products may also be involved. However, the separation of batch processing and stream computing complicates research and development. The data lake has become a middle ground between EDWs and data dumps in offering systems that are still agile and flexible, but have the safeguards and auditing features that are necessary for business-critical data. Data flows through the Lambda platform from left to right, as shown in Figure 3. DLA implements data discovery, data access, and data migration to build a complete data lake that is adaptable to all types of data sources. It is highly possible that data lakes and data warehouses will be integrated in the future. It is a technology that is better adapted to customers' businesses. An effective way to break out of ineffective online advertising methods is to use data tools to convert more of your website visitors into paying customers and refine ad serving comprehensively. AWS's data lake solution provides all of the functions shown in the reference architecture except quality management and data governance. In terms of the total cost of ownership (TCO) for ad serving and analytics, DLA provides serverless elastic services that are billed in pay-as-you-go mode, with no need to purchase fixed resources. The data stored in a data lake is raw data or a complete replica of business data. Customers can purchase resources based on the peaks and valleys of their businesses. The data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes. What are the building blocks of a data lake project? A river flows freely and eventually converges with the sea. In addition, a data lake provides the following more sophisticated data management capabilities: Figure 5: The Reference Architecture of Data Lake Components. Therefore, to meet the access needs of different applications, a data lake must support additional storage engines as well as core storage engines, such as Amazon S3, Alibaba Cloud OSS, OBS, and HDFS. DMS manages data at four granularities, database, table, column, and row, providing control over data security required by enterprises. The preceding table compares the differences between a data lake and a traditional data warehouse. Data from data sources of an operational or transactional system is extracted or loaded to the ODS. Figure 16: DG's Transformed Advertising Data Lake Solution. Figure 13: Data Application Architecture of Alibaba Cloud's Data Lake Solution. The DAYU platform is an implementation of Huawei's methodology for data lake governance and operations. The top-down (DM-DW) data schema provides a fast warehouse setup, quick ROI, and agility. Business demands always evolve along with business growth. This reflects the maturity of the ecosystem surrounding AWS's data lake solution. It manages full data throughout its lifecycle. AWS Glue implements basic computing through batch ETL tasks. Unlike traditional data warehouses, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially when you do not yet know what the scope of data or its use. Compared with Figure 22, which illustrates the basic process of building a data warehouse or data mid-end, Figure 23 illustrates a five-step process for building a data lake in a simpler and more feasible manner: 1. Data Analysis: Analyze basic information about the data, including the data sources, data types, data forms, data schemas, total data volume, and incremental data volumes. Figure 14 illustrates a classic implementation of data lake-data warehouse integration in the gaming industry. This post is a collaboration between OReilly and Zaloni. Huawei Cloud provides the DAYU platform to better support advanced data lake functions, such as data integration, data development, data governance, and quality management. This problem is solved by stream computing engines, such as Storm, Spark Streaming, and Flink. 2022, OReilly Media, Inc. All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. For example, TensorFlow and PyTorch can be trained on sample data from the Hadoop Distributed File System (HDFS), Amazon S3, or Alibaba Cloud Object Storage Service (OSS). When it was founded, DG decided to build its IT infrastructure on a public cloud. To drive down analytics costs resulting from exponential data growth, DG decided on full migration from the AWS platform to the Alibaba Cloud platform after meticulous testing and analysis. Beaver Falls in the Havasupai reservation in the Grand Canyon. Deep support for data asset management has yet to be achieved. Once, I read an interesting article that raised this question: Why do we use the term "data lake" instead of data river or data sea? Does Alibaba Cloud have a mature data lake solution? Going back to the people and skills point, its critical to have the right people with experience managing these environments, to put together an operations team to support the SLAs and meet the business requirements. As business is increasingly becoming pure digital, access to data will become a critical priority, as will speed of development and deployment. Figure 10: Data Lake Solution Provided by Huawei. In recent years, the cost of traffic acquisition has been increasing, forcing many companies to invest heavily to attract new online customers. At the same time, create a data schema, metrics system, and quality standards. Is there a single system to solve all these problems? This solution is based on AWS Lake Formation, which is a management component and works with AWS's other services to form an enterprise data lake. Figure 20: SaaS Model of YM's Data Intelligence Services. Hadoop is an extraordinary technology. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that's ready to meet your current and future business needs. This means AWS Glue is closely integrated with AWS Lake Formation and has higher access permissions for the data stored in AWS Lake Formation. This section analyzes the data lake solutions of major cloud service providers and illustrates them in a data lake reference architecture to compare the advantages and disadvantages of different solutions. In theory, a well-managed data lake retains raw data permanently, while constantly improving and evolving process data to meet your business needs. Both DLA and AnalyticDB inherently support OSS access. I would like to thank Nanjing for compiling the cases in Section 5.1 of this article and thank Xibi for his review. Huawei's official documentation offer suggestions for building a data governance organization. It splits each computing process by dividing a job into logical stages based on aggregation operations. It processes data in multiple modes and manages data throughout its lifecycle. For example, Amazon S3 stores raw data, NoSQL stores processed data that can be accessed in KV mode, and online analytical processing (OLAP) engines store data that is used to create reports in real-time or support ad hoc queries. You can use the built-in capabilities of Spark or Flink to support machine learning. Azure Data Lake includes all of the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. Figure 5 shows the reference architecture of a data lake system. Behavior data and structured data are separated and cannot be associated for analysis, Intelligent behavior data retrieval was supported, but deep data mining and analytics services were not, OSS was only used for data storage and its deeper data value was not utilized. The combination of big data and artificial intelligence (AI) gave birth to a variety of machine learning and deep learning algorithms. DAYU covers the core processes of data lake governance and provides tool support. For the architecture, I used a format similar to that of AWS's data lake solution, so we can intuitively compare the two solutions. Big data platforms process the full data of an enterprise or organization while providing a full range of data processing capabilities to meet application needs. In short, AWS's data lake solution provides full support for metadata management and permission management. The data lake setup process is business-oriented and different from the process of building a data warehouse or data mid-end, which are also popular technologies. One part is subject to batch processing, and the other part is subject to stream computing. Figure 7 illustrates the AWS-recommended data lake solution. This section introduces the basic characteristics of a data lake, especially the characteristics that differentiate a data lake from a big data platform or a traditional data warehouse. It provides a unified area for the storage of the internal data of an enterprise or organization. The concept of data lakes has recently become a hot topic. Gamer behavior data was collected by Logtail in Log Service (SLS) and then shipped to OSS and Elasticsearch. On the other hand, DLA provides powerful computing capabilities, allowing DG to analyze ad serving on a monthly and quarterly basis, accurately calculate the number of activities for each brand, and analyze the ad performance of each activity in terms of media, markets, channels, and data management platforms (DMPs). 6. Multi-Modal Storage Engine: In theory, a data lake should provide a built-in multi-modal storage engine to enable data access by different applications, while considering a series of factors, such as the response time (RT), concurrency, access frequency, and costs. Support for stream computing varies in different data lake solutions. A data lake is intended for data scientists and business analysts that usually need highly efficient data processing and analytics and prefer to use visual tools. This enables effective adaptation between data and your businesses but increases the cost of data warehouse maintenance at the early stage. The data lake solution uses AWS Glue for basic data processing. A data lake focuses on raw data storage and exploratory data analytics and applications. 3. Data Access: Determine the data sources to be accessed and complete full data extraction and incremental data access. For many fast-growing game companies, a popular game often results in extremely fast data growth in a short time. Data lakes can give retailers profitable insights from raw data, such as log files, streaming audio and video, text files, and social media content, among other sources, to quickly identify real-time consumer behavior and convert actions into sales. The centralized storage shown in Figure 5 is a business-related concept. Integrated data lake management solutions like Bedrock and Mica are now delivering the necessary controls without making Hadoop as slow and inflexible as its predecessor solutions. If you plan to perform actions like these, you need to thoroughly understand the throughput requirements from a data ingestion standpoint, which will dictate throughput for storage and network as well as whether you can process the data in a timely manner. A data lake is the infrastructure for next-generation big data analytics and processing. The global data lake market was valued at $7.9 billion in 2019 and is expected to grow at a compound annual growth rate (CAGR) of 20.6 percent by 2024 to reach $20.1 billion. As the gaming industry becomes more competitive, gamers are demanding higher quality products and the lifecycles of game projects are becoming increasingly short, which directly affects projects' return on investment (ROI). To implement these capabilities, a data lake provides a series of data management components, including data access, data migration, data governance, quality management, asset catalog, access control, task management, task orchestration, and metadata management. DG's mobile advertising system frequently encountered exponential increases in traffic volume during its rapid business growth and introduction of multiple business lines. Data storage permission specifies the explicit or implicit access permissions for each specific data catalog in Amazon S3. Different from a data warehouse, a data lake stores raw data, whose format, schema, and content cannot be modified. Scalability means a data lake is scalable in terms of size and capabilities. Regardless of the source, most definitions of the data lake concept focus on the following characteristics of data lakes: In short, a data lake is an evolving and scalable infrastructure for big data storage, processing, and analytics. AWS's data lake solution provides stream computing and machine learning only as extended computing capabilities rather than required capabilities, but you can easily integrate them with the solution. Data access permission is similar to the access permission that a database grants to its tables. http://www.gartner.com/newsroom/id/2809117? However, this SaaS model has the following problems: Therefore, we introduced a data lake to the SaaS model shown in Figure 20 to provide an infrastructure for data accumulation, modeling, and operations analytics. The solution also introduces QuickBI in the frontend for visual analysis. Phase 1: This shows the offline data processing infrastructure, as represented by Hadoop. As part of your technology platform and architecture, you must think about what the scale-out capabilities of your data lake will look like. Compared with a traditional data warehouse, this near-source layer provides the following advantages: By integrating DLA and AnalyticDB, you can enjoy the processing capabilities of a big data platform and a data warehouse at the same time. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle the new requirements. The following sections explain each of these phases in detail. In terms of machine learning, Amazon SageMaker reads training data from Amazon S3 and then writes trained models back to Amazon S3. This helps you trace the entire production process of any data record. The solution can use all Alibaba Cloud databases as data sources, including online transaction processing (OLTP), OLAP, and NoSQL databases. After migration, we integrated DLA and OSS to provide superior analytics capabilities for DG. This may be from solving pain points or by creating net new revenue streams that you can enable business teams to deliver. Outline a robust security strategy, especially if your data lake will be a shared platform used by multiple lines of business units or both internal and external stakeholders. Thinking that you can simply hire more developers to write code or rely on your legacy rows-and-columns-centric tools is a recipe to, Data Automation and Integration (ETL/ELT), Data Lakes 8 Data Management Requirements, sink in a data swamp instead of swimming in a data lake, Data Lakes: 8 Enterprise Data Management Requirements. A data lake stores your business data as-is. This architecture had the following problems: Our analysis showed that YJ's architecture was a prototype of a data lake because full data is stored in OSS. The stored data can include data of any format and of any type. It stores structured, semi-structured, and unstructured data, It supports access to all types of disparate data sources, It discovers, manages, and synchronizes metadata, It provides the built-in SQL and Spark computing engines, which can process various types of data more effectively.