1. What is big data?
In this article, the author uses a metaphor of cooking in a restaurant to explain in detail the whole processing process of big data. This easy-to-understand way, I hope to help everyone.
Big data is a relatively abstract and complex concept. I hope to introduce big data in an easy-to-understand way. The whole processing process of big data is actually the same as cooking in a restaurant. It also requires steps such as buying food, washing vegetables, cutting vegetables, side dishes, cooking, coloring and setting dishes. Here are some steps.
The first is big data to buy food. The process of buying food is very important and not simple, commonly known as "data acquisition" or "data collection".
In "Big Data Restaurant", there are various channels for obtaining data, just as restaurants obtain ingredients from different suppliers, the data ingredients of "Big Data Restaurant" can be obtained from databases, hodoop, cloud and other channels.
The level of ingredients provided by these suppliers is different, just like the type of data.
There are preliminarily processed ingredients, such as packaged vegetables and meat, which are similar to structured data and have clear format and content;
Ingredients that have been watered once, such as vegetables without packaging but mud, are similar to semi-structured data, which are more regular but also need to be processed;
There are also unprocessed ingredients, such as muddy vegetables and live chickens, ducks and fish, which are similar to unstructured data and have no fixed format, and need further sorting and processing.
The frequency of buying food also needs to be considered, just like the frequency of data acquisition, once a day, once an hour, or once a second, and so on.
The second is big data storage, just like the warehouse of a restaurant. The bought ingredients should have enough space to store, and the freshness and safety of the ingredients should be guaranteed, commonly known as "data storage".
For big data, storage systems such as Hadoop’s HDFS (Distributed File System) provide such a space. It can store a large amount of data in multiple nodes, just like storing ingredients on different shelves in different areas of the warehouse. The advantage of this is that it can handle a large amount of data, and when a storage node fails, it will not affect the storage of the whole data, just like one shelf of the warehouse is damaged, and the ingredients on other shelves can still be used normally.
Different types of data are stored in different ways: structured data may be stored in relational databases, while unstructured data (such as texts and images) may be stored in special file systems or object storage, just like different ingredients need to be stored in different warehouse areas, dry goods are stored in normal temperature areas, fresh food is stored in cold storage areas, and meat is stored in frozen areas.
The third is big data to pick vegetables and wash vegetables. We can’t put muddy vegetables or hairy meat directly into the pot, so we can’t eat them, and we can’t cook with spoiled ingredients, which may affect the taste of vegetables and even cause some accidents. Therefore, picking vegetables and washing vegetables is a necessary process.
In the same way, raw data can rarely be used directly, and dirty data can’t be used directly for cooking, so as to avoid improper influence in the subsequent use. This step is commonly known as "data cleaning" or "data preprocessing", and subsequent analysis and mining can only be realized after cleaning.
The fourth is big data to cut vegetables and side dishes. In restaurants, chefs will cut the ingredients into appropriate shapes and sizes according to different dish needs, and then mix them to achieve the best cooking effect.
In big data processing, this step is equivalent to "data processing and conversion".
For big data, data may come from different sources and have different formats and structures. Through data processing and transformation, data can be standardized and formatted to meet the requirements of subsequent analysis. For example, the data of different companies will be converted into the same company, and the date format will be unified into a specific standard format.
At the same time, you can also filter, aggregate and split the data according to the analysis requirements, just as a chef cuts and matches the ingredients according to the needs of dishes.
The fifth is big data cooking. Chefs use various cooking skills and seasonings in the kitchen to process the cut ingredients into delicious dishes.
In the field of big data, this step corresponds to "data analysis and mining".
Data analysis and mining is the core of big data processing. By using various analysis methods and algorithms, valuable information and knowledge can be extracted from a large amount of data. For example, the statistical analysis method is used to calculate the mean, variance, correlation and other indicators of the data to understand the basic characteristics of the data; Machine learning algorithm is used for classification, clustering, prediction and other tasks to find patterns and laws in data.
Just as chefs create delicious dishes through different cooking methods and seasoning combinations, data analysts dig out valuable insights from data through various analytical means.
The sixth is big data coloring. A delicious dish should not only taste good, but also have an attractive appearance.
In restaurants, chefs will carefully color and set dishes to make them more attractive. For big data, this step is "data visualization". Data visualization displays the results of analysis and mining in the form of intuitive and easy-to-understand graphics and charts, so that users can quickly understand the meaning and value of data.
For example, the distribution, trend and proportional relationship of data are displayed by visual tools such as histogram, line chart and pie chart. Just as exquisite plate setting can enhance the appeal of dishes, data visualization can enhance the readability and understandability of data and help users make better decisions.
By comparing the big data processing process to restaurant cooking, we can clearly see the importance and relationship of each link. Big data is like the art of cooking, from the data collection of obtaining ingredients, to the data storage of storing ingredients, to the cleaning, processing, analysis and mining, and finally to the visual presentation, just like making a dish with good color and flavor.
These steps are closely linked, and any problem in any link may affect the final "food quality", that is, the effective mining and utilization of data value.
Whether it is enterprise decision-making, scientific research exploration or social governance, understanding and mastering the process of big data processing can help us cook our own "delicious food" from massive data and provide strong support and guidance for our actions and choices.
In front of that, it mainly introduces the processing flow of big data like cooking, that is, the process of processing raw data into valuable products.
However, this is only part of it, and there is still a gap from the complete construction of a "big data restaurant". To really run a restaurant, it is far from enough to master the cooking methods. It also requires the cooperative participation of all kinds of people, such as buyers carefully selecting ingredients, chefs displaying their cooking skills, and service personnel providing thoughtful services to customers.
Similarly, in the field of big data, we also need corresponding personnel to ensure its smooth operation, and we also need to equip them with appropriate equipment to support their work. Personnel and equipment, both of which are indispensable parts of this "big data restaurant".
So, in this "big data restaurant", what role do people play in all aspects?
Data acquisition personnel (buyer)
Just as a buyer in a restaurant is responsible for finding and obtaining high-quality ingredients, a data collector is responsible for collecting data from various data sources. They need to understand different data sources and be able to use appropriate tools and technologies to obtain data. For example, for collecting data from website logs, they should be familiar with the use of log collection software to ensure the integrity and accuracy of the data. These people also need to pay attention to the legality and compliance of data collection, just as buyers need to ensure that the sources of ingredients are legal, and avoid problems such as data privacy disclosure.
Data storage engineer (warehouse manager)
Similar to the restaurant warehouse manager responsible for warehouse planning, food storage and management, the data storage engineer should design and maintain the data storage system. They need to be proficient in distributed storage systems such as Hadoop’s HDFS, allocate storage resources reasonably, and ensure that massive data has enough space to store. When there are problems in data storage, such as storage node failure or data loss, they should take timely measures to restore and repair as warehouse administrators deal with food damage or loss. Moreover, they are also responsible for the security of data storage, setting access rights and preventing unauthorized access, just as warehouse administrators want to ensure the security of warehouses.
Data cleaning expert (vegetable washer)
Data cleaning experts are like serious vegetable washers in restaurants. Their task is to carefully check and clean up the "dirt" in the data. These "dirt" include missing values, wrong values, duplicate data and data with irregular format.
They should use various data cleaning tools and methods, such as using data cleaning software to identify and deal with missing values, writing scripts or using special tools to check logical errors of data and correct them. The quality of their work directly affects the effect of subsequent data processing, just as the vegetable washer will affect the quality of the dishes if he doesn’t wash them clean.
Data processing and analysis personnel (chef)
Data processors and analysts are the core role of big data restaurants, just as chefs are the soul of restaurants. They should be proficient in various data processing frameworks (such as MapReduce and Spark) and data analysis methods (such as statistical analysis and machine learning algorithms). They carefully "cook" (process and analyze) the cleaned "ingredients" (data), and dig out valuable information in the data, such as discovering association rules in the data, classifying and clustering the data, etc. They also need to flexibly use different "cooking skills" (analysis methods) according to different "dish needs" (business problems) to produce "dishes" (analysis results) that meet the needs of "customers" (data users).
Data Visualization Designer (Disker)
Data visualization designers are like dish setters in restaurants. They are responsible for presenting the analysis results in an attractive way. They should understand users’ needs and visual habits, and choose appropriate visualization tools (such as Tableau and PowerBI) and chart types (such as histogram, line chart and pie chart).
Their job is to make the data "dishes" more attractive visually, so that users can quickly understand the meaning and value of the data, just as the dish setter makes the dishes more attractive and convenient for customers to enjoy and enjoy through exquisite dish setting.
Data application expert (waiter)
Data application experts are like waiters in restaurants. They pass on the results of data processing and analysis to users (enterprise decision makers, business personnel, etc.) and help users understand and apply these results. They need to understand business scenarios and user needs, and can turn data insights into practical action suggestions.
For example, in an enterprise’s precision marketing scenario, data application experts should provide personalized marketing solutions for marketers according to customer preferences obtained from data analysis, just as waiters recommend appropriate dishes according to customers’ tastes, so as to ensure that the value of data can be fully exerted in actual business.
Big Data System Administrator (Restaurant Manager)
Big data system administrators play the role of restaurant managers, and they have to coordinate the operation of the whole big data system. They are responsible for coordinating the personnel in each link to ensure the smooth connection of data collection, storage, processing, visualization and application.
They should also pay attention to the performance and resource utilization of big data systems, just as restaurant managers should pay attention to the operational efficiency and cost of restaurants. When problems arise, they should dispatch resources to solve them in time, and make plans for the development and optimization of big data systems to ensure that big data restaurants can operate continuously and efficiently.
Finally, there are tools and equipment. In the field of big data, they claim to be big data. In fact, they are all suppliers of pots, that is, making pots. For example, Hadoop, MPP database, big data platform and BI are all pots.
However, the pot is only a part of cooking delicious food, and no matter how excellent the pot is, it can’t play its real value without the use of skilled chefs.
In the world of big data, although these pots are important, what is more important is the people who use them.
Data scientists, analysts and engineers are like chefs. They use their professional knowledge and experience to carefully "cook" data in these "pots" and turn it into valuable information to promote decision-making, innovation and development. At the same time, different "pots" are suitable for different "ingredients" and "cooking styles". Enterprises and organizations need to choose the appropriate big data tools and platforms according to their own data characteristics and business needs, in order to truly cook a "data feast" that meets their own tastes and nutritional needs, and thrive and stand out in this data-driven era.
This article was originally published by @ Heart Shui Mu. Everyone is a product manager. Reprinting is prohibited without the permission of the author.
The title map comes from Unsplash and is based on CC0 protocol.
This article only represents the author himself, and everyone is a product manager. The platform only provides information storage space services.