Big Data Hadoop Basics

You can’t have a discussion about big data without some Hadoop basics, but when you ask people what Hadoop is, they struggle to define it. I define Hadoop as a software framework for harnessing the power of distributed, parallel computing to store, process, and analyze large volumes of structured, semi-structured, and unstructured data.

Imagine having to write a 30 chapter book. You have two options. One, write it yourself or two write the one chapter you know the most about and farm out the other 29 chapters to experts on those topics. Which option would deliver the best book in the least amount of time? Obviously, distributing the work among 30 experts would expedite the process and result in a higher quality product. The same is true with distributed, parallel computing; instead of having one computer do all the work, you distribute tasks to dozens, hundreds, or even thousands of powerful “commodity” servers that perform the work as a collective. This group of servers is often referred to as a Hadoop cluster, and Hadoop coordinates their efforts.

Three Key Features

Hadoop has three key features that differentiate it from traditional relational databases, which make it an attractive option for working with big data:

Support for semi-structured and unstructured data : With a traditional relationship database, any incoming data must be structured in a way to fit with the database’s existing structure (or schema). With Hadoop, data can be imported without consideration of structure. The structure is added when the data is read from the database.
Hadoop Database File System (HDFS): Instead of storing data in interrelated tables, Hadoop stores it in a compressed file format, splits the data, stores it on different nodes, and replicates it. Storing identical data on separate nodes facilitates distributed, parallel computing while making the system more fault tolerant. If a node fails, another can pick up the slack.
MapReduce: MapReduce is a Java-based programming model for distributed, parallel computing. It assigns work to various nodes in a Hadoop cluster. The work is broken down into two procedures. There is map and reduce. Map performs the filtering and sorting, followed by reduce, which pulls together the results from the map procedure and summarizes them.

Advantages of Hadoop

Hadoop has several advantages over traditional relational databases, including the following:

Flexible: Hadoop supports every imaginable data type, so organizations can load diverse data into their data warehouse without spending a lot of time transforming it to match an existing schema.
Scalable: If more storage or compute resources are required, Hadoop can simply bring more servers online to handle the load.
Fast: By distributing workloads to numerous servers and coordinating their parallel processing, Hadoop significantly increases the speed of processing and analytics. Also, with the ability to add Hadoop clusters, performance doesn’t suffer when multiple users need to access the same data or when the system is engaged in different processes, such as loading data versus analyzing data.
Economical: Cloud storage and processing are quickly becoming inexpensive commodities, and companies that sell these commodities often charge according to the amount of storage and compute resources used. With Hadoop, usage can be scaled up and down at a moment’s notice to increase performance when needed and reduce costs when less storage and compute are needed.
Fault resistant: Because storage and compute is distributed, if one server fails, its workload can be transferred to another server with no loss of data or performance.

Disadvantages of Hadoop

Hadoop certainly helps to overcome many of the challenges of big data, but it is not, in itself, an ideal solution. It is weak in the following areas:

Small files: Hadoop is geared to handle large files. HDFS cannot handle lots of files significantly smaller than the default storage block of 64MB.
Security: Hadoop wasn’t designed with security in mind. Security features were added later, but the built-in security features aren’t considered adequate for most enterprises, especially in industries that require strict security and privacy, such as healthcare and finance.
Data consolidation: Many organizations that run Hadoop essentially have two data warehouses. One for internal (structured) data and another for external (semi-structured and unstructured data). They struggle to combine data sets from the two warehouses to run queries or analytics.
Ease of use: Most database users are familiar with structured query language (SQL), which enables them to easily query data in traditional databases. With Hadoop, users need to manually code each and every operation to develop in MapReduce, which is significantly more complicated. However, certain add-ons are available to simplify the process, such as Hive and Pig.

Commercial versions of Hadoop are available to help overcome these and other limitations of the open source Apache Hadoop, including Amazon Web Services Elastic MapReduce, Cloudera, Hortonworks, and IBM InfoSphere BigInsights.

Frequently Asked Questions

What is Big Data?

Big Data refers to the vast volumes of data that traditional data processing software cannot manage effectively. These large datasets come from a variety of sources like social media, sensors, and transactions.

What is Apache Hadoop?

Apache Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines.

What are the main components of the Hadoop Ecosystem?

The Hadoop Ecosystem consists of several key components including Hadoop Distributed File System (HDFS) for storage, Yet Another Resource Negotiator (YARN) for resource management, and MapReduce for data processing. Other important components are HBase, Hive, Pig, and Sqoop.

What is YARN and how does it function?

YARN stands for Yet Another Resource Negotiator. It is a core component of Hadoop responsible for resource management. YARN allocates resources to various applications running in a Hadoop cluster and schedules tasks.

Why is Hadoop suitable for Big Data processing?

Hadoop is suitable for Big Data processing because it can handle and process large data sets efficiently across clusters of computers. The distributed nature of Hadoop allows it to manage the storage and processing of vast amounts of data.

What is the role of Hive in the Hadoop ecosystem?

Hive is a data warehousing solution built on top of Hadoop. It allows users to query and manage large datasets stored in Hadoop using a SQL-like language known as HiveQL. It simplifies data analysis in the Hadoop ecosystem.

What are the prerequisites to learn Big Data Hadoop?

Prerequisites to learn Big Data Hadoop include a basic understanding of Java and Linux commands. Familiarity with database concepts and SQL is also beneficial. Some knowledge of distributed computing and data warehousing concepts can be helpful too.

How can machine learning be integrated into Hadoop?

Machine learning can be integrated into Hadoop using libraries and frameworks such as Apache Mahout and Apache Spark. These tools allow for the development of scalable machine learning algorithms that can process large data sets within the Hadoop ecosystem.

This is my weekly newsletter that I call The Deep End because I want to go deeper than results you’ll see from searches or AI, incorporating insights from the history of data and data science. Each week I’ll go deep to explain a topic that’s relevant to people who work with technology. I’ll be posting about artificial intelligence, data science, and data ethics.

This newsletter is 100% human written 💪 (* aside from a quick run through grammar and spell check).