Utilizing big data has many benefits. These include reducing costs, improving productivity, and informing key decisions. Studies show that Fortune 1000 companies can gain more than $65 million additional net income when increasing their data accessibility by just 10%. 

As the world becomes increasingly digitised, large companies see larger data pools and face growing problems effectively using big data. At this point, what needs to be done is to extract key information hidden in the data by making an accurate analysis – something like looking for a needle in a haystack. Intimidated? Well, you don’t have to be. We will be guiding you through how we used the Hadoop ecosystem to address a specific big data problem.

The problem: Our customers who use portfolio builders create their own financial portfolios by using stock data. This stock data is updated daily by another API. At the beginning of the project, there was no problem as our data size was relatively manageable. However, once we added mutual funds with ETFs, the data size and volume increased. As a result, performance noticeably decreased in the PostgreSQL database. Thus, we thought of trying big data tools to remedy this problem.

For us, using big data as a solution was broken down into 3 parts. First of all, we chose to use the Hadoop Distributed File System (HDFS) as data storage. Secondly, we used Sqoop to transfer the data from PostgreSQL to HDFS. After all the data was ready, we experimented using Hive and HBase with queries.

First step: Solving the Storage Problem

We needed a storage infrastructure designed specifically to store, manage and retrieve massive amounts of data or big data. These big data storage infrastructures enable the storing and sorting of data so that it is easily accessed, used, and processed by applications and services.

HDFS: In Hadoop applications, HDFS is the main data storage system and represents a distributed file system that offers access to application data for high throughput. It is part of the big data environment and offers a way for vast quantities of structured and unstructured data to be handled. To handle the computational load, HDFS distributes the processing of massive data sets over low-cost computer clusters. One thing to bear in mind is that HDFS is not suitable for real-time processing. If you have such a need, the final topic of this article on the HBase database will be of help to you.

We have two tips for using the HDFS system. First of all, spend time understanding the system and become familiar with the data. Following this, it is essential to understand what your company needs and expects from the operation. Once these two check boxes have been ticked, the only thing left is to prepare the necessary environments and move the data to HDFS. Companies usually undergo this shift when they are running batch processing. 

The screen below illustrates a single node cluster configuration for data node and name node data saving. YARN is a major component of Hadoop and allows data to be processed through the various procedures stored in HDFS. As all processes should be tested to make sure they work, we ran the YARN and HDFS systems separately on the platform. Below is an illustration of the process.

Next step: Data Ingestion into New Environment

The next step is to transfer the data to the Hadoop data lake. These transfers can be made in real-time or in batches. 

Sqoop: When you are ready to conduct data analysis, Sqoop helps you transfer the data to the Hadoop environment. Sqoop is an open-source tool that allows you to ingest data from many different databases into HDFS. It also can export data from HDFS back into an external database like Oracle or MSSQL. 

Many companies use a Relatable Database Management System (RDBMS) for daily transactions such as customer movements. This is a sample Sqoop script that we have used to transfer over a 75million records from PostgreSQL to HDFS. This script can be tailored to your company’s needs and can be used for different analyses by transferring newly incoming stock data from any RDBMS database to the Hadoop environment.

You can use the code blog below to transfer your local system data to the Hadoop environment.

Final Step: Performance Comparison 

We tend to use the PostgreSQL database as a structure and we detail here our experiences during some trials. While we did not utilise complex queries, there were still some delays spanning 2-3 seconds to 7-10 seconds. 

Hive:  Hive provides easy, familiar batch processing for Apache Hadoop and uses current Structure Query Language (SQL) competencies to conduct batch queries on data stored in Hadoop. Queries are written using HiveQ, a SQL-like language, and executed via MapReduce or Apache Spark. This makes it easy for more users to process and analyze infinite quantities of data making Hive the most useful for data preparation, ETL, and data mining.

Hive enables companies that have their data files in HDFS to be a significant source of SQL queries. We can leverage Hive to tackle Hadoop data lakes and connect them to BI tools (like OracleBI or Tableau) for visibility.

Here are the steps you need to take to use Hive after uploading the files to HDFS. First of all, you need to create a table. Following this, you will connect the table with the file extension on HDFS. The images below illustrate these two steps. 

After the table has been connected, we can easily filter and pre-process our file on HDFS by accessing it via Hive.

HBase: Apache HBase is a non-relational, column-oriented database management system operating on HDFS and supports jobs via MapReduce. Being column-oriented means that each column in the system is a contiguous unit of page. An HBase column represents an object attribute; if the table stores diagnostic logs from servers in your setting, each row may be a log record, and a regular column may be the timestamp of when the log record was written. The column could also represent the name of the server from which the record originated. HBase also supports other high-level languages for data processing. HBase is suitable for your current process if you don’t need a relational database and require quick access to data.

 

As we mentioned before, HBase does not store files internally. Hence, we need to connect directly to HDFS and transfer the stored files into HBase. You can refer to the sample code blog we have used below to initiate the transfer. Don’t forget to create a table in HBase before doing so.

In HBase, there are no data types; data is stored as byte arrays in the HBase table cells. When the value is stored in the cell, the content or value is distinguished by the timestamp. This means that every cell in the HBase table can contain multiple data versions. In the picture below, you can see how HBase has stored our data. A key assigns values for each column when given a date, and the rows are sorted according to row keys.

Results: When we analyzed the historical data, Hive gave us faster performances. However, when users wanted to see the stock data they were filtering instantly, PostgreSQL was faster here. Hive loses a lot of time preparing to run map-reduce, so it is only used in the historical batch analysis. Thus, it is not suitable for Online Transaction Procession (OLTP). 

Once we tested HBase performance over PostgreSQL, we saw some performance improvement, but it failed to satisfy. When processing a small amount of data, all other nodes are left idle, and only a single node is utilized. Petabytes of data must be stored in this distributed environment to use HBase effectively. Since we do not have such a large data pool and prefer an official SQL structure, we chose not to proceed with the HBase.

Summary

In this walkthrough, we have illustrated how Bambu utilized big data tools to solve a problem we were facing. We hope that this demystifies your impression of big data tools and has given you insight into effectively deploying them. 

We have also shown that there is more than one data processing tool in the Hadoop environment. To determine which tool to use, you need to first look at your data and focus on your problem. When the appropriate big data tool is chosen, data processing is made much more accessible.