Bambu

Assets Under Media

© 2022 Mangosteen BCC Pte Ltd. All Rights Reserved.

Hadoop Ecosystem for a Specific Big Data Problem

As the world becomes more digitized, data processing and management has become a crucial step for sizable companies tackling large data pools to maximize their business decisions and strategies. In recent years, studies have shown that Fortune 1000 companies can gain more than $65 million in additional net income when increasing their data accessibility by just 10%. Recognizing the many benefits analyzing big data has to offer, from cost savings to productivity improvements, large companies are progressively jumping on the bandwagon.

While managing and reaping the benefits of big data is ideal, it’s not without its challenges. Companies that wish to dabble in it for the first time need to be well prepared to tackle any roadblocks that lie ahead. While there could be multiple obstacles to overcome including analysis and data security,  along with  finding the right tools and talent to manage it all, there are a myriad of solutions available to help overcome them. Let’s take a look at how Bambu uses the Hadoop ecosystem to address a specific big data problem.

The Data Volume Problem

At Bambu, we offer Portfolio Builders that allow clients to create their own financial portfolios with the stock data, which is updated daily by an independent API. At the beginning of the project, our data size was relatively manageable and we didn’t encounter any problems with our performance output. However, once we added Mutual Funds with ETFs, the data size and volume increased considerably. As a result, performance declined noticeably in the PostgreSQL database. 

Our solution to analyze and manage big data systems comprises 3 components. First of all, we opted for the Hadoop Distributed File System (HDFS) as data storage. Secondly, we used Sqoop to transfer the data from PostgreSQL to HDFS. After all the data was ready, we experimented using Hive and HBase with queries.

Solving the Storage Problem

We needed a storage infrastructure that’s specifically designed to store, manage and retrieve massive amounts of data assets. These big data storage infrastructures enable the storing and sorting of data so that it’s easily accessed, used, and processed by applications and services.

In Hadoop applications, HDFS is the main data storage system and represents a distributed file system that offers access to application data for high-throughput. It’s a part of the big data environment and offers a way for vast quantities of structured and unstructured data to be handled. To handle the computational load, HDFS distributes the processing of massive data sets over low-cost computer clusters. One thing to bear in mind is that HDFS is not suitable for real-time processing. If you have such a need, the final topic of this article on the HBase database will provide a more suitable solution.

If you’ve never used the HDFS system before, we’ve two tips for you. Firstly, spend time understanding the system and become familiar with the data. Following this, it’s essential to recognize what your company needs and expects from the operation. Once these two checkboxes have been ticked, the only thing left is to prepare the necessary environments and move the data to HDFS. Companies usually undergo this shift when they are running batch processing. 

Single node cluster configuration for DataNode and NameNode data storage
Figure 1: Single node cluster configuration for DataNode and NameNode data storage

YARN is a major component of Hadoop, and allows data to be processed through the various procedures stored in HDFS. As all processes should be tested to make sure they work, we ran the YARN and HDFS systems separately on the platform. 

Code to test YARN and HDFS to ensure platform is operational
Figure 2: Testing YARN and HDFS to ensure platform is operational

Data Ingestion into New Environment

The next step is to transfer the data to the Hadoop data lake. These transfers can be made in real-time or in batches. 

When you’re ready to conduct data analysis, Sqoop helps you transfer the data to the Hadoop environment. Sqoop is an open-source tool that ingests data from many different databases into HDFS. It can export data from HDFS back into an external database like Oracle or MSSQL as well. 

Many companies use a Relatable Database Management System (RDBMS) for daily transactions such as customer movements. We’ve used Sqoop to transfer over 75 million records from PostgreSQL to HDFS. This script can be tailored to your company’s needs and can be used for different analyses by transferring new incoming stock data from any RDBMS database to the Hadoop environment.

Sample Sqoop script transferring records into HDFS
Figure 3: Sample Sqoop script transferring records into HDFS
Sample code blog to transfer local system data to the Hadoop environment
Figure 4: Sample code blog to transfer local system data to the Hadoop environment

Performance Comparison 

The PostgreSQL database is our choice of structure, and while we didn’t utilize complex queries, there were slight delays spanning 2-3 to 7-10 seconds. 

Hive provides easy, familiar batch processing for Apache Hadoop and uses current SQL competencies to conduct batch queries on data stored in Hadoop. Queries are written using HiveQ, an SQL-like language, and executed via MapReduce or Apache Spark. This makes it easy for more users to process and analyze infinite quantities of data, making Hive the most useful for data preparation, ETL, and data mining.

Hive enables companies that have their data files in HDFS to be a significant source of SQL queries. We can leverage Hive to tackle Hadoop data lakes and connect them to BI tools (like OracleBI or Tableau) for visibility.

To use Hive after uploading the files to HDFS, we need to first create a table. Next, connect the table with the file extension on HDFS. 

Table to house stock prices data files
Figure 5: Create table
Connect and load data into the table
Figure 6: Connect and load data into the table

After the table has been connected, we can easily filter and pre-process our file on HDFS by accessing it via Hive.

Apache HBase is a non-relational, column-oriented database management system operating on HDFS and supports jobs via MapReduce. Being column-oriented means that each column in the system is a contiguous unit of page. An HBase column represents an object attribute; if the table stores diagnostic logs from servers in your setting, each row may be a log record, and a regular column may be the timestamp of when the log record was written. The column could also represent the name of the server from which the record originated. HBase also supports other high-level languages for data processing. HBase is suitable for your current process if you don’t need a relational database and require quick access to data.

Features list of HDFS and HBase
Figure 7: Features of HDFS and HBase

As HBase doesn’t store files internally, we need to connect directly to HDFS and transfer the stored files into HBase. You can refer to the sample code blog we have used below to initiate the transfer. Don’t forget to create a table in HBase before doing so.

Sample code blog for transferring stored files into HBase
Figure 8: Sample code blog for transferring stored files into HBase

In HBase, there are no data types; data is stored as byte arrays in the HBase table cells. When the value is stored in the cell, the content or value is distinguished by the timestamp. This means that every cell in the HBase table can contain multiple data versions. In Figure 9, you can see how HBase has stored our data. A key assigns values for each column when given a date, and the rows are sorted according to row keys.

Data stored in HBase
Figure 9: Data stored in HBase

When we analyzed the historical data, Hive gave us faster performances. However, when users wanted to instantly see the stock data they were filtering, PostgreSQL performed better. Hive loses a lot of time preparing to run map-reduce, so it’s only used in the historical batch analysis. Ultimately, it’s not suitable for Online Transaction Procession (OLTP). 

Once we tested the HBase performance over PostgreSQL, we saw some performance improvement, but it failed to satisfy. When processing a small amount of data, only a single node is utilized while all other nodes are left idle. Petabytes of data must be stored in this distributed environment to use HBase effectively. Since we don’t have such a large data pool and prefer an official SQL structure, HBase didn’t make the final cut.

Final Thoughts

As seen in this walkthrough, big data tools can provide productive solutions when deployed effectively. There’s more than one type of data processing tool that’s available in the Hadoop environment. In order to determine which tool to implement, begin by looking at your data and focus on a problem to solve.

Subscribe to our Newsletter

YOU MAY ALSO LIKE

Keep up with us!

Subscribe to our Newsletter

If you want to keep up with us and get the latest on #fintech and Robo-Advisory, leave your email. No spam, just gold.

Download our
free Case Study

To download our case study, please submit the form below and we will e-mail you the link to the file.