As the world becomes more digitized, data processing and management has become a crucial step for sizable companies tackling large data pools to maximize their business decisions and strategies. In recent years, studies have shown that Fortune 1000 companies can gain more than $65 million in additional net income when increasing their data accessibility by just 10%. Recognizing the many benefits analyzing big data has to offer, from cost savings to productivity improvements, large companies are progressively jumping on the bandwagon.
While managing and reaping the benefits of big data is ideal, it’s not without its challenges. Companies that wish to dabble in it for the first time need to be well prepared to tackle any roadblocks that lie ahead. While there could be multiple obstacles to overcome including analysis and data security, along with finding the right tools and talent to manage it all, there are a myriad of solutions available to help overcome them. Let’s take a look at how Bambu uses the Hadoop ecosystem to address a specific big data problem.
The Data Volume Problem
At Bambu, we offer Portfolio Builders that allow clients to create their own financial portfolios with the stock data, which is updated daily by an independent API. At the beginning of the project, our data size was relatively manageable and we didn’t encounter any problems with our performance output. However, once we added Mutual Funds with ETFs, the data size and volume increased considerably. As a result, performance declined noticeably in the PostgreSQL database.
Our solution to analyze and manage big data systems comprises 3 components. First of all, we opted for the Hadoop Distributed File System (HDFS) as data storage. Secondly, we used Sqoop to transfer the data from PostgreSQL to HDFS. After all the data was ready, we experimented using Hive and HBase with queries.
Solving the Storage Problem
We needed a storage infrastructure that’s specifically designed to store, manage and retrieve massive amounts of data assets. These big data storage infrastructures enable the storing and sorting of data so that it’s easily accessed, used, and processed by applications and services.
In Hadoop applications, HDFS is the main data storage system and represents a distributed file system that offers access to application data for high-throughput. It’s a part of the big data environment and offers a way for vast quantities of structured and unstructured data to be handled. To handle the computational load, HDFS distributes the processing of massive data sets over low-cost computer clusters. One thing to bear in mind is that HDFS is not suitable for real-time processing. If you have such a need, the final topic of this article on the HBase database will provide a more suitable solution.
If you’ve never used the HDFS system before, we’ve two tips for you. Firstly, spend time understanding the system and become familiar with the data. Following this, it’s essential to recognize what your company needs and expects from the operation. Once these two checkboxes have been ticked, the only thing left is to prepare the necessary environments and move the data to HDFS. Companies usually undergo this shift when they are running batch processing.
YARN is a major component of Hadoop, and allows data to be processed through the various procedures stored in HDFS. As all processes should be tested to make sure they work, we ran the YARN and HDFS systems separately on the platform.
Data Ingestion into New Environment
The next step is to transfer the data to the Hadoop data lake. These transfers can be made in real-time or in batches.
When you’re ready to conduct data analysis, Sqoop helps you transfer the data to the Hadoop environment. Sqoop is an open-source tool that ingests data from many different databases into HDFS. It can export data from HDFS back into an external database like Oracle or MSSQL as well.
Many companies use a Relatable Database Management System (RDBMS) for daily transactions such as customer movements. We’ve used Sqoop to transfer over 75 million records from PostgreSQL to HDFS. This script can be tailored to your company’s needs and can be used for different analyses by transferring new incoming stock data from any RDBMS database to the Hadoop environment.
The PostgreSQL database is our choice of structure, and while we didn’t utilize complex queries, there were slight delays spanning 2-3 to 7-10 seconds.
Hive provides easy, familiar batch processing for Apache Hadoop and uses current SQL competencies to conduct batch queries on data stored in Hadoop. Queries are written using HiveQ, an SQL-like language, and executed via MapReduce or Apache Spark. This makes it easy for more users to process and analyze infinite quantities of data, making Hive the most useful for data preparation, ETL, and data mining.
Hive enables companies that have their data files in HDFS to be a significant source of SQL queries. We can leverage Hive to tackle Hadoop data lakes and connect them to BI tools (like OracleBI or Tableau) for visibility.
To use Hive after uploading the files to HDFS, we need to first create a table. Next, connect the table with the file extension on HDFS.
After the table has been connected, we can easily filter and pre-process our file on HDFS by accessing it via Hive.
Apache HBase is a non-relational, column-oriented database management system operating on HDFS and supports jobs via MapReduce. Being column-oriented means that each column in the system is a contiguous unit of page. An HBase column represents an object attribute; if the table stores diagnostic logs from servers in your setting, each row may be a log record, and a regular column may be the timestamp of when the log record was written. The column could also represent the name of the server from which the record originated. HBase also supports other high-level languages for data processing. HBase is suitable for your current process if you don’t need a relational database and require quick access to data.
As HBase doesn’t store files internally, we need to connect directly to HDFS and transfer the stored files into HBase. You can refer to the sample code blog we have used below to initiate the transfer. Don’t forget to create a table in HBase before doing so.
In HBase, there are no data types; data is stored as byte arrays in the HBase table cells. When the value is stored in the cell, the content or value is distinguished by the timestamp. This means that every cell in the HBase table can contain multiple data versions. In Figure 9, you can see how HBase has stored our data. A key assigns values for each column when given a date, and the rows are sorted according to row keys.
When we analyzed the historical data, Hive gave us faster performances. However, when users wanted to instantly see the stock data they were filtering, PostgreSQL performed better. Hive loses a lot of time preparing to run map-reduce, so it’s only used in the historical batch analysis. Ultimately, it’s not suitable for Online Transaction Procession (OLTP).
Once we tested the HBase performance over PostgreSQL, we saw some performance improvement, but it failed to satisfy. When processing a small amount of data, only a single node is utilized while all other nodes are left idle. Petabytes of data must be stored in this distributed environment to use HBase effectively. Since we don’t have such a large data pool and prefer an official SQL structure, HBase didn’t make the final cut.
As seen in this walkthrough, big data tools can provide productive solutions when deployed effectively. There’s more than one type of data processing tool that’s available in the Hadoop environment. In order to determine which tool to implement, begin by looking at your data and focus on a problem to solve.