All the technical detail, expertise and advice


High performance analytics with Celebrus

Published: Tuesday, 29 December 2015 by Ant Phillips, Senior Developer
Big Data Analytics

It is just about a year now since we first added Hadoop support into Celebrus. That release focused primarily around feeding data in the Apache Avro file format into Apache Hive. With our latest of the Celebrus Big Data Engine Version 8 Update 14 we’ve expanded this to include the Apache Parquet file format and the Apache Impala analytics database.

From an analytics processing perspective, Parquet is a hugely efficient data storage format. In traditional databases and data storage solutions, the data in a table is laid out in rows. Row based data storage isn’t a particularly good fit with the very wide tables which data warehouses often demand. For example, if a table query has a WHERE clause then ideally the only data accessed is just those columns needed to check whether a row is accepted or not. Once a row has been accepted then of course further data might be read to obtain the requisite columns for the SELECT or JOIN clauses. In row based layouts the database engine has to read the entire row, much of which won’t be required if the row is not accepted. Even if the row is accepted then much of the data in those wide rows is still likely going to be discarded. At internet scale, these design choices have huge implications.

Parquet takes a different approach. Parquet stores the data one column at a time. A group of rows is taken (a so called row group), and each column in that row group is written out as a chunk. This process is repeated until all chunks and row groups are finished. This approach has some dramatic effects on the amount of data which has to be read off the disk. For example, if a table query has two predicates in the WHERE clause then only the chunks corresponding to those two columns is read from disk. All the other chunks are left alone until the database engine knows it needs them. Furthermore, the chunks corresponding to the columns which are not required for further processing are simply not read.

The result is massively improved query processing. This is especially true for Celebrus data which often contains long strings of data relevant to the website. For example, the referring URL describes how a visitor arrived at the site. For search engine referrals this often contains a visitor’s search terms as a sequence of query string parameters. Likewise, each page that a visitor lands on is also stored along with the URL and title. These strings are very useful in some queries, and the appeal of Parquet is that they have no performance impact when they are not required.

To prove this out we’ve been using Apache Parquet files and feeding them into HDFS for Apache Impala. Impala makes excellent use of Parquet and its columnar storage format. The result has been amazing; we’ve seen queries running interactively across huge data sets. Ultimately, the end user experience is what this feature is all about. With Celebrus you can get really accurate and timely information to help you understand what your customers are doing on your website, and with these high performance tools you can now analyse the data in real-time and act at the speed that today’s consumers demand.