Hadoop vs. Redshift Comparison
The Big Data world has its own share of epic battles. In November 2012 Amazon announced Redshift, their cutting edge data warehouse-as-a-service that scales for only $1,000 per terabyte per year. Apache Hadoop, created in 2005, is not the only big data superhero on the block anymore. Now that we have our own Superman vs. Batman, we gotta ask, how does Hadoop compare with Amazon Redshift? Let’s get them in the ring and find out.
Hadoop vs. Redshift Comparison
In the left corner wearing a black cape we have Apache Hadoop. Hadoop is an open source framework for distributed processing and storage of Big Data on commodity machines. It uses HDFS, a dedicated file system that cuts data into small chunks and spreads them optimally over a cluster. The data is processed in parallel on the machines via MapReduce (Hadoop 2.0 aka YARN allows for other applications as well).
In the right corner wearing a red cape we have Redshift. Redshift’s data warehouse-as-a-service is based on technology acquired from ParAccel. It is built on an old version of PostgreSQL with 3 major enhancements:
- Columnar database – this type of database returns data by columns rather than whole rows. It has better performance for aggregating large sets of data, perfect for analytical querying.
- Sharding – Redshift supports data sharding, that is, partitioning the tables across different servers for better performance.
- Scalability – With everything running on the cloud, Redshift clusters can be easily up/down sized as needed.
Traditional solutions by companies like Oracle and EMC have been around for a while, though only as $1,000,000 on-premise racks of dedicated machines. Amazon’s innovation, therefore, lies in pricing and capacity. Their pay-as-you-go promise, as low as $1,000/TB/year, makes a powerful data warehouse affordable for small to medium businesses who couldn’t previously manage it. Because Redshift is on the cloud, it shrinks and grows as needed instead of having big dust gathering machines in the office that need maintenance.
Hadoop vs. Redshift Comparison
The largest Redshift node comes with 16TB of storage and a maximum of 100 nodes can be created. Therefore, if your Big Data goes beyond 1.6PB, Redshift will not do. Also, when scaling Amazon’s clusters, the data needs to be reshuffled amongst the machines. It could take several days and plenty of CPU power, thus slowing your system for regular operations.
Hadoop scales to as many petabytes as you want, all the more so on the cloud. Scaling Hadoop doesn’t require reshuffling since new data will simply be saved on the new machines. In case you do want to balance the data, there is a rebalancer utility available.
First round goes to Hadoop!
According to several performance tests made by the Airbnb nerds, a Redshift 16 node cluster performed a lot faster than a Hive/Elastic Mapreduce 44 node cluster. Another Hadoop vs. Amazon Redshift benchmark made by FlyData, a data synchronization solution for Redshift, confirms that Redshift performs faster for terabytes of data.
Nonetheless, there are some constraints to Redshift’s super speed. Certain Redshift maintenance tasks have limited resources, so procedures like deleting old data could take a while. Although Redshift shards data, it doesn’t do it optimally. You might end up joining data across different nodes and miss out on the improved performance.
Hadoop still has some tricks up its utility belt. FlyData’s benchmark concludes that while Redshift performs faster for terabytes, Hadoop performs better for petabytes. Airbnb agree and state that Hadoop does a better job of running big joins over billions of rows. Unlike Redshift, Hadoop doesn’t have hard resource limitations for maintenance tasks. As for spreading data across nodes optimally, saving it in a hierarchical document format should do the trick. It may take extra work, but at least Hadoop has a solution.
We have a tie – Redshift wins for TBs, Hadoop for PBs
This is a tricky one. Redshift’s pricing depends on the choice of region, node size, storage type (newly introduced), and whether you work with on-demand or reserved resources. Paying $1000/TB/year only applies for 3 years of a reserved XL Node with 2TB of storage in US East (North Virginia). Working with the same node and the same region on-demand costs $3,723/TB/year, more than triple the price. Choosing the region of Asia Pacific costs even more.
On premise Hadoop is definitely more expensive. According to Accenture’s “Hadoop Deployment Comparison Study”, the total cost of ownership of a bare-metal hadoop cluster with 24 nodes and 50 TB of HDFS is more than $21,000 per month. That’s about $5,040/TB/year including maintenance and everything. However, it doesn’t make sense to compare pears with pineapples; let’s compare Redshift with Hadoop as a service.
Pricing for Hadoop as a service isn’t that clear since it depends on how much juice you need. FlyData’s benchmark claims that running Hadoop via Amazon’s Elastic Mapreduce is 10 times more expensive than Redshift. Using Hadoop on Amazon’s EC2 is a different story. Running a relatively low cost m1.xlarge machine with 1.68 TB of storage for 3 years (heavy reserve billing) in the US East region costs about $124 per month, so that’s about $886/TB/year. Working on-demand, using SSD drive machines, or a different region increases prices.
No winner – it depends on your needs
Ease of Use
Redshift has automated tasks for data warehouse administration and automatic backups to Amazon S3. Transitioning to Redshift should be a piece of cake for PostgreSQL developers since they can use the same queries and SQL clients that they’re used to.
Handling Hadoop, whether on the cloud or not, is trickier. Your system administrators will need to learn Hadoop architecture and tools and your developers will need to learn coding in Pig or MapReduce. Heck, you might need to hire new staff with Hadoop expertise. There are Hadoop as a Service solutions which save you from all that trouble (uh hum), however, most data warehouse devs and admins will find it easier to use Redshift.
Redshift takes the round