LH-pinkRed Hat thinks it will be used to harvest data from its Gluster storage systems, so it must be time for us to look at Apache Hadoop. The Apache HTTP Server is of course the web server software, which drove the phenomenal growth of the World Wide Web. It is Open Source software and currently in use on 57% of all active websites, according to Netcraft’s research.
Apache Hadoop itself is a ‘software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.’ It has been widely adopted, with 165 references provided on its site. There is significant use reported by the social networking/Public Cloud suppliers. In particular:

  • eBay is running 8 x 532-node clusters with 5.3PB of associated storage, using it for search optimisation and research; it also reports heavy usage of Java MapReduce[1], Pig[2], Hive[3] and Hbase[4]
  • Facebook is running it on a 1.1k machine cluster (8.8k cores) with 12PB storage and a 300 machine cluster (2.4k cores) and 12TB storage; each ‘commodity’ node has 8 cores as 12TB storage; it reports heavy use of streaming and Java APIs, has built a higher-level data warehousing framework known as Apache Hive and a FUSE implementation over HDFS (Hadoop Distributed File System)
  • LinkedIn has multiple grids ‘divided up based upon purpose’; according to our calculations in total it runs Hadoop on around 1,900 machines with a total of 20k cores, 45.6TB RAM and 60PB storage; it uses Centos 5.5 Linux, Sun JDK and Pig ‘heavily customised’
  • Spotify uses Hadoop for ‘content generation, data aggregation, reporting and analysis’ on a 60-node cluster with 1,440 cores, 1TB RAM and 1.2PB storage
  • Twitter uses Hadoop to store and process Tweets, log files and other data generated across Twitter; it uses Cloudera’s CDH distribution of Hadoop and uses Scala and Java to access its MapReduce APIs; Twitter handles 12TB of Tweets every day, according to Red Hat
  • Yahoo! has 40k computers with 100k CPUS running Hadoop; its largest cluster has 4.5k nodes, which it uses to support research into Ad Systems and Web Search and for scaling tests to support Hadoop on larger clusters; it claims that 60% of Hadoop developers within the company work on Pig

The massive scale of some of these clusters is testament to the value of Open Source development and a continuation of the work done in the HPC market many years ago. Their use of Hadoop also goes beyond analytics to data warehousing, Web search and optimisation.
We expect Hadoop to become much more widely used for analytics and in enterprises as well as suppliers and aggregators over time, although it implies a change in data centre architecture for many. Public Cloud suppliers typically run simple applications at massive scale: enterprises will use Hadoop to embrace social business more effectively, but in most cases it will involve new systems, increasing their management burden. We also know a number of systems suppliers are making plans to release Hadoop appliances in the next few months – so watch this space.


[1] A data warehouse infrastructure providing data summarization and ad hoc querying
[2] A high-level data-flow language and execution framework for parallel computation
[3] A data warehouse infrastructure providing data summarization and ad hoc querying
[4] Hadoop distributed database supporting structured data storage for large tables

Leave a Reply

Your email address will not be published. Required fields are marked *