I have spent the last week and will be spending this week in México, meeting with clients, press and partners. It’s been a great experience with a lot of learning opportunities. During these discussions I have been struck by the perception that Hadoop runs on ‘commodity hardware’. Clearly this was the case around 2 years ago with cheap servers building a high performance, fault tolerant, scalable cluster. But, as I mentioned previously, this was OK for clusters that were delivering batch processed, overnight jobs for actionable insights or reports. With the continuing development of the Hadoop ecosystem and Cloudera in particular this has changed completely, here’s why :-
- Spark requires much greater memory, 32 or 64GB machines cannot perform on Spark. 128, 256 or even greater amounts of memory are really the standard now for Spark, as Spark replaces MapReduce this requirement will only grow.
- The transition from Batch to real-time, in particular the heavy adoption of NoSQL databases like HBase and others mean HBase Regions need 128GB minimum, 256Gb standard or 512GB for performance in memory. Join HBase with Spark and you need some very high end machines.
- The increasing requirement for streaming and/or transactional data using Kafka and other tools means the servers that ingest the data and then serve up the analysis in real time have much greater memory requirements.
- With the move to realtime analytics and services, most new systems really benefit from SSD storage. While the cost of SSD storage is declining it’s still an expensive option.
- Take all of the above into account and quad core systems are the absolute minimum required now.
So – when thinking about Big Data and Hadoop/Cloudera in particular – probably a good idea to reset your expectations on Hardware costs as they are going up and will continue to go up. The good news is that as the Hadoop ecosystem grows in capability organizations will be able to deliver a much broader spread of use cases (see my post next week for a use case discussion) covering not just BI/Analytics but actual services to consumers/users.
What do you think? Is Hadoop moving beyond commodity hardware to be more expensive? Will this slow down Hadoop adoption?