The 3rd edition of Big Data Spain in Nov 2014 was a resounding success.
Watch the video below and find out why our attendees, speakers, partners and friends turned Big Data Spain into one of the largest events in Europe about Hadoop, Spark, NoSQL and cloud technologies.

Storing and processing data in Hadoop by Jacek Juraszek

This presentation will cover labyrinth of decision we have made to run data processing smooth and easy. As in case of any puzzle it was long and not quite self-tuned process with many dead ends and very long shortcuts. I will be covering three main topics of big data processing to easy your way to implement it.

Shape of stored data including:
• Directory structure and it implication on hadoop cluster performance
• Importance of declaration of record scheme
• Versioning and updating of records aka managing historical data

Calculation process:
• Make it faster by using proper datastructures in every single step
• Bugs are inevitable so easy recalculation is a must
• Cluster malfunction/maintenance triggers recalculation process without human intervention
• Design for arbitrary time frame but be prepared for stream data as well

Be a good citizen:
• Small files and empty directories are an issue
• Make it small as possible with data compression
• YARN resource demand based on real needs
• CSV files should be your last choice but it is not forbidden

Moreover I will talk about hadoop ecosystem surprises were we expected some specific behavior based on experience from classic DBRMS and single server apps. I don't have to tell you that such intuition is misleading in context of hadoop. We have spend many sleepless hours on cases including:
• Killing hive process with easy select query
• Retries which are overloading cluster
• Non deterministic measures of MR performance operating on same data
• MR reduce phase object reusing
• Client machine and its capacity may sometimes be bottleneck

For each troublemaking aspect mentioned I will be providing sample solution applied or at least evaluated in real life. Mentioned problems may seems as pessimistic and overwhelming vision for applying hadoop but it is not. We really enjoy working with that framework but I will speak about that dark side of implementing big data because in the shadow of each success story there are hiding hours of issue fixing. I will be more then happy if anyone would evade presented pitfalls and successfully use hadoop ecosystem.

Join our Newsletter