Hadoop data management software(entity) is for managing enormous amounts of scattered data. It is part of the ecosystem of modern data management. Expert of the subject, Jari Putula opens up the lid for you. Enjoy!
Everyone (at least the readers working in related fields) can connect the sympathetic little elephant-character and term ‘ Hadoop’. But what lies beneath the figure which was named after one the developers’ offspring’s toy-elephant and which has already gone past its prime-hype? Hadoop is usually associated with other, often overcharged -ism’s like Big Data, Data Lake, stupendous amounts of freeform data and maybe to the technology which came to support more traditional data warehousing. If somebody tells about Hadoop without specifying her/his thoughts; what is this person most likely talking about? This blog ponders what this software entity is, and is it even possible to define and categorize it? Additionally, we’re going to cover vital companions of this specific solution.
Let’s go back in time for five years – five days in Cloudera’s Hadoop MapReduce programming course in London. It was one the most mentally taxing occasions that I have had the privilege to witness. Besides this I’ve witnessed only one that made my physical CPU’s and io’s clock the maximum 100% hour after hour – five days Oracle DBA-course in the 90’s operated with a console. I’m going to go back to the Hadoop and to the firstly mentioned “excursion”. Despite the preparation before the actual happening, the course was the ultimate bombing of new terms and topics that the brain-overfloat was inevitable. Understanding the subject-entities was often left to be done afterwards. The one who jumps to the world of Hadoop should prepare for the swarming of terms, tools and funny names and also understand that you don’t have to understand everything or even remember.
Hadoop data management – What is it?
Back to basics. Hadoop data management system is a fault-tolerant, Java-based open source ‘distributing computing’ framework. The core components are HDFS-equipment, YARN-resource management & MapReduce programming. On top of these, Hadoop ecosystem includes many components that integrate directly to the systems core components. Because the platform can be composed of, let’s say twenty different tools which have naturally updated versions and problems to comply with other systems in real life, the producers have entered the markets with their own Hadoop-ecosystems. When you take one of these, so-called ready-to-use Hadoop-distributions to your disposal, you don’t have to maintain a complex configuration- and version management abyss, but the platform will do it for you. Most renowned distribution-providers are Hortonworks, Cloudera & MapR. There are many more providers, of course. Notice: the cloud-service providers have their own Hadoop-distribution, for example, Microsoft Azure HDInsight that is based on Hortonworks’ HDP-distribution. The added value in this instance is the fully integrated data-platform in the Azure-cloud services.
How to learn Hadoop?
A good base for learning the theoretical and practical is to choose a ready-made distro and inspect the offerings of it. Tools among different distros are mostly the same but there are differences. If you have enthusiasm for practical testing, you’ll find downloadable virtual workstations for VMware and VirtualBox from providers web pages – I’ve been using Cloudera’s and Hortonworks’ virtual desktops. Nifty option to exercise Hadoop is to acquire instance from Azure or AWS. This makes it easy to share the work with other involved personnel.
The all-encompassing tool presentation won’t fit into this text, but let’s take couple examples from some of that I’ve been using myself.
- Hive – SQL-directory. Notice that the Hive resembles traditional relation database only for some parts and this style of implementation has many consequences, e.g. it won’t apply to ad hoc- inquiries that well. This is changing or changed already.
- Ambari – E.g. Hortonworks Data Platform’s graphical tool for administration
- Ranger – a Diverse platform-integrated tool for example data authorisation
- Atlas – Data governance and metadata tool
- Spark – Hadoop-data processing tool (supports, for example, Scala, Java, Python and R-languages)
- Sqoop – Tool for transferring data between Hadoop and database
- Oozie – Data pipeline and scheduling -tool
- Nifi – Very versatile graphical tool for transferring data, monitoring and governing
The situation now
I’m writing this from Hadoop – Big Data seminar in Berlin, which is sponsored by Hortonworks. After all these years the information flood causes a nice swell in my head. Maybe listing the topics sheds a light on the seminars’ nature:
- Data warehousing and operational data stores
- Artificial intelligence and data science
- Bog compute and storage
- Cloud and operations
- Governance and security
- IoT & streaming
- Enterprise adoption
The speed of development in data-platforms and technologies is immense, and it doesn’t only stem from the fact that almost all the companies developing it-software have been tagged along in one way or another. One reason for a fast penetration is the aspect of Apache open-source license; almost all the tools and instruments are projects under aforementioned license and they have massive user and developer community. On top of this, huge companies like Google, Facebook, Yahoo, LinkedIn & Twitter (among others) principally offers their frameworks and tools for Apache to manage. This provides more reliability, visibility and continuity for trends. Naturally, the commercial BI/reporting (et cetera) software providers (SAS, Oracle, IBM, Microsoft…) have developed their own platforms and tools to integrate Hadoop ecosystem (and most of them have created their own, Hadoop-based commercial solutions). If you haven’t already dig into this subject: you really should!