
Introducing Trunk Data Platform: Open-Source Big Data Distribution Curated by TOSIT
Ever since Cloud and Hortonworks be mergedthe choice of commercial Hadoop distributions for local workloads essentially comes down to CDP Private Cloud. CDP can be seen as the “best of both worlds” between CDH and HDP. With HDP 3.1's End of Support coming in December 2021, Cloudera's clients are “forced” to migrate to CDP.
What about customers who are cannot be upgraded regularly follow EOS dates? Some other customers are not interested in the cloud features flagged by Cloudera and just want to keep running their “legacy” Hadoop workloads. Hortonworks' HDP used to be free to download and some companies are still interested in having an unsupported Big Data deployment non-business-critical workloads.
Finally, some are concerned about the common sense reduced open source contributions since the two companies merged.
Trunk Data Platform (TDP) was designed with these concerns in mind: shared governance of the future of distribution, available for free and 100% open source.
TO SIT
TO SIT (The Open Source I Trust) is a French non-profit organization that promotes open source software. Some of the founders include industry leaders such as Carrefour (retail), EDF (energy) and Orange (telecommunications) as well French Ministry of Economy and Finance.
The work with the “Trunk Data Platform” (TDP) has started through talks between EDF and the French Ministry of Economy and Finance regarding the status of their company's Big Data platforms.
Trunk Data Platform
Apache components
The core idea of the Trunk Data Platform (TDP) is to have a secure, robust base of well-known Apache projects of the Hadoop ecosystem. These projects should cover most Big Data use cases: distributed file systems and computing resources as well as SQL and NoSQL abstractions for querying data.
The following table summarizes the components of the TDP:
Component | Version | Base Apache branch name |
---|---|---|
Apache ZooKeeper | 3.4.6 | release-3.4.6 |
Apache Hadoop | 3.1.1-TDP-0.1.0-SNAPSHOT | rel/release-3.1.1 |
Apache Hive | 3.1.3-TDP-0.1.0-SNAPSHOT | branch-3.1 |
Apache Hive 1 | 1.2.3-TDP-0.1.0-SNAPSHOT | branch-1.2 |
Apache Tez | 0.9.1-TDP-0.1.0-SNAPSHOT | branch-0.9.1 |
Apache Spark | 2.3.5-TDP-0.1.0-SNAPSHOT | branch-2.3 |
Apache Ranger | 2.0.1-TDP-0.1.0-SNAPSHOT | ranger-2.0 |
Apache HBase | 2.1.10-TDP-0.1.0-SNAPSHOT | branch-2.1 |
Apache Phoenix | 5.1.3-TDP-0.1.0-SNAPSHOT | 5.1 |
Apache Phoenix query server | 6.0.0-TDP-0.1.0-SNAPSHOT | 6.0.0 |
Apache Knox | 1.6.1-TDP-0.1.0-SNAPSHOT | v1.6.1 |
Note: The versions of the components have been chosen to ensure interoperability. They are roughly based on the latest version of HDP 3.1.5.
The table above is essentially maintained TDP Repository.
Our repositories are mainly forks of specific tags or branches mentioned in the table above. There is no deviation from the Apache codebase except for the version name and some backports of patches. Should we contribute meaningful code to any of the components that would benefit the community, we will go through the process again please submit these contributions to the Apache codebase of each project.
Another core concept of TDP is mastering everything from building to deploying the components. Let's see the consequences.
Build TDP
Building TDP involves building the underlying Apache projects from source with some minor modifications.
The difficulty lies in the complexity of the projects and their many interdependencies. For example, Apache Hadoop is a 15+ year old project with more than 200,000 lines of code. While most of the components in the TDP are Java projectthe code we compile also includes C, C++, Scala, Ruby and JavaScript. To ensure reproducible and reliable designs, we use a Docker image containing everything needed to build and test the components above. This image was heavily inspired by one present in the Apache Hadoop project but we are plan on making our own.
Most of the components in TDP have some dependencies on other components. For example, here is an excerpt of TDP Hives pom.xml file:
<storage-api.version>2.7.0storage-api.version>
<tez.version>0.9.1-TDP-0.1.0-SNAPSHOTtez.version>
<super-csv.version>2.2.0super-csv.version>
<spark.version>2.3.5-TDP-0.1.0-SNAPSHOTspark.version>
<scala.binary.version>2.11scala.binary.version>
<scala.version>2.11.8scala.version>
<tempus-fugit.version>1.1tempus-fugit.version>
In this case, Hive depends on both Tez and Spark.
We created one tdp
directory in each repository for the TDP projects (eg here for Hadoop) where we provide the commands used for build, test (discussed in the next section) and packages.
Note: Be sure to check out our previous articles “Build your open source big data deployment with Hadoop, HBase, Spark, Hive & Zeppelin” and “Installing Hadoop from Source: Build, Patch and Run” for more information on the process of building interdependent Apache projects of the Hadoop ecosystem.
Testing TDP
Testing is a critical part of the TDP release process. Since we package our own releases for each project in an interdependent manner, we need to ensure that these editions are compatible together. This can be achieved by running unit tests and integration tests.
Since most of our projects are written in Java, we chose to use Jenkins to automate the build and testing of the TDP distribution. Jenkins' JUnit plugin is very useful for a complete reporting of the tests we run on each project after compiling the code.
Here is an example of the Apache Hadoop test report:
Like the builds, we've also committed TDP test commands and flags in each of the repositories tdp/README.md
files.
Note: Some high-level information about our Kubernetes-based build/test environment can be found here in our warehouse.
Installing TDP
After the construction phase we just described, we are still standing .tar.gz
files of the components of our Hadoop distribution. These archives package binaries, compiled JARs, and configuration files. Where do we go from here?
To stay consistent with our philosophy of having control over the entire stack, we decided to write our own Ansible collection. It comes with roles and playbooks to manage the deployment and configuration of the TDP stack.
The tdp collection is designed to deploy all components with security (Kerberos authentication and TLS) and High accessibility by default (when possible).
Here is an excerpt from “hdfs_nn” subtask of the Hadoop role deploying the Hadoop HDFS Namenode:
- name: Create HDFS Namenode directory
file:
path: "{{ hdfs_site['dfs.namenode.name.dir'] }}"
state: directory
group: '{{ hadoop_group }}'
owner: '{{ hdfs_user }}'
- name: Create HDFS Namenode configuration directory
file:
path: '{{ hadoop_nn_conf_dir }}'
state: directory
group: '{{ hadoop_group }}'
owner: '{{ hdfs_user }}'
- name: Template HDFS Namenode service file
template:
src: hadoop-hdfs-namenode.service.j2
dest: /usr/lib/systemd/system/hadoop-hdfs-namenode.service
TDP Lib
The Ansible playbooks can be run manually or via TDP Lib Which is a Python CLI we developed for TDP. Using it provides several benefits:
- lib uses one generated DAY based on dependencies between components to distribute everything in the right order;
- All deployment logs are saved in a database;
- lib also handles configuration versioning of the components.
What about Apache Ambari?
Apache Ambari is an open source Hadoop cluster management interface. It is maintained by Hortonworks and has been ceased in favor of Cloudera Manager which is not open source. Although an open source Apache project, Ambari was strongly linked to the HDP and was only capable of managing Hortonworks Data Platform (HDP) clusters. HDP was distributed as RPM packages and the process used by Hortonworks to build these RPMs (ie: the underlying spec files) were never open source.
We judged that the technical debt to maintain Ambari for TDP's sake was too much to be profitable and decided to start from scratch and automate the deployment of our Hadoop deployment with the industry standard for IT automation: Ansible.
What comes next?
TDP is still one work in progress. Although we already have a solid base of Hadoop-oriented projects, we are planning on expand the list of components in the deployment and experimenting with new Apache Incubator projects that Apache Datalab or Apache Yunikorn. We also hope to soon be able to contribute code to the Apache trunk for each project.
The design of a web interface is also in progress. It should be able to handle everything from configuration management to service monitoring and alarms. This web interface will be powered by TDP lib.
We invested a lot of time in Ansible roles and we plan to leverage these in the future admin interface.
Get involved
The easiest way to get involved with TDP is to go through Getting Started archive where you will be able to run a fully functional, secure and highly available TDP installation in virtual machines. You can also contribute via pull requests or report questions in the Ansible collection or on any of them TOSIT-IO storehouse.
If you have any questions, please get in touch david@adaltas.com.
#Introducing #Trunk #Data #Platform #OpenSource #Big #Data #Distribution #Curated #TOSIT
Source link