An Overview of Cloudera Data Platform (CDP)

Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multi-functional self-service tools to analyze and centralize data. It provides enterprise-level security and governance, all of which host public, private and multi-cloud deployments. CDP is the successor to Cloudera's two previous Hadoop distributions: Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP). In this article, we dive into the new Cloudera Big Data offering and how it differs from its predecessors.

Overview

CDP features a unique public-private approach, real-time data analytics, scalable on-premises/cloud and hybrid cloud deployment options, and a privacy-oriented architecture. According to its official website, CDP lets you:

  • Automatically generate workloads when needed and suspend their operation when they are ready, controlling cloud costs as a result
  • Use analyzes and Machine learning to optimize the workload
  • Show data lineage of all cloud and transient clusters
  • Use a single pane of glass across hybrid and multicloud
  • Scale to petabytes of data and thousands of different users
  • Use multi-cloud and hybrid environments to centralize control of customer and operational data

CDP is available in two editions: CDP Public Cloud and CDP Private Cloud.

CDP Public Cloud

CDP Public Cloud is one Platform-as-a-Service (PaaS) which is compatible with a cloud infrastructure and can be transferred without difficulty between different cloud providers including private solutions such as OpenShift. CDP was built to be fully hybrid as well as multi-cloud, meaning that one platform can handle all data lifecycle use cases, regardless of location or cloud, with a consistent security and governance model. CDP can work with data in a variety of settings, including public clouds such as AWS, Azureand GCP. In addition, it can automatically scale workloads and resources up and down to improve performance and lower costs.

CDP Public Cloud Services

Here are the main elements that make up the CDP Public Cloud:

  • Computer technology

    CDP Data Engineering is an all-in-one toolkit for data engineering. Built on Apache Spark, makes it possible to streamline ETL processes across enterprise analytics teams by enabling orchestration and automation with Apache airflow and provides highly sophisticated pipeline monitoring, visual troubleshooting, and comprehensive management tools. It has isolated workload environments and is containerized, scalable and easily transportable.

  • Data Hub

    CDP Data Hub is a service that enables high-value analytics from Edge to AI. Streaming, ETL, data marchdatabases and machine learning are just some of the tasks covered by the wide range of analytical workloads.

  • Data warehouse

    CDP Data Warehouse is a service that allows THE to provide a cloud-based self-service analytics experience to BI analysts. Streaming, Data Engineering and Machine Learning (ML) analytics are all fully integrated into the CDP Data Warehouse. It has a unified framework that makes it possible to secure and control all your data and metadata on private, multiple public or hybrid clouds.

  • Machine learning

    CDP Machine Learning optimizes ML workflows by using built-in and comprehensive tools to deploy, serve and monitor models. With the extended Cloudera Shared Data Experience (SDX) for models, it regulates and automates model categorization, then easily transmits findings to collaborate via CDP experiences such as Data Warehouse and Operational Database.

  • Data visualization

    With Cloudera Data Visualization, users can model data in the virtual data warehouse without having to delete or update underlying data structures or tables, and query large amounts of data without constantly loading data, saving time and money.

  • Operational database

    The Cloudera Operational Database experience is a managed solution that encapsulates the underlying cluster instance as a database. It will automatically scale based on the cluster's workload, and it will be able to improve performance within the same infrastructure and automatically resolve operational issues.

Architecture

In this section, we present all the services available on the CDP Public Cloud. The components shown here can be used independently or as a whole.

  • Data Hub
    • Management Console: service used by CDP administrators to manage environments, users and services
  • Data warehouse
    • Database catalogs: A logical collection of metadata definitions for managed data, as well as the data context that accompanies it
    • Virtual layer: An instance of compute resources that corresponds to a cluster
  • Machine Learning: Mobilize workspaces for machine learning
  • Data Engineering (CDE is currently only available on Amazon AWS)
    • Environment: A logical subset of your cloud provider account that includes a particular virtual network
    • CDE service: The long-lived Kubernetes cluster and services that manage the virtual clusters
    • Virtual cluster: An individual self-scaling cluster with its own CPU and memory pools
    • Job: Application code, as well as specified configurations and resources
    • Resource: A defined set of files necessary for a job
  • Security and governance
    • Data catalog: understand, manage, secure and control data assets
    • WorkLoad Manager: offers insights to help you better understand the workloads you send to clusters managed by Cloudera Manager.
    • Replication Manager: service to copy and migrate data from CDH cluster to CDP Public Cloud.

CDP Private Cloud

CDP Private Cloud is designed for hybrid cloud deployments, enabling on-premises environments to connect to public clouds while maintaining consistent, integrated security and governance. Compute and storage are decoupled in the CDP Private Cloud, allowing clusters of the two to scale independently. Cloudera Shared Data Experience (SDX) is available on a CDP Private Cloud Base cluster and delivers unified security, governance, but also metadata management. CDP Private Cloud users can quickly deliver and deploy Cloudera Data Warehousing and Cloudera Machine Learning services, but also scale them in and out as needed, using the Management Console.

CDP Private Cloud Services

Some of the components of CDP Public Cloud, such as Machine Learning and Data Warehouse, are available on CDP Private Cloud. In addition, it uses a collection of analytical engines covering streaming, data engineering, data marchoperational database and Data Science, to support traditional workloads.

Architecture

In this section, we present various services and components available for the private cloud. Unlike the Public Cloud offering, the components are much more flexible as the user has more control over the cluster deployment.




cdp-arch

Cloudera Private Cloud Architecture (provided by Cloudera, Inc.)

  • CDP PVC base
    • Cloudera Manager
    • Hadoop
      • HDFS: distributed file system that handles large data sets
      • Yarn: systems that manage and scale resources for distributed systems
    • Storage, databases
      • Beehive: data warehouse software designed to provide data querying and analysis
      • HBase: non-relational distributed database for storing massive amounts of sparse data in a fault-tolerant manner
      • Kudu: column-oriented distributed data storage engine for fast analytics data
    • Stream
      • Kafka: streaming messaging platform
      • Stream Messaging Manager (SMM): Operational monitoring and management tools that provide end-to-end visibility into an Apache Kafka enterprise environment.
      • Stream Replication Manager (SRM): enterprise-grade replication solution for fault-tolerant, scalable, and robust replication of Kafka topics across multiple clusters
    • Question
      • Impala: an Apache Hadoop-based query engine
      • Spark: a unified analysis engine for large-scale data processing
    • UI
      • Shade: SQL Assistant for searching databases and data warehouses and collaborating
      • Zeppelin: a web interface to easily analyze and format large amounts of data processed via Spark
      • Data Analytics Studio (DAS): application that provides diagnostic tools and smart recommendations to help business analysts become more self-sufficient and productive with Hive
    • Security, administration
      • Ranger: provides a centralized platform to define, administer and manage security policies across the entire Hadoop ecosystem in a consistent manner
      • Atlas: exchanges metadata with other tools and processes, both inside and outside the Hadoop stack
  • CDP PVC Plus
    • OpenShift: deploy projects in containers
    • Experiences
      • Data warehouse: building self-service systems of self-contained data warehouses and data marts that automatically scale up and down in response to changing workload demands
      • Machine Learning: deploy Machine Learning workspaces
  • Cloudera Data Science Workbench (CDSW): platform that enables data scientists to manage their own analysis pipelines
  • Cloudera Flow Management (CFM)
    • NiFi: automate data movement between different systems

Benefits of CDP Private Cloud

  • Flexibility – your organization's cloud environment can be tailored to meet specific business requirements.
  • Control — Higher levels of control and privacy due to non-shared resources.
  • Scalability – private clouds often offer higher scalability compared to on-premises infrastructure.

Conclusion

Cloudera Data Platform (CDP) gives you the greatest versatility in building and maintaining a cloud-based production data warehouse that makes it easy to migrate data to the cloud and run the data warehouse in production. They both depend on the Shared Data Experience (SDX), which is responsible for security and governance. Overall, it is an adequate solution for organizations that need a reliable scalable and secure cloud environment. It gives the flexibility to choose between private and public clouds, both of which come with their own advantages.

#Overview #Cloudera #Data #Platform #CDP

Source link

Leave a Reply