Adalta’s Summit 2022 Morzine

For its third edition, the entire Adaltas crew gathers in Morzine for a full week with two days dedicated to technology on the 15th and 16th of September 2022.

The speakers choose one of the three available formats:

  • Presentation: from 20 minutes to 1 hour
  • Demonstration: from 45 minutes to 2 hours
  • Training: from 1h to 2h

Program

Once an intervention is implemented, its supported resources as well as an article covering the intervention will be published on the Adalta website. Here is the calendar and list of topics covered this week.

Thursday, September 15, 2022

  • 9:30 am Kubernetes Networking Lab
  • 10:45 Running Kafka clusters on Kubernetes with Strimzi
  • 12:00 Expose your containers and virtual machines with a public IP
  • 2:30 p.m DuckDB introduction
  • 15:30 Using LXD with Terraform for local development environments
  • 4:30 p.m Comparison of Data Quality Validation Frameworks
  • 17:15 A brief look at Apache Arrow

Friday, September 16, 2022

  • 9:30 am Introduction to SingleStoreDB, the database for transactional and analytical workloads
  • 10:45 Introduction to Apache Iceberg, the open table format
  • 12:00 Introduction to Apache Kyuubi
  • 2:30 p.m Vector databases, Milvus Overview
  • 15:30 tdp server, residual service management for tdp cluster
  • 4:30 p.m Ballista, a Rust-based distributed query engine
  • 17:15 Data protection around the world

Abstract

Kubernetes Networking Lab

  • Speaker: Paul-Adrien CORDONNIER
  • Duration: 1h15
  • Format: talk + demo
  • Schedule: Thursday, September 15, 2022 at 9:30 a.m

The goal of this lab is to give everyone an introduction to Kubernetes network communication. We will try to cover most of the concepts at a high level and practice it in a sandbox environment.

By the end of the session, we should all be able to know what is the purpose of each element of the networking stack, how they are used. The lab should also serve as a reminder when confusion will inevitably arise during your Kubernetes journey.

Here are the concepts covered:

  • Basic Low-Level Networking (CNI)
  • Kubernetes Network API (Services)
  • DNS
  • Expose the Kubernetes application outside (LoadBalancer, Ingress, Gateways)
  • Service Mesh

Running Kafka clusters on Kubernetes with Strimzi

  • Speaker: Leo SCHOUKROUN
  • Duration: 1h15
  • Format: talk + demo
  • Schedule: Thursday, September 15, 2022 at 10:45 a.m

Kubernetes is not the first platform that comes to mind for running Apache Kafka clusters.

We'll walk through the basics of Strimzi, a Kafka operator for Kubernetes curated by Red Hat. A particular focus will be placed on the storage issue which is often a pain point on bare metal Kubernetes clusters.

We will also compare Strimzi with other Kafka operators by giving their pros and cons.

The presentation ends with a demonstration that presents different use cases for Strimzi.

Expose your containers and virtual machines with a public IP

  • Speaker: David WORMS
  • Duration: 1h
  • Format: discussion + demo
  • Schedule: Thursday, September 15, 2022 at 12:00

Virtual machines and containers are typically exposed to the web using port forwarding. In such cases, the public IP address is shared with the host computer. While this works well in many scenarios, it is sometimes necessary to associate the guest machine with its distinct public IP, for example to host your own mail server, to access an internal network, or to expose Kubernetes services .

The general idea is to route traffic from a public IP or CIDR subnet to a guest computer running inside a host computer. Said differently, the connection exposes containers and virtual machines with a static public address.

It works seamlessly with any hypervisor, including VMware ESXi, Citrix Xen Server, OpenStack and Proxmox, … The covered procedure uses LXD in cluster mode.

DuckDB introduction

  • Speaker: Stephan BAUM
  • Duration: 1h
  • Format: presentation + demo
  • Schedule: Thursday, September 15, 2022 at 2:30 p.m

DuckDB is an embedded column-vectorized OLAP DBMS that uses SQL queries.

We will present the architecture and specifications of the DuckDB DBMS, why it was created, how it achieves its performance by describing the ART indexing process, and we will explain in which cases DuckDB should be used or not. Finally, a demo will illustrate the basic use of DuckDB in a Python notebook and how it relates to Pandas and Apache Arrow.

Using LXD with Terraform for local development environments

  • Speaker: Gauthier LEONARD
  • Duration: 1h
  • Format: talk + demo
  • Schedule: Thursday, September 15, 2021 at 3:30 p.m

LXD is a modern, secure and powerful system container and virtual machine manager. LXD presents significant advantages over other common virtualization tools (namely Vagrant):

  • Unified interface for managing containers, virtual machines and networks
  • Super fast provisioning thanks to system containers
  • Live resizing of containers/VMs
  • Works both locally and on multiple host clusters (therefore useful for both development and production)

Nevertheless, the LXD API, LXC CLI and cloud-init are quite difficult to understand for new users and do not allow easy versioning of environment configurations.

The LXD Terraform provider is an elegant solution for doing infra-as-code on top of LXD. In the demo we will see how to migrate from Vagrant+VirtualBox to Terraform+LXD for local development environments.

Comparison of Data Quality Validation Frameworks

Data quality is an important issue that many companies have yet to address effectively.

Even when the tests are implemented, they are manually executed on a subset of tables. Lately I was involved in setting up an automated pipeline. Based on their requirements and the technical stack, I suggested several libraries that could be used for the purpose and a PoC with the chosen one.

I would like to share the experience on the subject, describe currently the most popular frameworks for data validation and present their pros and cons. These frames are namely:

  • Deequ
  • Big expectations
  • Delta Live Tables (DLT)
  • Soda

A brief look at Apache Arrow

  • Speaker: Albert Konrad
  • Duration: 45 min
  • Format: talk + demo
  • Schedule: Thursday, September 15, 2022 at 5:15 p.m

Is it a software development platform? Is it an in-memory data storage format? Or is it just a file format? No, it's Apache Arrow.

We'll take a very brief look at what Apache Arrow is, what problems it solves, and discuss how it appeals to data engineers. In a quick demo, we will also test if Apache Arrow lives up to its promise.

Introduction to SingleStoreDB, the database for transactional and analytical workloads

  • Speaker: Sergey Kudinov
  • Duration: 1h15
  • Format: presentation
  • Schedule: Friday, September 16, 2022 at 9:30 a.m

SingleStoreDB unifies transactions and analytics in a single engine to power low-latency access to large data sets. With its patented Universal Storage, SingleStore allows operational and analytical workloads to be processed using a single table type. Built for developers and architects, SingleStoreDB is based on a distributed SQL architecture that delivers 10-100 millisecond performance on complex queries.

The presentation will cover the architecture and optimization techniques by which SingleStore gains performance.

Introduction to Apache Iceberg, the open table format

  • Speaker: Yanis Bariteau
  • Duration: 1h15
  • Format: presentation + demo
  • Schedule: Friday, September 16, 2022 at 10:45 a.m

Iceberg is currently employed by organizations including Netflix, Apple, Adobe, LinkedIn, Expedia, Stripe and others as the open standard for large analytic tables in the cloud.

It is a tabular format for analytical datasets that can interoperate with a wide range of calculation engines. It has tons of capabilities that enable data professionals to successfully handle big data, even up to tens of petabytes in size, in addition to high-performance searches on data at rest.

Introduction to Apache Kyuubi

  • Speaker: Guillaume Holdorf
  • Duration: 45 min
  • Format: presentation
  • Schedule: Friday, September 16, 2022 at 12:00

Apache Kyuubi democratizes access to your data storage solution by allowing SQL queries from any ODBC/JDBC client. The Kyuubi servers allow you to serve a large amount of requests in a distributed manner and ensure HA, high performance and secure access to your data.

In this presentation we will see the different features of Apache Kyuubi and what they allow to do.

Vector databases, Milvus Overview

  • Speaker: Tobias Chavarria
  • Duration: 45 min
  • Format: presentation + demo
  • Schedule: Friday, September 16, 2022 at 2:30 p.m

Milvus is an open source vector database, built for scalable similarity search. It is part of the LF AI & Data Foundation.

Providing features such as CRUD operations, metadata filtering and horizontal scaling, Milvus offers:

  • Very available
  • Very scalable
  • Cloud based

tdp server, residual service management for tdp cluster

  • Speaker: Guillaume BOUTRY
  • Duration: 1h
  • Format: talk + demo
  • Schedule: Friday, September 16, 2021 at 3:15 p.m

tdp-server is the web service that exposes REST APIs over tdp-lib core functionality while providing multi-user capabilities, security and more contextual information to implementations.

As a reminder, tdp-lib's core features are task scheduling (through a DAG definition) and variable versioning (through git repositories).

With tdp server you will be able to manage services and components as resources where you can use the different endpoints to change the configuration (with GET, PUT (replaces), PATCH (changing power)). You cannot add services/components with POST or remove them with DELETE. Knowing which service/component is available is done through tdp-lib using its discovery functions.

Then is the most important function deploy, with deploy you will be able to perform operations on the cluster. It is a simple endpoint that contains three parameters: targets, sourcesand filter.

Ballista, a Rust-based distributed query engine

  • Speaker: Gonzalo Etse
  • Duration: 45 min
  • Format: presentation
  • Schedule: Friday, September 16, 2022 at 4:15 p.m

Ballista is a distributed computing engine built with Rust and leverages Apache Arrow, Arrow Flight and DataFusion. Its modern architecture allows other programming languages, such as Python, C++ and Java, to work without problems with serialization.

Apache Arrow enables in-memory usage, while flight further enables efficient data transfer between processes. Furthermore, DataFusion together with technologies such as Google Protocol Buffers will enable fast and efficient use of memory in various applications.

Ballista is still a work in progress and is implemented on top of DataFusion. Although still in its early stages, the architecture provides excellent memory efficiency and memory usage can be 5x – 10x lower than Apache Spark in some cases, meaning more processing can fit on a single node, reducing the overhead of distributed computing.

Data protection around the world

  • Speaker: Paul Farault
  • Duration: 45 min
  • Format: talk
  • Schedule: Friday, September 16, 2022 at 5:00 p.m

Data protection is a fundamental topic for companies. Not only for personal data (for customers, users or employees), but also for the company's own data.

Both of these are discussed, starting from the Alstom case – confronted by the FCPA and DOJ in 2014 – to the basic rules on personal data protection introduced by the GDPR.

This presentation marks the first step in a series on data protection. Future sections will address technical responses to these problems.

#Adaltas #Summit #Morzine

Source link

Leave a Reply