Building and Protecting Data Lakehouse Projects with Cloudian and Vertica

Henry Golas

Building and Protecting Data Lakehouse Projects with Cloudian and Vertica

Over the past year, Cloudian has greatly expanded its support for data analytics through new partnerships. One of those key partnerships is with Vertica, where the combination of Vertica and Cloudian HyperStore enables organizations to build and protect data lakehouses for modern data analytics applications.

This blog highlights the three main use cases we’re currently serving together:

Starting a data lakehouse with Vertica in Eon mode and Cloudian
Extending the data lakehouse with Vertica external tables and Cloudian
Protecting Vertica datasets with data backup to Cloudian

Just as a reminder, Vertica is a unified analytics data warehouse platform, based on a massively scalable architecture, and Cloudian is a software-defined, limitlessly scalable, S3-compatible object storage platform for on-premises and hybrid cloud environments.

Starting a Data Lakehouse with Vertica in Eon Mode and Cloudian

Cloudian-Vertica Data Lakehouse In the data analytics space, Vertica is known for performance, whether it is run in “Enterprise Mode” or “Eon Mode.” In Enterprise Mode each database node stores a portion of the dataset and performs a portion of the computation. In Eon Mode, Vertica brings its cloud architecture to on-premises deployments and decouples compute and storage. In Eon Mode, each Vertica node can access a shared communal storage space via S3 API. The advantages are: a) compute can be scaled as required without having to scale storage, meaning no more server sprawl and b) storage can be consolidated into a single platform and accessed by various data tools:

Building out Vertica communal storage on Cloudian is easy. For this exercise we are going to assume we have both a functional Vertica and Cloudian HyperStore instance that can communicate via HTTP(s):

Configure a bucket via Cloudian Management Console (CMC) on your HyperStore cluster:
- - Let’s use the name “verticabucketoncloudian” for this example.
Create an auth_params.conf file:
- On your Vertica node, create an auth_params.conf file that will be accessible when you create the Vertica database instance.
  auth_params.conf values required are going to be:
  awsauth = Access_Key:Secret_Key awsendpoint = HyperstoreAddress:Port (either 443 or 80) awsenablehttps = 0 Is required if not using HTTPs
Create your Vertica in Eon Mode database instance:
- On your Vertica node, create the database instance. Specify the location of your auth_params.conf file to leverage a Cloudian S3 bucket for communal storage.
  admintools -t create_db -x auth_params.conf \ --communal-storage-location=s3://verticabucketoncloudian \ --depot-path=/home/dbadmin/depot --shard-count=6 \ -s vnode01,vnode02,vnode03,vnode04,vnode05,vnode06 -d verticadb -p 'YourDBAdminPasswordHere'
Success! Let’s test.
- Once the above command returns successfully, you can test the Vertica in Eon Mode instance.
- Connect to your db instance and load a dataset.
- Connect to Cloudian bucket “verticabucketoncloudian” via CMC or S3 browser, and you will see objects in the bucket.

Extending the Data Lakehouse with Vertica External Tables and Cloudian

One of the key tenants of a successful data lakehouse initiative is the ability to access and analyze datasets that have been generated by other analytics platforms.

Prior to the data lakehouse, an ETL (Extract Transform Load) operation would have been required to move data from one analytics platform to another. Today, Vertica can analyze the data in-place by leveraging external tables, without the need for complex and expensive data moves.

Let’s consider the following scenario… we have an ORC dataset, which was generated by an Apache Hive instance, stored on Cloudian, and we need to connect to it with Vertica. To analyze this dataset in-place, use the following Vertica syntax to connect to the ORC dataset:

That is much simpler and easier than working through any data ETL.

Here are the details for the S3 parameters and configuration.

Protecting Vertica Datasets with Data Backup to Cloudian

As with all datasets, backups of data are key to protecting and preserving data. For this purpose, Vertica has its own backup and recovery tool called “vbr,” and Vertica can leverage Cloudian as a backup target.

Vertica has thoroughly documented the process, but here’s a condensed version:

Configure connectivity and credentials for HyperStore
1. HyperStore credentials are important. They are configured within the database, as a security function, and they are configured as environmental variables to allow vbr to connect.
  - For the database that is going to be backed up, set the AWSAuth credentials (S3 credentials):
    ALTER DATABASE DEFAULT SET AWSAuth = 'accesskeyid:secretaccesskey';
2. Configure vbr HyperStore URL address and credentials
  export VBR_COMMUNAL_STORAGE_ENDPOINT_URL=http:// export VBR_COMMUNAL_STORAGE_ACCESS_KEY_ID= export VBR_COMMUNAL_STORAGE_SECRET_ACCESS_KEY= export VBR_BACKUP_STORAGE_ENDPOINT_URL=http:// export VBR_BACKUP_STORAGE_ACCESS_KEY_ID= export VBR_BACKUP_STORAGE_SECRET_ACCESS_KEY=
  - Keep in mind that you can back up to the same endpoint using the same credentials as the communal storage, but to a different bucket. Or backup can be to a second endpoint with different credentials. Most users will want to back up to a different bucket to reduce associated cost.
Setting the configuration file for vbr
1. There are some additional parameters that must be stored in a configuration file for Vertica to successfully backup / restore with Cloudian
2. Create a file called “eon_backup_restore.ini’ in the home directory of dbadmin
  As a quick reference, /opt/vertica/share/vbr/example_configs contains examples for cloud backups
  eon_backup_restore.ini [CloudStorage] cloud_storage_backup_path = s3://verticabackuponcloudian/fullbackup/ cloud_storage_backup_file_system_path = []:/home/dbadmin/backup_locks_dir/ cloud_storage_concurrency_backup = 10 cloud_storage_concurrency_restore = 10 [Misc] snapshotName = EONbackup_snapshot tempDir = /tmp/vbr restorePointLimit = 1 [Database] dbName = dbPromptForPassword = True dbUser = dbadmin
Target initialization and performing data backup
1. Vertica requires the S3 bucket to be initialized prior to use
  - vbr -t backup -c eon_backup_restore.ini
    Initializing backup locations. Backup locations initialized.
2. Run the Vertica backup
  - vbr -t backup -c eon_backup_restore.ini
    Enter vertica password: Starting backup of database VMart. Participating nodes: v_vmart_node0001, …., v_vmart_node0006. Snapshotting database. Snapshot complete. Approximate bytes to copy: x of y total. [================================================] 100% Copying backup metadata. Finalizing backup. Backup complete!

I hope this tech blog post helps make your Cloudian and Vertica data lakehouse project a success.

For more information about Cloudian data lakehouse / data analytics solutions, go to S3 Data Lakehouse for Modern Data Analytics.

Henry Golas, Director of Technology, Cloudian

View LinkedIn Profile