Path to Data Insights – Data Observability Platform with an S3-Compatible Data Lake

Organizations today face the challenge of not only storing large data volumes but also being able to ingest, access and analyze the data to get insights on demand. That’s why they’re turning to observability solutions backed by scalable S3-compatible data lakes.

 

Steve Connors, Senior Alliances Manager, Cloudian

View LinkedIn Profile

 

 

Path to Data Insights – Data Observability Platform with an S3-Compatible Data Lake

 More than 90% of all data in the world was generated in just the last few years, and, according to IDC, the amount of data generated annually is expected to nearly triple from 2020 to 2025. This incredible growth presents the challenge of not only storing large data volumes but also, more importantly, of being able to ingest, access and analyze the data to get insights on demand. That’s why organizations are turning to data observability solutions backed by scalable S3 data lakes to break the silos and get business insights.

Monitoring vs Observability

Let’s first distinguish between monitoring and observability.

Monitoring refers to the processes and systems used by organizations to flag data for performance degradation and/or potential security breaches. Organizations can then configure these systems to send alerts and also automate the types of responses to certain events. Useful as it is, monitoring typically stays at one layer, either application or infrastructure and is only a partial solution.

Observability, on the other hand, provides full visibility into the health and performance of each layer of your environment. It starts with the capability to connect and ingest data from multiple sources into multiple tools and gives IT teams full visibility into their environment, from applications to infrastructure assets. Organizations are then able to observe their entire IT environment together and get insights to forecast future trends and outcomes which helps save time and money.

The Cloudian-Cribl Observability Platform

Cribl—a leading data analytics company and a key Cloudian technology alliance partner—has developed cutting-edge technology to solve the challenge of ingesting, managing and analyzing data more productively as well as efficiently. Cribl’s Stream product is an observability pipeline that connects to various sources of data, like networks, servers, applications, and software agents, and centralizes all of your observability data with full fidelity into an S3 data lake with Cloudian HyperStore object storage. This forms a modern observability platform based on a scale-out S3-compatible data lake designed for managing massive amounts and varieties of data, giving IT teams full control over every aspect of the data. The data is stored in HyperStore and always available for search and analysis. Cribl then allows customers to Replay (Cribl function) the data in Cloudian – applying filters and transformations as required – and feed it to higher level analytic tools like Splunk, Elastic and others.

The observability solution is also hybrid cloud-compatible, enabling organizations to reduce costs by having both on-prem and cloud assets simultaneously. Together, the solution built on Cribl and Cloudian lets you parse, restructure, and enrich data in flight – ensuring that you get the right data, where you want, and in the format you need. This provides organizations the opportunity to discover and understand their dynamic environments in real time and use the resulting data insights to make better informed decisions.

To learn more about the newly minted Cloudian-Cribl partnership how it can help you, register for the upcoming Cribl-Cloudian webinar on Nov. 9, 8am PST. You can also read more about the Cribl-Cloudian solution here.

Building and Protecting Data Lakehouse Projects with Cloudian and Vertica

See how to start a data lakehouse with Vertica EON mode and Cloudian, extend the data lakehouse with Vertica external tables and Cloudian, and protect Vertica datasets with data backup to Cloudian.

Building and Protecting Data Lakehouse Projects with Cloudian and Vertica

Over the past year, Cloudian has greatly expanded its support for data analytics through new partnerships. One of those key partnerships is with Vertica, where the combination of Vertica and Cloudian HyperStore enables organizations to build and protect data lakehouses for modern data analytics applications.

This blog highlights the three main use cases we’re currently serving together:

  • Starting a data lakehouse with Vertica in Eon mode and Cloudian
  • Extending the data lakehouse with Vertica external tables and Cloudian
  • Protecting Vertica datasets with data backup to Cloudian

Just as a reminder, Vertica is a unified analytics data warehouse platform, based on a massively scalable architecture, and Cloudian is a software-defined, limitlessly scalable, S3-compatible object storage platform for on-premises and hybrid cloud environments.

Starting a Data Lakehouse with Vertica in Eon Mode and Cloudian

Cloudian-Vertica Data LakehouseIn the data analytics space, Vertica is known for performance, whether it is run in “Enterprise Mode” or “Eon Mode.” In Enterprise Mode each database node stores a portion of the dataset and performs a portion of the computation. In Eon Mode, Vertica brings its cloud architecture to on-premises deployments and decouples compute and storage. In Eon Mode, each Vertica node can access a shared communal storage space via S3 API. The advantages are: a) compute can be scaled as required without having to scale storage, meaning no more server sprawl and b) storage can be consolidated into a single platform and accessed by various data tools:

Building out Vertica communal storage on Cloudian is easy. For this exercise we are going to assume we have both a functional Vertica and Cloudian HyperStore instance that can communicate via HTTP(s):

  1. Configure a bucket via Cloudian Management Console (CMC) on your HyperStore cluster:
      • Let’s use the name “verticabucketoncloudian” for this example.

  2. Create an auth_params.conf file:
    • On your Vertica node, create an auth_params.conf file that will be accessible when you create the Vertica database instance.
      auth_params.conf values required are going to be:

      awsauth = Access_Key:Secret_Key
      awsendpoint = HyperstoreAddress:Port (either 443 or 80)
      awsenablehttps = 0 Is required if not using HTTPs
  3. Create your Vertica in Eon Mode database instance:
    • On your Vertica node, create the database instance. Specify the location of your auth_params.conf file to leverage a Cloudian S3 bucket for communal storage.

      admintools -t create_db -x auth_params.conf \
      --communal-storage-location=s3://verticabucketoncloudian \
      --depot-path=/home/dbadmin/depot --shard-count=6 \
      -s vnode01,vnode02,vnode03,vnode04,vnode05,vnode06 -d verticadb -p 'YourDBAdminPasswordHere'
  4. Success! Let’s test.
    • Once the above command returns successfully, you can test the Vertica in Eon Mode instance.
    • Connect to your db instance and load a dataset.
    • Connect to Cloudian bucket “verticabucketoncloudian” via CMC or S3 browser, and you will see objects in the bucket.

Extending the Data Lakehouse with Vertica External Tables and Cloudian

One of the key tenants of a successful data lakehouse initiative is the ability to access and analyze datasets that have been generated by other analytics platforms.

Prior to the data lakehouse, an ETL (Extract Transform Load) operation would have been required to move data from one analytics platform to another. Today, Vertica can analyze the data in-place by leveraging external tables, without the need for complex and expensive data moves.

Let’s consider the following scenario… we have an ORC dataset, which was generated by an Apache Hive instance, stored on Cloudian, and we need to connect to it with Vertica. To analyze this dataset in-place, use the following Vertica syntax to connect to the ORC dataset:

That is much simpler and easier than working through any data ETL.

Here are the details for the S3 parameters and configuration.

Protecting Vertica Datasets with Data Backup to Cloudian

As with all datasets, backups of data are key to protecting and preserving data. For this purpose, Vertica has its own backup and recovery tool called “vbr,” and Vertica can leverage Cloudian as a backup target.

Vertica has thoroughly documented the process, but here’s a condensed version:

  1. Configure connectivity and credentials for HyperStore
    1. HyperStore credentials are important. They are configured within the database, as a security function, and they are configured as environmental variables to allow vbr to connect.
      • For the database that is going to be backed up, set the AWSAuth credentials (S3 credentials):
        ALTER DATABASE DEFAULT SET AWSAuth = 'accesskeyid:secretaccesskey';
    2. Configure vbr HyperStore URL address and credentials

      export VBR_COMMUNAL_STORAGE_ENDPOINT_URL=http://
      export VBR_COMMUNAL_STORAGE_ACCESS_KEY_ID=
      export VBR_COMMUNAL_STORAGE_SECRET_ACCESS_KEY=
      export VBR_BACKUP_STORAGE_ENDPOINT_URL=http://
      export VBR_BACKUP_STORAGE_ACCESS_KEY_ID=
      export VBR_BACKUP_STORAGE_SECRET_ACCESS_KEY=

      • Keep in mind that you can back up to the same endpoint using the same credentials as the communal storage, but to a different bucket. Or backup can be to a second endpoint with different credentials. Most users will want to back up to a different bucket to reduce associated cost.
  2. Setting the configuration file for vbr
    1. There are some additional parameters that must be stored in a configuration file for Vertica to successfully backup / restore with Cloudian
    2. Create a file called “eon_backup_restore.ini’ in the home directory of dbadmin
      As a quick reference, /opt/vertica/share/vbr/example_configs contains examples for cloud backups

      eon_backup_restore.ini
      [CloudStorage]
      cloud_storage_backup_path = s3://verticabackuponcloudian/fullbackup/
      cloud_storage_backup_file_system_path = []:/home/dbadmin/backup_locks_dir/
      cloud_storage_concurrency_backup = 10
      cloud_storage_concurrency_restore = 10
      [Misc]
      snapshotName = EONbackup_snapshot
      tempDir = /tmp/vbr
      restorePointLimit = 1
      [Database]
      dbName =
      dbPromptForPassword = True
      dbUser = dbadmin
  3. Target initialization and performing data backup
    1. Vertica requires the S3 bucket to be initialized prior to use
      • vbr -t backup -c eon_backup_restore.ini
        Initializing backup locations.
        Backup locations initialized.
    2. Run the Vertica backup
      • vbr -t backup -c eon_backup_restore.ini
        Enter vertica password:
        Starting backup of database VMart.
        Participating nodes: v_vmart_node0001, …., v_vmart_node0006.
        Snapshotting database.
        Snapshot complete.
        Approximate bytes to copy: x of y total.
        [================================================] 100%
        Copying backup metadata.
        Finalizing backup.
        Backup complete!

I hope this tech blog post helps make your Cloudian and Vertica data lakehouse project a success.

For more information about Cloudian data lakehouse / data analytics solutions, go to S3 Data Lakehouse for Modern Data Analytics.

 

Henry Golas

 

 

Henry Golas, Director of Technology, Cloudian

View LinkedIn Profile

On-prem S3 Data Lakehouse for Modern Analytics and More

The modernization of the data analytics architecture started in the cloud, but not everyone is able to or willing to move their data to the cloud, for data gravity, security, and/or compliance reasons. Organizations can now implement an S3 data lakehouse and get the same benefits with greater control.

amit rawlani

 

 

Amit Rawlani, Director of Solutions & Technology Alliances, Cloudian
View LinkedIn Profile

On-prem S3 Data Lakehouse for Modern Analytics and More

Data Lakehouse Origin

The modernization of the data analytics architecture started in the cloud. This was driven by the limits of traditional data warehouses with a conventional appliance-based approach, which could not provide the needed scalability, was cumbersome to work with and expensive, and could only serve one use case.  In response, companies such as Snowflake and Databricks took the traditional OLAP operations and showed the benefits of combining the flexibility, cost-efficiency, and scale of a data lake built on cloud storage (based on the S3 API) with the data management and ACID transactions of a data warehouse, thereby giving us the modern data lakehouse.

However, not everyone is able to or willing to move their data to the cloud, for data gravity, security, and/or compliance reasons. Customers – especially enterprise customers – have started replicating the same architecture that Snowflake pioneered within their own data centers and/or in hybrid cloud configurations. Specifically, by using S3-compatible object storage platforms like Cloudian HyperStore, customers can now implement an S3 data lakehouse and get the same efficiencies with greater control.

The Many Benefits of a Data Lakehouse Architecture

An on-prem data lakehouse solution offers the same cloudification of the data analytics environment, but behind the security of users’ firewalls. This gives organizations full control to implement the right security protocols, compliance measures, and audit-logging for their needs.

In addition to providing public cloud-like scalability, the data lakehouse architecture also gives customers the ability to scale the data lake (storage) independent of the compute, a big difference from the shared-everything architectures that bogged down the big data world a few years ago.

Standardization on the S3 API — the de facto storage standard of the cloud — enables enterprises to continue to build and reuse applications already built for the cloud with their on-prem data lakehouse. Standardization also allows for use cases beyond analytics, such as a repository for Splunk data and immutable backup storage for ransomware protection — generally a repository for all unstructured data (media, images, videos, etc.) — and all supported in the same S3 data lakehouse.

These are just a few reasons for considering S3-compatible storage such as Cloudian for your data lakehouse solution. You can read about others at 7 Reasons to Run Data Analytics on a Cloudian S3 Data Lakehouse. In addition, to learn more about data lakehouses and how Cloudian can help you meet your data analytics needs, please visit https://cloudian.com/solutions/data-lakehouse/

From Data Warehouse to Data Lakehouse: The Evolution of Data Analytics Platforms

As a data management company, Cloudian has always been interested in how organizations manage their data. A lot of attention has been paid to the WHY and HOW of data interactions, as well as WHERE data is stored. One particularly interesting combination of why, how and where is data analytics.

Henry Golas

Henry Golas, Director of Technology, Cloudian
View LinkedIn Profile

From Data Warehouse to Data Lakehouse: The Evolution of Data Analytics Platforms

As a data management company, Cloudian has always been interested in how organizations manage their data. A lot of attention has been paid to the WHY and HOW of data interactions, as well as WHERE data is stored. One particularly interesting combination of why, how and where is data analytics.

To date, object storage has not had a defining role in the data analytics space. Instead, organizations have mostly relied on traditional block and file storage solutions housing structured/semi-structured data. At best organizations might have placed a database backup onto an S3 object storage target, but object storage was rarely used as a primary data repository.

Today, with the business driver of building out successful data lakehouses, analytics platforms such as Greenplum, Vertica and SQL Server 2022 now support object storage data repositories via the S3 API. Many other platforms, such as Teradata, have the functionality coming soon. This means that as an S3 compatible object storage platform, Cloudian can be used to house a variety of data sets for a variety of analytics (and non-analytics) use cases!

WHY is this important?

A brief history of data warehousing and analytics will help explain.

 

Data warehouses have existed for decades and are great for performing specific queries on structured data, such as a company billing/invoicing system. In a data warehouse, data inputs are structured; the data isn’t growing exponentially; and many frameworks/workflows exist as part of business intelligence (BI) and reporting tools. The challenge here is that as organizations have developed and evolved, so has the relationship between the organization and its data.

In the mid-to-late 2000s, a need to collect, query and monetize a large amount of company data began to emerge. This new data was structured, semi-structured and unstructured and came from different data sources at blinding speeds. Organizations wanted to leverage data science or machine learning techniques to provide some desired output or piece of monetizable information, such as a formula that would predict the failure rate of a widget based on millions of data points. The term “data lake” was coined, and a data lake’s purpose was to store data in raw formats. The challenge here was that data lakes are good for storing data, not enforcing data quality or running transactions on top of them.

Coming back to the present, object storage and the standardization of the S3 API for communication have changed the game. From a storage perspective, object stores can store a variety of data sets, everything from structured to unstructured data. From an analytics platform/BI tool perspective, it is now possible to tap into the entire data set via the S3 API.

HOW does this all come together?

S3-based object storage enables the creation of a modern data lakehouse, where storage can be decoupled from compute, diverse analytic workloads can be supported and tools/platforms are able to access data directly with standard S3 API calls.

WHERE does this all happen?

This all happens on premises, where Cloudian underpins a data lakehouse by providing scalable, cost-effective storage which is accessible by the S3 API.

 

Use Cases

Check out Cloudian’s data lakehouse/data analytics-focused solution briefs at: Hybrid Cloud Storage for Data Analytics