Splunk Architecture: Data Flow, Components and Topologies
Splunk is a distributed system that aggregates, parses and analyses log data. In this article we’ll help you understand how the Splunk big data pipeline works, how components like the forwarder, indexer and search head interact, and the different topologies you can use to scale your Splunk deployment.
In this article you will learn:
- Stages in the Splunk data pipeline
- Splunk Enterprise vs Splunk Light
- Splunk components
- Putting it all together: the Splunk architecture
How Splunk Works: Stages in the Data Pipeline
Splunk is a distributed system that ingests, processes and indexes log data. Splunk processes data in three stages:
- Data Input – Splunk ingests the raw data stream from the source, breaks it into 64K blocks, and adds metadata keys, including hostname, source, character encoding, and the index the data should be stored in.
- Data Storage – Splunk parses log data, by breaking it into lines, identifying timestamps, creating individual events and annotating them with metadata keys. It then transforms event data using transformation rules defined by the operator. Finally, Splunk writes the parsed events to disk, pointing to them from an index file which enables fast search across huge data volumes.
- Data Search – at this stage Splunk enables users to query, view and use the event data. Based on the user’s reporting needs, it creates objects like reports, dashboards and alerts.
Splunk Enterprise vs Splunk Light: How Does it Affect Your Architecture?
Splunk is available in three versions:
- Splunk Light – the free version
- Splunk Enterprise – the paid version
- Splunk Cloud – provided as a service with subscription pricing
Your selection of a splunk edition will affect your architecture. This is summarized in the table below.
Splunk Edition | Limitations | Architectural Considerations |
Light | Up to 500MB indexing per day | Supports only a single instance |
Enterprise | Unlimited | Supports single site clustering and multi-site clustering for disaster recovery |
Cloud | Depending on service package | Clustering managed by Splunk |
Splunk Components
The primary components in the Splunk architecture are the forwarder, the indexer, and the search head.
Splunk Forwarder
The forwarder is an agent you deploy on IT systems, which collects logs and sends them to the indexer. Splunk has two types of forwarders:
- Universal Forwarder – forwards the raw data without any prior treatment. This is faster, and requires less resources on the host, but results in huge quantities of data sent to the indexer.
- Heavy Forwarder – performs parsing and indexing at the source, on the host machin,e and sends only the parsed events to the indexer.
Splunk Indexer
The indexer transforms data into events (unless it was received pre-processed from a heavy forwarder), stores it to disk and adds it to an index, enabling searchability.
The indexer creates the following files, separating them into directories called buckets:
- Compressed raw data
- Indexes pointing to raw data (.TSIDX files)
- Metadata files
The indexer performs generic event processing on log data, such as applying timestamp and adding source, and can also execute user-defined transformation actions to extract specific information or apply special rules, such as filtering unwanted events.
In Splunk Enterprise, you can set up a cluster of indexers with replication between them, to avoid data loss and provide more system resources and storage space to handle large data volumes.
Splunk Search Head
The search head provides the UI users can use to interact with Splunk. It allows users to search and query Splunk data, and interfaces with indexers to gain access to the specific data they request.
Splunk provides a distributed search architecture, which allows you to scale up to handle large data volumes, and better handle access control and geo-dispersed data. In a distributed search scenario, the search head sends search requests to a group of indexers, also called search peers. The indexers perform the search locally and return results to the search head, which merges the results and returns them to the user.
Source: Splunk Documentation
There are a few common topologies for distributed search in Splunk:
- One or more independent search heads to search across indexers (each can be used for a different type of data)
- Multiple search heads in a search head cluster – with all search heads sharing the same configuration and jobs. This is a way to scale up search.
- Search heads as part of an indexer cluster – promotes data availability and data recovery.
Putting it All Together: Splunk Architecture
The following diagram illustrates the Splunk architecture as a whole.
Source: Splunk Documentation
From top to bottom:
- Splunk gathers logs by monitoring files, detecting file changes, listening on ports or running scripts to collect log data – all of these are carried out by the Splunk forwarder.
- The indexing mechanism, composed of one or more indexers, processes the data, or may receive the data pre-processed by the forwarders
- The deployment server manages indexers and search heads, configuration and policies across the entire Splunk deployment.
- User access and controls are applied at the indexer level – each indexer can be used for a different data store, which may have different user permissions.
- The search head is used to provide on-demand search functionality, and also powers scheduled searches initiated by automatic reports.
- The user can define Scheduling, Reporting and Knowledge objects to schedule searches and create alerts.
- Data can be accessed from the UI, the Splunk CLI, or APIs integrating with numerous external systems.
Reduce Splunk Storage Costs by 70% with SmartStore and Cloudian
Splunk’s new SmartStore feature allows the indexer to index data on cloud storage such as Amazon S3. Cloudian HyperStore is an S3-compatible, exabyte-scalable on-prem storage pool that SmartStore can connect to. Cloudian lets you decouple compute and storage in your Splunk architecture and scale up storage independently of compute resources.
You can configure SmartStore to retain hot data on the indexer machine, and move warm or cold data to on-prem Cloudian storage. Cloudian creates a single data lake with seamless, modular growth. You can simply add more Cloudian units, with up to 840TB in a 4U chassis, to expand from terabytes to an exabyte. It also offers up to 14 nines durability.
Learn more about Cloudian’s big data storage solutionsLearn more about Cloudian’s solution for Splunk storage.
Learn More About Splunk Architecture
Splunk architecture is a broad topic. To understand how Splunk works, you should learn about its components, how you can save storage costs when scaling your deployment, and how to analyze big data with Splunk. Read our additional articles below for information that will help you understand and optimize Splunk storage.
Splunk Big Data: a Beginner’s Guide
Splunk is a tool you can use to derive value from your big data. It enables you to incorporate insights from a variety of tools, allowing you to collect, search, index, analyze, and visualize your data from a central location. Splunk supports extracting and organizing real-time insights from big data regardless of source. You can integrate Splunk with NoSQL and relational databases, and establish connections between your workflow tools and Splunk
Read more: Splunk Big Data: a Beginner’s Guide
Splunk Data Analytics: Splunk Enterprise or Splunk Hunk?
There are two main ways to use Splunk for data analytics—Splunk Enterprise that collects log data from across the enterprise and make it available for analysis, and Splunk Hunk that indexes and makes queries of Hadoop data, creates dashboards and reports directly from Hadoop datasets.
This article reviews the second method, explaining how Hunk can help you make sense of legacy Hadoop datasets.
Read more: Splunk Data Analytics: Splunk Enterprise or Splunk Hunk?
Splunk Storage Calculator: Learn to Estimate Your Storage Cost
In Splunk, you store data in indexes made up of file buckets. These buckets contain data structures that enable Splunk to determine if the data contains terms or words. Buckets also contain compressed, raw data. This data is usually reduced to 15% of its original size, once compressed, to help Splunk store data efficiently.
Unfortunately, there is no official Splunk storage calculator. There are techniques you can use to estimate storage requirements yourself. All you need is an understanding of Splunk data and storage tiers and the ability to use CLI commands.
Learn more: Splunk Storage Calculator: Learn to Estimate Your Storage Costs
Anatomy of a Splunk Data Model
Splunk data models are used to generate pivot reports for users. Pivot reports are visualizations, tables, or charts displaying information from a dataset search. Pivot is also the name of the tool used to create pivot reports in Splunk. In Pivot, you can select the data model you want to use according to the data you work with. Within that model, you select the dataset specific to the data they want to report on.
This article explains how splunk data models and datasets work, how to define a data model using the Splunk editor, and important best practices for efficient data model design.
Read more: Anatomy of a Splunk Data Model
Splunk Backup: What are Your Options?
Like any enterprise system, Splunk must be supported by a data backup plan. There are two main backup strategies to protect your Splunk data. You can backup Splunk index data to an on-premises storage device using the Splunk data lifecycle stages, or you can use the SmartStore indexer to backup data to cloud storage such as Amazon S3, or local S3-compatible storage devices.
This article explains how to backup Splunk workloads using on-premises storage, cloud storage via the new SmartStore connector, and on-premise object storage.
Read more: Splunk Backup: What are Your Options?
See Our Additional Guides on Key Data Storage Topics:
We have authored in-depth guides on several other data storage topics that can also be useful as you explore the world of Splunk.
Data Protection Guide
Data protection relies on technologies such as data loss prevention (DLP), storage with built-in data protection, firewalls, encryption, and endpoint protection. Learn what is the difference between data protection and data privacy, and how to leverage best practice to ensure the continual protection of your data.
See top articles in our data protection guide:
- Keeping Up with Data Protection Regulations
- Data Availability: Ensuring the Continued Functioning of Business Operations
- How You Can Maintain Secure Data Storage
Data Backup Guide
Data backup is a practice that combines techniques and solutions for efficient and cost-effective backup. Your data is copied to one or more locations, at pre-determined frequencies, and at different capacities. Learn what a data backup is and how it differs from archiving, what benefits it offers, and what considerations you should take before deploying data backup solutions
See top articles in our data backup guide:
- Ensuring Your Data with Effective Backup Storage
- NAS Backup
- Using Storage Archives to Secure Data and Reduce Cost
Hybrid IT Guide
Hybrid IT is a blend of on-premise and cloud-based services that has emerged with the increasing migration of businesses to cloud environments. Learn about hybrid IT, implementation solutions, and practices, and discover how Cloudian can help optimize your implementation.
See top articles in our Hybrid IT guide:
- Hybrid Cloud Management: What You Need to Know
- Hybrid Cloud Architecture: Selecting the Best of Both Worlds
- Multi-Cloud Management: 5 Critical Considerations
IT Disaster Recovery Guide
IT disaster recovery is the practice of anticipating, planning for, surviving, and recovering from a disaster that may affect a business. Learn what is disaster recovery, how it can benefit your business, and four essential features any disaster recovery program must include to be effective.
See top articles in our IT disaster recovery guide:
VMware Storage Guide
VMware provides a variety of ways for virtual machines to access storage. It supports multiple traditional storage models including SAN, NFS and Fiber Channel (FC), which allow virtualized applications to access storage resources in the same way as they would on a regular physical machine.
See top articles in our VMware storage guide:
- VMware Data Protection is EOA: 5 Great Alternatives
- VMware Backup: Three Approaches
- VMware Cloud Director 101: Architecture, Features and Concepts
Health Data Management Guide
Health Data Management (HDM), also known as Health Information Management (HIM) is the systematic organization of health data in digital form. Learn what is health data management, the types of data it encompasses, unique challenges and considerations for storing Petabytes of health data.
See top articles in our health data management guide: