Elasticsearch is a free and open-source search and analytics engine built on Apache Lucene. Elasticsearch is distributed and supports all data types, including numerical, textual, structured, unstructured, and geospatial data. Elasticsearch comes with simple REST APIs and provides features for scalability and fast search.
Elasticsearch is an important part of the Elastic Stack, which is a set of open-source tools including data ingestion, storage, enrichment, visualization, and analysis. Notable tools in the stack are Elasticsearch, Logstash, and Kibana (ELK).
The Elasticsearch architecture leverages the Lucene indexing build and combines it with a distributed model that separates the architecture into small components, called shards, which can be distributed across multiple nodes.
In this article, you will learn:
Elasticsearch is a scalable search and analytics solution that supports multi-tenancy and provides near real-time search. You can use Elasticsearch to enable searches for all types of data and various locations. The engine ingests data from multiple locations, stores it, and indexes the data according to predefined manual or automated mapping.
Because Elasticsearch works with a distributed architecture, users can search and analyze massive volumes of data in near real-time. Additionally, Elasticsearch introduces scalability into the searching process, enabling you to start with just one machine and scale up to the hundreds.
You can use Elasticsearch to run a full-featured (also known as full-text) search cluster, such as document search, product search, and email search. However, this requires a high level of skills and experience. You can also use Elasticsearch to store data pending slicing and dicing, and data that needs to be grouped into categories, such as metrics, traces, and logs.
Elasticsearch is deployable in various cloud environments as well as on-premises. You can self-host Elasticsearch or use a cloud service like AWS Elasticsearch.
Elasticsearch uses shipping agents, called beats, to transfer raw data from multiple sources into Elasticsearch. After data is shipped into Elasticsearch, the engine runs data ingestion processes, which parse, normalize, enrich, and prepare data for indexing. After the data is indexed, users can run complex queries and use aggregations to retrieve complex data summaries.
For visualization and management, the Elastic Stack offers a tool called Kibana, which enables users to create real-time data visualizations, such as pie charts, maps, line graphs, and histograms. Kibana also lets you share dashboards, use Canvas to create custom dynamic infographics, and use Elastic Maps to visualize geospatial data.
The Elasticsearch architecture is built for scalability and flexibility. The core components are Elasticsearch clusters, nodes, shards, and analyzers.
An Elasticsearch cluster is composed of a group of nodes that store data. You can specify the number of nodes that start running with the cluster, as well as the IP address of the virtual or physical server. You can specify this information in the config/elasticsearch.yml file, which contains all configuration settings.
Nodes in an Elasticsearch cluster are connected to each other, and each node contains a small chunk of cluster data. You can run as many clusters as needed. However, usually one node is sufficient. The system automatically creates a cluster when a new node starts. The nodes participate in the overall cluster processes in charge of searching and indexing.
In general, the term node refers to a server that works as part of the cluster. In Elasticsearch, a node is an instance—it is not a machine. This means you can run multiple nodes on a single machine. An Elasticsearch instance consists of one or more cluster-based nodes. By default, when an Elasticsearch instance starts, a node also starts running.
Here are the three main options to configure an Elasticsearch node:
The Elasticsearch architecture uses two main ports for communication:
There is no limit to the number of documents you can store on each index. However, if an index exceeds the storage limits of the hosting server, Elasticsearch might crash. To prevent this issue, indices are split into small pieces called shards.
Shards are small and scalable indexing units that serve as the building blocks of the Elasticsearch architecture. Shards enable you to distribute operations and improve overall performance. After creating an index, you can create as many shards as needed. Each shard works as an independent Lucene index, which you can host anywhere in the cluster.
In Elasticsearch, replicas are copies of index shards. Replicas are used as a fail-safe mechanism for backup and recovery purposes. Replicas are never placed on the node containing the original (primary) shards.
To ensure availability, replicas are stored in different locations. You can define replicase after the index is created and create as many replicas as needed. This means you can store more replicas than primary shards.
Analyzers are responsible for parsing phrases and expressions into constituent terms. This occurs during the indexing process. Each analyzer is composed of one tokenizer and several token filters. When encountering a certain expression, the tokenizer can split a string into pre-defined terms.
The Elasticsearch architecture is designed to support the retrieval of documents, which are stored as JSON objects. Elasticsearch supports nested structures, which helps handle complex data and queries. To track information, Elasticsearch uses keys prepended with an underscore, which represents metadata.
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, Cloud Volumes ONTAP provides storage efficiency features, including thin provisioning, data compression, and deduplication, reducing the storage footprint and costs by up to 70%.
For more on optimizing Elasticsearch deployment with NetApp, download our free eBook Optimize Elasticsearch Performance and Costs with Cloud Volumes ONTAP today.