Elasticsearch Architecture I: Clusters, Nodes, and Indices
Hey there!
Today, we’ll have a fun and simple chat about Elasticsearch architecture. So, sit back, grab a coffee and settle in.
“Elasticsearch is one of the most powerful search engines available today. Its ability to index and search vast amounts of data quickly and accurately makes it an indispensable tool for any company looking to improve their search capabilities.” — Werner Vogels, CTO of Amazon
Introduction
In today's world, data is expanding at an unprecedented rate, creating a need to efficiently organize, search, and analyze it. Elasticsearch is a widely used solution for managing large amounts of data. It is a distributed search and analytics engine that is open source, allowing for quick and efficient storage and retrieval of data at scale.
So, let’s get into the details of Elasticsearch’s design! We’ll start off by talking about the fundamentals of Elasticsearch including its nodes, clusters, and indices. We will also discuss how Elasticsearch manages data dispersion and scales to suit the demands of modern applications.
Understanding Elasticsearch
To better understand Elasticsearch’s architecture, it’s important to understand what it is and what it’s used for. Elasticsearch is a distributed search and analytics engine that can handle massive amounts of data. It is written in Java and is based on the Apache Lucene search library.
The primary goal of Elasticsearch is to make it simple to quickly and effectively search through and analyze massive amounts of data. It can be used to store and search data across several nodes or devices and making it very scalable. This makes it perfect for use in modern applications where real-time data organization, search, and analysis are required to keep up with the rapid growth of data.
Elasticsearch architecture is like a symphony orchestra, with each node playing a different instrument, but coming together to create beautiful music.
“Once upon a time, when you need a distributed search engine, but all you have is a bunch of old servers and some duct tape 😂”
Core Components of Elasticsearch
It’s important to understand Elasticsearch’s fundamental parts before exploring its architecture. Nodes, clusters, and indices are the three primary parts of Elasticsearch.
Nodes
Elasticsearch is built on a distributed design, meaning that data is saved across numerous nodes.
In Elasticsearch, what does a “node” mean? Data is saved in a “node,” which is similar to a storage container. Consider of it as a computer programme that saves a component of the data puzzle. The nicest feature is that you may run as many nodes to store as many terabytes of data as necessary. While only a portion of our data will be stored on each node, you can spread out a large amount of data across numerous virtual or actual devices. Although though each machine only has a few hundred gigabytes of disc space, this is quite helpful because it allows us to store several terabytes of data.
It’s crucial to remember that a node is actually an instance of Elasticsearch rather than a physical machine. So, you won’t need to deal with virtual machines or containers while starting up five nodes on your development system. But, it’s better to keep things distinct in a production environment so that each node operates on a separate server, virtual machine, or container.
Perhaps you’re wondering how all of this coordination takes place. How is data truly spread among nodes, and how does Elasticsearch know where a particular piece of data is kept? The quick answer is that every node is a part of a cluster.
Clusters
Clusters are collections of related nodes that store all of our data. While it’s possible to have many clusters, one is usually sufficient.
By default, clusters are completely independent of each other. While it’s possible to perform cross-cluster searches, it’s not very common. It’s more typical to run multiple clusters that serve different purposes. For instance, you could have a cluster for powering the search of an e-commerce application, and another for application performance management (APM). Typically, we split things into multiple clusters to separate them logically and to be able to configure them differently.
However, one cluster is typically enough, so we will be working with a single cluster. But wait a minute, you might be thinking, how do we create a cluster?
When we start up a node, a cluster is formed automatically. A node will either join an existing cluster if configured to do so, or it will create its own cluster consisting of just that node. An Elasticsearch node will always be part of a cluster, even if there are no other nodes.
There are some problems with having only a single node in terms of availability and scalability, but for development purposes, it’s perfectly fine to have a cluster consisting of a single node. We’ll get back to these problems soon, but for now, let’s focus on understanding the basics.
Now that you know what clusters and nodes are, let’s take a closer look at how data is organized and stored.
Each unit of data that you store within your cluster is called a document. Documents are JSON objects containing whatever data you desire. When you index a document, Elasticsearch stores the original JSON object you sent, plus some metadata for its own use. To store info about a person, your object might look like this:
I see that the object has two fields, name and country. But did you know that you can actually add more fields to this object? That way, you can have full control over it!
Take a look to the right and you’ll see an example of how the object would be stored in Elasticsearch.
When we send a JSON object to Elasticsearch, it gets stored within a field called “_source”. Elasticsearch also stores some meta-data with the object, but we’ll cover that later. For now, just relax and don’t worry about it.
Now, let me answer a question that you might have: how are documents organized? The answer is simple: they are organized within indices. Every document within Elasticsearch is stored within its own index. Pretty neat, huh?
Indices
An index is an excellent approach to logically arrange documents together and offers configuration options for scalability and availability. Later, we’ll explore those options in more detail.
In other terms, an index is just a group of documents that are logically related and have similar features. Consider the scenario where you run a shopping website. Each product might have its own document, which you might compile into a single index called “products.” Similar to this, if you had a social networking platform, you might have documents for each user, which you could compile into an index called “users” and organise.
The possibility of an index containing an unlimited number of documents is quite great. As a result, you can keep as many documents as you like without being concerned about exceeding a storage capacity. When searching for data, we select the index where we want to search for documents. This indicates that indices are actually used to process search queries.
In general, using indices in Elasticsearch is a great way to organise your data. They give you options for scalability and availability and let you group related documents together.
Thanks so much for taking the time to read this article. I really hope you enjoyed it!
If you have any questions or comments, please don’t hesitate to let me know! I’m always here to help and would love to hear your thoughts. 😊