Elasticsearch Architecture V: Node Roles

Mahmoud Yasser
14 min readApr 8, 2023

--

Hey there! So, you probably already know that an Elasticsearch cluster is made up of one or more nodes. Each node stores data on shards, and each shard is stored on a node. Up until now, every node you’ve seen has stored at least one shard, but it’s worth noting that nodes don’t always have to store shards.

This is due to the fact that each node might have one or more roles, which determine what the node is used for. These roles determine what and how the node does. And, based on the cluster setup, we can assign each node different roles. If we don’t define a role for a node, Elasticsearch will assign it all roles by default. This is true for all versions of Elasticsearch, by the way.

Types of node role:

  1. Master Node Role
  2. Data Node Role
  3. Coordinating Node Role
  4. Ingest Node Role
  5. Machine Learning Node Role
  6. Remote Eligible Node Role
  7. Transform Node Role

Let’s take a look at the available roles.

Master Node

Now, let’s talk about the big cheese in Elasticsearch: the master node. This bad boy is the one in charge of managing the whole cluster, so you know it’s pretty darn important. It handles all cluster-wide functions such as building and deleting indices, monitoring nodes, and distributing shards. And get this: if the master node fails or becomes unavailable, any node can take over. So it takes a collaborative effort to keep the cluster working like a well-oiled machine.

The master node functions as the operation’s brain. It controls all of the metadata that describes the indices, such as the index name, number of shards, and primary shard location. This metadata is what tells the cluster about the existence and location of indices. It also manages node operations such as data distribution among shards and enforcing cluster-level settings and policies. That’s how all the nodes work together seamlessly.

But hold on, there’s more! The master node also monitors the health of all nodes. It guarantees that they respond to queries and have sufficient storage space for storing data. If a node fails or runs out of space, the master node has the ability to move shards to another node. This ensures that the cluster remains stable and responsive even during peak usage periods.

The master node is built to be fault resilient and highly available. It’s like a superhuman who can keep the cluster going even when everything else fails. If the master node fails, another node can take over without disrupting the rest of the cluster. Based on parameters like the number of available nodes and the amount of data to be stored, the master node selects which shards go where. This guarantees that data is efficiently distributed throughout the cluster.

But wait, just because a node has the master role does not automatically make it the master node. The master node of the cluster is elected by a vote procedure, therefore it is always assured to be operational, even if one or more nodes fail. This voting mechanism ensures that the cluster stays stable and effective even under extreme stress.

In large clusters, it’s a good idea to have dedicated master nodes that are only responsible for particular tasks. This allows the master node to focus on crucial tasks while maintaining cluster stability. The cluster may become unstable if the master node attempts to execute too many tasks, such as serving search queries. As a result, be certain that the master node is not overburdened with new tasks.

Consider how to configure master nodes when configuring an Elasticsearch cluster. For big clusters, dedicated master nodes isolated from data nodes may be the best option. However, for smaller clusters with lower demand, a single node with the master role may suffice. It all comes down to finding the right balance of performance and stability.

If you see that the master node is consuming a lot of CPU memory and I/O, try installing another master node to handle that task exclusively. In this manner, the workload may be distributed across multiple nodes, ensuring that the cluster remains stable and responsive during peak usage periods.

To assign a node role, modify the node’s “elasticsearch.yml” file and add the following line:

node.roles: ["master"]

Data Node

Data nodes in Elasticsearch are used for storing data inside of them which looks like data warehouses, and when a client sends a query to the cluster, this request is routed to a data node that contains the necessary data and then executes the query, and the results are delivered to the client.

When data is added to the cluster, it is indexed and stored on one or more data nodes. Elasticsearch uses the previously mentioned sharding technology to distribute data across several nodes, enhancing performance and scalability.

Using dedicated data nodes in Elasticsearch provides numerous advantages for users. One of the most significant advantages is speed. By categorizing data and search requests, Elasticsearch can handle more concurrent queries and give faster search results by dividing data and search requests across numerous nodes. This is especially critical in large clusters when a single node may be overwhelmed by the demand.

Another advantage of data nodes is that they make scaling easier. Elasticsearch is meant to scale horizontally, which means that you may add more nodes to the cluster to improve its performance. Users can grow the storage and search capabilities of the cluster individually by using data nodes. This enables for more efficient resource utilization and decreases the possibility of bottlenecks.

Of course, leveraging data nodes isn’t all rainbows and sunshine. There are certain difficulties involved. One of the most difficult challenges is managing all of that data. It becomes more difficult to manage all of the data on the nodes as the cluster grows larger. This is especially true in clusters with many shards or a high rate of search requests. Users may need to consider adding more data nodes or optimizing their data management strategy to overcome this difficulty.

Another issue that data nodes face is resource utilization. To operate properly, data nodes demand an enormous number of resources such as memory, disk space, and CPU. Data nodes can act as a bottleneck in clusters with limited resources, making it difficult for the cluster to operate efficiently. Users may need to optimize their hardware configuration or build up a dedicated master node to offload part of the processing tasks from the data nodes to avoid this.

To set node role, edit the node’s “elasticsearch.yml” and add the following line:

node.roles: ["data"]

So, when you’re doing a multi-tier deployment architecture, you gotta use these fancy data roles like data_content, data_hot, data_warm, data_cold, and data_frozen to put data nodes into specific tiers. A node can be in multiple tiers, but if it has one of those fancy data roles, then it can't have the plain ol' data role. If you want to check out more about these tiers, just hit up the Elasticsearch documentation.

Coordination Node

The coordination role is in charge of distributing queries throughout an Elasticsearch cluster. In other words, it handles the work delegation required to process a request. Coordination nodes, unlike data nodes, do not search for data on their own, but instead assign that task to data nodes.

To have a node function exclusively as a coordination node, all other roles must be removed from it. This can be accomplished by defining the node’s settings to exclude all other roles. As a result, a dedicated coordination node may handle query coordination and delegation activities within the Elasticsearch cluster.

One potential application for coordinating nodes is in large-scale search operations, such as those seen in e-commerce sites or data warehouses. A significant number of queries are running concurrently in these cases, and it is critical to distribute the load evenly across the cluster to avoid any one node from getting overloaded. The load can be distributed evenly by using dedicated coordination nodes, ensuring that queries are processed quickly and effectively.

Another potential use case for coordination nodes is when dealing with complex queries. In these scenarios, a coordination node can break the query down into smaller sub-queries and distribute them across the cluster for processing. This approach can greatly improve the performance of complex queries, as each sub-query can be processed in parallel by different nodes within the cluster.

Coordination node role is not a standalone solution. In order to achieve optimal performance, all roles within the cluster must be optimized and configured appropriately. This includes proper node sizing, network configuration, and query tuning.

To set node role, edit the node’s “elasticsearch.yml” and add the following line:

node.roles: []

Just a heads up: if you add too many coordinating only nodes to a cluster, things will go wrong. Because the master node has to wait for updates from every single node, it makes the entire cluster work harder. Don’t get me wrong: coordinating nodes are useful, but they should be used carefully. Data nodes can do the same function and be just as happy about it!

Ingest Node

Elasticsearch’s ingest role allows a node to run pipelines for consuming Elasticsearch documents. But what exactly is an ingest pipeline? Essentially, it is a series of processes that must be followed while uploading a document to Elasticsearch. These processes, sometimes known as processors, can be used to fine-tune the document before it is indexed.

Assume you have a web server access log that you wish to ingest into Elasticsearch. Each request in the log is saved as a separate document, but you might want to spice things up by adding geographical information based on the visitor’s IP address. You can utilize an intake pipeline to convert an IP address to latitude and longitude coordinates, as well as the country of origin, by using a processor.

Ingest pipelines can be as basic or as complicated as you require. Logstash, another component in the Elastic Stack, can be used for extremely complicated data transformations. However, if you only need to make minimal changes to your data, the ingest pipeline is a handy tool to have.

However, when you’re ingesting a ton of data, running all of the documents through an ingest pipeline can be a real drag on your hardware. That’s where dedicated ingest nodes come in handy. By enabling or disabling ingest pipelines for each node, you can spread the workload and avoid any one node from getting bogged down.

Keep in mind that some Elastic Stack products, like as Beats and Logstash, provide their own ingest pipelines. These pipelines are capable of collecting and transforming data from a variety of sources, including logs, metrics, and network traffic.

In order to use the ingest role in Elasticsearch, you must first set up an ingest pipeline. You can accomplish this by defining a collection of processors and configuring how they should be applied to your data using the Ingest API. After you’ve built your pipeline, you can use the Index API to ingest data into Elasticsearch.

The Ingest API has a number of data processors that allow you to add, remove, or rename fields, divide or combine fields, convert data types, and execute conditional actions. Third-party processors can also be used to extend the capability of the ingest pipeline.

When designing an ingest pipeline, it’s critical to consider the performance consequences of each processor. Some processors are extremely resource-intensive, and utilizing too many or in the wrong order will severely degrade the performance of your Elasticsearch cluster.

Test your ingest pipeline on a meaningful sample of data before releasing it into your production environment to ensure optimal performance. You may also monitor the performance of your pipeline using the Elasticsearch Monitoring API, which gives real-time metrics and statistics on your Elasticsearch cluster.

To assign a role to a node, update its “elasticsearch.yml” file and add the following line:

node.roles: ["ingest"]

Machine learning node

Elasticsearch has a fantastic machine learning node that is ideal for handling machine learning API calls.

It all comes down to utilizing cutting-edge algorithms and statistical models to analyze data and uncover patterns and insights, and with Elasticsearch’s machine learning capabilities, it all comes down to employing cutting-edge algorithms and statistical models to analyze data and uncover patterns and insights.

You may utilize Elasticsearch’s machine learning capabilities to analyze massive datasets, find anomalies, and even generate predictions! Isn’t that amazing?

But here’s the catch. Your CPU must be capable of handling SSE4.2 instructions in order to use the machine learning algorithms. So, if your CPU does not support this instruction set, you’re out of luck.

But wait, there’s more! Setting up a machine learning node is super easy. All you gotta do is add this config to the “elasticsearch.yml” file:

node.roles: [“ml”]

This instructs Elasticsearch to establish a super sweet machine learning node that can handle all of your machine learning API calls like a boss, after which you can start executing machine learning algorithms on your Elasticsearch cluster like a pro.

Oh, and if you’re utilizing the remote_cluster_client functionality for machine learning, be sure to give the machine learning nodes the “remote_cluster_client” role as well. This ensures that all nodes are ready to handle remote cluster requests like a boss.

node.roles: ["ml", "remote_cluster_client"]

By the way, utilizing a dedicated master or coordinating node as a machine learning node isn’t always the best option. If you run machine learning algorithms on nodes that aren’t optimized for them, they can consume a lot of resources and cause performance issues.

Let’s have a look at some of the other interesting things you can do with the machine learning node. It contains, for example, an API for generating and managing machine learning jobs, as well as tools for monitoring machine learning model performance and identifying abnormalities in real time.

The machine learning node in Elasticsearch also supports a wide range of machine learning algorithms and models, including supervised and unsupervised learning, regression, clustering, and anomaly detection. This means you can choose the best algorithm for your specific use case and data set.

One of the best things about Elasticsearch’s machine learning capabilities is that it allows you to quickly and easily analyze large datasets and extract meaningful insights. For example, businesses can use machine learning algorithms to analyze customer data and identify patterns and trends that can inform marketing and product development strategies.

Another great thing about Elasticsearch’s machine learning node is that it can help businesses detect and prevent fraud in real-time. By analyzing transaction data and identifying patterns and anomalies, businesses can quickly detect and prevent fraudulent activity before it causes significant harm.

Remote eligible node

Elasticsearch can support remote clusters, which is one of its most appealing features. This means that you can have clusters in separate locations that can replicate indices and search utilizing cross-cluster search.

To get this going, you need to use a special node role called “remote_cluster_client”. This role lets a node communicate with remote clusters and do cross-cluster replication and search operations.

To set up a node as a remote eligible node, all you have to do is add this configuration to the “elasticsearch.yml” file:

node.roles: [“remote_cluster_client”]

But that’s just the start. You’ll also need to configure your clusters and indices so they can be accessed and searched across multiple nodes.

The remote_cluster_client node role is particularly beneficial for large organizations with many data centers or geographical regions. This allows them to replicate data across many sites for redundancy and disaster recovery. With cross-cluster replication, data can be replicated to remote clusters in real-time or near-real-time, so it’s always available even if the primary cluster goes down.

Also, remote eligible nodes can be used for cross-cluster search, which is awesome for big organizations with lots of teams or departments working on different projects. They can all search across all the data that is being generated.

Essentially, the remote_cluster_client node role is a critical tool for organizations that need to configure cross-cluster replication and search activities. You may ensure that data is replicated across many locations and can be searched from any node in the cluster by making nodes remote eligible.

Transform Node

Transform nodes are a great tool in Elasticsearch for creating new indices and gaining useful analytical insights. You may simply summarize large amounts of data and optimize your Transform API requests using these nodes for faster and more efficient results.

Furthermore, the advantages of transform nodes go beyond simple summarizing jobs. With their dope capabilities, you can also perform data normalization and filtering, which can be critical in making sure your analysis is legit and reliable.

Simply said, transform nodes are critical for every Elasticsearch user trying to get the most out of their data. Elasticsearch would be unable to construct summarized indices without them, and your Transform API requests would be rendered ineffective.

However, how can you verify that your transform nodes are properly configured for optimal performance? The solution is to include the necessary configuration in your “elasticsearch.yml” file. Simply by inserting this line

node.roles: [“transform”, “remote_cluster_client”]

You can make sure that your transform nodes are also remote_cluster_client nodes. This is especially important if you’re working with data across multiple clusters, as it ensures that your transform nodes are fully equipped to handle the task at hand.

You can also fine-tune your transform nodes by changing settings like the number of shards and replicas. This enables you to optimize their performance and ensure that they can handle even the most complex data transformations.

Transform nodes, on the other hand, shouldn’t be used as coordinating or master nodes. These nodes are optimized for particular tasks, but you want your transform nodes to solely focus on data transformations. By following to such guidelines, you can ensure that you are getting the most out of your transform nodes and creating the best indices for your data.

Getting Cluster’s Node Roles

So, what are the actual roles of our nodes? You’ve seen it before, but I didn’t tell you about it. Let’s go to Kibana and look at the node in our cluster “GET /_cat/nodes?v”

We’re particularly interested in two columns: “node.role” and “master.” The first column describes the roles that each node has. Because the roles are abbreviated, “dim” stands for “data,” “ingest,” and “master.”

These three roles are the default roles for all of our nodes unless you modify them. As a result, any of the three nodes can be chosen as the master node. The first node was picked as the master in this scenario because it was the first node to start up and there were no other nodes accessible at the time.

However, changing node roles is often meant for larger clusters, not smaller ones. You should also do this if you need to scale up the cluster throughput, such as when consumption is high.

However, you should usually start by adjusting the number of shards and nodes.

Another reason for changing roles in a large cluster is to make it easier to see how hardware resources are being used and optimize the cluster accordingly.

As a general guideline, if you don’t know what you’re doing or don’t have a strong cause, avoid changing roles. You are able to change your roles later, so stick with the defaults until you need to.

Just so you know, I’m not showing you this so you can change roles. It’s just to let you know that you can modify them whenever you choose.

Thank you for spending time on reading the article. I genuinely hope you enjoyed it. I also recommend that you read the previous articles in the series to help you connect the dots.

If you have any questions or comments, please don’t hesitate to let me know! I’m always here to help and would love to hear your thoughts. 😊

--

--

Mahmoud Yasser
Mahmoud Yasser

No responses yet