Building a Reliable Notification Service: Solving Duplication and Scaling Issues

Mahmoud Yasser
17 min readOct 22, 2024

--

Introduction

One of the biggest issues you will deal with as your app scales is building scheduled tasks that work well under heavy load in a distributed system, particularly one like ours that must send notifications. Sending messages is not just about sending a message; it’s more like orchestrating an extremely complex symphony of data and timing.

As the lifeblood of user engagement, notifications are a balance between optimal and overwhelming. You are not only handling basic message dispatch; you are traveling through a battlefield of potential challenges. Notification duplication: Notifications that repeat the same message from numerous sources and spammers who distribute false or unnecessarily duplicate information. Then, you have the problem of improving retry mechanisms, where it is wasteful to use precious system resources on useless attempts. Not to mention the constant battle of resource allocation — making sure different parts of your system play nice with each other and don’t end up competing over resources.

But fear not! Read on, and let this article guide you to your fine place of notification nirvana. Today, we will be talking about our experience in bulding a robust and scalable notification system that won’t slow your company down. We’ll simultaneously explore the toolbox of solutions we used to overcome problems arising from scaling our user base into enormous numbers. How we used BullMQ to build this extremely reliable job scheduling system will also be revealed. Redis locks? We’ll see the magic of Redis locks, our secret weapon for achieving perfect synchronization throughout the system. And that’s without even mentioning our clever heartbeat mechanism — it literally zaps your notification system back to life whenever certain parts go cold, ensuring your notifications stay active and efficient.

After reading, you’ll have the knowledge to transform your app from a simple messenger to a notification powerhouse. So, let’s embark on this exciting journey to supercharge your app’s communication prowess!

The Initial Design: A Simple Notification System

Our notification system had a clear and meaningful goal: we wanted to send each user one personalized message every day, thoughtfully designed to be both timely and valuable. But we wanted to go beyond just sending messages; we wanted to make sure that every notification had a true purpose for everyone.

To achieve this, we developed smart rules based on user behavior to figure out who would benefit from receiving a notification that day. If the algorithm determined that the user didn’t need a notification, we omitted the message altogether. It was our way of respecting users’ time and keep their inboxes clutter-free.

Although it sounds straightforward — a system that sends notifications every day and only when it matters — building and scaling it has been a journey full of exciting challenges. As we delved into this process, we encountered complex problems that forced us to find innovative solutions. The story of how we developed, fine-tuned, and scaled this system is one of creativity and tenacity.

Event-Driven Architecture

We set up a pub/sub listeners that subscribed to multiple topics, each representing a different type of events that the user participated in. These events gave us the insights we needed to create personalized user profiles, allowing us to determine which notifications would be the most valuable to each user. For example, if our metrics indicated that a particular user would benefit from a certain information, our system would trigger a rule and a notification would be scheduled via Firebase.

Our approach is designed with real-time, event-driven logic that allows us to immediately react to user behavior and deliver timely, meaningful notifications:

  • User Profiles: By leveraging the incoming Pub/Sub events, we created customized user profiles to fine-tune who receives notifications and when.
  • Daily Notifications: Based on these personalized profiles, we ensured that each user received one carefully curated notification per day. Nothing more, nothing less.
  • Firebase Magic: We relied on Firebase’s ability to process large-scale device messaging, allowing us to seamlessly deliver notifications to millions of users simultaneously. This was a simple, yet powerful tool that made managing large volumes of notifications a breeze.

At first, everything went smoothly — our event-driven system worked like a charm when things were steady. But as more users joined and traffic really picked up, some issues arose. Suddenly, our seemingly perfect setup started showing cracks, especially when it came to retries and duplicate notifications. That’s when we realized that to scale this real-time system, we needed an innovative solution to ensure its reliability.

The Challenges of Scaling with Kubernetes Cron Jobs

One of the biggest challenges we faced with Kubernetes cron jobs was dealing with retries, especially if something went wrong. At first glance, Kubernetes’ automatic retry of failed tasks seemed like a good safeguard: If something didn’t go as planned, no problem — the system would just try again, giving the notification a second chance to be delivered.

As we began scaling, we noticed some significant issues with our retry mechanism. The main challenge was in how Kubernetes managed retries: when a cron job encountered a failure, it would retry the entire batch of notifications, not just the ones that failed. So, if we sent out 50 notifications and only 10 of them didn’t go through, Kubernetes would, by default, retry all 50 on the next run — even the ones that had already been successfully delivered.

This approach led to a cascade of unintended consequences:

  • Inefficient Retries: Kubernetes was retrying every notification in the batch, even the ones that had already been successfully sent. This led to users receiving duplicate notifications, which not only caused confusion but also frustration among our audience.
  • Resource Waste: These unnecessary retries also put a heavy load on our system. Each time Kubernetes triggered a retry, it consumed additional CPU and memory resources, As our user base expanded, this issue became more apparent. Resending notifications that had already gone through successfully added unnecessary overhead, gradually affecting the overall performance and efficiency of our infrastructure.

This wasn’t just a small glitch; it turned quickly into a major bottleneck as we scaled. The system we had chosen for its simplicity and scalability was now turning into a hurdle. It wasn’t just risking the user experience by sending duplicate notifications; it was also putting unnecessary strain on our servers. What began as a safeguard to ensure reliable delivery was now threatening to overwhelm the very infrastructure it was supposed to protect.

Realizing the Need for a Finer-Grained Approach

As the volume of notifications continued to grow, the inefficiencies of the retry process could no longer be ignored. We were wasting resources left and right, but even more concerning was the impact on our core mission: delivering meaningful, targeted notifications to our users. The system was no longer delivering the value we wanted to provide, and it was clear that changes were needed to get back on track.

We realized that we needed a more fine-tuned retry mechanism that could intelligently distinguish between successful and failed notifications. Our system needed to pick up where it left off, leaving successfully delivered messages untouched and repeating only those that weren’t sent. This approach ensures that users don’t receive duplicate notifications and significantly reduces the load on our infrastructure.

At this point, it became clear that Kubernetes cron jobs, despite their strengths, couldn’t provide the fine-grained control and efficiency we needed. We needed a more adaptable solution, one that could manage retries with the precision we needed to scale efficiently, without preserving the integrity of the user experience.

The lesson was clear: even the simplest tools can have limitations when scaling systems becomes a priority. And when those limitations start affecting both performance and user experience, it’s time to pivot and find a solution that can keep pace with growing demands. This brought us on to BullMQ, a system that would give us the level of control we needed to manage retries smartly and efficiently.

BullMQ: Managing Notifications with Fine-Grained Control

After facing the limitations of Kubernetes cron jobs, we realized that we needed a more sophisticated solution that could handle the complexity of retries in a growing system. That’s when we switched to BullMQ, a job queue system built on top of Redis. BullMQ gave us the granular control we had been missing, — especially in scheduling, managing, and retrying jobs. With this new flexibility, we were able to customize the retry logic to fit our exact needs, ensuring notifications were delivered efficiently and effectively, even as the system scaled.

Our previous approach with Kubernetes had been too broad, retrying every notification in a batch without checking which had already been delivered. BullMQ allowed us to create a more intelligent system, retrying only failed notifications. This not only streamlined the process but also improved the user experience by eliminating redundant notifications.

How BullMQ Solved Our Retry Problem

One of the standout features of BullMQ was its ability to configure retries with pinpoint accuracy. For example, if we sent out 50 notifications and 10 of them failed, BullMQ would retry only those 10 failed messages, leaving the other 40 that were successfully delivered untouched. This was a game-changer in both resource efficiency and in preventing users from being overwhelmed with duplicate messages.

The benefits of using BullMQ became clear right away:

  • Targeted Retries: Unlike our experience with Kubernetes cron jobs, BullMQ allowed us to focus solely on the failed notifications for retry. This meant users who had already received their notification wouldn’t be bothered with duplicates, enhancing the overall user experience.
  • Optimized Resource Usage: With BullMQ, unnecessary retries were minimized, which drastically reduced resource consumption. CPU and memory weren’t wasted on jobs that had already been successfully completed, allowing our infrastructure to run more efficiently.
  • Retry Limits: We introduced a limit of three retries per notification. After three failed attempts, BullMQ would log the failure, and an automatic alert would be sent to a dedicated Slack channel. This real-time notification allowed our team to quickly investigate and resolve any persistent issues before they affected a larger portion of the user base.

This flexibility wasn’t just about gracefully managing failures — it gave us complete control over our notification pipeline. BullMQ allowed us to fine-tune the retry logic to meet the demands of production traffic, ensuring that our system could adapt to changing conditions without sacrificing performance or user experience.

Monitoring and Error Handling with BullMQ

Beyond targeted retries, BullMQ also integrated seamlessly into our monitoring and alerting infrastructure. When a job failed to execute after the set number of retries, we configured the system to generate detailed error reports that were instantly sent to a Slack channel, providing the team with critical information on why the job failed and which notifications were impacted. This real-time visibility allowed us to address issues quickly and prevent them from escalating.

  • Real-Time Alerts: BullMQ ensured that failure wasn’t left in the dark. Each failed job after the third retry was logged, and an immediate Slack alert was generated, allowing our team to quickly evaluate whether the issue was a one-off problem or part of a larger system issue.
  • Detailed Error Reports: BullMQ gave us detailed insight into each failure, allowing us to track patterns and ensure that the system remained stable and efficient over time.

However, as we scaled the system and increased the number of pods running the notification service, a new challenge emerged: duplicate notifications being sent to users. This led us to the next problem we needed to solve — controlling how jobs were pushed into the queue to prevent these unnecessary duplicates from reaching users.

Managing Duplicate Notifications: The New Challenge

While BullMQ offered a powerful and efficient solution for managing retries, scaling the notification system to accommodate increasing traffic introduced a new challenge — duplicate notifications. As we deployed more pods to handle the growing user base, the system began to behave unpredictably. Each pod, operating independently, was pushing jobs into the queue, leading to the same notification being sent to users multiple times. This wasn’t just a minor inconvenience; it had a direct impact on user experience, as repeated notifications could easily frustrate recipients.

The root of the issue lay in the lack of coordination between the pods. Since each pod was unaware of the jobs being handled by the others, they occasionally duplicated efforts. This highlighted the need for a more centralized way of controlling how jobs were pushed into the queue, ensuring that each notification was sent only once, regardless of how many pods were running. Solving this would be key to maintaining a smooth and reliable notification system as we continued to scale.

The Duplication Dilemma

Imagine this scenario: multiple pods are running the notification service, each unaware of what the others are doing. If two pods are active, both might unknowingly push the same notification job into the queue for the same user. This results in the user receiving two identical notifications, which can be frustrating. Now imagine three or four pods running concurrently, all processing the same notification job. In this case, users might receive three or even four identical notifications, far beyond what was intended.

This wasn’t a minor issue. It posed a serious threat to the user experience we had worked so hard to design, where notifications were supposed to be personalized, relevant, and non-intrusive. Instead, users were now receiving multiple, redundant messages, which not only led to frustration but also diminished the perceived value of our service. The very goal of delivering meaningful, timely notifications was at risk, and it became clear that we needed a solution to coordinate the pods and prevent these unwanted duplicates.

Why Duplication Happened

The duplication problem arose as a byproduct of how Kubernetes handled scaling. As traffic increased, Kubernetes would spin up more pods to help manage and process the workload, keeping the system responsive under heavy load. However, each pod operated independently. This meant that when multiple pods were running, each one would process the same schedule, resulting in multiple jobs being pushed into the queue for the same notification.

  • Multiple Pods, Same Job: Each pod, unaware of the others, would independently attempt to schedule a notification job for the same user. The result? Duplicate jobs for a single notification, and consequently, multiple notifications delivered to the user.
  • Unintended Consequences: While Kubernetes scaling ensured that our system could handle increased traffic, it also led to unintended duplication. Users who received the same notification multiple times grew frustrated, and the user experience we had carefully crafted began to erode.

It became evident that simply scaling the number of pods wasn’t enough. We needed to find a way to ensure that only one pod could push a notification job into the queue at a time, no matter how many were running. This was essential to maintaining the integrity of the user experience and avoiding unnecessary duplication.

Redis Locks: The Solution to Job Duplication

As our notification system scaled, the introduction of multiple pods processing the same jobs created an unforeseen issue: duplicate notifications. To tackle this, we introduced a Redis-based locking mechanism that allowed us to coordinate the actions of multiple pods and prevent the same job from being processed by more than one pod at a time. This ensured that each user would receive only one notification, no matter how many pods were running.

How Redis Locks Worked

The Redis lock acted as a gatekeeper, ensuring that only one pod could perform specific tasks — such as pushing a job into the queue — at any given moment. Here’s how the process worked in practice:

  1. Acquiring the Lock: When a pod wanted to push a notification job into the queue, it first had to acquire the Redis lock. If another pod had already acquired the lock, the second pod would be blocked from proceeding until the lock was released.
  2. Executing the Job: Once a pod successfully acquired the lock, it could safely push the notification job into the queue, knowing that no other pod could perform the same task. The lock ensured exclusive control over the job scheduling process.
  3. Releasing the Lock: After the job was successfully scheduled, the pod would release the lock, allowing other pods the opportunity to proceed with their tasks.

This mechanism ensured that, even when multiple pods were running concurrently, they wouldn’t interfere with one another, maintaining concurrency control and preventing multiple jobs from being scheduled for the same notification. It guaranteed that only one notification job was pushed into the queue for each user, no matter how many pods were active.

Redis Locks in Action: A Real-World Example

To better understand how Redis locks prevented job duplication, consider the following scenario with two pods — Pod A and Pod B — running simultaneously.

  1. Pod A and Pod B Start: Both pods come online and begin their processes. They each try to acquire the Redis lock to push a notification job into the queue.
  • Pod A successfully acquires the lock, giving it exclusive control over scheduling the notification.
  • Pod B is blocked from acquiring the lock and must wait.
  1. Pod A Executes the Job: With the lock in hand, Pod A pushes the notification job into the queue. It processes the job, fetching user data and content, and sends notifications in batches. During this process, the lock is periodically renewed to ensure that Pod A retains control over the job.
  2. Pod B Waits: Since Pod B can’t acquire the lock, it waits and periodically retries. Once Pod A finishes its task, Pod B will have the opportunity to proceed.
  3. Pod A Finishes: Once Pod A has finished processing the job, it releases the lock, freeing up the system for other pods to acquire it.
  4. Pod B Acquires the Lock: After Pod A releases the lock, Pod B successfully acquires it on its next retry. Now, Pod B can proceed with scheduling and processing the next notification job.
  5. The Cycle Repeats: This cycle continues, ensuring that only one pod pushes the job into the queue at a time, preventing job duplication even as the system scales.

By implementing Redis locks, we achieved several critical improvements that ensured both the reliability and scalability of our notification system:

  • No More Duplicate Notifications: The Redis lock guaranteed that only one pod could push a notification job at a time, eliminating the issue of users receiving multiple identical messages.
  • Improved User Experience: With the duplication problem solved, we were able to maintain the high-quality, personalized user experience that was central to our service. Users no longer faced the frustration of receiving the same notification multiple times.
  • Efficient Pod Coordination: Redis locks enabled better coordination between multiple pods, ensuring that even under heavy traffic loads, the system remained organized and predictable.
  • Scalability without Compromise: As we continued to scale the number of pods to meet growing traffic demands, Redis locks ensured that this scaling did not come at the cost of system integrity or user experience. Our notification system could now grow in capacity without introducing any chaos or duplication issues.

While Redis locks solved the critical issue of duplicate notifications, their introduction surfaced a new challenge: what happens if the pod holding the lock crashes? If the pod that held the lock were to crash unexpectedly, the job it was processing might remain uncompleted, or worse, stay locked in the system, preventing other pods from processing notifications.

This led us to develop a heartbeat mechanism to monitor the health of the pods and ensure that, if a pod failed, the lock would be released, allowing other pods to take over and maintain continuity in job processing. In the next section, we will explore how we implemented this heartbeat system to ensure that our Redis-based locking mechanism remained robust and fault-tolerant, even in the face of pod failures.

Handling Pod Crashes: The Role of the Heartbeat Mechanism

While the Redis lock was an effective solution for preventing duplicate notifications, it also introduced a new challenge: what happens if a pod holding the lock crashes or is terminated unexpectedly? Without a mechanism to handle this, the job would remain unowned in the queue, and the entire system could grind to a halt, leaving important notifications unsent.

In a system designed for scalability and reliability, such a failure scenario had to be addressed. We needed to ensure that when one pod failed, another could seamlessly take over and continue processing jobs without any disruption. This is where the heartbeat mechanism came into play.

The Heartbeat Mechanism for Fault Tolerance

The heartbeat mechanism was designed to ensure that our system could recover quickly in the event of a pod failure. Here’s how it worked:

Every 30 seconds, each pod would send a “heartbeat” signal, essentially checking in to confirm that it was still active and holding the Redis lock. This regular check-in ensured that the system was constantly monitoring the health of the pods responsible for processing notifications.

Key Elements of the Heartbeat Mechanism:

  • Periodic Check-Ins: Each pod would periodically check the status of the Redis lock it held. If a pod crashed or was terminated unexpectedly, the lock would be released, and the system would become aware that no pod was actively processing the job.
  • Quick Recovery: When a lock was released due to a pod failure, other active pods would race to acquire the lock. The fastest available pod would take over the job, ensuring that the notification was processed without delay. This ability to quickly recover from pod failures kept the system running smoothly, even during periods of instability or high load.

Let’s consider the following scenario where a pod crashes unexpectedly:

  1. Pod A Acquires the Lock: Pod A acquires the Redis lock and begins processing a notification job. It periodically renews its “heartbeat” to indicate that it’s still active and holding the lock.
  2. Pod A Crashes: Halfway through processing, Pod A crashes unexpectedly. The Redis lock remains but is no longer being actively renewed.
  3. Redis Lock Expires: After 30 seconds, the system detects that Pod A is no longer sending a heartbeat. The Redis lock expires, and other pods, such as Pod B and Pod C, become aware that the job is unowned.
  4. Pod B Takes Over: Pod B is the fastest to react and successfully acquires the lock. It takes over the job and continues processing where Pod A left off, ensuring the notification is sent without further delays.
  5. The Cycle Continues: As long as the system is running, the heartbeat mechanism ensures that jobs are monitored, and any disruptions due to pod failures are handled automatically, preventing jobs from being left in limbo.

Key Takeaways and Lessons Learned

Building a reliable and scalable notification service demanded addressing several intricate issues such as concurrency control, job duplication, and efficient retries. Along the way, we encountered key challenges and solutions that shaped our understanding of managing distributed systems at scale. Here are the critical lessons we learned:

1. Concurrency Control with Redis Locks

Handling concurrency in a distributed environment, especially with multiple pods running the same service, is complex. By implementing Redis locks, we ensured that only one pod at a time could process a job, thereby preventing duplicate notifications from being sent to users.

  • Locks as a Gatekeeper: Redis locks acted as a protective gate, ensuring that no two pods could push the same notification job at once, effectively eliminating job duplication.
  • Maintaining Consistency: With Redis locks in place, we preserved consistency in how notifications were managed and sent, significantly improving the user experience by preventing repetitive messages.

2. Fault Tolerance with the Heartbeat Mechanism

In any distributed system, fault tolerance is crucial for ensuring uninterrupted service. We incorporated a heartbeat mechanism to enable the system to detect and recover quickly from pod crashes, ensuring jobs were always processed, even if one pod failed.

  • Resilience in Failure: The heartbeat mechanism ensured that if a pod crashed while holding the Redis lock, the system would quickly detect it and allow another pod to take over the job, preventing tasks from stalling.
  • Continuous Operation: This approach made the system highly fault-tolerant, ensuring that notifications were still delivered during high traffic or unexpected failures without service disruptions.

3. Efficient Retry Management with BullMQ

Retries are often necessary in any distributed job scheduling system, but handling retries efficiently is critical to avoid resource waste and unnecessary repetition. BullMQ allowed us to implement a smarter retry mechanism that focused only on failed notifications, leaving successfully processed jobs untouched.

  • Targeted Retries: BullMQ’s ability to target only failed jobs for retry meant that the system didn’t waste resources or risk sending users duplicate messages.
  • Error Monitoring: By setting retry limits, we could monitor persistent failures and send alerts after a set number of unsuccessful retries, enabling us to take proactive action when necessary.

Creating a scalable and reliable notification service was no simple task, but through careful design choices and key technological solutions, we achieved a system that could handle the challenges of distributed architecture. Combining Redis locks for effective concurrency control, BullMQ for efficient job scheduling and retries, and a heartbeat mechanism for fault tolerance, we developed a system capable of managing traffic spikes while preventing job duplication and maintaining reliability.

The secret to success lies in understanding the intricacies of distributed systems and implementing strategies that address core challenges such as concurrency, fault tolerance, and resource efficiency. Whether you’re building a notification service or any other large-scale distributed system, these principles — concurrency control, granular retries, and fault tolerance — are fundamental to ensuring that your system remains robust and reliable as it grows.

By incorporating these lessons into your system design, you’ll be well-prepared to build solutions that not only scale effortlessly but also maintain resilience in the face of unexpected challenges.

Thank you for spending time on reading the article. I genuinely hope you enjoyed it. Don’t forget to take a look at the Elasticsearch Architecture & V8 Engine Series.

Elasticsearch Series

11 stories

V8 Engine Series

2 stories

If you have any questions or comments, please don’t hesitate to let me know! I’m always here to help and would love to hear your thoughts. 😊

--

--

Mahmoud Yasser
Mahmoud Yasser

No responses yet