As promised, we continue the series of articles on data feed delivery issues. The previous post was dedicated to feed metadata discovery challenges. Real-time data delivery mechanisms are of no less importance to feed management, and handling them successfully requires the knowledge of the typical drawbacks that appear due to their peculiarities. Therefore, our today’s post deals with the technical challenges of delivering large volumes of data to numerous subscribers in real-time via the most common push and pull feed delivery mechanisms.
Pull-Based Feed Delivery
Typically, data feed subscribers can access data in the pull-based systems using a database, filesystem, or webserver interface. Irrelevant of the interface used, the issues we are going to discuss below are common for all of them.
The first challenge a subscriber faces when trying to retrieve a file from a server is the awareness of its existence. The following mechanisms are employed for the purpose:
- A subscriber needs to handle the metadata discovery issues in order to find out the file generation patterns and arrival frequency.
- In order to detect the new files a subscriber needs to pull the directories where the feed files are expected to appear, at regular intervals.
In case real-time feed data retrieval and propagation is not critical, this method works well enough - the remote directory is scanned for new files at the estimated frequency of file arrival. However, when it is important that the feed files are retrieved in real-time with well-defined tardiness, the performance problems usually arise. They result from the incessant polling, which is indispensable in this case and means an increased load for the filesystem.
This issue is more severe in pull-based systems with many subscribers sharing the same data feeds. This means, the remote directories are scanned simultaneously by a large number of users, which seriously affects the performance.
Push-Based Feed Delivery
The push-based feed delivery mechanisms appeared as the response to the performance problems caused by the regular directory polling used in the pull-based approach discussed above. Instead, this method introduces an alternative - the specific feeds are automatically delivered to the subscribers that registered for them. While push-based delivery solves the performance issues, it has its own technical challenges to cope with.
- Reliable feed delivery. It is imperative that the delivery of all data feed files to all subscribers registered for it is guaranteed. A data feed management system should also be able to withstand any subscriber failures. For this purpose, a DFMS needs to successfully handle state tracking file delivery receipts.
- Real-time delivery. It is not enough if a DFMS guarantees eventual feed data delivery. It should also be done timely with little delays. This can be achieved by implementing a data transmission schedule, created with consideration of possible bandwidth and resource limitations on the side of the feed provider. Other than that, it is important to guarantee the real-time data delivery even in spite of the failing subscribers.
The basis of the push mechanism is an open-source utility rsync. Its work principle lies within the synchronization of the two directories’ content and simultaneously minimizing the amount of feed data to be transmitted via delta compression. However, rsync doesn’t support scheduling, which is critical for real-time data delivery. Therefore, scheduling is ensured by the cron facility from Unix. It is used to schedule rsync jobs for all feed subscribers at intervals. Nevertheless, while solving certain issues, cron and rsync combination leads to the new ones:
- Performance issues. Since there is no mechanism of notifying subscribers of the new file arrival, they need to regularly perform directory scans, which causes slow performance. Moreover, it presents problems for applications that need to process the data as soon as it arrives.
- Longer data transmission time. Since rsync stores no data on which subscriber received which files, it obtains this info by scanning local and remote directories. This causes the growth of the history data on both sides, which significantly slows down the scan process and consequently the data transmission itself.
- Lack of control over the destination directory structure. Basically, rsync attempts to make the target directory similar to the source one, filling it with as much data, even if it is not necessary. It also makes risky loading partial files
- Unwanted delays. The weak side of cron utility is lack of prioritized resource management. Moreover, unlike the triggered processing, it is prone to delays and therefore doesn’t do a good job with delivery scheduling.
New File Notifications as a Viable Solution
For applications like visualization systems that need to “know” immediately when the new data arrives the mechanisms we mentioned previously are not very effective due to their drawbacks. To eliminate them, it would be sound to implement a method to notify them about the new data file arrival or its being ready to be retrieved from their data feed management system.
There are some applications though, that don’t need notifications for each newly delivered file, for example, streaming data warehouses with multiple temporary partitioned materialized views from unprocessed feed data. For such apps, the per-batch notification approach is more preferable. To ease their work, they recalculate only the set of recently updated partitions instead of updating the whole warehouse along with each new file. Thus, an effective DFMS should be able to define the batch limits for each application, in order to trigger the notifications correspondingly to its needs. However, this task is very complex when dealing with dynamic feeds with the changing number of sources.
The point is, a time-fixed batch for such feeds may result in delays or unnecessary recalculations if the feed composition isn’t taken into consideration. So, designing a trigger mechanism for notifications requires two key points: a specification language to define the batch notion, and the method to effectively define the batch limits even for highly dynamic feeds.
This post points out the most typical real-time data feed delivery challenges that DFMS vendors face when their designing software. The solutions offered to suggest the direction in which to move if you wish to make your product highly competitive and truly effective in attending to the clients’ demands. With this purpose in mind, you’ll surely succeed in figuring out how to achieve the desired results and refine your data feed management solution.