The Internet of Things, Database Systems and Data Distribution, Part 2
In Part 1 of this two-part series, I explained where in the Internet of Things data needs to be collected: on edge devices, gateways, and servers in public or private clouds. And I discussed the characteristics of these systems, as well as the implications for choosing an appropriate database management system technology.
In this article, I’ll talk about legacy data distribution models and how the Internet of Things introduces a new model that you, or your database system vendor of your choice, need to adapt.
Historically, there are common models of data distribution in the context of database systems: high availability, database clusters, and partitioning.
I touched on sharding in the first part. This is the distribution of the contents of a logical database over two or more physical databases. Logically, this is always a single database and it is the responsibility of the database system to maintain the integrity and consistency of the logical database as a unit. The exact way in which data is distributed varies greatly from one database system to another. Some systems delegate (or at least allow delegation) responsibility to the application (for example, “put this data on partition three”). Other systems are on the other end of the spectrum, deploying intelligent agents that monitor how data is queried and by which clients, and moving data between partitions to co-locate data that is queried together, and / or to move the data to a partition that is closer to the client (s) most often using that data. The database system must isolate the applications from the physical implementation. See figure 2.
The goal of high availability is, as the name suggests: to create redundancy and provide resiliency against the loss of a system that stores a database. In a high availability system, there is a master / primary database and one or more replica / standby databases to which the overall system can fail over in the event of a master system failure. A database system that provides high availability as a service must replicate changes (for example, insert, update, and delete operations) from the master to the replica (s). In the event of a master failure, a means is provided to promote a replica to the master. The specific mechanisms by which these features are performed are different for each database system. See Figures 3a and 3b.
The purpose of database clusters is to facilitate distributed computing and / or scalability. Clusters do not use the concept of master and standby; each instance of a database system in a cluster is a peer to every other instance and works cooperatively on the contents of the database. There are two main architectures of database system clusters. See Figures 4a and 4b.
As shown in Figure 4a, when clusters are used, and unlike sharding, where each shard contains a fraction of the entire logical database, each database system instance in a cluster maintains a copy of the entire database. Local readings are extremely fast. Insert, update, and delete operations must be replicated to all other nodes in the cluster, which is detrimental to performance, but the overall performance of the system (that is, the performance of the cluster in the cluster). ‘aggregate) are better because N nodes perform these operations. Nonetheless, this architecture is best suited for read-intensive usage patterns compared to write-intensive usage patterns.
High availability is sometimes associated with sharding, in which each partition has a master and a database and a standby database system. Because all partitions represent a single logical database, failure of a node hosting a partition would make the entire logical database unavailable. Adding high availability to a partitioned database improves the availability of the logical database, isolating it from node failure.
Replicating databases on edge devices adds a new dimension to data distribution. Remember that with high availability, and depending on the architecture, of the cluster, the content of the entire database is replicated between the nodes and each database is a mirror image of the other. However, IoT cloud server database systems must receive replicated data from multiple edge devices. A single logical cloud database should contain the content of multiple Edge device databases. In other words, the cloud database server database is not a mirror image of a single device database, it is an aggregation of many. See figure 5.
Additionally, for replication in a high availability context, the sending node is always the master and the receiving node is always the replica. In the context of IoT, the receiver is certainly not a replica of the peripheral device.
Additionally, for replication in a cluster environment, as shown in Figure 4a, the database system must ensure consistency for each database in the cluster. This involves two-phase validation and synchronous replication. In other words, a guarantee that a transaction succeeds on every node in the cluster or on none of the nodes in the cluster. However, synchronous replication is neither desirable nor necessary for data replication from the edge to the cloud.
Thus, the relationship between the sender and the recipient of replicated data in the IoT system is different from the relationship between primary and standby, and between peer nodes in a DB cluster, so the base system data must support this unique relationship.
In Part 1 of this series, I led the statement, âIf you want to collect data, you have to collect it somewhere. But, it is not always necessary to actually collect data at the edge. Sometimes a peripheral device is just a data producer and there is no requirement for local data storage, so a database is unnecessary. In this case, you can consider another alternative to move the data: the Data Distribution Service (DDS). This is a standard defined by the Object Management Group to âenable scalable, real-time, reliable, high performance and interoperable data exchangeâ. There are commercial and open source implementations of DDS. Simply put, DDS is publish-subscribe middleware that does the bulk of transporting data from publishers (eg, an edge IoT device) to subscribers (eg, gateways and / or servers).
DDS is not limited to use cases where there is no local storage on a peripheral device. Another use case would be to replicate data between two different database systems. For example, an integrated database used with edge devices and SAP HANA. Or, between a NoSQL database and a classic relational database management system.
In conclusion, the design of the architecture of an Internet of Things technology includes consideration of the characteristics and capabilities of the various hardware components and the implications for the selection of an appropriate database system technology. . A designer must also decide where to collect the data, where the data should be processed (for example, to provide command and control of an industrial environment or to perform analysis to find actionable information, etc.), and what data. must be moved. and when. These considerations will in turn inform the choice of database systems and replication / distribution solutions.
All contributors to the IoT Agenda network are responsible for the content and accuracy of their posts. Opinions are those of the authors and do not necessarily reflect the thoughts of the IoT agenda.