A PDF version of the full document can be found here

A. M. AbdelRahman
PhD Student
Dept. of Systems and Computer Engineering
Carleton University
Ottawa, ON
Canada K1S 5B6
amoham16@connect.carleton.ca

Abstract

Access to stored data in main memory is much faster than access to local disks/hard drives. Well maintained data structures can utilize such access. In addition, the performance of the memory fetch, dispatch and store operations on Distributed Operating Systems would augment if they exploit a highly scalable distributed data structures. This paper describes the concept of Distributed Data Structures (DDS) which has been made viable by multicomputers. Scalability aspects and their implementation are reviewed for a chosen subset of classic data structures. Open research issues for SDDS are also discussed.

Introduction

Multicomputers are collections of autonomous workstations or PCs on a network (network multicomputers), or of share-nothing processors with local storage linked through a high-speed network or bus (switched multicomputer)<ref> A. S. Tanenbaum, Distributed Operating Systems. Prentice Hall, 1995.</ref>. Recent advances in multicomputers showed that they became in need for new data structures that scale well with the number of components to make effective use of their aggregate performance <ref> W. Litwin, R. Moussa, and T. Schwarz, “Lh*_RS —a highly- available scalable distributed data structure,” ACM Trans. Database Syst., vol. 30, pp. 769–811, September 2005. [Online]. Available: http://doi.acm.org/10.1145/1093382.1093386 </ref>. In addition, low latency interconnected mulicomputers allow significant flexibility to the size of data structure without suffering from space utilization or access time penalties <ref>V. Gupta, M. Modi, and A. D. Pimentel, “Performance evaluation of the lh*lh scalable, distributed data structure for a cluster of workstations,” in Proceedings of the 2001 ACM symposium on Applied computing, ser. SAC ’01. New York, NY, USA: ACM, 2001, pp. 544–548. [Online]. Available: http://doi.acm.org/10.1145/372202.372458</ref>

The rest of this paper is organized as follows. Section 2 presents the work done to implement a distributed version of classical data structures including trees, sets, lists, graphs and queues, binary indexed trees (BITs) and distributed hash tables (DHT). The described work hide the distinction between local and distributed data structure manipulation, simplifying the programming model at some performance cost. Section 3 focuses on scalability issues in distributed data structure models that are based on linear hashing. Section 4 provides a performance parameters dealing with SDDS research outcomes. Finally, in section 5, a conclusion is drawn summarizing the presented work.

Distributed Versions of Classic Data Structures

Hash Tables

A distributed hash table can provide more scalability of throughput and capacity of data more than a locally deployed one. An architecture for a distributed hash table was proposed in <ref name="ref3">S. D. Gribble, E. A. Brewer, J. M. Hellerstein, and D. Culler, “Scalable, distributed data structures for internet service construction,” in Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4, ser. OSDI’00. Berkeley, CA, USA: USENIX Association, 2000, pp. 22–22.</ref> which consisted of the following components:

Client (a web browser for example): a component that is totally unaware of the underlying DDS infrastructure. i.e., DDS parts don't runs on a client.
Service: software processes (each is called a service instance) that are cooperating together.
Hash table API: the boundary between DDS libraries and their service instances.
DDS library: a library (in Java) that presents the API of the hash table to services.
Brick: the component that manages durable data. It consists of a lock manager, a buffer cache, a persistent chained hash table implementation and network stubs as well as skeletons for remote communication. Each CPU in the distributed clusters runs one brick.

(Image)

The partitioning and replication operations work by horizontally partitioning tables to spread data and operations across bricks. Each brick saves a number of partitions of each table in the system and when new nodes are added to the cluster, this partitioning is altered so that the data is spread onto the new node. For avoiding unavailability of portions in the hash table due to node failures, each partition in the hash table is replicated on more than one cluster node. Two problems arise, finding a partition that manages a specific key, and determining the list of replicas in partitions' replica groups. To solve these, the DDS libraries consult two metadata maps (namely data partitioning map DP and replica group membership map RG) that are replicated on each node of the cluster. That pair of metadata maps exist on each hash table in the cluster. The DP map controls the horizontal partitioning of data across the bricks. Whereas the RG map returns a list of bricks that are currently working as replicas. An asynchronous event-driven programming style builds all components of the distributed hash table. Each layer in the hash table is designed so that only a single thread executes at a time. In this manner, distributed hash tables succeed in simplifying the construction of services which exploited the data consistency and scalability of the hash table <ref name="ref3" />.

Another method of distributing hash tables is by distributing their insert and lookup operations (i.e., calling them locally on each processor) <ref> S. Chakrabarti, E. Deprit, E.-J. Im, J. Jones, A. Krishnamurthy, C.-P. Wen, and K. Yelick, “Multipol: A distributed data structure library,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/CSD-95-879, Jul 1995. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/1995/5483.html </ref>. In this method, such local operations will allow each processor to process its local portion of the hash table. No synchronization is to be performed on such local operations and also no definition for any interaction scheme between the concurrent distributed operations. In addition, it distributes the buckets over the processors. In case of collision, chaining is used to resolve it. Moreover, the authors in <ref name="ref4"> S. Chakrabarti, E. Deprit, E.-J. Im, J. Jones, A. Krishnamurthy, C.-P. Wen, and K. Yelick, “Multipol: A distributed data structure library,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/CSD-95-879, Jul 1995. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/1995/5483.html </ref> demonstrated two performance techniques which are a one-way communication and latency masking. They tested their proposed model on a puzzle where their algorithm switches iterations between a defined set of states and a new generated set of valid moves, exploring around 100,000 positions. They compared three implementations for their model:

A simple blocking algorithm where each processor iterates only over its local set and executes insert operations into the hash table. On a 32-node CM5, the process took around 28 seconds.
With pipelining the inserts (where each processor must wait for the synchronization counters), the running time was reduced to approximately 16 seconds.
Finally, by eliminating the acknowledgment traffic and using a global synchronization point, the process took 11 seconds.

To conclude, this multi-ported nature of the hash table trades powerful semantics for good performance.

The main difference between the later method and the former one is not in the management of data but in the manipulation of the metadata. The former method is deemed having higher data availability due to the replication scheme that is implemented. However, it is the hardware resources, software achievements tradeoff.

Replicated List

Replicated lists have great similarities to hash tables <ref name="ref4" />. They also have an iterator for local element traversal. One difference between both approaches arises from the fact that in hash tables, local iterators produce subsets of the elements that are disjoint, whereas the iterator of the replicated list does not. In <ref name="ref4" />, the authors proposed two performance improvements on replicated lists. The first is by aggressive replication of the list. This is acceptable when the list is used to store ONLY object IDs or global pointers. The second is by operating the list over a distributed object library that manipulates the global address space. Obviously, this design of replicated lists works better with large objects that come in relatively smaller sets.

Octree

Octree data structure represents 3D objects as a disjoint union of cubes of varying sizes. Because of their simplicity and non-semantic properties, octrees' structure is easy to manipulate <ref name="ref5">K. Yamaguchi, T. Kunii, K. Fujimura, and H. Toriya, “Octree-related data structures and algorithms,” Computer Graphics and Applications, IEEE, vol. 4, no. 1, pp. 53 –59, Jan. 1984.</ref>. The authors in <ref>P. Sojan Lal, A. Unnikrishnan, and K. Poulose Jacob, “Parallel implementation of octtree generation algorithm,” in Image Processing, 1998. ICIP 98. Proceedings. 1998 International Conference on, Oct. 1998, pp. 1005 – 1009 vol.3. </ref> have proposed a parallel implementation of octrees for the sake of medical purposes. Their proposal focused on the bottom up (figure \ref{Sojan1}) algorithm where they later ported it to a Transputer.

(image)

Upon the implementation of a distributed version of octrees, two challenges arise. The first is how to disseminate the octree nodes between processors while retaining good load balancing and locality. The second is how to decrease the cost of remote node accessing for higher efficiency <ref name="ref5" />. It can be argued that an octree needs to be partitioned across processors to allow equal number of inter-particle interactions in order to achieve good load balancing \cite{Chakrabarti:CSD-95-879}. Besides, it is highly preferable that tree distribution among processors occurs such that each processor gets a complete subtree(s). This will help in preserving locality. Obviously, this approach can guarantee minimum number of remote accesses for the computation entirety <ref name="ref5" />.

Distributed Search Trees (DST)

No class of the usual single site leaf search trees can guarantee a logarithmic bound on the path length <ref name="ref7">B. Krll and P. Widmayer, “Balanced distributed search trees do not exist,” in Algorithms and Data Structures, ser. Lecture Notes in Computer Science, S. Akl, F. Dehne, J.-R. Sack, and N. Santoro, Eds. Springer Berlin / Heidelberg, 1995, vol. 955, pp. 50–61, 10.1007/3- 540-60220-8 50. [Online]. Available: http://dx.doi.org/10.1007/3-540-60220-8 50 </ref>. However, research on DSTs have attracted considerable attention, and lots of seminal work has been suggested based on the distributed random binary search trees <ref>B. Kroll and P. Widmayer, “Distributing a search tree among a growing number of processors,” in Proceedings of the 1994 ACM SIGMOD international conference on Management of data, ser. SIGMOD ’94. New York, NY, USA: ACM, 1994, pp. 265–276.[Online].Available:http://doi.acm.org/10.1145/191839.191891</ref> and the distributed variant of B-trees <ref>W. Litwin, M.-A. Neimat, and D. A. Schneider, “Rp*: A family of order preserving scalable distributed data structures,” in Proceedings of the 20th International Conference on Very Large Data Bases, ser. VLDB ’94. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, pp. 342–353. [Online]. Available: http://portal.acm.org/citation.cfm?id=645920.672842</ref>. The authors in <ref name="ref7" />. have proofed that the distributed height of a DST that results from a stable distribution for m keys is $\Omega(\sqrt{m})$ for the worst case. They have also proposed a method that builds up a stable distribution satisfying the bound of $O(\sqrt{m})$ on the height of its DST.

An interesting implementation for a DST was proposed by the authors in <ref>A. Di Pasquale and E. Nardelli, “A very efﬁcient order preserving scalable distributed data structure,” in Database and Expert Systems Applications, ser. Lecture Notes in Computer Science, H. Mayr, J. Lazansky, G. Quirchmayr, and P. Vogel, Eds. Springer Berlin / Heidelberg, 2001, vol. 2113, pp. 186–199, 10.1007/3-540-44759-820.[Online]. Available: http://dx.doi.org/10.1007/3-540-44759-8 20 </ref> where each server holds a unique set of keys which has a fixed capacity b. If a server gets added more than b keys, it is said to be in overflow, where it splits to two subtrees. In principle, a virtual distributed tree is build from this splitting operation. Needless to say, for a sequence of m insertions, there will be at most $\lfloor\frac{2m}{b}\rfloor$ splits.

SD-Rtrees

Scalable-Distributed-Rtree (SD-Rtree) is a scalable distributed version of Rtrees <ref name="ref11">C. du Mouza, W. Litwin, and P. Rigaux, “Large-scale indexing of spatial data in distributed repositories: the sd-rtree,” The VLDB Journal, vol. 18, pp. 933–958, 2009, 10.1007/s00778-009-0135-4.[Online].Available: http://dx.doi.org/10.1007/s00778-009-0135-4 </ref>. It uses a distributed balanced binary tree that can scale to any number of storage servers via splitting the overloaded ones. The structure of an SD-Rtree is considered a hybrid approach between an AVL tree and Rtree (where the principle of the data organization is taken). Their kernel structure is defined as a binary tree that is mapped to a set of servers and satisfies the following properties:

All but leaf nodes refers to two children
All but leaf nodes maintain left and right directory rectangle which are the bounding boxes of the left and right subtrees respectively.
All leaf nodes store a subset of the indexed objects.
The height of the two subtrees at any node differs by one at most.

An internal node maintains the ID of its parent node. Each link carries four pieces of information which are: ID, that ID of the server that stores the referenced node, height, the height of the subtree rooted at the referenced node, dr, the directory rectangle of the referenced node and type, either data or routing. Both data and routing define and overlapping coverage and the ID of the parent routing node. Nevertheless, data nodes define the local data sets and the underlying directory rectangle. Whereas the routing node defines a description of the routing node and links to the left and right children.

Balancing is preserved via rotations during a bottom-up height adjusting traversal. In SD-Rtrees particularly, balancing exploits the fact that the rectangle containment relationship allows for more freedom for reorganizing an unbalanced tree.

The authors in <ref name="ref11" />. claim that SD-Rtrees constitute scalable, compact and efficient structure that make use of the processing and storage space of a pool of distributed data servers and that they are particularly important in clusters of servers.

Binary Indexed Trees (BITs)

BITs were developed for maintaining the cumulative frequencies which are needed to support dynamic arithmetic data compression \cite{Fenwick:1994:NDS:182408.182416}. A parallel implementation for BITs was proposed in <ref>A. M. Elhabashy, A. B. Mohamed, and A. E. N. Mohamad, “An enhanced distributed system to improve the time complexity of binary indexed trees,” in World Academy of Science, Engineering and Technology, 2009, pp. 154–159.</ref> to improve the complexity of the read and update queries. This work was achieved by efficiently partitioning the problem space into small distributed fragments which operate concurrently. The amount of communication between the system fragments has been reduced as much as possible in order to obtain a speed up that is near to the maximum enhancement offered by the distributed parallel system. The authors implemented their proposed system on parallel processors for simulation purposes. The code in listing \ref{Habashy1} resembles the function of each processor, where the \emph{check\_zeros} function is to check that all the digits of the \emph{idx} are set to zeros, and the convert_zeros function is to convert all the variables stated in the range given in its parameter list to zeros.

Task Queue

Task queues <ref name="ref4" />. can provide dynamic load balancing of a set of structures used to identify tasks. Task queues have four main functions: insertion of a new task, extraction of an existing task, load balancing and detection of whether the system is idle or not. For the load balancing task, two algorithms have been chosen by the authors. One uses a simple randomized load balancing scheme and the other uses heuristics for locality along with round-robin task pushing. The authors claim that both of them work well in cooperation with the task queue, where locality is not a major concern.

Linear and Dynamic Hashing Based SDDS

Distributed Linear Hashing

Linear hashing, a technique proposed by Witold Litwin in 1988, is a hashing that allows the address space to dynamically expand or contract <ref name="ref14">W. Litwin, “Linear hashing: a new tool for ﬁle and table addressing.” in Readings in database systems. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1988, pp. 570–581. [Online]. Available: http://portal.acm.org/citation.cfm?id=48751.48788</ref>. Hence, it allows a file or a table to support any number of deletions or insertions with deteriorating the memory load performance. Linear hashing works by organizing files into buckets; a collection of buckets forms an LH file. Each collection is addressable through a pair of hashing functions.

Linear hashing can be applied in a distributed manner <ref name="ref15">W. Litwin, M. A. Neimat, and D. A. Schneider, “Lh: Linear hashing for distributed ﬁles,” SIGMOD Rec., vol. 22, pp. 327–336, June 1993. [Online]. Available: http://doi.acm.org/10.1145/170036.170084</ref>. To do so, each bucket is placed at a different server and retains its bucket level in its header \ref{Litwin1}.

(image)

Expansion of linear hashing occurs as an LH file. If a bucket n overflows, it splits and bucket n+1 is created. The definition of a bucket overflow is beyond the scope of this paper (please refer to <ref name="ref14"/> for a detailed description). In addition, if collision occurs, another splitting occurs for the bucket that suffered collision. This scheme was further enhanced by the authors in <ref name="ref16">W. Litwin, M.-A. Neimat, and D. A. Schneider, “Lh*— a scalable, distributed data structure,” ACM Trans. Database Syst., vol. 21, pp. 480–525, December 1996.[Online].Available:http://doi.acm.org/10.1145/236711.236713 </ref> by adding support for load control that performed at no additional message cost for bandwidth preserving. The old mechanism, that lacks load control, gave a load factor of 65-70\%. On the other hand, after adding the refinement, the load factor increased to 80-95\% regardless of the file size.

References

DistOS-2011W Distributed Data Structures: a survey

Contents

Abstract

Introduction

Distributed Versions of Classic Data Structures

Hash Tables

Replicated List

Octree

Distributed Search Trees (DST)

SD-Rtrees

Binary Indexed Trees (BITs)

Task Queue

Linear and Dynamic Hashing Based SDDS

Distributed Linear Hashing

References

Navigation menu