Abstract
Sets and maps are two essential collection types for programming used widely in data analytics [Shaikhha et al., 2022]. The underlying implementation for both are normally based on 1) hash tables or 2) ordered data structures. The former provides (averagecase) constanttime lookup, insertion, and deletion operations, while the latter performs these operations in a logarithmic time. The tradeoff between these two approaches has been heavily investigated in systems communities [Kim et al., 2009].
An important class of operations are those dealing with two collection types, such as setsetunion or the merge of two maps. One of the main advantages of hashbased implementations is a straightforward implementation for such operations with a linear computational complexity. However, naïvely using ordered dictionaries results in an implementation with a computational complexity of O(n log(n)).
Motivating Example. The following C++ code computes the intersection of two sets, implemented by std::unordered_set, a hashtablebased set:
std::unordered_set<K> result;
for(auto& e : set1) {
if(set2.count(e))
result.emplace(e);
}
However, the same fact is not true for ordered data structures; changing the dictionary type to std::set, an ordered implementation, results in a program with O(n log(n)) computational complexity. This is because both the count (lookup) and emplace (insertion) methods have logarithmic computational complexity. As a partial remedy, the standard library of C++ provides an alternative insertion method that can take linear time, if used appropriately. The emplace_hint method takes a hint for the position that the element will be inserted. If the hint correctly specifies the insertion point, the computational complexity will be amortized to constant time.
std::set<K> result;
auto hint = result.begin();
for(auto& e : set1) {
if(set2.count(e))
hint = result.emplace_hint(hint, e);
}
However, the above implementation still suffers from an O(n log(n)) computational complexity, due to the logarithmic computational complexity of the lookup operation (count) of the second set. Thanks to the orderedness of the second set, one can observe that once an element is looked up, there is no longer any need to search its preceding elements at the next iterations. By leveraging this feature, we can provide a hinted lookup method with an amortized constant runtime.
Hinted Data Structures. The following code, shows an alternative implementation for set intersection that uses such hinted lookup operations:
hinted_set<K> result;
hinted_set<K>::hint_t hint = result.begin();
for(auto& e : set1) {
hinted_set<K>::hint_t hint2 = set2.seek(e);
if(hint2.found)
hint = result.insert_hint(hint, e);
set2.after(hint2);
}
The above hinted set datastructure enables faster insertion and lookup by providing a cursor through a hint object (of type hint_t). The seek method returns the hint object hint2 pointing to element e. Thanks to the invocation of set2.after(hint2), the irrelevant elements of set2 (which are smaller than e) are no longer considered in the next iterations. The expression hint2.found specifies if the element exists in set2 or not. Finally, if an element exists in the second set (specified by hint2.found), it is inserted into its correct position using insert_hint.
The existing work on efficient ordered dictionaries can be divided into two categories. First, in the imperative world, there are C++ ordered dictionaries (e.g., std::map) with limited hinting capabilities only for insertion through emplace_hint, but not for deletion and lookup, as observed previously. Second, from the functional world, Adams' sets [Adams, 1993] provide efficient implementations for setset operators. Functional languages such as Haskell have implemented ordered sets and maps based on them for more than twenty years [Straka, 2010]. Furthermore, it has been shown [Blelloch et al., 2016] that Adams' maps can be used to provide a parallel implementation for balanced trees such as AVL, RedBlack, WeightBalanced, and Treaps. However, Adams' maps do not expose any hintbased operations to the programmer. At first glance, these two approaches seem completely irrelevant to each other.
The key contribution of this paper is hinted dictionaries, an ordered data structure that unifies the techniques from both imperative and functional worlds. The essential building block of hinted dictionaries are hint objects, that enable faster operations (than the traditional O(log n) complexity) by maintaining a pointer into the data structure. The underlying representation for hinted dictionaries can be sorted arrays, unbalanced trees, and balanced trees by sharing the same interface. In our running example, alternative data structures can be provided by simply changing the type signature of the hinted set from hinted_set to another implementation, without modifying anything else.
BibTeX  Entry
@InProceedings{shaikhha_et_al:LIPIcs.ECOOP.2022.33,
author = {Shaikhha, Amir and Ghorbani, Mahdi and Shahrokhi, Hesam},
title = {{Hinted Dictionaries: Efficient Functional Ordered Sets and Maps}},
booktitle = {36th European Conference on ObjectOriented Programming (ECOOP 2022)},
pages = {33:133:3},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959772259},
ISSN = {18688969},
year = {2022},
volume = {222},
editor = {Ali, Karim and Vitek, Jan},
publisher = {Schloss Dagstuhl  LeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/16261},
URN = {urn:nbn:de:0030drops162610},
doi = {10.4230/LIPIcs.ECOOP.2022.33},
annote = {Keywords: Functional Collections, Ordered Dictionaries, Sparse Linear Algebra}
}
Keywords: 

Functional Collections, Ordered Dictionaries, Sparse Linear Algebra 
Collection: 

36th European Conference on ObjectOriented Programming (ECOOP 2022) 
Issue Date: 

2022 
Date of publication: 

23.06.2022 