PatTrieSort - External String Sorting based on Patricia Tries

Sven Groppe, Dennis Heinrich, Stefan Werner, Christopher Blochwitz, Thilo Pionteck

Abstract

External merge sort belongs to the most efficient and widely used algorithms to sort big data: As much data as fits inside is sorted in main memory and afterwards swapped to external storage as so called initial run. After sorting all the data in this way block-wise, the initial runs are merged in a merging phase in order to retrieve the final sorted run containing the completely sorted original data. Patricia tries are one of the most space-efficient ways to store strings especially those with common prefixes. Hence, we propose to use patricia tries for initial run generation in an external merge sort variant, such that initial runs can become large compared to traditional external merge sort using the same main memory size. Furthermore, we store the initial runs as patricia tries instead of lists of sorted strings. As we will show in this paper, patricia tries can be efficiently merged having a superior performance in comparison to merging runs of sorted strings. We complete our discussion with a complexity analysis as well as a comprehensive performance evaluation, where our new approach outperforms traditional external merge sort by a factor of 4 for sorting over 4 billion strings of real world data.
Original languageEnglish
JournalOpen Journal of Databases (OJDB)
Volume2
Issue number1
Pages (from-to)36-50
Number of pages15
ISSN2199-3459
Publication statusPublished - 01.10.2015

Research Areas and Centers

  • Research Area: Intelligent Systems
  • Centers: Center for Artificial Intelligence Luebeck (ZKIL)

DFG Research Classification Scheme

  • 409-06 Information Systems, Process and Knowledge Management
  • 409-04 Operating, Communication, Database and Distributed Systems

Fingerprint

Dive into the research topics of 'PatTrieSort - External String Sorting based on Patricia Tries'. Together they form a unique fingerprint.

Cite this