Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

Given are two sorted arrays a, b of type T with size n and m. I am looking for an algorithm that merges the two arrays into a new array (of maximum size n+m).

If you have a cheap comparison operation, this is pretty simple. Just take from the array with the lowest first element until one or both arrays are completely traversed, then add the remaining elements. Something like this http://stackoverflow.com/questions/5958169/how-to-merge-two-sorted-arrays-into-a-sorted-array

However, the situation changes when comparing two elements is much more expensive than copying an element from the source array to the target array. For example you might have an array of large arbitrary precision integers, or strings, where a comparison can be quite expensive. Just assume that creating arrays and copying elements is free, and the only thing that costs is comparing elements.

In this case, you want to merge the two arrays with a minimum number of element comparisons. Here are some examples where you should be able to do much better than the simple merge algorithm:

a = [1,2,3,4, ... 1000]
b = [1001,1002,1003,1004, ... 2000]

Or

a = [1,2,3,4, ... 1000]
b = [0,100,200, ... 1000]

There are some cases where the simple merge algorithm will be optimal, like

a = [1,3,5,7,9,....,999]
b = [2,4,6,8,10,....,1000]

So the algorithm should ideally gracefully degrade and perform a maximum of n+m-1 comparisons in case the arrays are interleaved, or at least not be significantly worse.

One thing that should do pretty well for lists with a large size difference would be to use binary search to insert the elements of the smaller array into the larger array. But that won't degrade gracefully in case both lists are of the same size and interleaved.

Any ideas?

Update for clarification: The only thing available for the elements is a (total) ordering function, so any scheme that makes comparisons cheaper is not possible.

share|improve this question
2  
There's no way to do fewer comparisons than in the "simple merge algorithm". You can try to handle edge cases like the first you mention, but this will worsen the average case. –  Mephy 20 hours ago
4  
@Mephy: enlighten us and give us a formal prove, please. Or if you can't, consider to delete (or at least refine) your comment. –  Doc Brown 19 hours ago
3  
@DocBrown if I had a formal proof, I would give an answer, not a comment. Anyway, it is a pretty obvious linear problem, because trying to find a better-than-linear solution would need at least linear time. –  Mephy 18 hours ago
3  
@Mephy: I suggest you take the time to read the answer below, and think twice about what you wrote. –  Doc Brown 17 hours ago
2  
@Mephy Most things that are obvious ("you can't do multiplication in less than O(n^2)", "if I change which door I picked I won't improve my chances to win a price", "you can't sort in less than O(n log n)",..) are wrong. Using a binary search approach on the shorter list for example should conceivably improve the average case. –  Voo 16 hours ago

2 Answers 2

The normal merge sort algorithm - merge step with normally apply n + m -1 comparisons, where one list is of size n and and the other list is of size m. Using this algorithm is the most simplest approach to combine two sorted lists.

If the comparisons are too expensive you could do two things - either you minimize the number of comparisons or you minimize the cost of comparisons.

Let's focus on the minimization of the cost of comparison. You and only you can decide whether the data you are comparing can be quantized or not. If you can quantize them, which is a form of implementing a hash method, which is keeping the ordering. E.g. if your Data is compared by Name, Then the first tname, ... you can take the first to Chars of the name "Klaehn, Ruediger" and reduce/quantize your data element to "Kl.Ru", if you compare it to "Packer, The" you preserve the ordering "Pa.Th" - you can now apply a cheaper comparison algorithm, comparing the reduced values. But if you find another "Kl.Ru", you now have a near value, and you might now switch to a more expensive approach comparing these elements.

If you can extract this quantized value from your data, faster than comparing it, this is the first thing you do, you compare the quantized or hashed value first. Please keep in mind, that this value needs to be computed only once, so you can compute it on creating the data element.

I also mentioned another way, to minimize your comparisons.

I had a look into the classic book TAOCP- Volume 3-Sorting and Searching, (pp.197-207, section 5.3.2) which has full 10 pages on this topic. I found two references to algorithms which are faster than n+m-1 comparisons.

First there is the Hwang-Lin merge algorithm and the second an improvement by Glenn K Manacher - both are cited by TAOCP as well as an algorithm by Christen, which approaches the lower bound of needed comparisons, on special conditions on the length n and m of the lists.

The algorithm of Manacher was presented in Journal of the ACM Vol. 26 Number 3 on pages 434-440: "Significant Improvements to the "Hwan-Lin" Merging Algorithm". the list with m items and the list with n items can be of different length, but they must also be odered by the numbers of elements they contain m<=n

The Hwang-Lin algorithm breaks the lists to merge, apart to smaller lists and sorts the lists by comparing the first element of each sublist, and to decide whether some elements in the sublist need to be compared or not. If the first list is smaller than the second list, then the chance is high, that consecutive elements of the longer list can be transferred into the resulting list without comparison. If the first element of the small ist is greater than the first element of the splitted larger list, all elements in front of sublist can be copied without comparison.

Average case analysis of the merging alorithm of Hwang and Lin (Vega, Frieze, Santha) in Section 2 you can find a pseudocode of the HL-Algorithm. Which is a lot better than my description. And you can see why there are fewer comparisons - the algorithm uses a binary search, to find the index, where to insert the element from the shorter list.

If the lists are not interleaved like in your last example, you should have a remaining smaller and a remaining larger list in most cases. This is when the the HL-algorithm starts to perform better.

share|improve this answer
    
Thank you, for your comment on this - I checked my answer and found that Knuth spend full 10 pages on this topic. And then i took The JACM from m bookshelf and looked there fore more. I will improve my answer. - No need for downvoting. The hash-(quantizer) algorithm is a simple idea, which can be applied on many datasets - but only the Guy who asked, is the only one to decide whether it is applicable for his data or not. –  thepacker 19 hours ago
2  
After you improved your answer, everyone who downvoted you will get a chance to upvote you again ;-) –  Doc Brown 19 hours ago
    
+1 for noting that if the sizes are very different then the standard merge is not optimal. –  Florian F 17 hours ago

Newer answer

I will outline a simple attack. Of course the answer is data-dependent; every answer to OP's question has to be data-dependent.

Let's define a function IndexOf(sequence, query) which returns:

  • k if Compare(sequence[k], query) == EQ,
  • k + 0.5 if Compare(sequence[k], query) == LT and Compare(query, sequence[k+1]) == LT
  • other values, such as -1 or sequence[k].Count as normally expected.

Let's say we have two sorted sequences, A and B. Assume A.Count is much larger than B.Count. Furthermore, we look at the range of items of B with respect to A:

  • Let RankOfMinBInA = IndexOf(A, B[0])
  • Let RankOfMaxBInA = IndexOf(A, B[B.Count - 1])

If it turns out that (RankOfMaxBInA - RankOfMinBInA) is much smaller than the size of A, then we know that there are some simple (stupid) tricks that can eliminate the inspection for a (proportionally) large number of elements in A from the comparison.


Older answer

Aside from the excellent answer by thepacker, I would add that in general it is useful to learn deeply about the theorem behind the Bitonic Merge algorithm.

In practice, Bitonic mergesort is not as efficient as Batcher's odd-even mergesort (which again, is not even among the efficient ones being used in practice). However, gaining an intuitive understanding of why bitonic merge algorithm works would provide some intellectual insight about what one can do about the OP's question.

  • Lecture page
  • Figure of interest, showing what happens when one performs tries to overlap one sorted sequence against another reversely-sorted sequence.

The insight is that when one sequence "A" is sorted in ascending order while the other "B" is sorted in descending order, one can try to find the "crossover point". Note that the "unique crossover point" mentioned in this lecture happens because there was an additional constraint imposed: To find the crossover point that also partitions the two sequences into equal-sized sub-sequences. If the two sequences do not have the same length, as in OP's question, then there is no unique crossover point; instead, just pick one which is most convenient in the construction of your overall algorithmic system.

For two sequences with same length, finding the crossover point that also partitions into equal-sized sub-sequences is O(log N), logarithmic of the length of the sequences. It is a simple binary search.

In parallel merge-sort, it is important to partition into two equal-sized sub-sequences because then those sub-sequences can then be merged by two separate execution units without communication. In OP's question, there is no mention of parallelism, so this preference doesn't apply.

For the first two cases in OP's question, the crossover point will be at one sequence's end and another sequence's start. This can be taken as a clue that merging is a simple matter of contiguous copying of values.

For the interleaved values case, the crossover point is not useful, unless one uses that information in the construction of a parallel mergesort system.


There is also a curiosity known as "nuts and bolts mergesort" which may be relevant to OP's question. This is typically used as a programmer's interview question. (Because of this, I wouldn't give much detail about it.)

When used as an interview question, it is usually not expected to do a careful analysis, as a result I don't know if it has any chance of beating the other well-known techniques. For this reason I don't have much to say about it aside from mentioning it.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.