Friday, April 29, 2011

Sorted Combination of Multiple Lists

Consider L1, L2, L3 as lists containing n1, n2 and n3 integers in sorted order respectively.

Task is to construct a sorted list L such that,

L[0] = L1[0] + L2[0] + L3[0]
L[i] = L1[i1] + L2[i2] + L3[i3]
L[n1 * n2 * n3] = L1[n1] + L2[n2] + L3[n3]

But n1, n2, n3 are very large and therefore L cannot be constructed in one go and then sorted.

Therefore the list is to be constructed in stages and such that we can display k top integers and save the state of computation to resume by computing [k+1]th top integer.

What all data structures and algorithms can be used to achieve the objective?

From stackoverflow
  • Can't you just use a modified merge sort, since you already have three sorted lists? (By "modified" I mean something that takes advantage of the fact that you know that each input list is already sorted.)

    Assuming you cannot use a merge sort directly, as you don't want to compute, in memory, the entire newly merged sorted list, how about you this: Use a modified merge sort where you calculate the first group of merged entries and display those, maintaining the pointers used in the merge sort. You just persist where you are in each list, one pointer to the current location in each list, and pick up where you left off for each chunk.

  • Ok, I will be maybe flamed by this answer. But since you only need the algorithm, the best solution would be to trasversed every list at the same time building the result list with the best element (in this case the lower, or the one you like in a tie). With this method, you have 4 positions, one for every list you are trasversing and the last one point could be pointing to the position in the result list that you need to insert (or the last position inserted). With this, the only structure you need is a list.

    I see a problem with merge sort in this case. The data you are showing could be not the exact data (since you need to sort the next portion, and that could be merged with the current one).

  • OK, first an example in two dimensions:

        1  2  3
    
    1   2  3  4
    5   6  7  8
    7   8  9 10
    

    You start in the top left corner, obviously, and put the value into the result list. Next, you have to add all candidates that are reachable (through incrementing exactly one index) from there to some sort of sorted collection, here, that is the cells with the values 3 and 6. Then you take the lowest member out of that collection, put its value into the result list, add all candidates reachable from there that are not yet in the collection into that, and so on.

    You will need:

    • a data structure holding a candidate, with all indices and the result value (I represent that below as "((i1 i2) value)").
    • a data structure for a collection of candidates, sorted by value. A heap seems ideal for that.

    You will have to make sure that all candidates are unique by their indices when you put them into the collection. The values are not necessarily unique, but the heap should be sorted by them. Since a given set of indices always produce the same value, you will have to check for uniqueness of the indices only when you encounter that value while inserting into the heap. It might be an optimization to make the nodes of the heap not single candidates but a list of candidates with the same value.

    Doing this with the above example: First, the result list is (2). The candidates are ((1 2) 3) and ((2 1) 6). Take the candidate with the lowest value out, put the value into the result list -> (2 3), find all new candidates' coordinates -> (2 2) and (1 3), calculate their values -> ((2 2) 7) and ((1 3) 4), put them into the candidates' heap (serialized representation here) -> ((1 3) 4) ((2 1) 6) ((2 2) 7), lather, rinse, repeat.

    In tabular form:

    result-list          candidates
    (2)                  ((1 2) 3) ((2 1) 6)
    (2 3)                ((1 3) 4) ((2 1) 6) ((2 2) 7)
    (2 3 4)              ((2 1) 6) ((2 2) 7) ((2 3) 8)
    (2 3 4 6)            ((2 2) 7) ((3 1) 8) ((2 3) 8)
    (2 3 4 6 7)          ((3 1) 8) ((2 3) 8) ((3 2) 9)
    (2 3 4 6 7 8)        ((2 3) 8) ((3 2) 9)
    (2 3 4 6 7 8 8)      ((3 2) 9) ((3 3) 10)
    (2 3 4 6 7 8 8 9)    ((3 3) 10)
    (2 3 4 6 7 8 8 9 10)
    

    I don't see a better way at the moment. The heap seems to need a number of nodes in the magnitude of the sum of n1, n2, and n3.

0 comments:

Post a Comment