Friday, April 8, 2011

How much is it fair to assume about implementation when doing big-O calculations?

When programmers versed in a wide variety of languages are discussing the merits of an algorithm expressed in pseudocode, and the talk turns to efficiency, how much should be assumed about the performance of ultimate the language?

For example, when I see something like:

add x to the list of xs

I see an O(1) operation (cons), while someone else may see an O(n) array copy-to-extend or worse.

The harder you look, the more such cases there are, at least potentially. For example,

add x to the running total

may look O(1) to most of us (integer addition, right?) but it could plausibly be O(ln(max(x,total))) if the numbers were very large (BigNums). Or it could be O(x) if add were implemented as

b = a
while b < 0
    inc total
    dec b

and I'll bet there are even worse ways to do it.

So what's reasonable? Always assume the most efficient possible implementation (even though no extant language provides it) or...?


Edit:

To stir the pot some more here:

I note several people saying you should state your assumptions.

Fine, but suppose I assume (as I do) that advancing to the next line of code is free. Should I have to state that? Someone could, after all, be imagining some sort of mechanical implementation where going to the next line is a time consuming step. Until they bring it up, I might never have considered the posibility.

I'd contend for this reason that it isn't possible to state all your assumptions.

So my question is really "since there are some assumptions can you make without having to state them, and some you can't, where do you draw the line"?

From stackoverflow
  • While some caveats about implementation might be appropriate, I think if you're trying to compare the merits of algorithms as algorithms, the meat of it is in the optimal case. Once you know the theoretical boundaries of performance, there's another set of problems in trying to get your implementation to approximate that, but you're never going to exceed your theoretical bound.

    Jon Skeet : But you shouldn't assume that *every* operation will be optimal for the same type. For example, different lists give constant time insertion addition anywhere if you've got a reference to one node (linked list) *or* constant time index access (array list) but not both at the same time.
    Jon Skeet : (Continued) Other implementations will be balanced between the two (e.g. O(log n)), etc. If you assume both operations will be constant, you may miss an algorithm which is faster in *practical* big-O terms.
    chaos : Yeah. You can assume an implementation that's as good as possible, but no better. :)
    MarkusQ : But (to stir the pot) ought you assume that all instances of the same abstract "type" have the same concrete implementation? Is it reasonable to assume three collections, one which is fast to add to, one that is fast to index into, and one that is fast to search?
    chaos : Sure, why not? If your algorithm would benefit from that, then that's part of its performance boundaries.
    David Thornley : O(n) limits are usually for the worst case.
    MarkusQ : Worst case data in the best case implementation,
  • I think it's reasonable to assume a normal, but not best possible implementation, unless you are previously aware of such special scenarios as the one you construed. The difference between both should be constant. If you always assume for the worst like you showed in your example, I doubt that you could do any big-O calculations at all.

  • I think it's okay to make assumptions, so long as you state them where there's any cause for doubt. I would assume without statement that adding fixed-size integers is a constant time operation, but as there are so many wildly different implementations of lists (depending on what you need) it's worth stating the assumptions there. Likewise adding variable-size integers would be worth calling out explicitly - particularly because it may not be obvious when just reading through it relatively casually.

    That way when you consider the algorithm in a concrete setting, you can test the assumptions before starting the implementation. It's always nice if the API you're working with (collections etc) states the complexity of operations. Java and .NET are both getting quite good on this front, IIRC.

  • Interesting question.

    In my opinion, the steps involved in the algorithm in question should not be complex algorithms themselves. In otherwords, if "add x to the list of xs" is a complex operation, it needs to be fully specified, and not just a single sentence. If it is expressed as a single, straight forward sentence, then assume it is appropriately fast (eg. O(1), or O(log n) or similar depending on the apparent operation).

    Another case where this is clear is in string-building. Sure, these days we have the string-builder class, and understand the problem well. But it was only a few years ago when I saw an application brought to its knees because the developer coded a simple "append thousands of small strings together to make a big string", but each append did a full mem-copy

  • I believe the first step of doing big-O calculations is to decide which actions you consider elementary. In order to do that, you may stand by common point of view for standard actions (+,-,*,/ are elementary, so are insertions in lists and so forth).

    But in particular cases (those you mentionned, for instance), these actions are not elementary any more and you have to make new assumptions. You can make them using simple tests that you tried or, just out of reasoning. Any way what works for you is good !

  • The answer is to look at your use cases, and analyze in terms of them, rather than in the absolute.

    What kind of data are you going to be consuming in the real world? If you're going to be crunching longs, assume constant time addition. If you're going to be crunching lots of unbounded BigNums, that's not a good assumption.

    Does a list fit with the rest of your algorithm, or do you need lots of random access? O(n) array-to-copy is a worst case scenario, but is it legit to amortize your analysis and consider what happens if you double the array size each time (giving you an amortized O(1) cost)?

    Being able to answer these questions in terms of your own algorithms makes you a good algorithmatist. Being able to answer them in terms of algorithms you're considering for a project you're working on makes you a responsible implementer.

    If your algorithm is being designed purely in the abstract, then be straightforward and explicit in your assumptions. If you assume constant time addition, just mention it. If you're doing amortized analysis, be sure you mention that.

    MarkusQ : It was a discussion here that prompted the question, and I think we all thought we were being clear in our assumptions. But there comes a point when you really don't want to have to type "and I also assume that checking if N is odd is O(1)" for every operation.
    rampion : True enough. Lots of common assumptions can be derived from simple statements - if you mention constant data size, we can assume constant addition.
  • Assume fair, to decent performance. If no guarantees are provided in some documented form assume a little as possible but the it should be safe to expect that the basic data structures and their operations preform as taught by the computer science department.

    But what language you're running kind of matters.

    A good C programmer is able to compile the code in his head. Meaning, he is able to glance at the code and get a fairly good approximate of what that would look like in machine code.

    A Haskell programmer can't do that. It's the exact opposite. A Haskell programmer is going to care more about the complexity of his function and with good reason.

    However, the Haskell programmer will never write code that's going to out perform what the C programmer writes. It's a misunderstanding among many that this is the case. Programmers that know how to write fast code on modern hardware are superior in the performance segment. However, it would be fair to assume that the Haskell programmer will be more productive.

    There's a sizable overhead to any programming language and notable managed environments. This doesn't make them slow but it makes them slower. The less hardware specific, the slower execution.

    There's plenty of ways you can write code to eliminate performance bottle necks without modifying your original algorithm and a lot of this has to with how the hardware is expected to execute your code.

    What you assume about code and it's performance has to be based on measurements and profiling. It's simply not fair to disregard the hardware performance which the Big-O notation does, (I'll admit it's very theoretical but not that practical).

  • I don't think that it is helpful when discussing efficiency of algorithms to go in details of the different languages the can be used (or OS or hardware architecture or ...).

    More general it's always good to isolate the different aspects of a question, even if they influence each other.

    My experience is, that in most of the cases, when it comes to complex algorithms, the most efficient one will work well in most language implementations. The optimization in matters of the used language is a more detailed thing in most of the cases. When discussing the algorithm in pseudocode the details should express the what in the most understandable way.

    Your examples are such details, parts used in complex algorithms and can be discussed and adjusted during implementation in some language. When discussing those details, pseudocode can be used, but the syntax of the actual language(s) would do a better job than.

  • I usually assume best implemenation, and explicitly mention the assumed implementation if necessary (e.g. use a HashSet, or a BST). I would define the data structures being used, and not leave them ambiguous (i.e. I'd clearly state that this is an array or a linked list). This would give a clear view of the order of any internal operations.

    I also consider basic, elementary operations to be O(1), or otherwise I clearly state that. Adding two numbers is O(1), unless we know that our numbers will cause overflow for primitive data types. For example, when assessing a shortest path algorithm, you don't consider the addition of costs to be a heavy operation. In my opinion, we are assuming that all similar algorithms in one category will have the same problem if numbers are larger than a certain bound, and thus they will still be comparable.

    Note that we (should) use the big-O notation to compare algorithms, rather than calculate their running time. For this purpose, simple operations that need to be performed in all algorithms of a certain category (e.g. addition) can safely be assumed to have O(1). It's also acceptable to say that the cost of mathematical and logical operations is almost the same. We say a linear search is O(n), and a linear sum is also O(n). So, following this logic, all elementary operations can be thought of as O(1).

    As a final note, what if we write an algorithm that requires 128-bit, or even 256-bit integers? I believe we should not assume it will be implemented with a BigInteger, because in a few years, these data types will probably be mainstream. Think about the same idea for algorithms that were invented in the 1960's, and you'll see how the same logic applies.

  • I would also assume the best implementation. It usually doesn't matter what it really is unless it becomes a bottleneck, in which case you should be using a better data structure anyways.

  • In some applications, hashtables inserts/reads are considered O(1) as well. It depends on what your baseline is.

    For something even as 'simple' as sorting, if you're comparing sorts of large data that won't fit inside memory, it might be worth considering the access time for the hard disk as well. In this case, mergesort is much better than almost all of the other algorithms.

    In other cases, specialized hardware may make some algorithms perform much better, by allowing things to be done in constant time that that might take O(n), normally.

    Hosam Aly : Can you support your claim that mergesort is better for disc access? I have seen a presentation by Jon Bentley before, in which he showed that quicksort is better than mergesort when it comes to memory or disk access.
    FryGuy : Hey, this isn't Wikipedia :). And I didn't specifically mention it being faster than Quicksort. It may, or may not be. However, it will be much master than something like heapsort that requires a lot of random access, even though they are both O(n log n)
  • Not exactly an answer to the question but this question remind me how luckily somebody had already invented RAM (RANDOM Access Memory) The question will be much difficult without it.

    I think big-O notation is not an absolute property of the system. Any big-O statement need to be attached with the assumption on what kind of input is interested and expected to growth.

0 comments:

Post a Comment