Friday, February 11, 2011

Time spent on reading from file

I am doing this in C++:

    if (myfile.is_open()){
        while (! myfile.eof()){
            getline (myfile,line);
            DO STUFF
        }
        myfile.close();
    }
    else{
        cout << "Unable to open file";
    }

I am trying to read the lines from a text file and do stuff with it. I am trying to see what the run time of my algorithm would be. Would reading lines from the file slow down my program by much so I won't get an accurate result?

  • Yes - if you want to accurately benchmark "DO STUFF" then you shouldn't take into account disk IO also. So one thing you could do is buffer the whole file into memory, and then process it and time that. But if the file is too large or it would make your line handler more difficult, another thing you could do is read the file into memory line by line 10 times or so WITHOUT processing, average the times, and then time the function the way you have it now (including processing and disk I/O), and then subtract the average disk-reading time from the total time.

    Edit: I don't know why I didn't think of this before, but you could also just put a timer around the execution of "DO STUFF" and add to a sum after each execution.

    roe : careful though reading the file beforehand will cause io caching to play a large role. working on dummy data is better.
    danben : Right, that was why I suggested reading the file into memory a lot of times. Maybe the median time would be better than the average.
    SuperString : what's the best way to put it into memory? Put the lines as strings in an array?
    danben : Yes, that would work just fine
    From danben
  • From the pseudocode you've pasted I assume you are "doing stuff" on each line of the file. If the length of time required for your algorithm to process a single line is considerably longer than the time required to read that line from the file, then you can just ignore disk IO. In other cases, just read the file into a vector of lines prior to running your algorithm, and measure then.

  • while (! myfile.eof()){
      getline (myfile,line);
      ...
    

    is not the correct way to read a file - you want:

    while ( getline (myfile,line) ){
      ...
    

    For a detailed explanation of why this is so, you might want to take a look at my blog at http://punchlet.wordpress.com/

    From anon
  • Disk I/O times vary widely due to a large variety of factors including rotational speed, location on the disk, fragmentation, etc. Accessing a file twice in a very short time will usually result in the second access completing significantly faster then the first. For these reasons, you should not include I/O speed in any formal bencharking tests.

    Are you expecting to measure performance externally, by insturmentation, or using local time accumulators? If you're doing it externally, you're pretty much screwed. If you're using insturmentation, then the post-processor program should be able to provide you with a per-method breakdown, so you can simply eliminate those methods that do I/O (assuming you've structured the code to do that). If you're using a local accumulators, simply place the calls to the time functions around the major blocks you're interested in and have the program print out the total accumulator(s) at the end.

    SuperString : what are the differences between the three?
    jfawcett : External means you're timing the program from the outside. Something like using 'time ' to get the execution time. Insturmented means you're using a program like AppVerifier for Windows. They're called Performance Tools and there are lots of them. Local Accumulators mean you have one or more internal variables that accumulate time. You call, for example, gettimeofday() before and after the interesting section and calculate and accumulate the differences. Can be accurate to the line of code, but is a usually way overkill.
    From jfawcett
  • If you want to speed up your program, read more data with each I/O request. In other words, instead of reading one line at a time, read 1MB of data into a buffer and get you text line out of the buffer. Accessing internal memory is always faster than reading from a file.

    If your file resides on a hard drive, another method is to read often enough to keep the hard drive spinning. Much of the overhead in reading from a file is waiting for a hard drive to get up to speed. Hard drives love to keep reading and hate to stop and then start again. This is based on An object in motion tends to stay in motion.

    Also research double buffering. This technique allows data to be read into one buffer while the processor is searching another.

    SuperString : how do you read enough so that your hard drive keeps spinning?
    Thomas Matthews : Very tricky, depends on the I/O channel, whether it is external or not and the rotation speed of the platters. One method may be to send it enough requests to keep it busy.

0 comments:

Post a Comment