Sunday, January 23, 2011

Maximum number of files in one ext3 directory while still getting acceptable performance?

I have an application writing to an ext3 directory which over time has grown to roughly three million files. Needless to say, reading the file listing of this directory is unbearably slow.

I don't blame ext3. The proper solution would have been to let the application code write to sub-directories such as ./a/b/c/abc.ext rather than using only ./abc.ext.

I'm changing to such a sub-directory structure and my question is simply: roughly how many files should I expect to store in one ext3 directory while still getting acceptable performance? What's your experience?

Or in other words; assuming that I need to store three million files in the structure, how many levels deep should the ./a/b/c/abc.ext structure be?

Obviously this is a question that cannot be answered exactly, but I'm looking for a ball park estimate.

  • http://en.wikipedia.org/wiki/Ext3#Functionality - This mentions that a directory can only have approximately 32000 subdirectories, but makes no mention of files.

    http://roopindersingh.com/2008/05/10/ext3-handling-large-number-of-files-in-a-directory/

    Also, I hate Experts Exchange, but I read a comment on this question that it's ideal to have less than 10-15,000 per directory.

    From bradlis7
  • I would suggest you try testing various directory sizes with a benchmarking tool such as postmark, because there are a lot of variables like cache size (both in the OS and in the disk subsystem) that depend on your particular environment.

    My personal rule of thumb is to aim for a directory size of <= 20k files, although I've seen relatively decent performance with up to 100k files/directory.

  • Provided you have a distro that supports the dir_index capability then you can easily have 200,000 files in a single directory. I'd keep it at about 25,000 though, just to be safe. Without dir_index, try to keep it at 5,000.

  • I have all files go folders like:

    uploads/[date]/[hour]/yo.png

    and don't have any performance problems.

    Jefromi : And how many files do you get per hour?
    From Coronatus
  • In my experience, the best approach is to not over-engineer the file structure in advance. As mentioned in at least one other answer, there are filesystem extensions that deal with the performance-issue end of things.

    The problem that I have hit more frequently is usability on the administrative end. The least amount of work you can do to decrease the number of files in a directory is probably the approach you need right now.

    sqrt(3_000_000) == 1732

    A couple thousand files in a single directory sounds reasonable to me. Be your own judge for your own situation. To achieve this, try splitting the files into a single level of hash directories so that the average number of files per directory is about the same as the number of directories.

    Given your example this would be ./a/abc.ext, ./ab/abc.ext, ./abc/abc.ext, ... .

    The spread of files will depend heavily upon the actual filenames. Imagine applying this technique to a directory of a million files each named foobar???.txt. There are ways to accomplish a more even spread, like hashing based on the value of a particular number of bits from the MD5 sum of each filename, but I'm going to dare guess that would be overkill for what you are trying to accomplish.

  • I think you're putting too much thought into this. If you even chose a single additional level of directories and were able to balance things evenly, you'd have 1732* directories and 1732 files per directory.

    Unless you plan on needing tens of billions of files, you could pretty much pick a number between 1000 and 100,000 and get good results.

    * square root of 3 million.

0 comments:

Post a Comment