Wednesday, April 6, 2011

Is there a fast and scalable solution to save data?

I'm developing a service that needs to be scalable in Windows platform.

Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.

It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.

Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.

There are 5 steps:

  1. Receive the data via http handler (c#);
  2. Save the received data; <- HERE
  3. Request the saved data to be processed;
  4. Process the requested data;
  5. Save the processed data. <- HERE

My pre-solution is:

  1. Receive the data via http handler (c#);
  2. Save the received data to Message Queue;
  3. Request from MSQ the saved data to be processed using a windows services;
  4. Process the requested data;
  5. Save the processed data to Microsoft SQL Server (here's the bottleneck);
From stackoverflow
  • 6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?

    I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.

    Have a separate batch task which will insert a bunch of records into the database at a time.

    Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.

    I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.

  • I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.

    If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.

  • Why not do this:

    1.) Receive data
    2.) Process data
    3.) Save original and processsed data at once

    That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.

0 comments:

Post a Comment