Monday, March 7, 2011

Approach for large data/reporting project

What is a good approach for the following:

On a weekly basis, approximately 250,000 records representing transactions will be appended to a large table in an SQL 2005 database. It is required to augment this data and append it to another table. (For example, based on the client ID on the transaction, various data will be computed based on certain business rules; the data is dependent on the client ID and the transaction ID.) Think of this table as preprocessed input for an analytical engine. This input data will then be put through the analytical engine (vendor solution) that will produce yet a third database table. This table will then require further processing (e.g., aggregation by client ID and some in-house analytics) to be appended to a table containing results in a form our team can use for report production. It is likely that in the future the same data will feed other applications, such as a web-based viewer of the data.

Our skill base is C, C++, C#, .NET and a passing familiarity with ADO.NET and LINQ. If this is not sufficient for such a project, please let me know. While our budget for new talent is non-existant right now, we would probably work to enhance our in-house skill base or borrow from another team to meet the project requirements.

From stackoverflow
  • Based off your description this sounds like it should be totally driven through the database, e.g., with T-SQL and SSIS. Loading to tables and pre & post processing (aggregations, subsequent loads, etc.) is where SSIS will shine. Let me know if I missed the intent.

    Dillie-O : I was thinking the same thing. The one thing to note that the scripting language within SSIS is VB.NET. Most of the modules/tools/whatever in SSIS are quite robust, so you shouldn't have to do anything too intense with the scripting.
  • It sounds like ETL project. You want to use SQL Server Integration Service. It is the replacement for DTS in SQL 2000.

  • I'm reading about dimensional modeling, star schemas, and data warehousing these days, so forgive me for seeing a nail after studying hammers. I'd ask if you have a good data modeler on hand. I really like Ralph Kimball's ideas about dimensional modeling. I'll bet that vendor reporting solutions would plug right in. It seems to me that the difference in mission between transactional and reporting schemas calls for different approaches.

    1. Database operations are S... L... O... W...

    2. You put stuff in a database to query it.

    Therefore, your intermediate results are not good candidates for database storage.

    "250,000 records representing transactions .... It is required to augment this data ... based on the client ID on the transaction, various data will be computed based on certain business rules; the data is dependent on the client ID and the transaction ID."

    The first few operations do not lead to data you're going to query. You're simply staging it in the database. That's (1) slow and (2) not very helpful. Just leave it in files to do all of this augmentation processing.

    Put the files in directories that can be backed up as simply as you do your database backups and you'll be fine.

    Since your design seems pretty well fixed, that's about all you can do. However, if your database design is flexible, look at Kimball's Data Warehouse Toolkit, you'll see that when you have fields dependent on Client ID, that really means you have a client dimension.

    Define a table of clients. Use a simple join between fact and dimension to locate the client data.

    This dimensional lookup technique works for all of these "data augmentation" operations. Consequently, there's little actual processing to be done in your "data augmentation".

    Mostly, you figure out how to match the incoming transaction's keys against your defined dimensions. Sometimes the dimension values have changed; sometimes you need to add new dimension values. Consequently, your "augmentation" devolves to checking the dimensions, and appending some keys to each incoming record. Then loading the fact tables with appropriate measures and foreign keys to the dimensions.

0 comments:

Post a Comment