This project has moved and is read-only. For the latest updates, please go here.

Performance

May 15, 2012 at 3:54 PM

Hi Keith, me again.

Am using MultipleHash in anger now.  Was after a guage of perfomance you expect.  Previously I was using my own C# code to generate an MD5 across approx 1000 input columns from a text file.  This was taking approx 30min on a 4gb file.

Using Multiple hash to create 3 MD5's on different portions of the same columns is taking 2.5 hrs (all other things being equal).  This is with multithreading set to on.

Do you have any benchmarks to compare against?

May 17, 2012 at 4:06 AM

G'day,

Can you let me know the average width of the 1000 columns (and what mix of data types), and approximate number of rows?

Also, was your C# code running within SSIS (as a transformation)?

I haven't got benchmarks of hashes at that number of columns.

FYI:

The Multithreading will pass each output column into a separate thread (in your case 3 threads), and wait for all the output columns to be calculated, and then go onto the next row.

The output column is calcuated by converting each input column into a byte array, concatentated together with all the other required input columns (as byte arrays), and then a set of flags on nullability and length of input data.  This concatenated byte array along with the flags are hashed via the chosen algorithm.

I haven't looked specifically at speeding the transformation from Pipeline datatypes to byte arrays (reused someone else's code as per their license), and this is the only place that I can see where improvements may be found.

May 22, 2012 at 11:22 AM
Edited May 22, 2012 at 12:19 PM

Hi

Its actually 1102 columns in total, with one MD5 across 2 columns, one across 6 and the other across the rest.  I suspect the multithreading is not helping much due to the lack of balance.

This is being done in the pipe line from a text file so all types are character, average width 8 (min of 1 and max of 34).  There are approx 1.2 million rows.

My C# was indeed within a transformation using cryptography services, which also converted to a byte array.

Thanks for the info.  The hash we use is not of importance - would there be any benefit in using a different algorithm?

May 23, 2012 at 12:34 AM

G'day,

The hash calculation is only a very small percentage of the time spent in execution...  Changing from MD5 to another algorithm wouldn't make significant difference.

I've done some performance testing, using a file of 50 columns (all string) with 512Mb of data, on a dual CPU virtual machine.  The data is pushed back out to another text file.

V1.5 takes 153 seconds to process this with three hashs, on all columns, threading turned off, 150 seconds with threading turned on.

V1.5.1 (not released yet) takes 86 seconds to process this with three hashs, on all columns, threading turned off.  I haven't tested with threading turned on, but don't expect much difference.

The difference in the versions is to do with ArrayList versus List performance in C#, and changing to only extending the byte array every 1000 bytes.

I don't recommend using the currently checked in code (although it does perform faster), as it hasn't been tested yet, and I know there is at least one unwanted "feature" in it at the moment (if you hash a column > 1000 characters it will crash).

I'll be doing some more performance testing, with data that matches your profile above, to see what other improvements can be made.

Keith.

May 23, 2012 at 3:22 PM

Excellent, thanks Keith.  We are in UAT at the moment so I will give v1.5.1 a go if you like and report back.

May 24, 2012 at 2:19 PM

Parody,

V1.5.1 has been released.

For your particular test scenario it should take just less than half the time it was taking before.

Due to the need to handle all different types of data, and dynamic number of columns etc., I don't believe that I can speed it up much more than this.

FYI:

50000 records at 1102 columns wide, with 3 hashes of 2, 6, the rest. Not MultiThread.  592Mb file

MD5 takes 178s, SHA1 takes 180s, FNV1a64 takes 182s...

Keith.

May 24, 2012 at 3:36 PM

OK will try this now.  I am just running the dll I built from your source and its still aiming for 2.5hrs.  But I probably did something wrong...

 

May 25, 2012 at 8:36 AM

Hmmm...

My testing today in full SSIS packages (not just a source, multiple hash and then a raw data file), has shown no performance difference at all.  This is on both SQL 2008 and SQL 2012.

FYI, 2012 is faster than 2008 for the same SSIS package (just upgraded), and x64 is faster than x32.

2008 -> 569s, 2012 -> 384s.

I'll do some more digging.

Keith.

May 25, 2012 at 9:26 AM

OK I did a full un-install of 1.5 and re-installed your release of 1.5.1 and have gone down to 93 minutes for a run on the same file, so thats a good improvement although might not be good enough for us to use.  The entire package currently runs in 2.5 hrs, and the Md5 is allowing us to cut out 60% of the rows.  So if the later steps are improved my the same margin the overall time might be equivilent.

I will see if I can get 2012 installed and try that.  If the we get the improvement you suggest this would be a very strong case for us to migrate this package.

May 28, 2012 at 12:06 PM

79 minutes with 2012...