menu search
brightness_auto
Ask or Answer anything Anonymously! No sign-up is needed!
more_vert
I have 22 datasets in excel worksheets, each one with over 100,000 rows by 125 columns.  I have written a VBA script that goes through each row and compares each value in that row to the other values in that row to find numbers that are no more than 0.0002 a part.

Just so you know, there are many qualifications required before it can compare a value to another (for example, if value1 is in the first five columns, then it cannot be compared to any values in columns 30-35). So it is not just a simple statistical computation.

The script has three outputs: the median of the two values when it finds two numbers no more than 0.0002 difference in their values, the median of three numbers, and the median of four numbers.  Some rows it finds nothing; some it finds multiple 2 matches, multiple 3 matches, etc.  It places those outputs at defined columns of 125, 145, 155 (I do this because I have to reference these outputs in another computation: with defined locations I know what that value represents (2, 3, or 4 matches) by which column it is in).  If it finds more than one number with two matches, for example, it would put the output in 125 and 126.

When I run this script, it calculated 189 rows in 6.5 hours.  It will take 2.9 million hours to complete one dataset, to which I have 22 datasets to complete.  And it not a hardware issue as the some computation rate occurred with 32 cores/128 GB ram vs 4 cores/16 GB of ram (because VBA is run lineally).

What is a faster way to do this and achieve the defined outputs?

I have intermediate coding experience in various languages (I have been able to figure out in the past how to do what I want by referring to references or asking forums).  I say this, because I need a solution that I could learn to operate rather easily and not the most advanced option available.

I googled and saw some add-ons to excel (python and java).  I don't know if that would be faster in this situation. Or should I import the dataset into a database, and if yes, which database software and how I would calculate (for example, I know nothing about databases and SQL and writing queries).

1 Answer

more_vert
In this case you definitely switch to SQL RDBMS and wrote store procedure with temporary tables

and some functions. There are two types of them - scalar and aggregate. 

SQL is not that big and hard to understand.

Free RDBMS are - MS SQL Developer Edition, MySQL, PostgreSQL.

MS SQL is more user friendly in my humble opinion.
thumb_up_off_alt 0 like thumb_down_off_alt 0 dislike
Welcome to Answeree, where you can ask questions and receive answers from other members of the community.
...