Aug 17, 2025

Finding Primary Keys

Late 2021 - Early 2022.

While at work, I had a bit of a problem. I was often given new datasets where either there was no Primary Key or it wasn't made readily apparent. So we had to do a bit of manual work, typically in something like Excel, to attempt a best-guess search for what column(s) could work as a primary key so that each row in the dataset could be uniquely identified by the key(s).

To improve this process, a tool was built at work using Python to manually search through and attempt to figure out what the primary key(s) were. That tool is what I used at work and we included access to it via another internal tool that could send requests to its API. It was hosted in a Docker container and given a URL.

I wanted to know if it could be faster.

Starting Out

I really only had three self-imposed requirements at this stage:

Make it faster
Learn something new
Don't let it interfere with my daily work

So I started playing around with the concept on weekends.

If I want to make it faster, I knew from the start that Python wouldn't be the tool to use. It's more of a "glue" language, and I already knew from the previous experience that the libraries available to me in Python can help a bit but they can't make use of the tricks I would want to use to speed up the searching process.

This left me with compiled languages -- something that can work on the data quickly and with little overhead. Something where I can be a lot more specific in not just what gets done but how it does it. The three I considered were C++, Rust, and Go.

C++ seemed like a good choice, but from what I read while searching around, it's not as easy as it should be to get the environment set up and you're likely to run into issues with pointers.

Go also seemed like it could work, but it sounded like there'd be some slowdowns with the garbage collection. I want speed!

Rust seemed like it was a good split between the two -- the speed of C++ but without the pointer-related issues, and the ease-of-use of Go but without garbage collection and other things that could slow it down. Plus (and this was a big one!) I heard that it had great error messages and though the learning curve started off steep, it would help you develop (basically forces you to learn) some good patterns and practices that could be used if I ever needed to swap to another compiled language.

Rust it is!

Keys

Starting out with some definitions, a Primary Key column for a dataset is a column that can be used to uniquely identify a row. If a specific key exists in a dataset and you search the dataset for it, you will only ever receive one row. This is useful for joining multiple tables together and for ensuring you don't get duplicate data.

A Composite Key is a set of columns that, when used together, work as a Primary Key. Each column individually won't have enough info to sufficiently differentiate rows, but the combination will always lead to a unique row.

Your key or keys typically get added to a Primary Key constraint which enforces this uniqueness.

I may be a little loose with my usage of the term "primary key". Just know that I'm referring to any column that either is or is part of the primary key for the dataset as "a primary key".

Having this info is crucial when adding new data to a dataset. It'll let you know when the incoming row needs to be appended to the dataset vs. when it needs to update an existing row with new information.

Process overview

We need the program to have an interface so the user can put in the commands to run it on a file that is representative of the dataset. I used the Clap library to provide a command-line interface with the Rust program. The Python one had the API. SQL was just a SQL query against a temp table created from the dataset.

Once the file is loaded, we need to:

Determine which column(s) we're testing. Starting with just one column, we'll go through this process and then start testing multiple columns together once all individual columns have been exhausted. I'll call this the columnset.
Read the contents of that columnset for each row in the dataset.
Hash the data and check if it exists in a store of all the hashes we have so far. If it doesn't exist, add it and move to step 4. If it does exist, this columnset won't work as a Primary Key and we need to short-circuit here and move back to step 1 to attempt the next columnset. Once individual columns have been exhausted, we then start attempting combinations of two columns, then three, etc.
If we haven't moved back to step 1 yet and have no more rows to test, congrats! We've found the columnset that will work as a Primary Key combination for the example dataset!

Existing tools had the problem of not being able to short-circuit as soon as a duplicate was found. With this process, we can speed up the search by quite a bit!

Comparisons

When I first started benchmarking on a specific dataset, I attempted it in SQL on a pretty high end server. I don't remember the exact time, but it was... slow. It couldn't short-circuit either!

Then I moved on to Python and found it could run 10x faster on a much less powerful computer, and I wasn't even doing anything special. I just had the ability to short-circuit the process when duplicates were found.

Rust? 20x faster than the Python version. And once multithreading was added, easily 200x faster.

Extra Speed!

While trying to figure out additional ways of speeding up the process beyond just short-circuiting, I had a great idea.

Datasets with a large number of primary keys were hard to figure out, especially as that number grew. Combinations of 1 are easy because you just have one loop for each column. Combinations of 2 aren't too bad, but that number starts growing fast!

Let's say we have 50 columns to look through (yes, this is reasonable for some tables).

Combinations of 1 mean we have 50 iterations to try.
Combinations of 2 mean we have 1,225 iterations to try. Computers are really good at processing tons of data nowadays, so this doesn't make much of a difference.
Combinations of 3 mean we have 19,600 iterations to try. We're starting to see the processing take a bit of time but it's still reasonable.
Combinations of 4 mean we have 230,300 iterations to try. Now it's getting concerning.
Combinations of 5 mean we have 2,118,760 iterations to try. This is pretty ridiculous.
Combinations of 6 mean we have 15,890,700 iterations to try. This could take hours, maybe even days.
...

This goes on, growing immensely until we hit the halfway point (25), after which it starts decreasing again.

I've encountered datasets with a Composite Key consisting of 13 columns before in a dataset containing 70-ish columns. Yikes.

What can we do? We're already short-circuiting once a duplicate hash is found while processing rows. We've already introduced multithreading where each thread (limited to one or two per CPU core) is processing its own combination. We've already got as much of the data loaded into RAM as possible.

What if, instead of starting from the bottom going up, we start from the top going down?

We can check for true duplicates by testing with all columns selected. If we get any duplicates, frankly we should consider talking to someone about the dataset because something odd is going on. But we'll move forward anyway because we can still get some insightful information.

Let's say we have 20 true duplicates. That sets our baseline for our next step.

Instead of checking all columns together, we'll check all columns... minus one. Now if the number of duplicates doesn't change from the above number then we know that that column we removed did not assist the dataset in providing some form of "uniqueness", therefore it alone cannot be a key. But if it does cause that number to go up, then we know that the difference between that column being there and it not being there is in additional duplicates. Therefore it must be a key.

We do this one-by-one for all of the columns, testing all columns together minus the one. This will give us a list of known keys that we can use in our search from the bottom-up. This reduces the columns we need to check going forward, therefore also the number of combinations we need to check. Massive win for certain datasets!

Additional Functionality

Now that we have a function for testing a combination of keys, let's make it available to the user directly.

If the user already has a good idea of what columns are part of the composite key, they can provide them as "known keys" and have it search combinations that include those keys. So we'll add a flag for that.

The user may also want to test a specific combination, so we'll make that single test available as a command as well. We can even expand on it by having it return not just whether or not the combination works as a composite key, but also how many duplicates are found and even the duplicate rows themselves! With this information, they'll be able to determine if a column maybe shouldn't be a key but should instead have values prioritized over nulls when upserting.

Next steps

Maybe we can take the top-down approach even further and start testing combinations of two removed? At that point I'm having a hard time visualizing the consequences of this approach given all the factors involved.

It's been a while since I've worked on this but I learned a ton on the way and I think working in Rust has really helped my understanding of programming and of better ways to approach writing code. I know I didn't really discuss Rust in this, but I'm sure I'll go into more detail at some point.

My question for anyone reading this is -- can you think of anything that can be done to improve the process further? I'd love to hear your thoughts!

Table of Contents