Project #1: Do The Anagram Jam

June 25th, 2020

Last night I was browsing the world's greatest and personal favorite cesspool (also known as Twitter) when I came across a rather interesting tweet having to do with anagrams:

Clever algorithm to find out whether or not 2 words are anagrams pic.twitter.com/p14sNx654v

— Fermat's Library (@fermatslibrary) June 22, 2020

Now, even though I'm not the most, er, mathematically minded person, this still itched the part of my brain that wants to implement weird things like this to see how they work. That same part of my brain also likes to take things apart and not be able to put them back together again (although that stopped when I was about 7).

The basic algorithm is pretty simple too:

  1. Get the prime equivalent for each letter in a word
  2. Multiply those prime equivalents together for each word
  3. If the products are the same, the words are anagrams. If they aren't, then they're not

So, even though I was tired as hell and my allergies were starting to act up, I got cracking on an implementation. As a matter of course, I try to implement the algorithm as literally as possible for my initial attempt. This way I can make sure I understand the algorithm correctly and make sure it works properly. Only then do I try and cut out the excess to get a more efficient and compact script.

Thankfully, it only took about 10 minutes to code.

Version 1

This implementation uses a lookup table for the primes, and just cycles through each word. The offset on each index variable is to correct for the conversion between unicode and the actual alphabet value of each letter. I ended up figuring it out through trial-and-error, so nothing special.

At this point I was pretty satisfied with myself, so I put my laptop away and got ready for bed. However, as I was lying there, I couldn't help but think I could make the script better. "Is it even worth it to improve the script?" I asked myself. I realized the answer almost immediately: "Of course it is. Not only is it good practice and somewhat fun, but it's also free blog content. So, with that sobering realization, I gave myself the task of improving the script - but it could wait until tomorrow. I was tired.

Version 2

For version 2, I wanted to see if I could move away from generating every prime I might need and only generate the exact prime I know I need. So, I got rid of the lookup table and just used sym.prime() to generate each prime right there:

I also realized I could save an tiny bit of memory by getting rid of the index variable and just doing the char-to-unicode-to-alphabet-index conversion calculation right in the sym.prime() arguments. That was really the only other improvement this version.

At this point, I had to go out and do some family activities, but like an addict who can only think about their next fix, I couldn't stop thinking about my script and those damn anagrams. But that could wait for later.

Version 3

Version 3 tries to compress the script as much as possible:

I realized that if I zip the words together, I only needed one for loop instead of the two in earlier versions. However, the biggest change comes in what I call the length clause: if the length of the two words don't immediately match, there is no way they can be anagrams. Embracing python's "fail fast" ideology, the script immediately returns that the two words aren't anagrams before any expensive math work is done.

We can still make improvements, but I think we should first look back and see if I made any bad choices.

Version 4

For this version I initially just wanted to move away from the Sympy module and implement my own algorithm for generating primes, but the more I looked for resources the more I realized that all the algorithms in this category generate a list of primes between 0 and n. I initially started using sym.prime() becuase I thought it just generated the nth prime, but could it be that it generates a list and just returns the final element in that list? In short, how does sym.prime() generate primes?

Taking a trip over to the Sympy github page, we can look at the raw code for the function. The short answer is (as far as I understand) yes, it does generate a pseudo-list for each inputted number, but they introduce a bunch of really cool optimizations that reduce the quantity of numbers tested and increase the efficiency of the operation:

Side note on that bitwise shift, I found that if you shift right it returns the number inputted divided by the number of places shifted squared. Shifting left is the same but it multiplies instead of divding. So n >> 2 is equivalent to n // 4, and n << 7 is equivalent to n * 49. Again, within a base 2 system (like in computer memory) this makes complete sense, but I just never realized it before. So not exactly the most relevant thing, but a cool tidbit of information to hold onto nonetheless.

I will admit, regardless of what I found in the Sympy code, I was going to build my own prime number function. To do this I'll also use a Sieve of Eratosthenes, as it's one of the simplest methods of finding primes. The algorithm works like this:

  1. Make a list of the integers equal to or less than n and greater than 0 ([1 ... n])
  2. Remove the multiples of all numbers less than or equal to n
  3. The remaining numbers must be prime

Using that, I created this sieve function:

I started the list from 2 because we know that 1 isn't prime. We also know that prime numbers are, in essence, numbers that cannot be broken down anymore. So, the sieve basically just removes multiples while leaving the originals. That's the best I can explain it. I was originally going to use a flag system with another array of booleans, but the overview on Wikipedia convinced me to do it like this.

All that's left is to implement it in this version of the anagram script:

If you really wanted to nitpick, technically I did cheat by finding the exact number to plug into the sieve in order to get a list of exactly 26 primes, but I feel like that's arguing semantics. If you do want to argue about it, send me an email. I'll happily let you know that I don't care and that you're a dick :P. Other changes made include changing the variable names in the zipped loop so that they're not the same as the ones in the sieve. Even though both sets of variables lie within the local scope, I wanted to stray from confusion as much as possible. I also adjusted the main function. I have never really been sure whether to put my driver code in the if __name__ == '__main__': bit or link that to a main function. I'm doing the latter just to be safe, I guess.

Also, we return to the lookup table! It occured to me that if the length of word1 and word2 together is greater than 26, it is far less efficient to construct primes in place (even though doing it that way looks much cleaner). This will come in handy later on...

Now, I was happy to call it quits there. The pseudo-perfectionist in my brain wasn't, unfortunately. "What about phrases, huh? You forget about those? Hmmmmm?. *sigh*

Version 5

To make this version possible I needed to flatten each word into a list without the punctuation or spaces. At that point, the original algorithm can take over. This version only required a couple lines changed, so not *too* much work:

That's the final version I'll do. I don't really know how to improve it from here.

All the code for this project is on github.

All the best, and stay safe.
- Jordan Streete

P.S. I'll hopefully have an actual "project" project coming soon. Gotta figure out what to do first :)