Project #1: Do The Anagram Jam
June 25th, 2020
Last night I was browsing the world’s greatest and personal favorite cesspool (also known as Twitter) when I came across a rather interesting tweet having to do with anagrams:
Clever algorithm to find out whether or not 2 words are anagrams pic.twitter.com/p14sNx654v
— Fermat’s Library (@fermatslibrary) June 22, 2020
Now, even though I’m not the most, er, mathematically minded person, this still itched the part of my brain that wants to implement weird things like this to see how they work. That same part of my brain also likes to take things apart and not be able to put them back together again (although that stopped when I was about 7).
The basic algorithm is pretty simple too:
- Get the prime equivalent for each letter in a word
- Multiply those prime equivalents together for each word
- If the products are the same, the words are anagrams. If they aren’t, then they’re not
So, even though I was tired as hell and my allergies were starting to act up, I got cracking on an implementation. As a matter of course, I try to implement the algorithm as literally as possible for my initial attempt. This way I can make sure I understand the algorithm correctly and make sure it works properly. Only then do I try and cut out the excess to get a more efficient and compact script.
Thankfully, it only took about 10 minutes to code.
Version 1
This implementation uses a lookup table for the primes, and just cycles
through each word. The offset on each index
variable is to
correct for the conversion between unicode and the actual alphabet value
of each letter. I ended up figuring it out through trial-and-error, so
nothing special.
At this point I was pretty satisfied with myself, so I put my laptop away and got ready for bed. However, as I was lying there, I couldn’t help but think I could make the script better. “Is it even worth it to improve the script?” I asked myself. I realized the answer almost immediately: “Of course it is. Not only is it good practice and somewhat fun, but it’s also free blog content. So, with that sobering realization, I gave myself the task of improving the script - but it could wait until tomorrow. I was tired.
Version 2
For version 2, I wanted to see if I could move away from generating
every prime I might need and only generate the exact prime I know I
need. So, I got rid of the lookup table and just used
sym.prime()
to generate each prime right there:
I also realized I could save an tiny bit of memory by getting rid of the
index variable and just doing the char-to-unicode-to-alphabet-index
conversion calculation right in the sym.prime()
arguments.
That was really the only other improvement this version.
At this point, I had to go out and do some family activities, but like an addict who can only think about their next fix, I couldn’t stop thinking about my script and those damn anagrams. But that could wait for later.
Version 3
Version 3 tries to compress the script as much as possible:
I realized that if I zip the words together, I only needed one
for
loop instead of the two in earlier versions. However,
the biggest change comes in what I call
the length clause :
if the length of the two words don’t immediately match, there is no
way they can be anagrams. Embracing python’s “fail fast” ideology, the script immediately
returns that the two words aren’t anagrams before any expensive math
work is done.
We can still make improvements, but I think we should first look back and see if I made any bad choices.
Version 4
For this version I initially just wanted to move away from the Sympy
module and implement my own algorithm for generating primes, but the
more I looked for resources the more I realized that all the algorithms
in this category generate a list of primes between 0 and n. I
initially started using sym.prime()
becuase I thought it
just generated the n th prime, but could it be that it
generates a list and just returns the final element in that list? In
short,
how doessym.prime()
generate primes?
Taking a trip over to the Sympy github page, we can look at the raw code for the function. The short answer is (as far as I understand) yes, it does generate a pseudo-list for each inputted number, but they introduce a bunch of really cool optimizations that reduce the quantity of numbers tested and increase the efficiency of the operation:
- Adjusting the upper and lower bounds of the search using logarithms
- If the number is within the pre-generated sieve [of eratosthenes], it just returns it from that list
- Finding the midpoint of the upper and lower bounds by using a right bitwise shift of one place, insead of doing integer division ((a + b) >> 1 instead of (a + b) // 2). I had no clue you could do this, but makes a lot of sense
Side note on that bitwise shift, I found that if you shift right it returns the number inputted divided by the number of places shifted squared. Shifting left is the same but it multiplies instead of divding. So n >> 2 is equivalent to n // 4, and n << 7 is equivalent to n * 49. Again, within a base 2 system (like in computer memory) this makes complete sense, but I just never realized it before. So not exactly the most relevant thing, but a cool tidbit of information to hold onto nonetheless.
I will admit, regardless of what I found in the Sympy code, I was going to build my own prime number function. To do this I’ll also use a Sieve of Eratosthenes, as it’s one of the simplest methods of finding primes. The algorithm works like this:
- Make a list of the integers equal to or less than n and greater than 0 ([1 … n])
- Remove the multiples of all numbers less than or equal to n
- The remaining numbers must be prime
Using that, I created this sieve function:
I started the list from 2 because we know that 1 isn’t prime. We also know that prime numbers are, in essence, numbers that cannot be broken down anymore. So, the sieve basically just removes multiples while leaving the originals. That’s the best I can explain it. I was originally going to use a flag system with another array of booleans, but the overview on Wikipedia convinced me to do it like this.
All that’s left is to implement it in this version of the anagram script:
If you really wanted to nitpick, technically I did cheat by
finding the exact number to plug into the sieve in order to get a list
of exactly 26 primes, but I feel like that’s arguing semantics. If you
do want to argue about it, send me an email. I’ll happily let you know
that I don’t care and that you’re a dick :P. Other changes made include
changing the variable names in the zipped loop so that they’re not the
same as the ones in the sieve. Even though both sets of variables lie
within the local scope, I wanted to stray from confusion as much as
possible. I also adjusted the main function. I have never really been
sure whether to put my driver code in the
if __name__ == '__main__':
bit or link that to a main
function. I’m doing the latter just to be safe, I guess.
Also, we return to the lookup table! It occured to me that if the length
of word1
and word2
together is greater than
26, it is far less efficient to construct primes in place (even though
doing it that way looks much cleaner). This will come in handy later on…
Now, I was happy to call it quits there. The pseudo-perfectionist in my brain wasn’t, unfortunately. “What about phrases, huh? You forget about those? Hmmmmm?. sigh
Version 5
To make this version possible I needed to flatten each word into a list without the punctuation or spaces. At that point, the original algorithm can take over. This version only required a couple lines changed, so not too much work:
That’s the final version I’ll do. I don’t really know how to improve it from here.
All the code for this project is on github.
All the best, and stay safe.
- Jordan Streete
P.S. I’ll hopefully have an actual “project” project coming soon. Gotta figure out what to do first :)