Shakespeare Cipher
In an earlier post I analyzed a PHP web shell using a version of transposition obfuscation. Recently, a friend of mine taking a cybersecurity class reached out for assistance with a cipher problem; I suspected this too was transposition.
Overview:
But first things first. Here is the cipher text: What can we infer right away? Are we going to assume that it’s some text made up of exactly 5-letter words throughout the message (i.e. “Today comes every other…etc”) or can we assume that this message is just broken up into easily transmitted bock sizes, in this case blocks of 5? I guessed the latter as more plausible. In fact, a lot of encoding breaks up the message into blocks, and strips out spaces, punctuation, etc. So, this message could in fact be something like this instead, “Todayisthefirstdayoftherest….. etc”…we just do not know yet. What else can we infer? Take a look at the second to last row, what kind of English works (we are assuming this message is in English) has a pattern of yhzhh, or gggcx? I cannot reliable think of any legitimate words that make up that pattern, so I already know this is a more complex cipher being used.
First Assumption (First Act):
Let's first assume this is some variant of a substitution cipher, and we will therefore use the most common technique (frequency analysis) as a starting point to decode this text. In the English language the following is a list, in order, of the characters that appear more frequently than other letters in the alphabet; e t o a i n s h r, and here is their percentage of how often they are seen in normal text; e=12.7% t=9.1% o=8.2% a=7.5% i=7.0% n=6.7% s=6.3% h=6.1% r=6.0% . So I counted up the number of times a letter appears and divided by the total, and got a ratio of how frequent each letter is. In this case with the above cipher text, here is a breakdown of the frequency:
f:26; x:16; g:15; z:15; q:12; h:12; s:11; a:11; d:10; m:9; c:9; n:7; b:7 etc……
So what might the cipher text of “f” , “x”, “g”, etc represent in the English language? Based upon the most common letters I already listed, “f” might really be an “e”, and “x” might really be a “t”, “g” might be an “o”, and so on. It’s not an exact science but I tried this out by mapping my frequency analysis to the most frequent letters:
Does this look correct to you? It could be, but I doubt you would map an “s” exactly to an “s”. Furthermore, “z” maps to an “a”, which is only one character away but our other mappings do not follow that pattern, so this might be incorrect too (it’s not impossible, but I wanted to stick to patterns and easy solutions first). Note, that although “f” maps to “e” it maps to a preceding letter, whereas “z” maps to a succeeding letter of “a”; thus it’s not a pattern yet and I am assuming the “f” is correctly mapped to “e”. So, the frequency as mapped is not perfect. I decided to rearrange the mapping a little by switching the “s” and “a” mappings for now since they both share the same frequency.
Second Act:
Patterns called pairs and triplets of letters also appear in English that can help solve this encoding, for example: TH, EA, ES, EN, OF, TO, IN, IT, IS, BE, AS, AN, AE, AT, SO, WE, HE, BY, OR, ON, DO, IF, ME, MY, UP, NT. Common pairs of repeated letters are SS, EE, TT, FF, LL, MM and OO. Common triplets of text are THE, EST, FOR, AND, HIS, ENT or THA. In our example, here are the most frequent pairs of letters:
df:5; sd:4; fz:4; af:4
If “f” really represents an “e” as I'm assuming, what then could “df”, “fz”, and “af” represent based on our frequency and common letter pairs? Well, “df” could be “ae”, or “be” or “we” or “he”, etc. We can assume this, however in the screenshot above I currently have the “d” mapped to an “r”, therefore "df" would become "re", and "re" is not one of most common pairs in English (yes, we have lots of words with “re” together, but we’re talking about frequency). So, I decided to assign “d” to one of the letters most commonly paired with an “e” in the pair list i noted above. The only one that would be suitable with the letters we initially mapped would be an "a", therefore "df" would become "ae".
Third Act:
Coincidently, “fz” and “af” already seem to map to common pairs of letters; fz=”es” and af=”he”, so I assume these are in the correct positions. But what about the other paired letters, “sd” in our frequency output? Currently that would map to “ra” which is not one of the most common pairs of letters, but I held off on this for now. Do we see any pattern of how things are mapped yet? Perhaps we have one pattern; notice the “d” is mapped to an “a” and “f” is mapped to an “e”, the distance between "a" and "e" is 4-characters apart. Can I rearrange any of the other mappings so that they are mapped 4-characters apart? I tried with the letters I had so far: Notice in the following screenshot I just rearranged the existing letters into positions that would reflect mappings that are 4-characters away, for example: a -- > e is 4-characters, e -- > i is 4-characters away, n -- > r and o --> s are both 4-characters away.
So, can this 4-character substitution be the “key”? I gave it a try and filled in all remaining mappings under this assumption:
Using this as my key I can begin to decode the cipher text using the substitutions above, for example in the original cipher block the first 5-character block of gasaz --becomes--> THRHS. Applying this to the entire message we get:
THRHS HEOEB INWSY SWNUN
IEBNE SAETC TRHHE HEAES
ESVMS EIIOI XCOOT CKUNY
EIRAF LNWNO LFEDO EOMTL
NRAHS TTKEB FUESY ONGTH
PEUAE POIRA EFLSV RTTAE
YEYSN ONOIL FTFFY THOWC
HEUEO WSRWM WUDEP ORIRU
RFSEL LEAVS DISII TTTLO
HOELN AFRAX
Forth Act:
Hmm, that doesn’t look like English at all – or does it? Remember at the beginning I asked what words in English could represent “yhzhh” or “gggcx” in the original cipher text, which is now represented as “DISII" and "TTTLO" in the second to last row? So far we have only used substitution to decipher this text, but no words map to these new simple representations. I decided to do another frequency test on this output and see the results:
e:26; o:16; t:15; s:15; i:12; n:12; r:11; h:11; a:10; f:9; l:9
Remember that “e t o a i n s h r” are the most common letters in English text, doesn’t this output seem to follow this frequency? Yes, it does, but why is it not readable yet? Well, if a frequency analysis delivers an expected output such as this example, but the words are not readable, it probably is most likely what is called a transposition cipher. A transposition cipher will spread the information across the message by rearranging the letters (not substituting them), often times with columns and rows. Transposition attempts to break any legitimate patterns used for frequency analysis.
Notice again those repeating letters like “DISIITTTLO”. It's actually the lack of expected letter patterns, strange repeating characters, and especially a lack of any normal triplet combinations thus far that give away the encoding scheme - transposition. Here is an example of a transposition column cipher:
This is a test < -- the plain text we are going to encode
This
isat < -- the plain text written into columns and rows of 4-characters
est
"Tiehs siats t" < -- the plain text rearranged by going down the column, then across and down again
Fifth and Final Act:
So how do we try to decode a column cipher with the output we have above (Trial and error, using frequency and patterns again). But wait a second; didn’t I just say transposition eliminates pattern matching? Yes, but we can brute force this encoding by doing the following ( Remember in the Second Act, the most common pairs of letters in English that will help us decode text: TH, EA, ES, EN, OF, TO, IN, IT, IS, BE, AS, AN, AE, AT, SO, WE, HE, BY, OR, ON, DO, IF, ME, MY, UP, NT. Common pairs of repeated letters are SS, EE, TT, FF, LL, MM and OO).
Under the assumption this is transposition, I first start with the cipher text again and pick the first 5-characters to brute force:
THRHS HEOEB INWSY SWNUN IEBNE SAETC TRHHE HEAES ESVMS EIIOI XCOOT CKUNY EIRAF LNWNO LFEDO EOMTL NRAHS TTKEB FUESY ONGTH PEUAE POIRA EFLSV RTTAE YEYSN ONOIL FTFFY THOWC HEUEO WSRWM WUDEP ORIRU RFSEL LEAVS DISII TTTLO HOELN AFRAX
Next, I will “map” these 5-characters to the next 5-characters to see if we can get any letter-pair patterns.
THRHS HEOEB INWSY SWNUN IEBNE SAETC TRHHE HEAES ESVMS EIIOI XCOOT CKUNY EIRAF LNWNO LFEDO EOMTL NRAHS TTKEB FUESY ONGTH PEUAE POIRA EFLSV RTTAE YEYSN ONOIL FTFFY THOWC HEUEO WSRWM WUDEP ORIRU RFSEL LEAVS DISII TTTLO HOELN AFRAX
The mapping is achieved by placing the first 5-characters above the next 5-characters as shown below; remember, this is a column cipher structure:
T H R H S
H E O E B
Now read each column downward for any letter pairs that are frequently seen in English. We end up with “TH”, and a couple of “HE” pairs, so this is starting to look reliable and our assumptions might have gotten lucky on the first try!
Let's keep this going:
We have readable text appearing, "This is the...". In fact, each first letter of each 5-character block will concatenate together, followed by the second letter of each block, and so on. Following this grouping we finally end up with:
“This is the excellent foppery of the world: that when we are sick in fortune often the surfeits of our own behavior we make guilty of our disasters the sun, the moon, and stars, as if we were villains on necessity, fools by heavenly compulsion …”