Assignment 1 Questions

In the following, suppose that ct is a ciphertext, or, when there is 
more than one, ct1 and ct2 are ciphertexts.  Let pt, pt1, pt2 be the 
corresponding plaintexts, or any standard text such as "blackcat.txt" 
by Edgar Allan Poe.  

################################################################################
################################################################################

Q: How to I read the ciphertext strings into SAGE?

Note that you may read the strings in as 'string monoid elements':

sage: S = AlphabeticStrings()
sage: pt = S.encoding(open("Plaintext/blackcat.txt").read())
sage: ct = S.encoding(open("Ciphertext/cipher06.txt").read())

or as plain vanilla Python strings:

sage: pt = strip_encoding(open("Plaintext/blackcat.txt").read())
sage: ct = strip_encoding(open("Ciphertext/cipher06.txt").read())

but remember in the former case to use apply strip_encoding to remove 
the end-of-line characters '\n'.  Also some of the *'s below, for 
multiplication of 'string monoid elements', must be +'s, for Python 
string concatenation. 

If you use the former 'string monoid elements' then either of the 
following syntax can be used to compute frequency distributions:

sage: X1 = frequency_distribution(ct,1)

or 

sage: X1 = ct.frequency_distribution(1)

and similarly for coincidence_index.

################################################################################
################################################################################

Q: (Question 1) How do I break a Vigenere cipher?

The first unknown you must determine is the period.  For this you 
can use the coincidence index of the decimations

for m in range(1,20):
    print "%s: %s" % (m, sum([ coincidenceindex(ct[i::m]) for i in range(m) ])/m)

A high value, near 0.065, suggests the value of m is the period.

X = frequency_distribution(pt)
eps = 0.33
matches = [ [] for i in range(m) ]
for i in range(1,m):
    Z = ct[i::m].frequency_distribution()
    Y = DiscreteRandomVariable(X,Z.function())
    for j range(26):
        corr = X.translation_correlation(Y)
	if corr > eps:
	   matches[i].append(j)
matches

################################################################################
################################################################################

Q: (Question 2) How do I break a substitution cipher?

Single character frequencies should give you the image y of the 
character E, and guesses for the next most frequent characters.
In order to break the substitution, you should apply the map

               x -> (P(x),P(xy|y),P(yx|y))

where y is the image of a known character, such as E.  You can 
compare these vectors to the data in "digraph_frequencies.pdf" 
from the course web page.  

In what follows, suppose we find that y = 'C' is the image of 'E'.

# Compute the 1 and 2-character frequency distributions:
S = AlphabeticStrings()
ct = S.encoding(open("Ciphertext/cipher06.txt").read())
X1 = frequency_distribution(ct,1)
X2 = frequency_distribution(ct,2)

# Determine the vectors (P(x),P(xy|y),p(yx|y)) in R^3 for ciphertext y = 'E':
E = S('C') # 'C' is the ciphertext image of E
V = VectorSpace(RealField(),3)
AZ = S('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
XE_ct = [ V([X1(AZ[i]),X2(AZ[i]*E)/X1(E),X2(E*AZ[i])/X1(E)]) for i in range(26) ]

# User digraph_frequencies.pdf: identify some isolated vectors of the form
# (P(x),P(xy|y),p(yx|y)) in R^3 for x = 'E' -- set up as a dictionary {} of 
# key:value pairs which can be indexed by the keys.

XE_pt = {
   S('H'):V([0.05744, 0.18046, 0.02125]), # expected image of 'H'
   S('R'):V([0.05674, 0.13071, 0.11352]), # expected image of 'R'
   S('S'):V([0.05669, 0.07553, 0.08050]), # expected image of 'S'
   S('T'):V([0.09364, 0.06919, 0.06151])} # expected image of 'T'

# Define the norm of a vector to be the sqrt of the self-inner product:
def norm(v):
    return v.inner_product(v).sqrt()

# Search for vectors near to those for H, R, S, and T:
for eps in [0.100,0.050,0.025,0.010]:
    print 'eps:', eps
    for x in (S('H'),S('R'),S('S'),S('T')):
        matches = []
        for i in range(26):
	    if norm(XE_ct[i]-XE_pt[x]) < eps:
	        matches.append(AZ[i])
        print '%s: %s' % (x, matches)

Note that by forming the standard vectors for 'HE', 'RE', 'SE', 'TE'
in XE_pt, and elements of XE_ct as vectors in R^3, we can take their 
differences as vectors, followed by their norm to determine what 
elements of XE_ct are close to each of the standard vectors.

Once several characters are known, one has many choices of y to form 
the images x -> (P(x),P(xy|y),P(yx|y)), each of which can be used to 
determine (the preimage of) additional characters x.

################################################################################
################################################################################

Q: (Question 3) There are some repeat entries in the rows defining the 
cryptosystem.  Doesn't this mean that the cryptosystem is not a bijection?

Yes, this is in fact true.  This implies that there will be some plaintexts 
which can not be deciphered.  This is not a good feature of a cryptosystem -- 
it decreases the entropy.  However remember that some modes of operation for 
block ciphers only require an enciphering operation, so it is not strictly 
necessary that a cipher be injective.  The entropy is still well-defined but 
lower than it should be (for a strong cryptosystem).


Q: (Question 3 continued) How do I read in all of that data in a useful why 
in SAGE?

# Everyone should have the same alphabet of 16 elements (and 8 keys):
Alpha = ['NTH','AND','EST','ERE','ETH','THA','THE','TER','WIT','HIS','HAT','INT','ING','HER','HES','ITH']

# Read in the 16 probabilities of the 16 plaintext elements:
Probs = open("Ciphertext/LastFirstMiddle.5.txt").read().split("\n")

Set up a the dictionary of probabilities of these characters, e.g. using 
a Python dictionary type P = { 'NTH':P['NTH'], etc. }.  Note that each line 
of Probs is of the form 'P(XYZ) = 0.nmpq' so split() of this gives 
['P(XYZ)', '=', '0.nmpq'] and we us eval of the entry at position 2 
(= '0.nmpq') to interpret it as a real number.

P = { } # Python 'dictionary'
for i in range(16):
    P[Alpha[i]] = eval(Probs[i].split()[2]) 
# To be really fancy, we can create this as a DiscreteProbabilitySpace:
MSpace = DiscreteProbabilitySpace(Alpha,P)

# Read in the 16 x 8 array of images ciphertexts E(K,M): 
Imags = open("Ciphertext/LastFirstMiddle.4.txt").read().split("\n")

# The conditional entropy should be defined in terms of the 
Crypt = { } #[ '' for j in range(8) ] for i in range(16) ]
for i in range(16):
    Crypt[Alpha[i]] = Imags[i].split()

Now Crypt is the dictionary of ciphertext images for each message, such that 

Crypt['NTH'][0] 

determines the ciphertext image 'NTH' with binary key 000 = 0 and 

Crypt['NTH'][7] 

is the ciphertext image of 'NTH' with binary key 111 = 7.

Now create CSpace as the DiscreteProbabilitySpace for the ciphertext space!

Note that you can determine the entropy of a DiscreteProbabilitySpace (with entropy), but 
check that this agrees with the formula!

Using Crypt as the cryptosystem map and MSpace, you should be able to easily calculate 
conditional probabilties and conditional entropies.

################################################################################
################################################################################

Q: (Question 4) How do I break a substitution-transposition cipher?

Break the transposition cipher first!  Then reduce to a substitution 
ciphertext.

Let coincidence_discriminant be defined as in the text, then you need to look 
for peaks in the coincidence discriminant some sequences of (i,j)-decimations 
of various periods m.  You may fix i = 1 and run through various j (why?).

ct = S.encoding(open("Ciphertext/cipher10.txt").read()); eps = 0.0080

N = len(ct)
for m in range(2,20):
    i = 1
    for j in range(2,m):
        ctij = [ ct[i+k*m]*ct[j+k*m] for k in range(N//m) ]
        dsc = coincidence_discriminant(ctij)
        if dsc > eps:
            print "(1,%s) mod %s: %s" % (j, m, dsc)   

Once you have decided on the correct value of the period m, you can 
repeat this loop with m fixed and varying (i,j), with i < j (why?):

m = 11 # conjectural period 
for i in range(m):
    for j in range(i+1,m):
        ctij = [ ct[i+k*m]*ct[j+k*m] for k in range(N//m) ]
        dsc = coincidence_discriminant(ctij)
        if dsc > eps:
            print "(%s,%s) mod %s: %s" % (i, j, m, dsc)