Wednesday, December 10, 2014

WPS - The secret Numbers in Letters

We all have secrets. Some we keep and some we share. The secrets we keep are generally easily managed. Our brain is an excellent safe that holds numerous secrets that no one will ever know. The secrets we share are harder to keep. If we want to send them to others then we need to encrypt them.

However, sometimes we don't want anyone to know that we share a secret. When the secret becomes a secret, we need more than cryptography to send it. We need steganography, the art of hiding messages.

By using steganography (lit. hidden writing) we can send a message through any open insecure channel without others even knowing that a message was sent. It doesn't draw attention or suspicion, as an encrypted e-mail or letter would, and the hidden message is deniable.

In this age of non-existing digital privacy there is still a method of processing and sending messages that resists even the best hackers and "Men in Black" organizations: the pen and the paper. Just as there is unbreakable pen-and-paper encryption, there is also fully deniable steganography.

Many steganographic techniques were invented in past centuries. Drawings with embedded codes or signs, invisible ink, harmless looking text with minuscule typographic differences or grammatical alterations under control of some algorithm. Most of them, however, fail when it comes to hiding the fact that steganography has been used.

Typographic changes, how little they may be, are visible, since the receiver should be able to see them. Obviously, unusual font changes or extra spaces in digital text files are easily detected. Secret words, embedded at certain places, might be out of context. The required grammatical changes or rules, applied on cover text, often don't stand against the scrutiny of a human reader, as he can easily spot subtle but suspicious changes in natural language that don't fit in the content or style of the cover text.

Fully deniable steganography has some important requirements: it should be impossible to detect the use of steganography, as this would in essence be a failure. After all, its goal was to hide the fact that an encrypted message was sent. Also, any attempt to extract the hidden message should never reveal the message nor the use of steganography, even when the method is known. Therefore, the message should always be encrypted prior to hiding. Otherwise, any eavesdropper who knows the steganographic method could extract the plain message.

One method that meets these conditions is the Words-Per-Sentence system or WPS. It's a simple yet effective text-based method to conceal  a message without the use of complex mathematical or grammatical tricks and offers complete freedom of writing style and content. The system consist of three steps: converting text into digits, encrypt those digits and hide them in an innocent cover text.

Step 1 - Convert text into digits

This can be done by a straddling checkerboard. Such a table converts the high frequency letters into one-digit values and the other letters in two-digit values, producing a relatively economical conversion. 
 
Optionally, to compress the message considerably, you can use three or four-digit codes (preceded by 0 - CODE) that represent common words, expressions or even whole phrases, taken from a code book or sheet (more about code books in section VI of this paper (pdf).


Let's convert the phrase "meeting at 14 PM in NY." Note that we repeat figures three times to exclude errors.

M  E E T I N G     A T     1   4     P  M     I N    N Y  (.)
79 2 2 6 3 4 74 99 1 6 90 111 444 90 80 79 99 3 4 99 4 88 91

Step 2 - Encrypt the digits

The letter-to-digit conversion is no protection whatsoever! We could scramble the letters of the checkerboard, but this provides only very limited protection. So, we must encrypt the digits. There are various manual cipher systems, but the most secure one is the unbreakable one-time pad. More detailed info in this paper.

Suppose our truly random one-time pad key starts with the following groups:

68496 47757 10126 36660 25066 07418 79781 48209 28600

The one-time pad key is written out underneath the plaintext digits. The first group of the pad serves as key indicator for the receiver and must be skipped in the encryption process. The key is subtracted from left to right from the plaintext without borrowing (a so-called modulo 10 subtraction):

Plain : KEYID 79226 34749 91690 11144 49080 79993 49948 89191
OTP(-): 68496 47757 10126 36660 25066 07418 79781 48209 28600
        -----------------------------------------------------
Cipher: 68496 32579 24623 65030 96188 42672 00212 01749 61591

Step 3 - Hide the encrypted digits

Now that we have a secure message, we must hide the ciphertext digits in a text. For each digit, a sentence is composed with as many words as the digit + 5 (or any other pre-arranged value). Adding 5 to the total ensures that all sentences have at least five words. Words like “it’s”, “you’re” or “set-up” are regarded as one word. To avoid statistical bias, some sentences with less than 5 or more than 14 words should be added (these are later simply ignored). The first ciphertext group 68496 from our example message is hidden in the first part of a letter, shown here below:

Dear John,

I Hope everything is going well with you and the family. If possible, Katherine and I would love to visit you somewhere next month. We could make it a weekend at the lake. The next few weeks are rather quiet so any date is fine for us. What do you think? If you’re interested, just pick a date and I arrange everything.

To retrieve the original digits, the receiver simply subtracts 5 from the total number of words in each sentence, ignoring sentences with less than 5 or more than 14 words. He counts 11 words in the first sentence and thus knows that the first digit is 11 – 5 = 6, and so one. He writes the proper one-time pad key underneath the extracted digits (skipping the key indicator) and adds ciphertext and key together without carry (modulo 10 addition). Finally, he converts the plaintext digits back into readable text with his own checkerboard.

The advantages of WPS are an excellent literary freedom and the lack of complex calculations or algorithms. Always start by writing a meaningful text and then play with the words to obtain the required sentence length. Exclude the salutation in a letter from the system, as a nine-letter salutation would obviously arouse suspicion.

Thanks to WPS, the hidden message is fully deniable. There is no way to ever prove the existence of a message inside the innocent looking letter without having the proper one-time pad key. Even when the eavesdropper knows the method used, he can merely extract some meaningless digits, as he would retrieve from any other "clean" text. We now have a safe method to send encrypted messages openly by postal mail, e-mail or Internet forums.

Or how you can hide numbers in letters ;-)

This pen and paper WPS system is an important advantage in today's digital world where secure  personal computers, smartphones or tablets are a fairytale and virtually all means to communicate are prone to eavesdropping. Of course, the cover text itself can be read by anyone and you will need a good excuse for the nonsense you wrote and to whom you wrote it. It's better to write a meaningfull text and story based on facts.

Further reading:

10 comments:

Anonymous said...

WPS does not seem very efficient. You need to write a long message to encode one sentence. Maybe more practical in combination with a codebook?

It might help if you change this "prove you're not a robot" verification system in the "comments". There are more user friendly systems (like the house number).

Dirk Rijmenants said...

@ Anon,

WPS is actually pretty efficient. The checkerboard also contains a code prefix (0). Also, additional extensive information about using code sheets/books to compress the inserted message is found in the links I provided. I might indeed state this more clearly. I will add a reference to this in the post.

Steganographic methods to achieve a higher payload for a given carrier text do exist but these all have a common flaw: the higher the payload, the more obvious it gets that the carrier text is manipulated. You can only achieve a denser payload by modifying adjectives, conjugations, adverbs, changing the syntax or replacing words by prearranged synonyms. Software exists that does a nice job in manipulating text to insert more information than WPS, but this inevitably results in curious phrases, up to plain ridiculous nonsense. Virtually none of these will survive the scrutiny of a human, reading the carrier text, and will at the least arouse serious suspicion. WPS gives complete linguistic freedom to write in the words and style of that specific person about a subject that makes sense. Forcing a higher payload produces the weirdest pieces of text.

Inevitably, steganography will enable you to insert a short message in an essay, but never an essay in a short message.

Jann Horn said...

Can't an attacker detect the use of steganography because the sentence lengths have higher entropy than normal?

Dirk Rijmenants said...

@TheJH, entropy is a wonderful statistical tool in cryptography that can provide various clues about a set of letters or digits in a ciphertext. The entropy of – enough - letters can produce a pretty accurate indication of the nature of a language. Letters, their combinations and position in words are never random but follow strict linguistic rules that define each specific natural language. This is where entropy is at its best.

Calculating the entropy of the lengths of sentences is a whole other thing. The lengths of a given sentence does not relate to the lengths of sentences before or after it, nor does its follows strict linguistic rules. The length of a sentence isn’t determined by linguistic rules and their typical statistical properties, rather by the linguistic skill of the writer and complexity of the subject (small talk, literature, technical). Also, less adept writers and readers find sentences with less than 10 words easy to write/read and more than 15 words more complex, something for the more skilled. WPS with its (random) ciphertext digits will produce random sentence lengths between 5 and 15 words, with an optionally added variable number of longer or shorter non-digit-hiding sentences (call it nulls). Therefore, entropy calculation of these lengths will not provide any conclusive results, let alone determine whether they indicate steganography.

As the writer will start by composing a cover text before adapting it for hiding the digits, he will quickly notice whether the to-be-adjusted sentence lengths suit his style. If he is used to write longer sentences, he can either write more longer null sentences or when he really hates/avoids shorter sentences and wants to raise the whole spectrum of lengths, he simply changes the digit + 5 rule into, for instance, digit + 9, or any other value, meanwhile still being able to write some short null sentences if required.

Anonymous said...

seems like a psudosteganographic "noise" of "traffic" with zero actual meaning would tend to occupy a disproportionate effort by those actors surveying the stream and thus make statistical probability of the examination of an Nth discrete real communication take longer to achieve - thus a metasteganographic technique that creates a crypto-crypto is a logically implied strategy. one might name this strategy "LBCW" (little boy cries wolf).

Greg Melton said...

Ironically, it appears that WPS would be ideally suited for the "digital age" in that with the increased amount of digital surveillance by both governments as well as commercial interests, a conventionally encrypted email (gpg or pgp comes to mind) or sms would very quickly trigger an automated flag in someone's database. WPS provides plausible deniability particularly if OTP has been properly employed. Most likely a code book would be required to keep the payload as small as possible.

Thanks Dirk for a clear lucid explanation of WPS!

Anonymous said...

Hey Dirk,

Rather than include null sentences and have to add an arbitrary number to your encrypted message to transmit zeros, could one instead use the number of words modulo the number of possible digits?

For example: If I am only sending digits between 0 - 9, and I want to transmit a zero, could I make my sentence 10 words since 10 modulo 10 is zero? To send a 1 I could make the sentence 11 words long, etc.

Thank you for all your great work.

-R

Dirk Rijmenants said...

Hi R,

Thanks for your comment! Using mod 10 is indeed another solution, just a bit more difficult and prone to errors than simply add a fixed number. With mod 10 you can't use dummy sentences, for instance with say 3, 4 or 15 words in fixed system with +5, as mod 10 sentences can always produces any 0-9 numbers for any lenght of sentence, or you'd have to agree a min/max number of words in a sentence. On the other hand, in a mod 10 system, a digit can be represented by different numbers of words. Still, mod 10 is also a valid option and there are more systems possible, as long as they are simple and not prone to errors, so they can be used by untrained people. In cryptography, the mod function is pretty much the golden nugget, but not everyone knows how to use it.

Ilya said...

>There is no way to ever prove the existence of a message inside the innocent looking letter without having the proper one-time pad key. Even when the eavesdropper knows the method used, he can merely extract some meaningless digits, as he would retrieve from any other "clean" text.
This seems very wrong, since length of sentences follow very specific distributions, roughly bell-shaped and skewed, very different from uniform distribution of digits in the ciphertext and especially from the distribution of lengths of the null sentences. For example, in Brown corpus the frequency on 5-word sentences is 3.04 and 15-word sentences is 3.75, with a mode around 15. A lengthy enough letter surely will disclose that the frequencies stand out from the linguistic norm, but it's much worse with the null sentences! These distributions have long tails: for example, 35-long sentence frequency is 1.05.
In fact, in your example one "can easily spot subtle but suspicious changes in natural language that don't fit in the content or style of the cover text" because the fragment is so lacking long sentences! I'm not a native speaker but I would write "If possible, Katherine and I would love to visit you somewhere next month. We could make it a weekend at the lake" as one sentence, possibly separated by a comma, a dash or a colon.
I'ld argue that it will be plainly impossible to write a text with plausible null sentence frequencies without calculating all the frequencies and editing the text which looks suspicious. While doing that with pen and paper a user of such a system is likely to just switch to PGP or any other normal modern electronic cryptography.

Dirk Rijmenants said...

Hi Ilya,

Thanks for your comment. The point of WPS is that, apart from the sentences that represents a digit, you have completely free choice of sentence length, and can use/insert any number of shorter or longer sentences. We're not talking about a huge text with many sentences that, by their length, shows the average frequency distributions, as shown in the Brown Corpus. Don't forget that the sentence length, construction and use of punctuation marks (which enables very long sentences), greatly differs, depending on the writing skills.

Take 100 small random texts on various subjects, written by people with different degrees of proficiency of writing, and you will find far more bias than in random statistical large texts. You wrote "can easily spot subtle but suspicious changes". As for any analysis, you could find statistical bias, that you find "suspicious", but that's not proof of encrypted text. You could add shorter or longer sentences, but there's even no need, and who cares, as they can't actually prove you used WPS.

However, when you know the writer, and know he is an operative, you still have no proof whatsoever that the text contains a message. If you test your assumptions, and count the words (that is, assuming you know the secret pre-arranged base value) you get truly random and meaningless numbers from those sentence lengths that "could" fit the WPS system. Because they used one-time pad encryption, you can never prove there's a message, and you can never decrypt the message.

Also, when having to use massive metadata collection, and analysis, to search and detect that "suspicious" e-mail er letter, between the millions that are sent each minute, good luck. Even if you have a target, a suspect, and check all his letters/e-mails, and find a letter that doesn't follow the Brown Corpus stats, you have no proof whatsoever.

And if you'd switch to PGP, good luck. It might protect your everyday e-mails, but don't count on it if you're up to no good, and they suspect you. In theory unbreakable, but using it on the average computer creates so many flaws, from bad random generation, over weak public-key pairs, to insecure computers or just-that-to-weak encryption. Remember Operation Ghost, with encrypted messages with stenography? You could encrypt love letters for your mistress, but I would not bet on regular cryptography when you're on the radar, as cracking codes is more than cryptanalysis.

Actually, today, writing a letter is far more secure than sending an e-mail, as the likelihood of intercept is low, unless they suspect you. Remember the Stasi (and others) who checked massive amounts of letters, having special machines to open them, without you noticing. That was a daunting task, and would still be today. Could you imagine re-route all paper letters to NSA to open and read them? And even then, there's no proof of a secret message whatsoever with WPS.

To conclude, we can theorize as much as we like, but in the end, most are simply caught by operational errors. And if you use encryption, the only you can trust to be really unbreakable is one-time pad, done manually on paper, because today, we're walking digital targets. But even when going completely dark/offline, or use OTP, a tiny operational error is easily made. If they can catch Hanssen, the FBI's own counterintelligence hotshot, then don't believe that stealth, secrecy or encryption will save your butt ;-]