Problem analysis: Steganography (stage III, XXXIII OI)

Problem statement

A string consisting of 26 lowercase letters of the English alphabet of length \(n \leq 200,000\) is given. Our task is to create a bit array of size \(k \times k\), where \(k = 1000\), in such a way that after permuting the rows and columns of this array, we are able to recover the original message.

Subtask \(n \leq 4\)

Note that permuting the rows and columns of the array does not change the number of bits set to 1 in it. Let's treat the string as a number written in base-27 (26 letters of the alphabet + end-of-string character). Note that this number does not exceed \(27^4 < 1,000,000\), which allows us to set exactly this many bits in the array to 1.

Subtask \(n \leq 35\)

We will use the fact that if row \(w_i\) is moved to position \(j\) as a result of permuting the array's rows and columns, then row \(w_i\) of the original array contains the same number of ones as row \(w_j\) of the new array. We can therefore use the number of ones in a row to store information that will allow its identification in the permuted array. Using the first 35 rows, we will encode a pair in each: (row index, character), by setting the appropriate number of bits to 1. We need to encode the indices of 35 rows, and the number of characters we need is 27 (26 letters of the alphabet and the end-of-string character), which gives \(35 \cdot 27 = 945\) possible pairs to encode, which fits within the limit of set bits in a row (maximum \(k = 1000\) bits).

Message encoding

The first step we must take to fully solve the problem is to encode the text message as a sequence of bits. To do this, we can encode each character sequentially, representing the letters of the English alphabet as numbers from 1 to 26. The number 0 can denote the end of the string – if we encounter it while reading, it means that it is the end of the message. We will need 5 bits to encode each character. This gives a total of \(5 \cdot n\) bits needed to encode the message and is sufficient to pass the subtasks, where \(n \leq 190,000\).

To obtain full points for the task, we need to encode the string more efficiently. We will use the first 18 bits to encode the length of the string (since \(2^{18} > 200,000\)). Then, we will group consecutive letters into blocks of 4 characters. We will treat each such block as a number written in base-26. To encode a given block, we will convert the corresponding number to the binary system – we will need \(\lceil \log_2{26^4} \rceil = 19\) bits for this. In this way, we are able to encode a string of length \(n\) using \(18 + \lceil \frac{n}{4} \rceil \cdot 19\) bits, which for \(n = 200,000\) is \(950,018 < 980 \cdot 980\).

What if only the rows of the array are permuted?

To read the message, we must recover the order of the rows. We can use the last 10 columns to store the row indices, and encode the message in the remaining area. To recover the message, we must first read the row indices. Thanks to this, we can restore the array to its original state and read the encoded message.

Subtask \(n \leq 20,000\)

Again, we will use the fact that if row \(w_i\) is moved to position \(j\) as a result of permuting the array's rows and columns, then row \(w_i\) of the original array contains the same number of ones as row \(w_j\) of the new array. We want the number of ones in a row to increase as the row indices increase. We will encode the message using the first 323 rows and the first 333 columns. Additionally, we will encode the indices of the first 333 columns in rows numbered 324 to 332. Then, for all rows numbered from 1 to 332, we will fill them with such a number of ones that as the row numbers increase, the number of ones in the row also increases. Formally, for row \(w_i\), let the number of set bits in the first 333 fields of \(w_i\) be \(b_i\), then exactly \(333 - b_i + i\) bits should be set to 1 in the columns with indices from 334 to 1000 in row \(w_i\). Next, we will also fill row 333 so that zeros are entered in the first 333 columns, and ones in the remaining ones – it will be used to mark the first 333 columns. In rows numbered 667 to 1000, we write only ones.

Figure showing the above construction

To recover the message, we should sort the rows according to the number of ones in a given row. Then we iterate over all columns and check if a 0 is entered in a given column in row 333 – if not, we skip this column; if yes, we read the column index from it. In this way, we have recovered both the order of the rows and the order of the columns in which the string was encoded, which allows us to read the message without any problems.

Model solution

We will use an area of size \(980 \times 980\) to encode the message. For each of the rows from 1 to 990 in columns 991 to 1000, we will enter the index of this row (indexing from 1). Then, for each of the columns from 1 to 980 in rows 981 to 990, we will enter the index of this column (again, indexing from 1). For each of the rows from 1 to 990 in columns 981 to 990, we will enter only ones (each of these 10 columns will therefore contain exactly 990 ones). We will use the last 10 rows to encode the positions of the last 10 columns – corresponding to the row indices, so that they can be easily found after permuting the rows and columns of the array. In row 991, we will enter zeros in all positions except the last column, in row 992 there will be zeros in all positions except the last two columns, and so on up to the last row.

Figure showing the above construction

The decoding program will first recover the indices of the last 10 rows from the original array. The last row will be the one that has exactly 10 ones, the penultimate one will be the one that has exactly 9 of them, etc. Note that they are determined uniquely, because every row with a number less than 991 has at least 11 ones (we added 10 in columns 981 to 990, and at least one set bit appears in the row index).

The only field in row 991 from the original array that contains a 1 indicates the last column from the original array. Similarly, the two fields with ones from row 992 indicate the last two columns from the original array, three fields with ones from row 993 indicate the last three columns, etc. Applying this reasoning, we can recover the last 10 columns from the original array and their order.

Then, having recovered the last 10 columns, we are able to read the row indices from 1 to 990, and then from rows 981 to 990 we read the column indices from 1 to 980 (we discard the 10 columns from 981 to 990, which contain exactly 990 ones), which allows us to recover the entire message.