7210c Rm 436 V07.23 Ppm Bi Only
The last 31 bases are handled specially because we are encoding 4 bases per byte, yet the RNA sequence consists of 29,903 bases, which is not a multiple of 4. Thus we must either (i) leave some bases out of the encoding process or (ii) pad the last byte with 2 dummy bits. We choose the former approach because it results in shorter code. Sensible choices are the last 3, 7, 11, 15, 19, 23, 27, or 31 bases, because (i) these bases are all As and so are easily appended, and (ii) the size of the sequence to be encoded is then a multiple of 4. The optimal choice is 31: each lower value successively increases the size of f by a byte, but the size of the decoding program either doesn't change (for 11, 15, 19, 23, or 27) or decreases by only a byte (for 3 or 7).
7210c Rm 436 V07.23 Ppm Bi Only
Note: this was done with version 1.10 of zpaq, released in 2009, obtained today for Ubuntu 18.04 via apt-get. The latest release 7.15 of zpaq not only has much more verbose option syntax, but does not even support the older options for omitting metadata. Therefore this cannot be reproduced with the current release of zpaq.
Posted here only to show grouping nucleotides approach as the decoder overhead is very large.So let's start with the explanations. As can be seen form MN908947, the virus mRNA contains "head" (265 nucleotides), "tail" (229 nucleotides) and translated parts with some non-translated parts between them. So, we have 10 genes, named orf1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10 in the link above.It comes out we could use codon usage bias as the picture of frequency codon (=nucleotide triples) usage is very impressive, e.g. for orf1ab:or for S (looks very similar):and for N (looks different):More precisely, here's the table of normed covariance of the codon usage distributions:
Looking back at the comments it's left only to explain how cut works. The python reduce function takes \$3\$ argumets: function(accumulator,item), the sequence and the optional accumulator initial value. Then accumulator on the next step = function(accumulator on the previous step,item from sequence). In other words, reduce feeds the sequence into the function and returns the accumulator when the sequence runs out. E.g. a way to replace math.prod: lambda z:reduce(lambda x,y:x*y,z,1).
So the accumulator in this particular case is chosen to contain [the sequence to cut,array ot already cut pieces] and acc is the sequence to cut, acc is already cut pieces, so acc[n:],acc+[acc[:n]] makes perfect sense now, doesn't it? And we need only the reduce return value at index \$1\$ (i.e. the pieces sequence).
Mathematica somehow knows when you're performing one-huge-exact-number arithmetic coding, and segfaults on every single action. It feels like this solution somehow works both only in theory and only in practice.
Didn't go quite like I'd hoped. I was hoping for better compression by turning it into a byte array first. I'm also really bad at combination theory and thought I could use only 4 bits to store the compression dictionary. Base64 really exploded my compressed string, something like 7.5k to over 9.5k bytes.
To encode the string in the first place, the following code was used - it is not golfed, as the answer above is completely standalone and does not need the following code, only its result (which is the string b) : 350c69d7ab