Frameshift resilient fuzzing

TL;DR: how a biology experiment from the 1960s inspired a modern-day fuzzing improvement (and a new default feature in AFL++ 4.40).

    📄 Paper (ICSE 2026)
    |
    
      AFL++ 4.40c Release
    
    |
    
      LibAFL Implementation

Inside our cells, we have intricate machinery (in the form of proteins) that processes and interprets our genetic information, not unlike how our computers run intricate programs (in the form of code) that process and interpret other kinds of information (image files, codecs, serialized data).

In our cells, however, information naturally gets mutated fairly often. The copying machinery is much less reliable than in computers. The interesting question is: what is the effect of a mutation relative to the change itself? Some mutations have little or no effect, while others have a drastic effect.

Why is that?

In computers, information ideally doesn’t get mutated. But in some cases it’s unavoidable, such as when trying to transmit data over a lossy medium, and in other cases it’s desirable, such as when we are trying to fuzz a program. In the latter case, we intentionally mutate data over and over to try to explore new program states.

In this post, I want to briefly explore both of these cases: genetic information mutation (natural process) and data information mutation (during fuzzing), compare the parallels between them, and then discuss how these thoughts led to research on improving fuzzers, eventually resulting in a paper at ICSE 2026 and a new feature enabled by default in AFL++ 4.40.

Part 1: The Genetic Machine

The “genetic machine” takes DNA as input and produces (observable) protein function as output.

    DNA
    ->
    Genetic Machine
    ->
    Protein function
  

The Central Dogma of Molecular Biology

At a high level, DNA stores information about how to make proteins. These proteins are basically cellular machines, carrying out most of the functions of the cell.

Each protein is effectively nothing more than a sequence of amino-acids (most life uses just 20 options). However, a lot of variation is possible. After linking together anywhere from a few dozen to thousands of amino-acids, the resulting protein folds up into a specific shape (called a conformation), dictated by molecular interaction forces between the different amino acids and the environment (mostly water).

Ignoring the (many) edge cases and variations on this process, the core idea (indeed the central dogma of molecular biology) is that DNA is transcribed into RNA (effectively just a copy of the DNA sequence) which is then translated into a protein sequence.

Notably, because there are 20 different amino-acids, but only 4 different nucleotides (A, C, G, T), we need to use multiple nucleotides to represent each amino-acid. In practice, it turns out that three consecutive nucleotides (called a codon) map to a single amino-acid. One group of codons instead acts as a stop codon, which tells the protein translation machinery that it’s reached the end of the sequence.

DNA Codons (64 total)

Amino-Acid Mapping

Highlight Details

Amino acid name

As an example, here is the DNA and corresponding amino-acid sequence for the green fluorescent protein (GFP) from Aequorea victoria, a bio-luminescent jellyfish. This protein is commonly used in microbiology experiments because it lets us measure the amount of protein produced indirectly by measuring the amount of light emitted by the protein.

Note: you can highlight and select parts of the DNA sequence to see the corresponding amino-acid sequence and 3D structure of the resulting protein.

DNA (synthetic coding sequence)

Amino-Acid Sequence (1GFL chain A)

Genetic Mutations

When our cells replicate, they need to copy the whole genome. While this is surprisingly accurate, mistakes do happen, leading to mutations in the DNA sequence.

These mutations have different effects on the resulting amino-acid sequence and therefore the structure of the resulting protein.

In order of least-to-most destructive:

1. Synonymous Mutation

A synonymous mutation changes a single nucleotide such that the codon still codes for the same amino-acid. For example changing CAC to CAT still maps to Histidine, thus the resulting protein would look exactly the same.

2. Missense Mutation

A missense mutation instead changes a single nucleotide such that the codon now maps to a different amino-acid. For example, changing CAC to TAC would produce Tyrosine instead of Histidine. The resulting protein would now have a single different amino-acid in its sequence.

Depending on how different this new amino acid is from the old one, the protein may be more or less similar. Generally though, changing just one amino acid is a pretty small effect on the resulting structure (and therefore function) in the protein.

3. Frameshift Mutation

Of interest to this post is the frameshift mutation, where nucleotides are inserted or deleted. These generally have a much larger effect on the resulting structure because the reading frame (the specific grouping of three nucleotides) shifts, affecting everything downstream.

For example, the start of our GFP protein currently goes:

GCTTCTAAAGGTGAAGAACTGTTTACT

If we insert a T at index 7, it becomes:

GCTTCTATAAGGTGAAGAACTGTTTACT

Notice how everything after the insertion is re-grouped into different triplets; a single inserted nucleotide can therefore rewrite the rest of the amino-acid sequence.

So everything after the insertion is now corrupted, and usually we will end up running into a stop codon early (as we did here), resulting in a truncated protein.

In this case, while our normal GFP protein glows green, this frameshift mutation would destroy the structure and therefore the function of the protein. A single missense mutation, on the other hand, would probably still result in a protein that glows since it is a much smaller, localized change (although maybe a bit dimmer, or maybe a slightly different color).

Part 2: The Program Machine

The “program machine” takes data as input and produces (observable) program behavior as output.

    Data
    ->
    Program Machine
    ->
    Program behavior
  

Data-Handling Programs

For the purposes of this post, we’ll restrict the scope of programs we consider to those which parse some data – i.e. the same sort of programs we fuzz.

For example, consider libpng, a popular C library for parsing PNG files.

We can set up this program (via a fuzz harness) to take a PNG file as input and attempt to parse it and read the image data.

The PNG format is a fairly simple binary format consisting of a fixed header followed by a series of chunks.

Every PNG file starts with a fixed header:

89 50 4E 47 0D 0A 1A 0A

This is followed by a series of chunks. Each chunk consists of:

a 4-byte length
a 4-byte type (e.g. IHDR, IDAT, IEND)
a variable-length data payload
a 4-byte CRC checksum

For example, here is a sample PNG file (seed.png) and a byte-level map of how it is laid out on disk.

The seed PNG used in this example — seed.png rendered as an image.

offset 0001020304050607 08090A0B0C0D0E0F

The actual image data is stored in the IDAT chunk. So in order to reach it, the program needs to start at the beginning of the file and iterate over the chunks (reading and skipping each chunk’s length) until it finds the IDAT chunk.

Data Mutations

During fuzzing, we intentionally mutate the data being handled by the program to try to explore new program states. Different types of mutations have different effects on the resulting program behavior.

Like DNA, we can try to classify these mutations based on their effects.

1. Synonymous Mutation

Some mutations have no effect on the resulting program behavior. For example, this PNG file has four tEXt chunks, which each contain a comment that is stored as metadata.

Mutating the data in these comments has no effect on how our libpng harness operates since this data is never actually parsed.

2. Missense Mutation

In DNA, a missense mutation changes just a single amino acid, leaving the protein structure (and therefore function) largely unchanged.

For a PNG file, analogous changes could be:

changing the color profile or gamma values
resizing a chunk
editing image pixel data
inserting a new tEXt chunk

These sorts of changes are likely to have just a small effect on the resulting program behavior. If the image was valid before, the mutated image is still likely to be valid, and will probably be parsed in mostly the same way.

Note that apart from the first option, these mutations are actually nontrivial to perform and unlikely to happen by chance with a blind fuzzer. Resizing a chunk requires rewriting the length field, editing pixel data requires decompressing and recompressing the stored image data, and inserting a new chunk is only possible at very specific offsets.

3. Frameshift Mutation

In DNA, a frameshift mutation effectively broke the way DNA was being interpreted, leading to downstream corruption and a large effect on the resulting protein function and structure.

In a PNG file, we see similar types of effects. For example, what happens if we edit the length of the gAMA chunk, increasing it by 1 without actually inserting another byte?

offset 0001020304050607 08090A0B0C0D0E0F

Here everything before the gAMA chunk is parsed normally. But right after it, the parser is misaligned: it reads the next chunk type as RGB\x00 instead of sRGB with a much larger length of 371. The chunk after that is then completely corrupted, since this size field points into the compressed image data of the original file.

In this case, we see that modifying a size field has a large cascading effect on the resulting parser state, eventually leading to early termination (just like in DNA).

Frameshifts are a problem for fuzzing

One of the key problems with fuzzing is that these sorts of “frameshift” mutations are super common. Without knowledge of how size fields in a data format are organized and what data they point to, it’s difficult to properly detect and handle them.

We could try to build a PNG-specific mutation engine or some sort of more general grammar fuzzer (as others have done), but the challenge is that this approach struggles to generalize for a few reasons:

Grammars are often overly prescriptive. In the real world, it turns out that data formats are often program-specific. For example, libpng has certain rules about which chunks are resizable that differ from image-io, a Rust-based image parsing library.
In some cases, programs parse many different data formats. For example bloaty, a tool for parsing object binaries, has support for parsing ELF, PE, Mach-O, and more, and can do so through a unified interface.
Fuzz harness inputs often themselves form ad-hoc data formats. What if our fuzz harness doesn’t take a raw PNG directly, but instead it takes a JSON object that contains a PNG string? How do we collate these different formats?
Finally, during fuzzing we’re often handling lots of semi-valid inputs, and we want to be able to perform structure-aware mutations in this space even when the full input itself might not technically parse for a given grammar. Thus our data specification needs to actually be context dependent for both the program and the specific input being fuzzed.

Part 3: The Double-Mutant Experiment

Let’s revisit DNA for a moment to see how this problem was solved in the 1960s.

In particular, scientists at the time (Crick, Brenner, et al.) were also interested in this sort of structure inference problem. They knew that DNA consisted of nucleotides in a sequence but they were not sure how exactly the sequence coded for an amino-acid sequence.

Was it 3 codons per amino-acid? Or 4? Or 2? Did the coding frame overlap or was each codon independent?

They had the ability to run only indirect experiments where they could insert or delete nucleotides at certain positions in a gene and then observe if the resulting protein was still functional.

In particular, they were working with a protein in bacteriophage T4 that infects E. coli bacteria. If the protein is functional, the phage will proliferate (visible plaque), if the protein is non-functional, the phage will be unable to grow (no plaque). This results in a macroscopic observable effect that they could use to indirectly infer the structure of the protein.

The Experiment

The key insight is that they could perform a two-part test to identify the underlying structure.

1: Destructive Mutation

First, they performed a destructive frameshift mutation by inserting or removing a single nucleotide at a time. In most cases, this caused a frameshift mutation, resulting in a non-functional protein.

2: Restorative Mutation

Second, they tried to identify a second mutation that could restore the functionality of the protein.

Interestingly, they found that if a protein was non-functional due to a nucleotide insertion, it was often possible to restore functionality by performing a nearby nucleotide deletion.

Conceptually: this second mutation restores the original reading frame. As long as it happens near the original insertion, the likelihood of hitting a stop codon during the shifted reading frame is minimized.

Similarly, any insertion or deletion that results in a net zero change modulo 3 preserves the downstream reading frame and therefore only risks affecting amino-acids translated in the shifted portion of the protein. (While modulo 2 or 4 leaves the frame shifted).

If you’re interested in the details, check out the original 1961 paper or associated wikipedia article.

Part 4: FrameShift Algorithm

So how does this relate to fuzzing?

Well the idea is that we can use this same sort of indirect double-mutant experiment to identify structure in data formats that are parsed by programs.

In our case, we don’t want to do this just once, but many times during the course of a fuzz campaign to learn input- and program-specific structures.

In our work, we restrict the sorts of structures we consider to the identification of offset and size fields (which we generalize together as “relation fields”). Empirically, this was the only structure that conferred a net benefit to fuzzing; see the comments at the end for more ideas.

For example, in our PNG format example, we’re interested in identifying all of the chunk size fields and specifically what regions they point to. If we learn this knowledge, then we can augment fuzzer mutations to preserve this information during mutation.

Measuring Destructiveness and Restoration

In order to perform the experiment, however, we need a way to measure how destructive or restorative a given mutation is.

We could use something like whether the PNG is “valid” or “invalid” as a proxy, but this doesn’t help us when we’re trying to learn structures on partially valid inputs and it’s program-specific, so it doesn’t generalize well.

Instead, we use a proxy coverage-based heuristic. A given input will take some path through the program and cover some set of basic blocks. A destructive frameshift mutation will likely lead to early termination and result in not covering some of these previously covered basic blocks. I.e. a corrupted PNG may instead take an error path and no longer reach the part of the program that parses the image data.

A restorative mutation, on the other hand, should restore some of this original coverage. It may take a slightly different path, but it should still reach roughly the same set of basic blocks as the original input.

Let’s consider our original PNG file again. It consists of a signature followed by 11 chunks.

Now when we mutate the gAMA chunk to have a length of 5, the parser is misaligned: it reads the next chunk type as RGB\x00 instead of sRGB with a much larger length of 371. The chunk after that is then completely corrupted, since this size field points into the compressed image data of the original file.

We can easily measure the destructiveness of this mutation by observing that functions like handle_sRGB and handle_cHRM are no longer called, since the parser fails before reaching them.

However, if we were to actually insert a single byte of data into the gAMA chunk, we actually restore the original parsing behavior. Now the chunk size is valid and the parser can continue to parse the rest of the file normally.

Thus, it will restore coverage to those original functions that were previously skipped and we can detect this as a restorative mutation.

Demo

To demonstrate the FrameShift algorithm in practice, I’ve built a small demo that simulates how this algorithm would work on our PNG file.

In particular here we’re making a few assumptions:

Instead of actually running a PNG parser and measuring coverage, we’re just simulating parsing all the chunks and interpreting the number of unique chunks parsed as the “coverage” of the parser.
I’ve constrained it to only search for big-endian 32-bit size fields for the sake of visualization simplicity.

Click the button to run the demo. The hex view will show the raw bytes of the PNG file and which bytes when modified lead to a coverage loss. Each prospective field is also displayed in a list view on the right.

Notice that size fields near the start result in a more destructive effect since there are more bytes downstream that are affected by the mutation.

For every prospective field, the algorithm then searches for places it can insert bytes and restore coverage. The graph below the hex viewer shows for each prospective field, which insertion points can be used to restore coverage (shown in green).

You can also click on a field in the list view to show where valid insertion points are for that field.

Critically, note that fields which are not actually real size fields (such as 0x0022) result in no valid insertion points, therefore we would eliminate these during the search.

Valid size fields, such as the one at offset 0x0008, can be restored by inserting bytes anywhere within the data region for that field.

FrameShift scan on seed.png

idle

only big-endian 4-byte candidate fields

Run analysis to list identified BE-32 fields.

Select a field to highlight it in the hex view.

Heat color indicates destructive loss (unique parsed chunk drop) for field mutations covering each byte.

Restoration overview (x: insertion point, y: field start offset)

Making it Practical

While this kind of test works in theory, there is a lot more that actually needs to be considered to make it practical. Note that the above demo probably took around 15 seconds to run, for just a small PNG file.

In our optimized implementation, the above file can be analyzed in just 313 milliseconds, using real libpng coverage.

Specifically, when we run this algorithm, we have a lot of choices for how we decide which destructive mutations to try and which places to try to restore original coverage. In general, this explodes the search space quadratically with the size of the input, and a naive implementation would eat up all of the fuzzing budget.

Furthermore, in many cases we need to do this search repeatedly. Some fields are only discoverable once we are able to identify and preserve other fields that they depend on.

In our paper for example, we show that this algorithm can learn indirect size fields (such as the offset/size of ELF chunks) by performing this search in a loop and it can even handle nested size fields (common in TLV type formats).

I’ll briefly touch on some of these considerations here, but please refer to our paper for full details.

1. Search Heuristics

The main optimization is that we very selectively choose which mutations to try. First, we scan an input for N-byte sequences (N=[1,2,4,8]) which could be decoded as a big- or little-endian integer smaller than the file size.

For each position, we try incrementing the size by between 32 to 255 (forcing a byte-wrap in most cases) and checking if this results in coverage loss.

If so, this is a candidate for a length/offset field. We then try to pick insertion points where we can insert the corresponding number of bytes and check if this results in coverage restoration. We use a fairly restricted set of possible insertion points to avoid the combinatorial explosion of possible mutations.

If the mutation is restorative, we have (likely) found a valid relationship between a length/offset field and the span it describes.

2. Structure-Aware Mutations

After learning a set of relation fields, we hook the byte-level mutators for the fuzzer to preserve these structures.

Conceptually, if we insert 5 bytes into some part of the input, we may need to increase the size or offset of some field by 5 to keep the frame synchronized and avoid a frameshift. In a sense, we are turning what would be frameshift mutations into missense mutations.

3. Overhead Bounds

During our initial evaluation, we found that while this algorithm was effective in most cases (nearly doubling code coverage on some targets), in the worst case it could be adversarial due to imposing a large overhead on the fuzzer.

During integration into AFL++, we imposed an overhead budget such that the fuzzer will only perform FrameShift experiments if the amount of time spent on analysis is less than a certain percentage of the total fuzzing time.

A default overhead of 10% was found to be generally effective, resulting in a net neutral or positive effect on every target, and this is currently what is enabled by default in AFL++ 4.40.

Part 5: Conclusion

Our implementation of FrameShift runs by default in AFL++ 4.40.

We also have a LibAFL Implementation which is compatible with any byte-level mutator and any target that produces coverage (we ran it on Python and Rust targets in our evaluation!).

If you’re interested in more details, check out our paper which was accepted to ICSE 2026 and will be presented at the conference in April.

FrameShift Resistant Formats

Interestingly, we found a continuation of the DNA-fuzzing metaphor while evaluating this approach on certain targets.

Most DNA does not actually code for a protein, rather it acts as metadata to encode when certain genes should be expressed and where to find them. It is actually functional enough to be repurposed and engineered in circuits which is super cool…

Genes therefore have certain markers called “promoters” near the start which act as beacons for the DNA transcription machinery to locate the start of the coding region.

Mutations such as insertions and deletions outside the coding region (i.e. before a promoter) don’t have the same kind of frame-shifting effect as they would inside the coding region since there is actually no notion of a “reading frame” here.

This is pretty important because it adds a lot of resiliency for genetic mutation in between genes; the machinery is robust to any changes here as long as the promoter and other important markers are still there.

For programs that operate on data which is expected to be transmitted over lossy mediums (for example video and audio codecs), it is reasonable to expect that packets might be lost during flight. In order to avoid corrupting the entire stream, these formats are designed to include markers (much like promoters in DNA) which act as beacons for the parser and allow it to re-synchronize after a packet loss.

As a result, these data formats are already “frameshift resistant” by design, and the double-mutant experiment fails to find any structure, specifically because mutations that might be normally destructive (such as inserting or deleting bytes) don’t actually prevent the parser from re-synchronizing.

Identifying Other Structures

During our research, we tried using this kind of double-mutant experiment to identify other structures in data formats. For example, you can use it to find regions of data that are compressed or encoded. E.g. a destructive mutation edits some data, a restorative mutation deserializes, then edits, then reserializes.

You can also use it to identify checksum-like fields. In the libpng example, we typically disable checksum validation during fuzzing. But in practice, you can validate that a given checksum is required with the same experiment.

In practice though, everything else we tried was much too expensive to confer any sort of improvement to fuzzing. The reality is that there is a fairly small window for structure inference overhead during fuzzing, especially in standard evaluations where you’re interested in optimizing coverage over a 24-48 hour horizon.