GMail detects duplication

GMail‘s auto-quoting feature had me confused for a while, and I thought it was just broken. Parsing MIME messages can be tricky, so I didn’t really mind, but bugs will annoy. For example, take a look at the stray line source of our mutual confusion below.

GMail quoting example from the XP list

It’s marked in purple, which indicates it’s quoted from an earlier message, but there are no quote chars (”>”) like in the quote block above, so it looks a bit random. After some head-scratching and scrutiny, I think I know what’s going on.

It appears that GMail, for every line in every message, scans backwards through the conversation, and if a duplicate line is found, it’s considered a quote and marked as such.

At first I thought leading quote chars were a magic give-away, but GMail actually notices modifications in quoted lines, so things like snipped citations are rendered as content, not quotes. The duplicate detection seems to be the only rule at work.

I wonder if the same idea could be used to detect duplication in source code. Duplication comes in many guises, but simple line-by-line equality would be a nice first step to detect in a code-base, either on a file- or project basis. It would be interesting to see duplicate lines marked in Visual Studio, for example.