Overview -------- spamsum is a tool for generating and testing signatures on files. The signature is designed to be particularly suitable for producing a result that can be used to compare two emails and see if they are 'similar'. This can provide the core of a SPAM detection system. The algorithms in spamsum are in two parts. The first part generates a signature which is encoded as a string of ascii characters less than 72 characters long. The second part takes a new signature and a database of existing signatures (actually just a text file with one signature per line) and finds the existing signature that best matches the new signature. A match result in the range of 0 to 100 is generated, where 100 is a perfect match and 0 is a complete mismatch. Signature Algorithm ------------------- The signature algorithm in spamsum has a number of interesting properties that make is especially suitable for SPAM detection. * non-propogation In most hash algorithms a change in any part of a plaintext will either change the resulting hash completely or will change all parts of the hash after the part corresponding with the changed plaintext. In the spamsum algorithm only the part of the spamsum signature that corresponds linearly with the changed part of the plaintext will be changed. This means that small changes in any part of the plaintext will leave most of the signature the same. This is essential for SPAM detection as it is common for varients of the same SPAM to have small changes in their body and we need to ensure that the matching algorithm can cope with these changes. * alignment robustness Most hash algorithms are very alignment sensitive. If you shift the plaintext by a byte (say by inserting a character at the start) then a completely different hash is generated. The spamsum algorithm is robust to alignment changes, and will automatically re-align the resulting signature after insertions or deletions. This works in combination with the non-propogation property to make spamsum suitable for telling if two emails are 'similar'. The core of the spamsum algorithm is a rolling hash similar to the rolling hash used in 'rsync'. The rolling hash is used to produce a series of 'reset points' in the plaintext that depend only on the immediate context (with a default context width of seven characters) and not on the earlier or later parts of the plaintext. A stronger hash based on the FNV algorithm is then used to produce hash values of the areas between two reset points. The resulting signature comes from the concatenation of a single character from the FNV hash per reset point. The frequency of the reset points determines how many characters in the plaintext will be used for each character of output in the signature. At startup spamsum scans the plaintext to determine how many valid input characters are in the plaintext (whitespace is ignored). The algorithm then estimates the reset frequency needed to produce a signature of length 64 and starts producing the signature. If after the signature is produced the result is less than a third of the desired length then the reset frequency is adjusted and the signature re-generated. Similarity Testing ------------------ Once a set of signatures has been generated you need to be able to take a new plaintext and see if it matches one of the signatures. The way this is done is to generate a spamsum signature of the new plaintext then compute a distance measure between each of the existing signatures and the new signature. The distance measure that spamsum uses is based on the 'string edit distance'. The string edit distance is a measure of how many edit operations are required to take one of the signatures and turn it into the other. In spamsum the 'insert' and 'delete' edit operations are given a weight of 1 while substitution is given a weight of 3 and transposition is given a weight of 5. The resulting string edit distance is then scaled to produce a 'score' in the range 0-100. A score of 100 indicates a perfect match and a score of 0 indicates a complete mismatch. If the two signatures used a different 'reset frequency' (also known as block_size) then the score is automatically set as 0. The score is weighted so that a value of 50 is a reasonable threshold to use for a 'good match'. Dual hashes ----------- A significant problem with the above algorithm is the sensitivity to the chosen hash strength of the rolling hash. The initial implementation used a single hash strength chosen based on the file size and rounded to a power of 2. This works, but it means that if the two files being compared cross over a boundary then they will not be able to be compared. To reduce this problem the current implementation chooses two different hash strengths and generates two hashes for each file. This means that the two files will have to have very different lengths for their respected spamsum signatures not to share a common hash strength. Running spamsum --------------- Please run 'spamsum -h' for help with command line operation. In spam-checking mode (-c or -C) spamsum returns 0 (success) for messages that score at least the threshold and are probably spam, and 1 for other errors, and 2 for non-spam messages. Using "success" to indicate a match is consistent with Vipul's Razor, and means that internal errors should not cause messages to be discarded or tagged. Note that the "action" option (-c or -C) must be *last* on the command line. Procmail recipes ---------------- spamsum works well from procmail(1). It should be called as a message filter, using the 'c' flag to feed it a copy of the message, and the 'W' flag to examine its exit code. The following rule should have the 'e' (error) flag if it will be used for non-spam message, and the 'e' (and) flag to be used for spam messages. For example, to file spam in a separate mailbox: :0 Wc | spamsum -H -T50 -d /home/mbp/spamsum/sigs.txt -C - :0 a: /home/mbp/Mail/spamsum-caught Infrastructure -------------- spamsum is useless without a good quality database of signatures for known spam. I am hoping that the spamsum algorithm will be incorportated into a online system for capturing known SPAM (such as razor). Author ------ spamsum was written by Andrew Tridgell