Mining Historical Archives for Near-Duplicate Figures

Supporting Material
• ICDM2011 paper and presentation.
• Slide to explain overall ideas about the paper.
• Slide to show all figures in higher resolution.
• Slide to theoretically analyze the upper bound of false positive.

FAQ Section
There are some questions which we want to clarify.

Q:  Does our algorithm guarantee to return an exact motif?
A:  The answer is no. We cannot guarantee that our algorithm will return the exact motif but, in practical, two windows of the motif is usually hashed in to the same bucket. We do multiple time of random hashing and we union the results; if the motif hash into the same bucket once (very high chance), bingo! In that case, our algorithm will return the correct answer. Examples of approximate motifs from real dataset are shown below and you can download more here.

Q:  If user want to be guaranteed that an exact motif will be discovered with probability 99%, what is the number of false positive?
A:  Our analysis shows that if the motif distance is known, we can bound the number of false positive; this number is usually small although we do random projection many times. You can see more in our theoretical analysis slide For quick review, please see figure below.  Note that we can also allow our algorithm set parameters automatically to obtain this upper bound by using equation in our analysis.

Q:  What is more related area for this work between data mining and image processing?
A:   It is absolutely in data mining area. It is true that the input of our algorithm are documents which may contain a lot of images. The image processing techniques are required to do preprocessing, and that's it. After converting the cleaned documents to ternary (or binary) matrix, all stuffs after that are in data mining area. We do 2-step hashing which contains downsampling part and random removing part. We also propose an easy but non-trivial method to locate the potential windows which can reduce the search space dramatically.
The main objective are to discover motifs on large documents; for example, at the end of this page, we show examples motifs across two heraldry books of size 221 pages, and 450 pages. To best of our knowledge, most work in image processing area do similarity search with many fine-tuning techniques and it cannot apply to large documents. Then, our method is more related to data mining area than image processing area.

Q:  Is our algorithm scale invariant?
A:
The short answer is no. However, because of downsampling technique, motifs which contain two windows in small different scale, say 10-15%, can be discovered, as you can see the tigers in previous question are in different scale.

Q:  Is our algorithm rotation invariant?
A:
No, but it is not much hard to change the hash function to support scale invariance. However, before that, a new distance measure which supports a rotation invariance is also required.

Motif Join with User Interface
Examples of our user interface are shown in figures below.

• Slide to explain how to use the user interface steps by steps.
• Matlab source code to run our algorithm using user interface.

Motif Join (Command Line)
User interface can help us to discover, browse, understand the approximate motifs. However, it is suitable to use UI for small documents. If input documents are larger than 8 pages, we recommended you to use command-line version which is much faster because many drawing steps are excluded.
• Slide to explain how to run our algorithm.
• Matlab source code to run our algorithm in command-line version.
• You can run our algorithm on toy examples from previous section, real documents which you have, or documents created from our data generator

Dataset and Data Generator
We inspired an idea from 14-segment display. Each character contains 14 segments which be turned on/off randomly, and there are 100 characters in each pages of our artificial book..

• Slide to explain how to use the data generator..
• Matlab source code to generate an artificial book..
• Set of images contains each segment of the 14-segment display.
• Slide to show motifs of different masking ratio.

Example of Motifs: Peer of Scotland
The figure below shows the motifs in the book "Peer of Scotland: Genealogical and Historical accounts of all the Peers of that Ancient Kingdom" by J. Almon from Internet Archive.
• Slide to show example motifs of this datasets discovered by our algorithm.

Example of Motifs: Heraldry
The same shield in three different books: "Leopards of England" by E. E. Dorling in year 1913, "Handbook of Heraldry" by Chatto and Windus in year 1893, and "A Manual of Heraldry, Historical and Popular" by Charles Boutell in year 1863.
• More motifs across heraldry books can be seen here.

The figures below shows motifs across four heraldry books which are "Scottish Heraldry made easy" by G. Harvey Johnston in 1912, "Peeps at Heraldry" by Phoebe Allen in 1912, "English Heraldic Book-Stamps" by Cyril Davenport in 1909, and "British Heraldry" by Methuen in 1912. We discover motifs from two books at a time.

Contact Information: rakthant "at" cs "dot" ucr "dot" edu