Mining Historical Archives for Near-Duplicate Figures



Supporting Material


FAQ Section 
There are some questions which we want to clarify.

Q:  Does our algorithm guarantee to return an exact motif?
A:  The answer is no. We cannot guarantee that our algorithm will return the exact motif but, in practical, two windows of the motif is usually hashed in to the same bucket. We do multiple time of random hashing and we union the results; if the motif hash into the same bucket once (very high chance), bingo! In that case, our algorithm will return the correct answer. Examples of approximate motifs from real dataset are shown below and you can download more here.
Q1-thai Q1-arabic Q1-pic1 Q1-pic2 Q1-pic 
 
Q:  If user want to be guaranteed that an exact motif will be discovered with probability 99%, what is the number of false positive?
A:  Our analysis shows that if the motif distance is known, we can bound the number of false positive; this number is usually small although we do random projection many times. You can see more in our theoretical analysis slide For quick review, please see figure below.  Note that we can also allow our algorithm set parameters automatically to obtain this upper bound by using equation in our analysis.
Q1-plot1 Q2-plot2 Q3-plot

Q:  What is more related area for this work between data mining and image processing?
A:   It is absolutely in data mining area. It is true that the input of our algorithm are documents which may contain a lot of images. The image processing techniques are required to do preprocessing, and that's it. After converting the cleaned documents to ternary (or binary) matrix, all stuffs after that are in data mining area. We do 2-step hashing which contains downsampling part and random removing part. We also propose an easy but non-trivial method to locate the potential windows which can reduce the search space dramatically.
       The main objective are to discover motifs on large documents; for example, at the end of this page, we show examples motifs across two heraldry books of size 221 pages, and 450 pages. To best of our knowledge, most work in image processing area do similarity search with many fine-tuning techniques and it cannot apply to large documents. Then, our method is more related to data mining area than image processing area.

Q:  Is our algorithm scale invariant?
A:  
The short answer is no. However, because of downsampling technique, motifs which contain two windows in small different scale, say 10-15%, can be discovered, as you can see the tigers in previous question are in different scale.

Q:  Is our algorithm rotation invariant?
A:  
No, but it is not much hard to change the hash function to support scale invariance. However, before that, a new distance measure which supports a rotation invariance is also required.



Motif Join with User Interface
Examples of our user interface are shown in figures below.

UI-1       UI-2


Motif Join (Command Line)
User interface can help us to discover, browse, understand the approximate motifs. However, it is suitable to use UI for small documents. If input documents are larger than 8 pages, we recommended you to use command-line version which is much faster because many drawing steps are excluded.


Dataset and Data Generator
We inspired an idea from 14-segment display. Each character contains 14 segments which be turned on/off randomly, and there are 100 characters in each pages of our artificial book..

14-segment 14seg_00 14seg_normal
14seg_Gaussian noise var 0.15 14seg_distortion


Example of Motifs: Peer of Scotland
The figure below shows the motifs in the book "Peer of Scotland: Genealogical and Historical accounts of all the Peers of that Ancient Kingdom" by J. Almon from Internet Archive.
Book: Peer of Scotland Book: Peer of Scotland

Example of Motifs: Heraldry
The same shield in three different books: "Leopards of England" by E. E. Dorling in year 1913, "Handbook of Heraldry" by Chatto and Windus in year 1893, and "A Manual of Heraldry, Historical and Popular" by Charles Boutell in year 1863.
Shield_1 Shield_2 Shield_3


The figures below shows motifs across four heraldry books which are "Scottish Heraldry made easy" by G. Harvey Johnston in 1912, "Peeps at Heraldry" by Phoebe Allen in 1912, "English Heraldic Book-Stamps" by Cyril Davenport in 1909, and "British Heraldry" by Methuen in 1912. We discover motifs from two books at a time.
Book: Scottish Heraldry
Icon Book 1
Book: Peeps at Heraldry
Icon Book 2
English Heraldic Book ...
Icon Book 3
Book: British Heraldry
Icon Book 4




Contact Information: rakthant "at" cs "dot" ucr "dot" edu