Pattern Recognition for Finding Similarity of Webpages

  IJCOT-book-cover
 
International Journal of Computer & Organization Trends (IJCOT)          
 
© 2013 by IJCOT Journal
Volume-3 Issue-2                          
Year of Publication : 2013
Authors :  N. Pughazendi , G. Pattusamy

Citation

 N. Pughazendi , G. Pattusamy    "Pattern Recognition for Finding Similarity of Webpages" . International Journal of Computer & organization Trends  (IJCOT), V3(2):58-62 Mar - Apr 2013, ISSN:2249-2593, www.ijcotjournal.org. Published by Seventh Sense Research Group.

Abstract

We proposed a functional technique for identifying similar Web pages that is based on measuring tree similarity. In this paper we introduce an experiment with two methods for evaluating the similarity of web pages. The results of these methods can be used in different ways for the reordering and clustering a web page set. Both of these methods belong to the field web content mining. The first method is purely focused on the similarity of web pages. This method segments web pages and compares their layouts based on the image processing and graph matching. The second is based on detecting of objects that result from the user point of view on the web page. The similarity of web page is measured as an object match on the analyzed web pages. The key idea behind the method is to transform each Web page into a compressed, normalized tree that effectively represents its visual structure.

References

[1] P. Lakkaraju, S. Gauch, and M. Speretta, “Document similarity based on concept tree distance,” in Proc. of the 19th ACM Conf. on Hypertext and hypermedia. New York, NY, USA: ACM, 2008, pp. 127–132.
[2] A. Paepcke, H. Garcia-Molina, G. Rodriguez Mula, and J. Cho, “Beyond document similarity: understanding valuebased search and browsing technologies,” SIGMOD Rec., vol. 29, no. 1, pp. 80–92, 2000.
[3] W. W. Cohen, “Recognizing structure in Web pages using similarity queries,” in Proc. of the 16th Nat. Conf. on Artificial Intelligence and the 11th Innovative App. of Artificial Intelligence, Menlo Park, CA, USA, 1999, pp. 59–66.
[4] A. Y. Fu, L. Wenyin, and X. Deng, “Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (emd),” IEEE Trans. Dependable Secur.
[5] J. Cao, B. Mao, and J. Luo, A segmentation method for web page analysis using shrinking and dividing," JPEDS, vol. 25, 2010.
[6] A.Y. Fu, L. Wenyin, and X. Deng, Detecting phishing web pages with visual similarity assessment based on earth mover`s distance (emd)," TDSC, vol. 3, 2006.
[7] N. Thome, D. Merad, and S. Miguet, Learning articulated appearance models for tracking humans: A spectral graph matching approach," Signal Processing: Image Communication, vol. 23, no. 10, 2008.
[8] S. Avila, N. Thome, M. Cord, E. Valle, and A. Araujo, Bossa: Extended bow formalism for image classification," in ICIP 2011.
[9] K. Zhang and D. Shasha, "Simple fast algorithms for the editing distance between trees and related problems" SIAM Journal of Computing, Vol 18-6, (1989), p. 1245- 1262.
[10] K. Zhang and D. Shasha, "Simple fast algorithms for the editing distance between trees and related problems" SIAM Journal of Computing, Vol 18-6, (1989), p. 1245- 1262.

Keywords

: Tree structure, Clustering, Web comparison, Edit distance.