DEVISING A ‘HYBRID APPROACH’ FRAMEWORK TO ENHANCE THE EFFICACY OF EXTRACTING TARGETED WEB CONTENT

Jahnvi Gupta

Volume 3, Issue 1 2019

Page: 36-42

Abstract

The worldwide Web has rich wellsprings of voluminous and different data, which The World keeps on growing in size and intricacy. Many Web pages are unstructured and semi-organized, so it comprises noisy data like headers, footers, ad, joins, etc. This tumultuous data makes extraction of Web content unchanged. Extricating the primary substance from the site pages is the preprocessing of web data frameworks. Numerous strategies proposed for Web content extraction depend on programmed extraction and carefully constructed rule age. A mixture approach is proposed to remove source content from Web pages. An HTML Web page is changed over to a DOM tree, and highlights are removed, and with the separated highlights, rules are created. Decision tree characterization and Naive Bayes grouping are AI techniques utilized for restrictions age.

Back Download



References

  • S. Baluja, Browsing on smalls screens: Recasting Webpage segmentation in toan efficient machine learning framework, Proceedings of the 15th International Conference on World Wide Web, pp. 3342, 2006.
  • S. Debnath, P. Mitra, N. Pal and C. L. Giles, Automatic identification of informative sections of Web pages, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 12331246, 2005.
  • S. Mahesha, M. S. Shashidhara and M. Giri, An Efficient web content extraction using mining techniques, International Journal of Computer Science and Management Research, Vol. 1, No. 4, pp. 872-875, 2012.
  • Nikolaos Pappas, GeorgiosKatsimpras and EfstathiosStamatatos, Extracting Informative Textual Parts from Web Pages Containing User-Generated Content, Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies, 2012.

Looking for Paper Publication??

Come to us.