Ekstraksi Data pada Tabel dari Halaman Web Menggunakan Pohon Document Object Model

Memen Akbar, Cici Patmala, Dini Nurmalasari

Abstract


Data on the web page can be available in various formats, such as table. With the growing of web pages, the need to extract data from tables is increasing. Results of the extraction can be used for integration with other web tables or stored in a database. This study discusses the extraction of data from a table on a web page using a Document Object Model (DOM) tree. The initial step of this extraction process is to transform the HTML document into a DOM tree. Then, by applying search methods Depth First Search (DFS), part of the data in the table is extracted and stored in a CSV file. An engine has been developed using Visual Basic. The results show that the engine can automatically extract data from the table that has the following characteristics: the number of rows and columns are not limited, able to handle all of the table orientation layout, and able to handle tables that are merged cells.

Full Text:

PDF

References


Memen Akbar, Fazat Nur Azizah, and G. A. Putri Saptawati, "Pengembangan Engine Integrasi Tabel HTML pada Halaman Web,"JNTETI, vol. 5, no. 3, pp. 177-183, Agustus 2016.

Kristina Lerman, Craig Knoblock, and Steven Minton, "Automatic data extraction from lists and tables in web sources," IJCAI-2001 Workshop on Adaptive Adaptive Text Extraction and Mining, vol. 98, August 2001.

J. Y. Ramel, Michel Crucianu, Nicole Vincent, and Claudie Faure, "Detection, extraction and representation of tables," in Document Analysis and Recognition, 2003, pp. 374-378.

Yeon-Seok Kim and Kyong-Ho Lee, "Extracting logical structures from HTML tables," Computer Standards and Interfaces (Elsevier), vol. 30, no. 5, pp. 296-308, August 2007.

Chen Kerui, Zhao Jinchao, Zuo Wanli, He Fengling, and Chen Yongheng, "Automatic table integration by domain-specific ontology," International Journal ofDigital Content Technology and Its Application, vol. 5, no. 1, pp. 218-226, January 2011.

David W. Embley, Cui Tao, and Stephen W. Liddle, "Automating the extraction of data from HTML tables with unknown structure," Data & Knowledge Engineering (Elsevier), vol. 54, pp. 3-28, November 2004.

Shijun Li, Zhiyong Peng, and Mengchi Liu, "Extraction and integration information in HTML tables," in Fourth International Conference on Computer and Information Technology (CIT), 2004.

Eko Prasetyo, Lukito Edi Nugroho, and Marcus Nurtiantara Aji, "Perancangan Data Warehouse Sistem Informasi Eksekutif untuk Data Akademik Program Studi," JNTETI, vol. 1, no. 3, pp. 13-20, November 2012.

Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm, "DOMbased content extraction of HTML documents," in 12th International Conference on World Wide Web, 2003, pp. 207-214.

Agny Ismaya, "Algoritma Ekstraksi Informasi Berbasis Aturan," JNTETI, vol. 3, no. 4, pp. 242-247, November 2014.

Anhar, Panduan menguasai PHP & MySQL secara otodidak. Jakarta, Indonesia: Mediakita, 2010.

Nathanael T. Black and Wolfgang Ertel, Introduction to artificial intelligence. Germany: Springer science; Business Media, 2011.




DOI: http://dx.doi.org/10.22146/jnteti.v5i4.273

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI)

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Tim Redaksi JNTETI

Jurusan Teknik Elektro dan Teknologi Informasi
Fakultas Teknik Universitas Gadjah Mada
Jl. Grafika No. 2 Kampus UGM Yogyakarta

Telp. +62 0274 552305

Email :
jnteti@ugm.ac.id
jnteti@jteti.gadjahmada.edu

 

JNTETI indexed by :

       

 

 JNTETI Visitor :

Flag Counter

JNTETI Visitor

 

JNTETI ISSN :

 

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.