Abstract

Information extraction (IE) aims to extract from textual documents only the fragments which correspond to datafields required by the user. In this paper, we present new experiments evaluating a hybrid machine learning approach for IE that combines text classifiers and hidden Markov models (HMM). In this approach, a text classifier technique generates an initial output, which is refined by an HMM, taking into account dependences in the order of the data to be extracted. The proposal was evaluated to extract information from bibliographic references. Experiments performed on a corpus of 6000 references have shown an improvement in performance compared to benchmarking IE approaches adopted in previous work.

BibTeX

 @inproceedings{barros2008hidden,
  title={Hidden Markov Models and Text Classifiers for Information Extraction on Semi-Structured Texts},
  author={Barros, Flavia A and Silva, Eduardo FA and Prud{\^e}ncio, Ricardo BC and Valmir Filho, M and Nascimento, Andr{\'e} CA},
  booktitle={Hybrid Intelligent Systems, 2008. HIS'08. Eighth International Conference on},
  pages={417--422},
  year={2008},
  organization={IEEE}
  }