Conference papers

UJM at INEX 2007: Document Model Integrating XML Tags

Abstract : Different approaches have been used to represent textual documents, based on boolean model, vector space model or probabilistic models. In text min- ing as in information retrieval (IR), these models have shown good results about textual documents modeling. They nevertheless do not take into account docu- ments structure. In many applications however, documents are inherently struc- tured (e.g. XML documents). In this article1 , we propose an extended probabilistic representation of docu- ments in order to take into account a certain kind of structural information: logical tags that represent the different parts of the document and formatting tags used to emphasized text. Our approach includes a learning step that estimates the weight of each tag. This weight is related to the probability for a given tag to distinguish the relevant terms.
Document type :
Conference papers
Contributor : Franck Thollard <>
Submitted on : Friday, October 17, 2008 - 3:03:11 PM
Mathias Géry, Christine Largeron, Franck Thollard. UJM at INEX 2007: Document Model Integrating XML Tags. INEX, Dec 2007, Germany. pp.103-114, ⟨10.1007/978-3-540-85902-4⟩. ⟨ujm-00331746⟩



