UJM at INEX 2007: Document Model Integrating XML Tags

Mathias Géry; Christine Largeron; Franck Thollard

doi:10.1007/978-3-540-85902-4

Communication Dans Un Congrès Année : 2008

UJM at INEX 2007: Document Model Integrating XML Tags

(1) , (1) , (1)

Mathias Géry

Fonction : Auteur

Laboratoire Hubert Curien

Christine Largeron

Fonction : Auteur
PersonId : 5702
IdHAL : christine-largeron
ORCID : 0000-0003-1059-4095
IdRef : 029304121

Laboratoire Hubert Curien

Franck Thollard

Fonction : Auteur
PersonId : 841732

Laboratoire Hubert Curien

Résumé

Different approaches have been used to represent textual documents, based on boolean model, vector space model or probabilistic models. In text min- ing as in information retrieval (IR), these models have shown good results about textual documents modeling. They nevertheless do not take into account docu- ments structure. In many applications however, documents are inherently struc- tured (e.g. XML documents). In this article1 , we propose an extended probabilistic representation of docu- ments in order to take into account a certain kind of structural information: logical tags that represent the different parts of the document and formatting tags used to emphasized text. Our approach includes a learning step that estimates the weight of each tag. This weight is related to the probability for a given tag to distinguish the relevant terms.

Domaines

Intelligence artificielle [cs.AI]

Franck Thollard : Connectez-vous pour contacter le contributeur

https://ujm.hal.science/ujm-00331746

Soumis le : vendredi 17 octobre 2008-15:03:11

Dernière modification le : vendredi 24 mars 2023-14:52:51

Dates et versions

ujm-00331746 , version 1 (17-10-2008)

Identifiants

HAL Id : ujm-00331746 , version 1
DOI : 10.1007/978-3-540-85902-4

Citer

Mathias Géry, Christine Largeron, Franck Thollard. UJM at INEX 2007: Document Model Integrating XML Tags. INEX, Dec 2007, Germany. pp.103-114, ⟨10.1007/978-3-540-85902-4⟩. ⟨ujm-00331746⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-ST-ETIENNE IOGS CNRS LAHC PARISTECH UDL

28 Consultations

0 Téléchargements

UJM at INEX 2007: Document Model Integrating XML Tags

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager