Advanced search

Pattern extraction from textual document collections using heterogeneous networks

Grant number: 11/12823-6
Support type:Scholarships in Brazil - Doctorate
Effective date (Start): October 01, 2011
Effective date (End): September 30, 2015
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Solange Oliveira Rezende
Grantee:
Home Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos, SP, Brazil

Abstract

Due to the large amount of textual document collections available today, there is a need to develop techniques for automatic knowledge extraction and organization of these collections. Normally, documents are represented in a vector space model, in which each document is represented by a vector, and each position of this vector corresponds to a feature of the document, for example, the frequency of a word. The methods for pattern extraction using this form of representation assume that the documents in a collection as well as their characteristics are independent. Entretanto, this can lead to erroneous results. Trying to avoid this error, there are representations that model the textual documents through networks. However, in this type of representation, the traditional algorithms consider that the network are compounded by objects of the same type, as well as their relations, i.e., networks are homogeneous. This limitation can be overcome. To do this, text can be represented by heterogeneous networks, i.e., documents can be represented considering different types of objects, as the document terms or authors. Different types of relationships among these objects can also be represented. However, the use of relationships between objects of same type in a heterogeneous network is unusual. Our hypothesis is that this kind of relationship can also help the pattern extract. To prove this hypothesis, in this PhD project is proposed a representation of textual document collections using heterogeneous networks, in which an study about what are the ways to relate objects of the same type in a heterogeneous network that can produce better results for classification tasks and clustering of textual documents will be carried out. Algorithms will be adapted or developed for the extraction using the proposed representation. (AU)

Articles published in Agência FAPESP about the scholarship:
Algorithms facilitate automated classification of web texts 

Scientific publications
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
FALEIROS, THIAGO DE PAULO; ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE. Optimizing the class information divergence for transductive classification of texts using propagation in bipartite graphs. PATTERN RECOGNITION LETTERS, v. 87, n. SI, p. 127-138, FEB 1 2017. Web of Science Citations: 0.
ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE; REZENDE, SOLANGE OLIVEIRA. Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. INFORMATION PROCESSING & MANAGEMENT, v. 52, n. 2, p. 217-257, MAR 2016. Web of Science Citations: 4.
ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE; FALEIROS, THIAGO DE PAULO; REZENDE, SOLANGE OLIVEIRA. Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, v. 29, n. 3, p. 361-375, MAY 2014. Web of Science Citations: 3.
Academic publications
(References retrieved automatically from State of São Paulo Research Institutions)
RAFAEL GERALDELI ROSSI. Text automatic classification through machine learning based on networks. 2015. Doctoral Thesis - Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação São Carlos.

Please report errors in scientific publications list by writing to: cdi@fapesp.br.