Pattern extraction from textual document collections using heterogeneous networks

Grant number: 11/12823-6
Support type:Scholarships in Brazil - Doctorate
Effective date (Start): October 01, 2011
Effective date (End): September 30, 2015
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Solange Oliveira Rezende
Home Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos, SP, Brazil


Due to the large amount of textual document collections available today, there is a need to develop techniques for automatic knowledge extraction and organization of these collections. Normally, documents are represented in a vector space model, in which each document is represented by a vector, and each position of this vector corresponds to a feature of the document, for example, the frequency of a word. The methods for pattern extraction using this form of representation assume that the documents in a collection as well as their characteristics are independent. Entretanto, this can lead to erroneous results. Trying to avoid this error, there are representations that model the textual documents through networks. However, in this type of representation, the traditional algorithms consider that the network are compounded by objects of the same type, as well as their relations, i.e., networks are homogeneous. This limitation can be overcome. To do this, text can be represented by heterogeneous networks, i.e., documents can be represented considering different types of objects, as the document terms or authors. Different types of relationships among these objects can also be represented. However, the use of relationships between objects of same type in a heterogeneous network is unusual. Our hypothesis is that this kind of relationship can also help the pattern extract. To prove this hypothesis, in this PhD project is proposed a representation of textual document collections using heterogeneous networks, in which an study about what are the ways to relate objects of the same type in a heterogeneous network that can produce better results for classification tasks and clustering of textual documents will be carried out. Algorithms will be adapted or developed for the extraction using the proposed representation. (AU)

Scientific publications
Academic publications
RAFAEL GERALDELI ROSSI. Text automatic classification through machine learning based on networks. 2015. Doctoral Thesis - Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação São Carlos.

