Português English
Contato

Tese de Doutorado de Alexandre Tadeu Salle


Detalhes do Evento


Aluno: Alexandre Tadeu Salle
Orientadora: Profª. Drª. Aline Villavicencio

Título: The Role of Negative Information when Learning Dense Word Vectors
Linha de Pesquisa: Aprendizado de Máquina, Representação de Conhecimento e Raciocínio
Data: 06/12/2021
Horário: 14h
Esta banca ocorrerá excepcionalmente de forma totalmente remota. Interessados em assistir a defesa poderão acessar a sala virtual através do link: https://mconf.ufrgs.br/webconf/00161896

Banca Examinadora:
– Profa. Dra. Viviane Pereira Moreira (UFRGS)
– Prof. Dr. Marcelo Finger (USP)
– Profa. Dra. Helena de Medeiros Caseli (UFSCAR)

Presidente da Banca: Profª. Drª. Aline Villavicencio

Abstract: By statistical analysis of the text in a given language, it is possible to represent each word in the vocabulary of the language as an m-dimensional word vector (also known as a word embedding) such that these vectors capture semantic and syntactic information. Word embeddings derived from unannotated corpora can be divided into (1) counting methods which perform factorization of the word-context cooccurrence matrix and (2) predictive methods where neural networks are trained to predict word distributions given a context, generally outperforming counting methods. In this thesis, we hypothesize that the performance gap is due to how counting methods account for – or completely ignore – negative information: word-context pairs where observing one is informative of not observing the other, mathematically formulated as two events (words and contexts) having negative Pointwise Mutual Information. We validate our hypothesis by creating an efficient factorization algorithm, LexVec, scalable to web-size corpora, that accounts for negative information in a principled way, closing the performance gap with predictive methods. Additionally, we show that strategies for breaking up words into smaller units (subwords) – an important technique in predictive methods for representing rare words – can be successfully adapted to LexVec. We show that the explicit nature of LexVec – having access to the underlying cooccurrence matrix – allows us to selectively filter whether to account for negative information in the factorization and to what degree, and use this filtering to isolate the impact that negative information has on embeddings. Word and sentence-level evaluations show that only accounting for positive PMI in the factorization strongly captures both semantics and syntax, whereas using only negative PMI captures little of semantics but a surprising amount of syntactic information. Finally, we perform an in-depth investigation of the effect that increasing the relative importance of negative PMI compared to positive PMI has on the geometry of the vector space and its representation of frequent and rare words. Results reveal two rank invariant geometric properties – vector norms and direction – and improved rare word representation induced by incorporating negative information.

Keywords: Word vectors. matrix factorization. natural language processing.