Statistical natural language processing and sentiment analysis with time-series : embeddings, modelling and applications
Chalkiadakis, Ioannis M.
MetadataShow full item record
This thesis addresses the problem of modelling and understanding the fundamental statistical structure that is present in texts through techniques from natural language processing (NLP) and statistics. This is achieved by proposing a novel framework for feature extraction from text data, studying their properties, constructing models for them, and applying these models in real-world applications, to illustrate how such approaches become relevant and are important, in the era of Big Data-driven natural language processing. In the first part of the thesis, the challenge of embedding raw text in a stochastic formulation is addressed, such that the resulting time-series preserve the key features of natural language: the time-dependent, sequential nature of information, semantics, and the laws of grammar and syntax. As part of this process, we present mathematically and in detail what noise means in text-based data sources, how to de-noise, and encode text data for modelling purposes. The latter is achieved by proposing a novel infrastructure of N-ary relations to construct stochastic text embeddings, based on which a sequence of statistical process summaries are constructed that study the statistical properties of the created embedding, including long memory and its multifractal extension, stationarity, as well as behaviour at the extremes. In the second part of the thesis, we present different ways to construct interpretable sentiment indices from text. We propose lexicon-based models, based on an entropy measure over varying sentiment supports. We then combine different sentiment types (positive, negative, neutral) with a number of aggregation rules and show how to interpret them. In the third part, we address the task of building models from the constructed text time-series. The class of models we consider comprises time-series Mixed-Data Sampling (MIDAS) regression models. We develop a novel Autoregressive Distributed Lag (ARDL) MIDAS model for multimodal covariates sampled at varying time resolutions. Finally, we show how to incorporate a deep neural network in the ARDL-MIDAS framework to develop the instrumental variables necessary to resolve estimation requirements for the ARDL-MIDAS model. The final part of the thesis demonstrates the application of our theoretical advancements in multimodal sentiment modelling settings, where one modality is text-based sentiment. The first application focuses on statistical causal relationships between investor sentiment in cryptocurrency markets, whilst the second focuses on mixed data modelling and deep neural networks (Transformers) for forecasting end-of-day investors’ sentiment from intra-daily price and technology indicators. Finally, the third application studies the problem of model risk in the epidemiology domain, for models predicting the total number of COVID-19 infected cases, which are additionally enhanced with public news sentiment information via an exposure adjustment.