dc.description.abstract | This thesis addresses the problem of modelling and understanding the fundamental statistical structure that is present in texts through techniques from natural
language processing (NLP) and statistics. This is achieved by proposing a novel
framework for feature extraction from text data, studying their properties, constructing models for them, and applying these models in real-world applications,
to illustrate how such approaches become relevant and are important, in the era
of Big Data-driven natural language processing.
In the first part of the thesis, the challenge of embedding raw text in a stochastic formulation is addressed, such that the resulting time-series preserve the key
features of natural language: the time-dependent, sequential nature of information, semantics, and the laws of grammar and syntax. As part of this process,
we present mathematically and in detail what noise means in text-based data
sources, how to de-noise, and encode text data for modelling purposes. The
latter is achieved by proposing a novel infrastructure of N-ary relations to construct stochastic text embeddings, based on which a sequence of statistical process
summaries are constructed that study the statistical properties of the created embedding, including long memory and its multifractal extension, stationarity, as
well as behaviour at the extremes.
In the second part of the thesis, we present different ways to construct interpretable sentiment indices from text. We propose lexicon-based models, based on
an entropy measure over varying sentiment supports. We then combine different
sentiment types (positive, negative, neutral) with a number of aggregation rules
and show how to interpret them.
In the third part, we address the task of building models from the constructed
text time-series. The class of models we consider comprises time-series Mixed-Data Sampling (MIDAS) regression models. We develop a novel Autoregressive
Distributed Lag (ARDL) MIDAS model for multimodal covariates sampled at
varying time resolutions. Finally, we show how to incorporate a deep neural
network in the ARDL-MIDAS framework to develop the instrumental variables
necessary to resolve estimation requirements for the ARDL-MIDAS model.
The final part of the thesis demonstrates the application of our theoretical
advancements in multimodal sentiment modelling settings, where one modality
is text-based sentiment. The first application focuses on statistical causal relationships between investor sentiment in cryptocurrency markets, whilst the second
focuses on mixed data modelling and deep neural networks (Transformers) for
forecasting end-of-day investors’ sentiment from intra-daily price and technology indicators. Finally, the third application studies the problem of model risk
in the epidemiology domain, for models predicting the total number of COVID-19 infected cases, which are additionally enhanced with public news sentiment
information via an exposure adjustment. | en |