Bitcoin price change and trend prediction through twitter sentiment and data volume – Financial Innovation

ByJacques Vella Critien, Albert Gatt and Joshua Ellul

May 5, 2022

Figure 1 provides an overview of the process followed to determine the best model for predicting: (i) the next day’s close price direction (i.e. whether it will increase/decrease); and (ii) the magnitude of difference in closing prices. Two main datasets are used in this study: (i) Bitcoin price data; and (ii) Twitter tweets. Historical Bitcoin price data providing a per-minute record of timestamps, opening and closing prices, high and low prices and volume of Bitcoin traded for the period between 1st January 2012 and 31st December 2020 was retrieved from Kaggle.Footnote 6 A Twitter datasetFootnote 7 (also from Kaggle) was filtered to retrieve tweets that contained either ‘bitcoin’ or ‘btc’. The period of tweets provided in the dataset is between 1st January 2016 and 29th March 2019—including a total of over 20 million tweets. In addition to the text of each tweet, the dataset provides timestamps, tweet IDs and URLs, associated authors’ usernames and full names, and the number of replies, likes and retweets that tweets received.

Data cleaning and pre-processing

The following cleaning and pre-processing tasks depicted in Fig. 2 were undertaken on the Twitter dataset:

• Removal of non-English tweetsFootnote 8 and duplicate tweets made by the same user in a similar manner to Pant (2018); Valencia et al. (2019) and Ranjan et al. (2018);

• Removal of URLs from tweets, as performed in Kraaijeveld and De Smedt (2020) and Ranjan et al. (2018);

• Tokenization and lemmatization in a similar manner to Pagolu et al. (2016) (that is, mapping each token to its morphological base form, so as to enable reasoning about words in a form-agnostic manner);Footnote 9

• Removal of stop words (e.g. ’a’, ’the’, etc.), similarly to what was done by Pagolu et al. (2016);

• Replacement of user mentions (akin to tagging, which takes the form of ‘@’ followed by the username) with the text ‘USER’, again following Pagolu et al. (2016);

• Removal of all punctuation (Abraham et al. 2018);

• Processing of hashtags: if the word following the hash sign was not found in a precompiled English wordlistFootnote 10, it was removed; otherwise, the ’#’ sign was dropped and the word retained as was done by Kraaijeveld and De Smedt (2020);

• Removal of tweets containing fewer than 4 words, similarly to Kraaijeveld and De Smedt (2020);

• The normalised datasets are available from the repository we have made available to download from https://github.com/jacquesvcritien/fyp.

In relation to the Bitcoin pricing dataset, the high and low prices were removed from the feature list so as to only keep the average price per minuteFootnote 11. After the the cleaning and pre-processing steps, this study ended up with tweets and prices ranging between 30th August 2018 and 23rd November 2019.

A desirable property of the resulting dataset is that Bitcoin prices within the specified date range evinced both downward ($6500 to$3300 = -49%) and upward trends ($3300 to$11500 = +248%).

Determining polarity scores

Following preprocessing, VADER (Hutto and Gilbert 2015) is used to assign sentiment scores to tweets. Similar approaches are used by Valencia et al. (2019); Abraham et al. (2018) and Kraaijeveld and De Smedt (2020)Footnote 12. VADER scores each tweet with a negative, positive, neutral and compound polarity score. The compound score is a sum of the individual sentiment scores, adjusted according to a set of rules and normalised to fall within the ([-1,+1]) range. However, for the purposes of this study, only positive and negative polarity scores are included in the training and evaluation data sets. VADER was widely used in related work (Valencia et al. 2019; Abraham et al. 2018; Kraaijeveld and De Smedt 2020; Mohapatra et al. 2020; Serafini et al. 2020) and provides advantages including the following: it is open source and free; it is human validated and tuned for Twitter content (Valencia et al. 2019); and it has also been shown to perform competitively with human annotators and has outperformed several benchmarks, especially on social media content (Hutto and Gilbert 2015).

Merging datasets and introducing lag

One of the research questions which this work aims to address is the optimal lag to consider that would enable the discovery of a relationship between Bitcoin-related tweets (and in particular the sentiment they express) and actual price change. Indeed, it is not certain that such tweets are the cause of the change in price. However in this work we investigate whether a potential correlation can be seen, and if so what the optimal time lag is (between tweets and the price being affected). This approach was similarly followed by (Stenqvist and Lönnö 2017) and (Balfagih and Keselj 2019), who explored lags ranging from minutes to hours. In contrast to these approaches, in this paper we investigate lag intervals of a number of days—to be exact, 1, 3 or 7 days. To illustrate, Fig. 3 depicts the effect of introducing a lag of 3 days on a dataset—where the original dataset is on the left, and the dataset with a lag introduced is on the right. From the lagged dataset (on the right) note how, as an example, Day 1’s score is associated with Day 4’s price—i.e. tweets from day 1 are being assumed to affect prices 3 days later (in this example). The reason for choosing to investigate lags of 1, 3 and 7 days is that since this study focuses on making a daily prediction, the minimum lag to be observed should be of at least 1 day. Thereafter it was decided to observe a granularity of a week. The choice of a 3-day lag represents an interval between these two extremes.

The three different lagged datasets (for 1-, 3- and 7-day lags) were created by shifting the price data (of the cleaned and merged dataset) back by the respective number of days in the lag being tested.

Grouping lagged datasets

Lagged datasets consist of preprocessed tweets coupled with their Bitcoin price at the minute the tweet was posted. Subsequently, these are grouped by day in order to allow a model to make daily predictions. Grouping is done in the following manner:

• Timestamps of tweets are floored to the hour or day when the tweet was posted;

• Tweets are grouped by their floored timestamp;

• For a given group, the polarity scores are averaged;

• The tweet volume is added as an additional feature, where the volume is the number of tweets in a given day;

• The closing Bitcoin price for the day is then identified as the price for the last record for the given day.

Features and labels

The classifiers described below are trained to predict a fluctuation in price based on the following features:

1. 1

Change: Bitcoin price change direction of that day (binary, indicating whether the price rises or falls);

2. 2

Close: Bitcoin’s closing price for that day;

3. 3

Positive polarity: The positive polarity score obtained from VADER;

4. 4

Negative polarity: The negative polarity score obtained from VADER;

5. 5

Tweet Volume: The volume of tweets in the relevant interval. This was also investigated in Abraham et al. (2018) which demonstrated that prince changes were highly correlated with tweet volume.

Note that lagged datasets also include the above features for the previous days. For example, as shown in Fig. 4, if the lag is 2, a training instance would include data from the last 2 days. Finally, the label of that instance would be the price change direction of the day following the last lagged feature.

Data split and resampling

In order to train and test the data, the dataset was split using a train-test ratio of 85:15. The reason for this is because of the small number of records available for training and testing after grouping and averaging the original datasets per day. Therefore, such a split allows the model to have a good percentage of the available data to train on while also having a fair number of records to test.

When testing different models and parameters, each set of parameters is tested on 3 different shuffled datasets. However, when shuffling the dataset, the same seed is set to ensure that each model and set of parameters is tested on the same three sets of datasets, allowing for a fair comparison. The training and test sets are prepared by first shuffling the original datasets and then using the first 85% as the training set and the last 15% as the test set.

Predicting next day’s close price direction

The direction of the closing price can be framed as a binary classification problem where, given the input corresponding to features extracted from tweets, the task is to predict whether the price will go up or down.

Three different models, (i) using an LSTM, (ii) CNN and (iii) Bidirectional Long Short Term Memory Cells (BiLSTM), were implemented for predicting whether the following day’s closing price will increase or decrease. These are hereafter referred to as Direction-LSTM, Direction-CNN and Direction-BiLSTM. Table 2 outlines the hyperparameters used for each model. The table also gives the accuracy statistics for each model. These are further discussed in the evaluation section below. It is however evident that the best performing model, in terms of mean accuracy, is Direction-BiLSTM. The architecture of this model is depicted in Fig. 5.

Daily price change magnitude prediction

Another prediction model tries to predict the magnitude of the change of closing day prices as a multi-class classification problem. This is done by predicting which interval the closing day price changes would fall into.

Closing day prices were categorised into ten different bins/classes. An average of the maximum positive ($1563) and maximum negative price ($1746) changes was calculated (and rounded to $1650) to define the lower and upper bins/classes (that were extended to included any greater price change), and then equal steps (of$330) calculated for each bin in between, as can be seen in Table 3.

As before, to predict the magnitude of change in price on the following day, three models were implemented using an LSTM, CNN and BiLSTM, which we’ll refer to as Magnitude-LSTM, Magnitude-CNN and Magnitude-BiLSTM for the remainder of this paper. Table 4 summarises the hyperparameters and training settings used for these models, together with the evaluation results (see the evaluation section for this discussion). The Magnitude-CNN model outperforms the other two for this task, as is evident from the mean accuracy and F1 scores. Figure 6 depicts the architecture of this model.

Voting classifier

The best performing models from each of the aforementioned predictive tasks, more specifically, the Direction-BiLSTM and Magnitude-CNN models, were merged together to create a voting classifier model which takes into consideration the outputs from the two models. As Fig. 7 shows, the voting classifier works by first predicting the next day’s closing price direction and then, the magnitude of the next day’s closing price using the second model. Then, it checks whether the next day’s closing price direction matches the direction of the predicted change magnitude. In other words, a match happens: (i) if the first model outputs a 0, which means a decrease in price, and the second model outputs a class from 1 to 5 (negative magnitude of price change); or (ii) if the first model outputs a 1, which means an increase in price, and the second model outputs a class from 6 to 10 (positive magnitude of price change). The prediction of the next day’s closing price direction is kept if there is a match in the output of the two classifiers. Moreover, the voting classifier is evaluated on 50 different runs with 50 differently shuffled datasets.