This research aims to develop a depression detection model using demographic characteristics and sentiment analysis from tweets. The research methodology consists of 5 processes as shown in Fig. 2 which are data acquisition, data transformation, data storage, model construction, and model performance evaluation. The details of each process are described as follows.

Fig. 2
figure 2

The framework of a depression detection model

Data acquisition

The data acquisition for model construction in this research is obtained from three sources which are the PHQ-9, a personal information questionnaire, and Twitter API.

Firstly, the data derived from the PHQ-9 are survey response date and a depression assessment score of each participant.

Secondly, the data collected from the personal information questionnaire comprise Twitter ID and demographic characteristics such as gender, age, weight, education, congenital disease, career, income, number of family members, self-couple status, and parent’s marital status.

Finally, Twitter API is used to collect Twitter user’s information in real time from the Twitter page. The Twitter ID collected from the first source is used to access the Twitter user’s information. The information was gathered 2 months prior to the survey response date. There are 222 tweets, 1522 retweets and 16 hashtags, collected from 192 Twitter users, 50 of whom are moderately depressed, 74 are slightly depressed, and 68 show no sign of being depressed.

Data transformation

The data processing for model construction using machine learning consists of three steps which are detailed as follows.

Extracting Twitter user’s information

The information obtained from Twitter will be processed to find sentiment attributes and their values. The process of extraction is illustrated in Fig. 3 and can be explained as follows. Firstly, the Twitter ID is used to access the Twitter user’s information via Twitter API. The extracted information consists of tweets, retweets, hashtags, the number of friends, the number of followers, and periods of tweets. Secondly, terms in the tweets, retweets, and hashtags are extracted using an NLTK library [23]. Thirdly, the extracted terms in Thai are translated to English using Google translation API. Finally, the sentiment numerical scores of each term are derived from opinion lexicons in the WordNet database [24], where each term has three sentiment numerical scores: Pos(s), Neg(s) and Obj(s), according to Eq. (1).

$$Pos(s) + Neg(s) + Obj(s) = 1 $$


The extracted sentiment attributes in this research comprise the number of positive and negative tweets, retweets and hashtags, the number of tweets, retweets and hashtags expressing depression, and sentiment score of all tweets.

Fig. 3
figure 3

The process of feature extraction

The sentiment score of all tweets (Theta ) is derived from sentiment scores of each tweet i as shown in Eq. (2).

$$ Theta = frac{sum _{i=1}^{n} Tweet(s)_{i}}{n}$$


The sentiment score of each tweet (Tweet(s)_{i}) is calculated from an average of positive and negative scores of all terms in the tweet as shown in Eq. (3).

$$ Tweet(s) = frac{sum _{j=1}^{m} Pos(s)_{j}}{m} – frac{sum _{j=1}^{m} Neg(s)_{j}}{m} $$


The number of positive tweets (Tweet_{pos}) is the total number of tweets that have (Tweet(s) > 0). On the other hand, the number of negative tweets (Tweet_{neg}) is the total number of tweets that have (Tweet(s) ge 0). The number of tweets expressing depression (Tweet_{depress}) is the total number of tweets that have an occurrence of at least one of depressive term. There are 78 depressive words collected from the PHQ-9 questionnaire and stored in a depression term corpus [25].

The number of positive and negative retweets and hashtags, and the number of retweets and hashtags expressing depression are derived in the same way as those equation of tweets.

Transforming the depression assessment score

The scores obtained from the PHQ-9 are converted from numeric to nominal. The scores are converted in to four classes as follows: 5-8 points are converted to Level 0 or no depression, 9-14 points are converted to Level 1 or slight depression, and more than 15 points are converted to Level 2 or moderate depression.

Loading relevant features

Before loading features into data storage, attributes obtained from the previous step must be relevant features. Feature selection is the process of selecting a subset of relevant features for model construction. The relevant features as shown in Table 2 include 10 demographic characteristic attributes ((x_{1},ldots,x_{10})), 10 sentiment attributes ((x_{11},ldots,x_{20})), 4 Twitter user’s information attributes ((x_{21},ldots,x_{24})), and depression evaluation result attribute (y). Moreover, the table contains additional details and data types of each feature. Irrelevant data such as Twitter ID and the survey response date are eliminated to prevent unnecessary computation in the machine learning technique. These relevant features will be loaded into the next process.

Table 2 Data characteristics of relevant features

Data storage

Data from the previous process are loaded into a Hadoop cluster, a specialized computer cluster designed for storing and analyzing a large amount of data. This research employed a Hadoop’s open source software suite called Cloudera running on commodity computers. Data gathered from Twitter are kept in Hadoop Distributed File System (HDFS). The HDFS consists of two principal components called name node and data node. Name node is served as metadata, where data nodes store the actual data block. Files are kept in 128-MB of data block, where each block is replicated on three different data nodes as illustrated in Fig. 4. Hadoop cluster is served as data lake for storing unstructured data from the Twitter.

Fig. 4
figure 4

Model construction

Machine learning is used to create the automated depression detection model. The type of machine learning applied for the model construction is a classification technique. A set of data called the training set of the model uses features or input variables ((x_{1},ldots,x_{24})), and the target or output variable (y), all of which are listed in Table 2. More specifically, this research employs supervised machine learning techniques for model creation, including Support Vector Machine, Naïve Bayes, Decision Tree, Random Forest and Deep Learning techniques. Model construction is implemented with Spark MLlib, Spark’s machine learning library using Python interface. An advantage of Spark MLlib is that it can load a huge data set directly from the HDFS. This enables a large volume of tweets to process using machine learning.

Model performance evaluation

The last process is to evaluate the performance of depression detection model to find the most appropriate machine learning technique for detecting the depression. The Support Vector Machine technique is used as a baseline of accuracy comparison since the majority of previous research work has relied on it [10,11,12,13, 19]. Comparing the results of different techniques is measured by standard measure scores which are f-measure, accuracy, precision, and recall [26].

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (