1 Introduction

The human immunodeficiency virus type 1 (HIV-1) is a global health threat that impacts millions of people worldwide. This lentivirus primarily infects the immune system by attaching to the CD4 receptor, entering the cell, integrating into the genome, and then producing viral copies to infect new cells, a process reviewed in greater detail by Mailler et al. (1). Antiretroviral therapy has proven to be an effective treatment for individuals; however, the error-prone replication process leads to rapid diversification of the virus and eventual resistance through the accumulation of mutations which impact the efficacy of the treatment (2, 3). This rapid evolution also allows the virus to evade host antibodies and invade new niches through the changes to the viral envelope and accessory proteins (47).

These phenotypes have drastic impacts on patient morbidity and mortality (5, 8, 9). Given their selective advantage, strains containing drug resistant mutations (DRMs) with low fitness requirements tend to proliferate within infected individuals and have been shown to drastically increase the potential for virologic failure (10, 11). At present, there are five classes of HIV-1 antiretroviral therapies: nucleoside reverse transcriptase inhibitors (NRTIs), non-nucleoside reverse transcriptase inhibitors (NNRTIs), protease inhibitors (PIs), integrase inhibitors (INIs), and entry inhibitors (12, 13). Indeed, DRMs have been identified corresponding to each class of inhibitors, albeit at different proportions (12).

Since these DRMs and other genotypes/phenotypes are transmitted within the proviral genome, there has been a long history of techniques for predicting these traits from genomic sequence. This manuscript has chosen three diverse tasks from the HIV-1 field across the viral lifecycle as described in Figure 1 and below:

1. Protease inhibitor resistance – The protease gene is a 99 amino acid protein responsible for cleaving viral proteins that are translated as a single polyprotein into their active form; thus making it an ideal target for antiretroviral therapy. The Stanford HIV database maintains a prediction server for annotating HIV-1 sequences with known resistance mutations (14). Investigators have also developed predictive tools such as SHIVA, a decision tree classifier (15), generative machine learning algorithms (16), and the ANRS (Agence Nationale de Recherches sur le SIDA) algorithm, a well-regarded program for predicting protease receptor and reverse transcriptase resistance (17).

2. Coreceptor utilization – HIV-1 enters human cells by first binding the CD4 receptor and then recruiting one of two possible coreceptors: CCR5 or CXCR4. This allows for the entry into two different cellular reservoirs with CCR5 primarily responsible for T-cell infection and CXCR4 for macrophages (4, 18). This phenotype is primarily due to the 3rd variable loop of the envelope gene, the V3 loop. This 35 amino acid loop mediates the interaction between the envelope protein and coreceptor primarily at positions 13-21 (19). The earliest methods of coreceptor prediction used alignment-based methods to create a position-specific scoring matrix (PSSM) (20). Newer methodologies like decision trees (21) and XGboost (22) have been applied with great success.

3. Bodysite identification – Due to the ability of HIV-1 to infect immune cell populations, it spreads across the body rapidly and adapts to new compartments (2326). When exploring HIV-1 cure strategies such as “shock & kill” and anti-HIV CRISPR-Cas9 gene editing, it is important to surveil these compartments (2729). This task is traditionally tackled using phylogenetic methods in which the evolutionary relationship between sequences isolated from different bodysites is explored (3032). In our review of the relevant literature this prediction task has not been previously attempted.


Figure 1 A diagram for the biological processes explored. The left column depicts the basic viral lifecycle starting with entry through the binding of CD4 and cell-specific coreceptors, followed by unpackaging of the capsid, reverse transcription of viral RNA, integration into the host genome, subsequent translation of viral proteins and repackaging. Steps where antiretroviral medications interfere with the lifecycle are also noted. The middle panel indicates bodysites that are known to be anatomic reservoirs of latent viral replication. The right panel shows a flow diagram of the experiments and training procedures described in the text. Figure created with BioRender.com.

Using sequence data for machine learning (ML) techniques poses multiple challenges. First, most ML techniques require numeric inputs. This requires a conversion of the amino acid sequence into a numeric form that can be consumed by downstream algorithms. Techniques for doing this can be grouped into two categories: alignment-based and alignment-free.

Alignment-based methods require using alignment tools like MUSCLE (33), T-coffee (34), or minimap2 (35) to associate each position in the query with a consistent column. This is critical when using machine learning methods like SVM, decision trees, or other classic techniques as they assume that each column consistently represents the same physical property. Once aligned, a feature matrix can be constructed by converting each column of the alignment into a number, or set of numbers, either through one-hot encoding or encoding physiochemical properties of each amino acid at each position. This matrix can then be used like any other ML dataset.

However, the rapid mutation of HIV-1 leads to difficulties with alignment-based methods as the insertion and deletion rate of the virus leads to alignments with copious gaps. Alignment-free methods sidestep this problem by summarizing a protein of any length into a fixed length vector. Tools like MathFeature (36) and ProtDCal (37), perform this through calculation of physiochemical properties of the protein. However, this presupposes that these properties are relevant to the prediction task. K-mer based approaches count occurrences of fixed-length subsequences of the protein. This generates a fixed-length vector for each protein that compares the presence, absence, or quantity of short amino acid sequences. However, this requires one to balance the length of the k-mer with the sparsity of the downstream dataset. As each k-mer is a distinct feature in the matrix, it also does not account for the similarities between amino-acids.

Many of these problems are also present in related fields like natural language processing (NLP): not all sentences are the same length, words have different meanings in different contexts, and similar words can be used interchangeably. This has spurred a great deal of crosstalk between the NLP field and the genomic language processing (GLP) field (38, 39). Of particular interest to this manuscript are advances in artificial intelligence (AI) methods, particularly the Transformer models (40, 41). While the initial investment is large (42), once trained, models can be reused on multiple tasks (4345).

In 2021 a high-performance computing group led by Burkhard Rost set out to leverage Summit, the world’s second fastest computer, to accelerate GLP research. They leveraged 1096 GPU containing nodes to train a Bidirectional Encoder Representations from Transformers (BERT) model on a dataset of 393 million amino acids from Uniprot50, Uniref100, and the Big Fantastic Database (BFD) of human isolated metagenomic proteomes (46). This model architecture considers the entire protein at once (Bidirectional) and encodes proteins into fixed length vectors (Encoder Representation) for downstream predictions. Like all transformer style language models, it can be trained on unlabeled data for many tasks through a technique called Masked Language Modeling (MaskedLM). In this process, a subset of amino acids are ‘masked’ from the model during training and it is tasked with predicting them and its weights are updated through gradient back-propagation. Accomplishing this task pretrains the model for future tasks. The Rost-Lab showed that this model could be further refined in a process called “transfer learning” to predict new tasks such as subcellular localization, secondary structure, and enzyme activity (46).

This group released their prediction models into the Hugging Face Model Repository for open-source use (https://huggingface.co/models). The Hugging Face transformer library is a Python-based library for implementing state of the art AI models in a consistent, reproducible, and extensible fashion (47). Pretrained models can be downloaded with a single command and applied to new data with another Hugging Face transformer pipeline (47). The library also provides an interface for refining the pretrained models on new data (48). These studies reported herein applies these tools to the three HIV-1 prediction tasks described above. It explores the effects of pre-training, class weighting, and dataset size on the task. It also releases the models and datasets to the Hugging Face dataset and model hub for the community at large to accelerate future work. These studies also provide advice and expectations on adapting this to other applicable tasks.

2 Materials and Methods

The entire analysis pipeline was implemented as a Snakemake pipeline and is available on the public Github repository (49). This allows any researcher to download, reproduce, transform, or extend this analysis. The deep learning aspect of this pipeline requires access to sufficient computational resources. While the script auto-scales to the user’s available memory the training of the transformer models require at least 6GB of GPU RAM. Once trained, the models can be used on generally available hardware.

2.1 Explanation of Public Datasets

2.1.1 HIV-1 Full Genome Dataset

The Los Alamos National Laboratory HIV sequence (LANL) database maintains a standard dataset of 1,690 high-quality full-length genomes. The most recent version was downloaded (2016-Full-genome) on 12/21/2021 (https://www.hiv.lanl.gov). This was then processed using the LANL GeneCutter tool to extract and splice the DNA sequence of each gene (https://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html). The GeneCutter tool aligns the query to a gold-standard codon aligned multiple-sequence alignment using HMMER v 2.32. This ensures that any gene within the region is properly extracted in the proper reading-frame. These extracted DNA sequences were then translated into the appropriate protein sequence to the first stop-codon using the BioPython library Seq.translate function with a human codon table (50). The database was processed using the script workflow/scripts/process_lanl_flt.py.ipynb and deposited as a Hugging Face Dataset damlab/HIV_FLT.

2.1.2 Protease Drug Resistance

The high-quality interactions from the Stanford HIV Genotype-Phenotype database (51) were downloaded on 12/21/2021 and contained 1959 lines at the time of download. The file stores the amino-acid differences from the provided reference sequence, for example V30I|D46N, and an array of drug-susceptibility scores for each isolate. This was converted into a Hugging Face dataset by inferring the full protease sequence by exchanging the indicated amino acids from the pre-translated reference sequence and labeling any drug with a >4-fold increase in resistance as “True” conforming with the methodology described by Rhee et al. After filtering the dataset to only include the drugs FPV, IDV, NFV, and SQV, and dropping all missing items, there were 1733 total PR sequences for analysis with roughly half being resistant to at least one drug. The database was processed using the script workflow/scripts/process_stanford_pr.py.ipynb and deposited as a Hugging Face Dataset damlab/HIV_PI.

2.1.3 Coreceptor Tropism

V3-loop sequences were downloaded from the LANL database through the Search Interface (https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html) on 12/20/21. The query was generated by limiting the Subtypes to A, B, C, and D; selecting “list in field output” for the Coreceptor and Sample Tissue fields; and selecting V3 in the Genomic Region Selection box. This generated approximately 220,000 results at the time of search. The LANL search tools were used to align, trim, and return the selected sequences with the associated background info. The background information was parsed to create an independent binary variable each for CCR5 and CXCR4 binding status. This search generated 2935 sequences with 91% being CCR5 tropic and 23% labeled as CXCR4 tropic. The database was processed using the script workflow/scripts/process_lanl_v3.py.ipynb and deposited as a Hugging Face Dataset damlab/HIV_V3_coreceptor.

2.1.4 Bodysite Identification

Using the same V3 dataset mentioned above, the Sample Tissue field was aggregated so similar bodysites were grouped together. The grouping was performed as follows:

Periphery-tcell: plasma, PBMC, T cells, CD4+ T cells, resting CD4+ T cells, effector memory CD4+ T cells, transitional memory T cells, central memory T cells, serum, blood, lymph node, CD4+ T cell supernatant, lymph node CD4+ T cells, CD14+ monocytes, activated CD4+ T cells, naive CD4+ T cells, effector memory T cells, T-cell, CD8+ T cells, PMBC, PBMC supernatant, stem memory T cells, terminally differentiated T cells

Periphery-monocyte: lamina propria mononuclear cells, CD14+ monocytes, monocyte, CD16+ monocytes

CNS: brain, CSF, spinal cord, dendrites

Lung: lung, BAL, sputum, diaphragm

Breast-milk: breast milk

Gastric: colon, rectum, jejunum, ileum, GALT, rectal fluid, intestine, feces, stomach, choroid plexus, sigmoideum, gastric aspirate, esophagus

Male-genitals: semen, seminal plasma, foreskin, seminal cells, urethra, prostate, testis, prostatic secretion

Female-genitals: vaginal fluid, cervix, vagina, vaginal cells, cervicovaginal secretions

Umbilical-cord: umbilical cord plasma, placenta

Organ: liver, kidney, epidermis, thymus, pancreas, adrenal gland, spleen, bone marrow

Dropped sequence: supernatant, saliva, urine, meninges, skin tumor, qVOA, urine cells, breast milk supernatant, aorta, glioma

Then, sequences were grouped such that each unique sequence was annotated with all bodysites it was associated with. This allows a sequence to be annotated with multiple bodysites. Due to the over-representation of periphery-tcell tags, a random 95% were discarded. After processing, there were 5510 unique V3 sequences annotated with about half from the periphery-tcell tag and the rest ranging from 5-20% composition. The database was processed using the script workflow/scripts/process_lanl_v3.py.ipynb and deposited as a Hugging Face Dataset damlab/HIV_V3_bodysite.

2.2 Models

Four models were constructed to evaluate the real-world performance on these prediction tasks. The datasets above were split using 5-fold cross-validation with the folds preserved across the different model trainings. This allows for an honest comparison of each fold across each model.

2.2.1 Naive Models

First, a Dummy model was created using the sklearn.dummy.DummyClassifier class with stratified strategy. This model represents randomly guessing utilizing the known class distribution of the training data. For example, 91% of V3 sequences are CCR5 tropic, therefore this model will guess True 91% of the time. This naïve model represents the lowest reasonable bar to set for prediction tasks.

Next, a basic Random Forest Classifier was used as a biology naive model machine learning model. The variable length sequences were encoded into fixed length vectors using term frequency-inverse document frequency (TF-IDF). This technique creates a feature for each N-gram present in the database and encodes proteins by the sum of the number of times the N-gram occurs in the sequence divided by the number of times it occurs in the database. Sequences, V3 or PR, were transformed with the sklearn.feature_extraction.text.TfidfVectorizer class using a ngram_range = (1,3) and analyzer=‘char’. This creates a vector of each k-mer sized 1-3 with the count normalized by the inverse of its prevalence in the rest of the training dataset. This naturally highlights k-mers that are over-represented in the sequence. After a basic variance threshold to remove invariant columns, a classifier was built with the sklearn.ensemble.RandomForestClassifier class using the default parameters. This model represents a purely mathematical approach to the prediction problem. These models were trained using the workflow/scripts/sklearn_train.py script using scikit-learn version 1.0.2.

2.2.2 Transformer Models

Biologically informed Transformer models were imported using the Hugging Face AutoModelForSequenceClassification tool. The Hugging Face library provides a Trainer module for refining pretrained models on new datasets. However, the default Trainer cannot accommodate multi-label prediction problems like those posed here. As such, a CustomTrainer class was developed per recommendations in a related forum post (52) and documentation (53). In brief, there were three changes implemented. First, the CategoricalCrossEntropy loss function was replaced with a BinaryCrossEntropy function; this allows the model to predict multiple True values in each field (i.e. a protease sequence can be resistant to zero, one, or potentially all drugs). Second, as the class labels are not equally distributed for each field, 91% of V3 sequences are CCR5 tropic weighting was used to balance the imbalanced classes as described in the BCELoss documentation. Finally, the loss from each field of the prediction was added together in equal weights. This CustomTrainer is implemented in the module workflow/scripts/common.py. Training was performed using the same parameters across all models: learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the validation loss metric.

For these experiments, the Prot-Bert-BFD model from the RostLab was used as the basis for pretraining (46), commit-tag 6c5c8a5. It has been trained across a wide array of proteins and is easily available in the Hugging Face library. This pre-trained model was used as the initial weights for training each of the three models described above. This is implemented in the script workflow/scripts/huggingface_train.py.

Previous research has shown that refining language models on domain-specific sequence can improve downstream performance (4345). Using the whole genome sequence data described above, the Prot-BERT-BFD model was refined using MaskedLM training. In this task, a random set of amino acids are masked from the model, which is then asked to predict them. A training script was adapted from the HuggingFace transformers run_mlm.py script (54). Pre-training was performed by concatenating all protein sequences from the Full HIV Genome Dataset described above and chunked into 256 amino acid segments. 80% of the sequences were used for training and 20% were reserved for validation. This is implemented in the script workflow/scripts/huggingface_lm.py using transformers v4.15.0.

2.3 Conceptual Error

Conceptual error was calculated using a set of reference mutations known from other sources of information to induce a known effect. It relies on our knowledge of the structure-to-function relationships involved in these processes and quantifies the biological “unexpectedness” of a model’s output.

Protease resistance conceptual error was measured by generating a dataset of known protease inhibitor resistance mutations. Utilizing the Stanford HIV database mutation explorer, 10 mutations were identified as inducing a >4-fold increase in drug resistance when acting alone: D30N, V32I, M46I, M46L, G48V, I54V, V82F, I84C, N88S and L90M. For each sequence in the testing dataset of each fold, they were examined for these mutations. If absent, they were introduced; if present they were removed. Then, the original and altered sequences were inferred by each trained model. The logit output of each pair was subtracted and squared. Then, the squared-error for pairs which introduced resistance mutations that increased resistance were set to zero; as were pairs which removed resistance mutations that decreased resistance. The resulting masked vector was then averaged. This error represents the mean-squared error change in the unexpected direction.

V3 coreceptor prediction can also be interrogated in this way. Structural alignments by Fouchier et al. (55) have demonstrated an 11/24/25 rule: if there is a positive amino acid at positions 11, 24, or 25 of the V3 loop then it can bind to CXCR4, otherwise it binds to CCR5. Utilizing the same switching and error calculation strategy described above, pairs of sequences were made with S11H, S11R, S11K, G24H, G24R, G24K, E25H, E25R, and E25K.

As there are no known structure-function relationships that can be used as a gold-standard for bodysite identification, it was excluded from the conceptual error calculations.

2.4 Statistical Comparisons

As each prediction task used the same 5-fold cross-validation scheme, it is possible to compare different models trained on the same 80% and evaluated on the last 20%. For this reason, a paired t-test provides the most power to evaluate differences. For each comparison below, a paired t-test was implemented with scipy.stats.ttest_rel (v1.7.3). As these pairwise comparisons result in many tests per experiment, a Bonferroni correction was used to correct for false positives. A corrected p<0.05 was used as a threshold for declaring statistical significance.

3 Results

3.1 Dataset Release

This manuscript publicly releases four HIV-focused datasets. These have been prepared for other researchers by conforming to the Hugging Face Dataset Hub style. This allows the datasets to be downloaded using a simple command like:

from datasets import load_dataset

hiv_dataset = load_dataset(‘damlab/HIV_FLT’)

which are then available as high-speed data objects for downstream use.

As shown in Figure 2A, the whole genome dataset contains a mixture of genes of the correct length as well as those with premature stop codons, with 33.3% of genomes contained at least one gene with a premature stop-codon. When concatenated, this dataset contains 3.9 million characters, approximately 1% of the size of the original BFD training dataset of 393 million characters (46). The classification datasets are independent from the genome dataset as these full-genomes lack drug-resistance, coreceptor binding type, or tissue isolation information. As such, three datasets have been prepared for the three classification tasks relevant to HIV biology. Figure 2B shows the prevalence of drug resistance in Protease sequences across four drugs from the Stanford HIV Database. Out of the 1733 Protease sequences with known drug resistance, 55.8% of sequences have resistance to at least one drug, while 28.0% have resistance to all four. Figure 2C describes the profile of body-sites where 5510 unique V3 sequences have been isolated with 28.3% isolated from multiple locations. A partially overlapping set of 2935 V3 sequences contained coreceptor information with the majority being CCR5 binding 92.1%, 23.9% CXCR4 binding, and 16.0% dual tropic as shown in Figure 2D. Over 200,000 V3 sequences were discarded as having neither body-site nor coreceptor information.


Figure 2 Description of publicly released datasets. (A) The length of each translated sequence is shown as a heatmap with darker regions indicating a greater concentration of sequences at that length. The proteome represents the total length of all translated sequences. (B) The number of protease sequences with observed resistance (orange) and no resistance (green) to each of four drugs. MultiDrug resistance represents the sum of individual drug resistances and is indicated by the key above. (C) The number of V3 sequences observed at each body-site (orange) and not observed (green) to each of the grouped sites. MultiSite represents the total number of sites that a unique V3 sequence was observed in. (D) The number of V3 sequences annotated with known coreceptor usage with those able to bind the coreceptor in orange and those not able to bind the coreceptor in green. DualTropic represents sequences that were determined to utilize both coreceptors in orange and those only able to bind to one are shown in green.

3.2 Classification Tasks

Table 1 and Figure 3 show the precision, recall, and accuracy of each of the trained models when a standard 50% cutoff was used to binarize the predictions. When considering accuracy, the percentage of correctly called sequences from the validation fold, the TF-IDF model best predicted protease resistance mutations with a 91.3% accuracy, while the HIV-BERT model performed the best at coreceptor prediction and bodysite identification with a 92.5% and 89.1% accuracy respectively. However, the TF-IDF and HIV-BERT models performed within 3% mean accuracy across all tasks, which is less than 1 standard deviation when considering models across folds.


Table 1 Average model performance metrics across 5-fold cross-validation.


Figure 3 Pretraining improves prediction metrics across all tasks. The accuracy (A), F1-score (B), precision (C), and AUC (D) are shown for each model and each prediction task. The bar indicates the mean value and the error bars represent the 95% confidence interval of 5-fold cross validation. The Null model is shown in red, the TF-IDF model is show in blue, the Prot-BERT model in green, and the HIV-BERT model is in purple. The test-comparison bars represent the results of a paired t-test between each group; undrawn comparisons p<0.05, *(0.05<p<=0.01), **(0.01<p<=0.001), ***(0.001<p<=0.0001), ****(p<=0.00001). A Bonferroni correction based on all possible tests in the figure.

While commonly measured, accuracy may not be the ideal metric for comparing model performance in a clinical context. Depending on the situation, a researcher may wish to trade precision for recall by altering the cutoff between a positive and negative prediction. To account for this, the area under the receiver-operator characteristic curve (AUC) was calculated and shown in Table 1 and Figure 3D. Here the TF-IDF model performed the best across all tasks but again this was statistically significant in the protease resistance and tissue tasks but not coreceptor prediction.

As these tasks are multi-class predictions, it is important to examine the ability of the model to perform on each class. Figure 4 shows the accuracy of each model across each of the classes of each prediction task. When considering each model individually, there is a consistent level of prediction ability across classes with AUC scores within 5%. In the protease resistance task, TF-IDF and HIV-BERT have an increased ability to predict IDV relative to SQV (p<0.05). When examining the tissue prediction task, the TF-IDF and HIV-BERT did show differences across tasks with breast-milk having the highest accuracy and periphery-tcell having the lowest.


Figure 4 Area under the curve (AUC) scores for individual fields of drug resistance and coreceptor prediction are consistent, but tissue identification is not. The model AUC scores were disambiguated for each field of each prediction task. Each task is shown in (A) protease drug resistance, (B) coreceptor prediction, and (C) tissue isolation with colors indicating the prediction field. The bar indicates the mean value and the error bars represent the 95% confidence interval of 5-fold cross validation. The test-comparison bars represent the results of a paired t-test between each group; undrawn comparisons p<0.05, *(0.05<p<=0.01), **(0.01<p<=0.001), ***(0.001<p<=0.0001), ****(p<=0.00001). A Bonferroni correction based on all possible tests in the figure.

3.3 HIV-BERT Pretraining

Utilizing the full genome dataset described above, the RostLab/prot_bert_bfd model was refined for HIV specific tasks. This pretraining reduced the masked token cross-entropy loss from 1.85 nats for the unrefined model to 0.36 nats. This indicates that the average prediction for the correct amino acid improved from approximately 15% to 70%. This was visualized by subjecting the consensus subtype B V3 loop CTRPNNNTRKSIHIGPGRAFYTTGEIIGDIRQAHC (19) to single amino acid masking across all 35 positions. Figure 5 shows the difference between the unrefined model and the HIV-refined model across this masking task. The HIV-refined model has a greater predicted probability for the consensus amino acid at most positions (31/35) compared to the unrefined model.


Figure 5 Full genome pretraining of the Prot-BERT model increases HIV-1 sequence awareness. The probability of each consensus amino acid of the V3 loop when performing masked prediction task. Green bars represent the prediction from the Prot-BERT model and red bars represent the HIV-BERT model.

3.4 Conceptual Error

Each sequence in the protease dataset was subjected to single amino acid mutations to either add or remove a known DRM, as listed in the methods above. Figure 6 shows the effect of those substitutions on the model’s prediction of the likelihood of drug resistance with red dots indicating a gain of a DRM and gray dots indicating a loss. A well performing model would show that an addition of a DRM (red dots) increases the resistance likelihood and would therefore be above the x=y line. The converse would be true when removing a known DRM (gray dots). When examining Figure 6, there are many instances in the TF-IDF and Prot-BERT models in which mutations mislead the prediction of the models; red dots that are below the line or grey dots that are above the line. This is noticeably less prevalent in the HIV-BERT model.


Figure 6 Transformer models accurately predict the outcome of drug resistance mutations (DRMs). Each sequence in the resistance dataset was mutated by each of the ten DRMs as described in the methods and each mutated sequence is shown as a single point; there are ten points per sequence. Sequences where a DRM was added are shown in red which should increase the probability of resistance. Points in grey indicate sequences where a DRM was removed and should decrease in probability of resistance. The dashed black line shows the x=y line and the assumption of no change due to the mutation. Each column of axes shows the predictions using each of the three models. With the first column indicating the TF-IDF model; the second indicating the Prot-BERT model; and the third showing the HIV-BERT results. Each row represents the prediction on a different drug in the order FPV, IDV, NFV, and SQV.

A similar pattern can be observed in Figure 7 when adding mutations known to increase CXCR4 binding (orange dots) and removing mutations known to indicate CXCR4 binding (green dots). The TF-IDF model also has poor confidence in classifying CXCR4 values after the addition of promoting mutations with few mutated sequences increasing past a 50% threshold. Prot-BERT and HIV-BERT do not suffer from this limitation.


Figure 7 Transformer models accurately predict the outcome of CXCR4 enhancing mutations. Each sequence in the V3 coreceptor dataset was mutated by each of the nine CXCR4 promoting mutations as described in the methods and each mutated sequence is shown as a single point; there are nine points per sequence. Sequences where a CXCR4 promoting mutation was added is shown in orange which should increase the probability of CXCR4 binding and decrease the probability of CCR5 binding. Points in green indicate sequences where a CXCR4 promoting mutation was removed and should decrease CXCR4 binding and increase CCR5 binding. The dashed black line shows the x=y line and the assumption of no change due to the mutation. Each column of axes shows the predictions using each of the three models with the first column indicating the TF-IDF model; the second indicating the Prot-BERT model; and the third showing the HIV-BERT. Each row represents the prediction on a different coreceptor with CCR5 on the top and CXCR4 on the bottom.

In order to quantify this, a mean-squared error was calculated as described above such that only predictions in the unexpected direction were penalizing. Figure 8A shows the results of this analysis grouped by model. In the coreceptor prediction task, all models improve upon the naïve predictions but are not statistically significantly different from each other. However, for the protease resistance prediction task the Prot-BERT and HIV-BERT models are more statistically significant than the TF-IDF model but not from each other. Across all tasks, Figures 8B, C, the HIV-BERT model outperforms all other models but not to a statistically significant level.


Figure 8 HIV-1 pretraining decreases conceptual error. (A) The average conceptual error of each model is shown across the coreceptor and protease resistance tasks for each model. (B) The conceptual error was disambiguated across each field of the coreceptor model. (C) The conceptual error was disambiguated across each field of the protease resistance model. Across all axes the Null model is shown in red, the TF-IDF model is shown in blue, the Prot-BERT model shown in green and the HIV-BERT model shown in purple. The test-comparison bars represent the results of a paired t-test between each group; undrawn comparisons p<0.05, *(0.05<p<=0.01), **(0.01<p<=0.001), ***(0.001<p<=0.0001), ****(p<=0.00001). A Bonferroni correction based on all possible tests in the figure.

4 Discussion

Over the past decade there has been an explosion in open-source natural language processing tools, especially in the AI field. However, biological datasets are rarely in a form amenable to easy implementation. This is particularly true of specialized datasets like those discussed in this analysis. Creating publicly accessible datasets in an easily retrieved form will help bridge the gap between AI and biological researchers. The study reported herein releases four HIV specific datasets to the public for new researchers to iterate upon. This is coupled with the release of a generically-trained HIV-BERT model as well as the three task-specific refinements discussed above. It is our hope that depositing these in the open-source Hugging Face transformers library will allow for a democratization of research and prevent issues like link-death, a common problem when attempting to build upon the work of others.

Examining previous machine learning attempts at these tasks reveals that our technique is competitive with existing methods. Early work in DRM mutation by Beerenwinkel et al. achieved sensitivities ranging from 58-92% and specificities ranging from 62-92% using decision trees across different inhibitors (56). Later work by Heider et al. was able to achieve AUCs ranging from 0.79-0.89 using chains of classifiers (57). Recent work published in 2020 by Steiner et al. tested older AI techniques like multi-layered perceptrons and recurrent networks achieving AUCs from 0.8 to 0.97 (58). The HIV-BERT model released on the Hugging Face repository has an average AUC of 0.94, making it competitive with current state of the art techniques. When examining our coreceptor prediction model we perform competitively with recent work by Chen et al. in 2019 using the XGBoost method (22).

It is important to note that none of the previous methods used the same dataset for training and validation, making direct comparison difficult. This is likely because HIV-1 sequence data is sequestered within databases that require domain specific knowledge to access and process for modern machine learning and AI tools. Our release of a standard-formatted dataset and processing scripts will help to alleviate these difficulties.

To our knowledge, the tissue identification task has never been posed as a classification problem in this manner. Previous research has used phylogenetic methods to identify the level of compartmentalization by finding mutations unique to a single tissue type when examining isolates from diverse tissues (59, 60). However, when performing routine surveillance sequencing of patients, invasive methods like lumbar punctures for CSF or bronchiolar lavage for lung sampling are impractical. Framing the problem as a classification task allows for sequencing of virus from peripheral blood to detect recent reactivation events from these latent reservoirs. This will be useful when evaluating the success of HIV cure strategies such as anti-HIV-1 CRISPR/Cas9 gene editing or latency reactivation (27).

In our analysis, a simple k-mer and tree-based prediction, TF-IDF, was able to outperform advanced AI models in many of the tasks when considering classic metrics such as accuracy and AUC. However, when it was subjected to biological investigation by introducing mutations with known function, it performed the worst. This reflects poor generalization and indicates that the TF-IDF model may be “memorizing” patterns that are correlated with the prediction task, but not causative. BERT-style models, pre-trained on millions of sequences, can distinguish between functional changes in the sequence and random mutations. This result indicates a need for detailed evaluation of prediction models to ensure that they reflect biological reality. The introduction of conceptual error described above is an initial attempt at performing this at-scale for HIV-1 tasks. Future work should expand this by incorporating deep mutational scanning datasets (61).

When examining the BERT style models across related tasks, there is a notable level of consistency across predicted fields. The AUC for each drug and coreceptor fall within 5% of each other. This may indicate that information is being shared within the model across these related fields, a common strength of transformer models. This pattern was not seen in the tissue classification task, this may indicate more sequences are needed or that other areas of the HIV-1 genome should be examined. Other published research has shown that regions across the genome have a role to play in tissue specificity and cellular tropism. It may be that the V3 loop is important for cellular tropism but the accessory proteins Vpr, Tat, and Nef may play a greater role in tissue specificity (62, 63). Future work should explore other proteins using similar prediction tasks.

The trained models have multiple biological applications. The PI- and Tropism-trained models are immediately applicable to predicting the most useful antiretroviral drug for a given patient. They can also be utilized for exploring the structure-function relationship between inhibitors and viral sequences. The bodysite prediction task allows for the tracking of leakages of strains from viral reservoirs. This may lead to new avenues of research around the exchange of viral quasispecies between compartments. Finally, the HIV-BERT model functions as a base that future HIV predictive models can be built upon.

Taken together, this work shows that AI models, particularly transformers, are well suited to biological prediction tasks. Refining the model on unlabeled sequences improves prediction accuracy with minimal upstream cost; the HIV-BERT model discussed above can be downloaded and finetuned for new prediction tasks with minimal additional effort and has been successful with fewer than 2000 labeled sequences. This puts many biologically relevant HIV-1 prediction tasks within reach and can accelerate any additional protein-to-function prediction task and will be useful to the HIV community as a whole.

Data Availability Statement

All trained models and datasets are available at https://huggingface.co/damlab. The training scripts and the code used to generate figures is available at https://github.com/DamLabResources/hiv-transformers.

Author Contributions

Conceptualization, WD; methodology, WD; software, WD; validation, WD; formal analysis, WD; investigation, WD; resources MN, BW, and WD; data curation, WD; writing—original draft preparation, WD; writing—review and editing, WD, RL, JE, MC, DS, KK, MN, and BW; visualization, WD; supervision, WD; project administration, WD; funding acquisition, MN, BW, and WD. All authors have read and agreed to the published version of the manuscript.


The authors were funded in part by the Public Health Service, National Institutes of Health, through grants from the National Institute of Mental Health (NIMH) R01 MH110360 (Contact PI, BW), the NIMH Comprehensive NeuroAIDS Center (CNAC) P30 MH092177 (KK, PI; BW, PI of the Drexel subcontract involving the Clinical and Translational Research Support Core), the Ruth L. Kirschstein National Research Service Award T32 MH079785 (PI, Tricia Burdo; BW, Principal Investigator of the Drexel University College of Medicine component) and from the National Institute of Neurological Disorders and Stroke (NINDS) R01 NS089435 (PI, MN).

Author Disclaimer

The contents of the paper are solely the responsibility of the authors and do not necessarily reflect the official views of the NIH.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.



This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.frontiersin.org/)