Author
Listed:
- Xin Bai
- Jian-an Jia
- Meng Fang
- Shipeng Chen
- Xiaotao Liang
- Shanfeng Zhu
- Shuqin Zhang
- Jianfeng Feng
- Fengzhu Sun
- Chunfang Gao
Abstract
Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60–80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.Author summary: HBV infection can lead to many liver complications including hepatocellular carcinoma (HCC), one of the most common liver cancers in China. High-throughput sequencing technologies have recently been used to study the genotype sequence compositions of HBV infected individuals and to distinguish chronic HBV (CHB) infection from HCC. We used NGS to sequence the pre-S region of a large number of CHB and HCC individuals and designed novel word pattern based approaches to analyze the data. We have several surprising key findings. First, most HBV infected individuals contained mixtures of genotypes B and C sequences. Second, multi-dimensional scaling (MDS) analysis of the data showed that the first principal coordinate was closely associated with the fraction of genotype B (or C) sequences and the second principal coordinate was highly associated with the probability of HCC. Third, we also designed K-nearest neighbors (KNN) and support vector machine (SVM) based classifiers for CHB and HCC with high prediction accuracy. The results were validated in an independent data set.
Suggested Citation
Xin Bai & Jian-an Jia & Meng Fang & Shipeng Chen & Xiaotao Liang & Shanfeng Zhu & Shuqin Zhang & Jianfeng Feng & Fengzhu Sun & Chunfang Gao, 2018.
"Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC,"
PLOS Genetics, Public Library of Science, vol. 14(2), pages 1-20, February.
Handle:
RePEc:plo:pgen00:1007206
DOI: 10.1371/journal.pgen.1007206
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1007206. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.