Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition 论文

2004Journal of Biological Chemistry引用 369顶会
Machine Learning in BioinformaticsRNA and protein synthesis mechanismsComputational Drug Discovery Methods

摘要

Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1Horn F. Vriend G. Cohen F.E. Nucleic Acids Res. 2001; 29: 346-349Crossref PubMed Scopus (158) Google Scholar). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide compositionbased classifiers were 82.6 and 97.5%, respectively. Therefore, our results prove that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred. Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1Horn F. Vriend G. Cohen F.E. Nucleic Acids Res. 2001; 29: 346-349Crossref PubMed Scopus (158) Google Scholar). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide compositionbased classifiers were 82.6 and 97.5%, respectively. Therefore, our results prove that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred. The availability of sequence data for different genomes in recent years has increased the demand for computational tools that can recognize new proteins from this data. The recognition of nuclear receptors is crucial, because many of them are potential drug targets for developing therapeutic strategies for diseases like breast cancer and diabetes (2Robinson-Rechavi M. Laude V. Methods Enzymol. 2003; 364: 95-118PubMed Google Scholar). Nuclear receptors are one of the most abundant classes of transcriptional regulators, which regulate diverse functions during reproduction, metabolism, and development. Nuclear receptors function as ligand-activated transcriptional factors, providing a direct link between the signaling molecules that control these processes and transcriptional responses (3Robinson-Rechavi M. Escriva Garcia H. Laudet V. J. Cell Sci. 2003; 116: 585-586Crossref PubMed Scopus (371) Google Scholar). The nuclear receptors share a common structural organization. All nuclear receptors consist of six distinct regions or domains as follows: highly variable N-terminal and C-terminal regions (A/B and F domains) that contain one or more transactivation regions; a central, well conserved DNA binding domain (C); a non-conserved hinge region (D) that contains a nuclear localization signal (NLS), and a moderately conserved ligand binding domain (E) (4Laudet V. Gronemeyer H. The Nuclear Receptors FactsBook. Academic Press, London2002Google Scholar). The DNA binding domain (C region) of nuclear receptors consists of two zinc fingers, which act as a signature for this superfamily (5Moras D. Gronemeyer H. Curr. Opin. Cell Biol. 1998; 10: 384-391Crossref PubMed Scopus (710) Google Scholar). The presence of these zinc fingers facilitates the recognition of nuclear receptors from a genome sequence using simple similarity-based search tools like BLAST and FASTA (6Altschul S.F. Gish W. Miller W. Myers E.W. Lipman D.J. J. Mol. Biol. 1990; 215: 403-410Crossref PubMed Scopus (71458) Google Scholar, 7Pearson W.R. Lipman D.J. Proc. Natl. Acad. Sci. U. S. A. 1988; 85: 2444-2448Crossref PubMed Scopus (9393) Google Scholar). On the other hand, the major limitation of these search tools is that they are not able to recognize the subfamilies of nuclear receptors. The nuclear receptors have been classified and assigned seven subfamilies according to the NucleaRDB database, which include thyroid and estrogen hormone-like receptors (1Horn F. Vriend G. Cohen F.E. Nucleic Acids Res. 2001; 29: 346-349Crossref PubMed Scopus (158) Google Scholar). However, classification of these subfamilies by using phylogeny-based or BLAST-based tools is difficult due to a scarcity of data for some subfamilies. Thus, there is a crucial need for methods that will enable the automated assignment of nuclear receptor subfamilies. In this report, we have made an attempt to develop a method for recognizing the subfamilies of nuclear receptors. We were able to design a method for recognizing the four subfamilies of nuclear receptors: (i) thyroid hormone-like (TR, RAR, and ROR); (ii) HNF4-like (HNF4, RXR, TLL, Coup, and USP); (iii) estrogen-like (ER, ERR, GR, MR, PR, and AR); and (iv) Fushi tarazu-F1-like (SFI, FTF, and FTZ-F1). Sequences for the other three subfamilies are not available in significant number (<10). The classification and assignment of nuclear receptors to various subfamilies was done on the basis of amino acid composition and dipeptide composition. Amino acid and dipeptide compositions are simplistic approaches for producing patterns of fixed length from the protein sequences of varying length (8Shepherd A.J. Gorse D. Thornton J.M. Proteins. 2003; 50: 290-302Crossref PubMed Scopus (28) Google Scholar). In the past, amino acid composition has been used to predict the structural class of domains and the subcellular localization of proteins (9Hua S. Sun Z. Bioinformatics. 2001; 17: 721-728Crossref PubMed Scopus (760) Google Scholar, 10Chou K.C. Cai Y.D. J. Biol. Chem. 2002; 277: 45765-45769Abstract Full Text Full Text PDF PubMed Scopus (495) Google Scholar, 11Bhasin M. Raghava G.P.S. Nucleic Acids Res. 2004; (in press)Google Scholar). The dipeptide composition is also widely used to encapsulate the global information and give a fixed pattern length of 400. In the past, dipeptide composition has been used for predicting the subcellular localization of proteins (11Bhasin M. Raghava G.P.S. Nucleic Acids Res. 2004; (in press)Google Scholar) and for fold recognition (12Grassmann J. Reczko M. Suhai S. Edler L. Lengauer T. Schneider R. Bork P. Brutlag D.L. Glasgow J.I. Mewes H.-W. Zimmer R. Proceeding of the Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, August 6-10, 1999. American Association for Artificial Intelligence, Menlo Park, CA1999: 106-112Google Scholar, 13Reczko M. Bohr H. Nucleic Acids Res. 1995; 22: 3616-3619Google Scholar). In this study, support vector machines (SVMs) 1The abbreviations used are: SVM, support vector machine; MCC, Matthew's correlation coefficient; RI, reliability index. were applied to classify nuclear receptors. SVMs are a relatively new type of statistical learning method that have proven to be particularly attractive for biological analysis due to their ability to handle large data sets and avoid overfitting. SVMs have been shown to perform well in multiple areas of biological analysis, including classification of G-protein-coupled receptors (GPCRs) (14Karchin R. Karplus K. Haussler D. Bioinformatics. 2002; 18: 147-159Crossref PubMed Scopus (229) Google Scholar) and enzyme families (15Cai C.Z. Han L.Y. Ji Z.L. Chen Y.Z. Proteins. 2004; 55: 66-76Crossref PubMed Scopus (155) Google Scholar), analysis of protein functions and types (16Cai Y.D. Ricardo P.W. Jen C.H. Chou K.C. J. Theor. Biol. 2004; 226: 373-376Crossref PubMed Scopus (163) Google Scholar, 17Cai C.Z. Wang W.L. Sun L.Z. Chen Y.Z. Math. Biosci. 2003; 185: 111-122Crossref PubMed Scopus (132) Google Scholar), and prediction of RNA-binding proteins (18Han L.Y. Cai C.Z. Lo S.L. Chung M.C. Chen Y.Z. RNA. 2004; 10: 355-368Crossref PubMed Scopus (111) Google Scholar). The overall accuracy of amino acid and dipeptide composition-based classifiers are 82.6 and 97.5% respectively. The Matthew's correlation coefficient (MCC) of the dipeptide composition-based classifier is 0.96, which is significantly higher in comparison to that of the amino acid composition-based classifier. MCC is a better parameter for evaluating the performance of a method, as it accounts for both over- and under-predictions. The performance of both classifiers has been estimated through a 5-fold cross-validation test. It was found that various subfamilies of nuclear receptors are correlated with amino acid or dipeptide composition, implying that the subfamilies of nuclear receptors are predictable to a highly accurate extent if good training data can be established. The method is available via the World Wide Web at www.imtech.res.in/raghava/nrpred. Data Set—The data for four subfamilies of nuclear receptors was obtained from the NucleaRDB data base available at www.receptors.org/NR/ (1Horn F. Vriend G. Cohen F.E. Nucleic Acids Res. 2001; 29: 346-349Crossref PubMed Scopus (158) Google Scholar). All entries not marked as fragments were extracted from the data base by the text-parsing method. The initial data set had 577 sequences belonging to four subfamilies of nuclear receptors. Redundancy was reduced so that no sequence had ≥90% sequence identity with any other sequence in the data set, using PROSET software (19Brendel V. Math. Comput. Model. 1992; 16: 37-43Crossref Scopus (33) Google Scholar). The final data set contains 282 sequences belonging to different subfamilies of nuclear receptors as shown in Table I.Table IThe number of sequences belonging to each nuclear receptor subfamilyNuclear receptor subfamiliesNo. of protein sequencesThyroid hormone-like114HNF-4-like72Estrogen-like75Fusi-tarazu-like21 Open table in a new tab Design and Implementation of the Prediction System—The prediction of subfamilies of nuclear receptors is a multi-class classification problem. In this case, the number of subfamilies of nuclear receptors was four. To handle this multi-class situation, we designed a series of binary SVMs. For N class classification, N SVMs were constructed. The ith SVM was trained with all samples of the ith subfamily being labeled as positive, and the samples of all other subfamilies being labeled as negative. The SVMs trained in this way were referred to as 1-v-r SVMs (9Hua S. Sun Z. Bioinformatics. 2001; 17: 721-728Crossref PubMed Scopus (760) Google Scholar). In this classification approach, each of the unknown proteins will achieve four scores. An unknown protein will be classified into the subfamily that corresponds to the 1-v-r SVM with the highest output score. Support Vector Machine—The SVMs were implemented using freely downloadable software, SVM_light, written by T. Joachims (20Joachims T. Schölkopf B. Burges C.J.C. Smola A.J. Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge, MA1999: 169-184Google Scholar). This software enables the users to define a number of parameters as well as the choice of inbuilt kernel, such as a radial basis function (RBF) or a polynomial kernel (of given degree). In this study, all of the parameters of a kernel were kept constant, except for the regulatory parameter C. The experimentation was conducted by using various types of kernels such as polynomial and radial base function. The SVMs require a fixed number of inputs for training, thus necessitating a strategy for encapsulating the global information about proteins of variable length in a fixed length format. The fixed length format was obtained from protein sequences of variable length using amino acid and dipeptide composition. Amino Acid Composition—Protein information can be encapsulated in a vector of 20 dimensions, using amino acid composition of the protein. In the past, this approach has been used for predicting the subcellular localization of proteins (9Hua S. Sun Z. Bioinformatics. 2001; 17: 721-728Crossref PubMed Scopus (760) Google Scholar, 21Reinhardt A. Hubbard T. Nucleic Acids Res. 1998; 26: 2230-2236Crossref PubMed Scopus (532) Google Scholar). The amino acid composition is the fraction of each amino acid type within a protein. The fractions of all 20 natural amino acids were calculated by using Equation 1,Fraction of aai=total number of amino acids of type itotal number of amino acids in protein(Eq. 1) where i is an specific type of amino acid (aa). Dipeptide Composition—The dipeptide composition was used to transform the variable length of proteins to fixed length feature vectors. Dipeptide composition has been used earlier by Grassmann et al. (12Grassmann J. Reczko M. Suhai S. Edler L. Lengauer T. Schneider R. Bork P. Brutlag D.L. Glasgow J.I. Mewes H.-W. Zimmer R. Proceeding of the Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, August 6-10, 1999. American Association for Artificial Intelligence, Menlo Park, CA1999: 106-112Google Scholar) and Reczko and Bohr (13Reczko M. Bohr H. Nucleic Acids Res. 1995; 22: 3616-3619Google Scholar) for the development of fold recognition methods (12Grassmann J. Reczko M. Suhai S. Edler L. Lengauer T. Schneider R. Bork P. Brutlag D.L. Glasgow J.I. Mewes H.-W. Zimmer R. Proceeding of the Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, August 6-10, 1999. American Association for Artificial Intelligence, Menlo Park, CA1999: 106-112Google Scholar, 13Reczko M. Bohr H. Nucleic Acids Res. 1995; 22: 3616-3619Google Scholar). We adopted the same dipeptide composition-based approach in developing an SVM-based method for predicting subcellular localization of eukaryotic proteins (11Bhasin M. Raghava G.P.S. Nucleic Acids Res. 2004; (in press)Google Scholar). The dipeptide composition gave a fixed pattern length of 400. Dipeptide composition encapsulates information about the fraction of amino acids as well as their local order. It was calculated using Equation 2,Fraction of dep(i)=total number of dep(i)total number of all possible where is one dipeptide i of of performance all classifiers was evaluated through 5-fold cross In 5-fold cross the data set was partitioned randomly to five The training and testing of each classifier was five times using one distinct set for testing and the other four sets for training. The performance of classifiers was evaluated by accuracy and the MCC for each subfamily of nuclear as by and Sun (9Hua S. Sun Z. Bioinformatics. 2001; 17: 721-728Crossref PubMed Scopus (760) Google Scholar) and shown in and where can be any subfamily of nuclear the number of sequences in subfamily the number of sequences of subfamily the number of sequences not of subfamily the number of and the number of of prediction reliability is using learning to subfamilies of nuclear receptors. The reliability was assigned on the basis of between highest and highest of SVMs in multi-class classification B. C. J. Mol. Biol. PubMed Scopus Google Scholar, H. S. G. J. Mol. Biol. PubMed Scopus Google Scholar). an into the of Equation shown the was for each Artificial such as SVM and the are approaches for the of patterns from biological sequence data. are highly for where fixed length is used A. M.C. Advances in MIT Press, Cambridge, Scholar). The major limitation of is that they need of fixed In this study, amino acid composition and dipeptide composition were used to transform the variable of proteins to fixed length The classifiers were developed using the support vector because it was shown in the that SVM is better at classifying the biological data in comparison with the Bioinformatics. 2003; PubMed Scopus Google Scholar, M. Raghava Bioinformatics. 2004; PubMed Scopus Google Scholar). The results of 5-fold cross validation of amino acid composition-based and dipeptide composition-based classification are in Table The accuracy of the five different sets used in the of the classifier during 5-fold cross-validation are shown in Table of the and in the of this The information about the number of and sequences in each set used during 5-fold cross-validation are in Table of the Table that the results achieved kernel The overall accuracy and MCC of the amino acid composition-based classifier for classifying the four subfamilies of nuclear receptors was and respectively. It that subfamilies of nuclear receptors correlated with amino acid composition and can be on this prediction accuracy and MCC of both amino acid and dipeptide composition-based classifiers with the radial basis function (RBF) type of kernel acid composition kernel were obtained through 5-fold composition kernel were obtained through 5-fold hormone-like (TR, RAR, (HNF4, RXR, TLL, (ER, ERR, GR, MR, PR, tarazu-F1-like FTF, were obtained through 5-fold cross-validation Open table in a new tab Chou and Cai that the of sequence with amino acid composition acid the accuracy of the classification of proteins K.C. Cai Y.D. J. 2003; PubMed Scopus Google Scholar, K.C. 2001; PubMed Scopus Google Scholar). Therefore, we also used acid composition to classify nuclear receptors with better accuracy and the of sequence on classification acid composition was using amino acid composition with The different acid compositions were using factors of different to as by Chou and Cai K.C. Cai Y.D. J. 2003; PubMed Scopus Google Scholar). The overall performance of the classifiers by using acid in classifying the four subfamilies of nuclear receptors is shown in Table The results that the overall MCC and accuracy of the acid based approach is better in comparison with that of the amino acid composition-based The results also that accuracy as well as MCC to the acid composition by factors from the to the with amino acid and Thus, a in predicting the subfamilies of nuclear receptors was achieved using acid performance of different acid composition-based approaches in classifying the subfamilies of the nuclear receptors The acid composition was as by Chou and Cai K.C. Cai Y.D. J. 2003; PubMed Scopus Google Scholar). The the sequence the For the the between most and the the between all of the most The to it was developed using acid composition by five factors with amino acid composition acid composition sequence acid composition amino acid to to to to to to acid composition amino acid to to to to to to acid composition amino acid to to to to to to Open table in a new tab In a a acid composition based on the correlation was to the of on acid composition based on this approach was to the from our data The of the was done by from Chou and Cai K.C. Cai Y.D. J. 2003; PubMed Scopus Google Scholar). The of amino acids were from et al. P. J. PubMed Scopus Google Scholar). The performance of acid composition using the amino acid composition and correlation is in Table The performance of the classifier based on this approach was that with the amino composition-based a acid composition was using and with amino acid composition. The performance of classifier based on this approach to the is shown in Table The the between all of the most The performance shown in Table was obtained using 5-fold The results and MCC obtained using this approach were significantly better in comparison with obtained with the amino acid composition-based results that acid composition more information about a protein in the of prediction On the other hand, the performance of the dipeptide composition-based approach was better as with that of the acid composition-based The most for the performance of the acid composition-based approach be that it of amino acids and The dipeptide composition based approach all of the of amino acids they are or To prediction a classifier based on the dipeptide composition of protein was The dipeptide encapsulates the global information of the amino acid fraction and the local of amino Thus, dipeptide composition is a better feature as with amino acid composition The overall accuracy and MCC of a dipeptide composition-based classifier were and which were significantly higher in comparison with the accuracy and MCC of the amino acid compositionbased classifier. The performance of the classifier on each set that was used for during 5-fold cross-validation is shown in and of the and in the of this The overall accuracy of the dipeptide composition-based classifier was that of the amino acid composition-based classifier. The overall MCC of the dipeptide composition-based classifier was which was significantly higher the MCC of the amino acid composition-based classifier. The accuracy and MCC of dipeptide composition based classifiers in recognizing different subfamilies of nuclear receptors are shown in Table The results that the thyroid hormone-like receptor subfamily is classified more and MCC in comparison with other subfamilies. The most for such results be due to the large of the data set related to thyroid hormone-like receptors. It is that the for the other subfamilies can be by the training data by more new proteins belonging to the subfamilies results also that the dipeptide compositionbased classifier is able to recognize the subfamilies of nuclear receptors more as with amino acid composition-based classifier This that different subfamilies of nuclear receptors are correlated dipeptide composition to the about the reliability of the the reliability of both amino acid composition-based prediction and dipeptide composition-based prediction was also The assignment information about the of prediction for a The reliability was assigned according to the between the highest and the highest 1-v-r SVM output (9Hua S. Sun Z. Bioinformatics. 2001; 17: 721-728Crossref PubMed Scopus (760) Google Scholar, 11Bhasin M. Raghava G.P.S. Nucleic Acids Res. 2004; (in press)Google Scholar). a was to have a large for a class of nuclear receptors from the the had a of belonging to that The reliability is the key for receptors with prediction accuracy S. Sun Z. J. Mol. Biol. 2001; PubMed Scopus Google Scholar). The shown in the of is the prediction for sequences labeled with a reliability index. For in the of amino acid composition-based the accuracy of sequences with is with of the sequences of the data set The accuracy at different is shown by a between the accuracy and the is also for dipeptide composition-based classifiers The dipeptide composition-based classifier sequences with sequences are results that our method is able to predict the subfamilies of nuclear receptors with Thus, this classifier will the methods like BLAST and FASTA in recognizing the nuclear receptor proteins with have developed the for recognizing the subfamilies of nuclear receptors This method, in with a search can be used for automated of data. The also that there is a direct correlation between the of the proteins acid and dipeptide and the subfamilies of nuclear receptors. The of such methods will the of subfamilies of nuclear receptors will drug for diseases or of as a written in and The of the is and The the protein sequence in any format like or in format. The the to the The the of prediction on the basis of amino acid composition or dipeptide composition. analysis, the will be in a format. The information about the subfamily of nuclear reliability and the The and related information is available from www.imtech.res.in/raghava/nrpred. information from the different subfamilies of nuclear receptors is most In the more information about different subfamilies of nuclear receptors will be to a large data set, because the performance of any method is on the and of data. This data set will be used to and the performance of method. The users are to give their about any or of the We are to and for the with