Possible Computational Filter to Detect Proteins Associated to Influenza a Subtype H1n1

The design of drugs with bioinformatics methods to identify proteins and peptides with a specific toxic action is increasingly recurrent. Here, we identify toxic proteins towards the influenza A virus subtype H1N1 located at the UniProt database. Our quantitative structure activity relationship (QSAR) approach is based on the analysis of the linear peptide sequence with the so-called Polarity Index Method that shows an efficiency of 90% for proteins from the Uniprot Database. This method was exhaustively verified with the APD2, CPP-site, Uniprot, and AmyPDB databases as well as with the set of antibacterial peptides studied by del Rio et al. and Oldfield et al.


INTRODUCTION
Pandemic of the influenza A virus subtype H1N1 occurred in Mexico in 2009.It refers to the 1918 flu pandemic outbreak in the USA and is commonly known as a Spanish flu.Spanish influenza pandemic caused the death of approximately 3.7% of the Earth population (between 50 and 100 million people).Most deceases took place during the first 25 weeks of the outbreak (USCB, 2013).Comparing both the pandemic processes, the Mexican pandemic was not lethal because the virus was weak and the means of transmission were bird-human.However, it is only a matter of time until the inevitably occurs and a lethal strain arises in humans.One particular point is that this type of influenza virus (Chowell et al., 2011), which can be quickly spread around the world, is a variety of an influenza virus with the gene segments common to birds, pigs and human flu strains.Considering that pigs (Wenjun et al., 2009) are susceptible to share the bird and human influenza virus allowing a redistribution of gene segments, we can assume that this virus can rapidly mutate in humans.One of the distinctive signs of the influenza A virus subtype H1N1 pandemic is, in particular, that the virus is easily transmissible among humans.A future pandemic threat can be combated also by predicting the location of the outbreak carrying out a fast count of the people infected using the predictor algorithms (Nishiura, 2011) and other predictive models or by developing new drugs.Some of them are based on proteins and peptides with toxic action towards influenza A subtype H1N1 and are detected by the bioinformatics algorithms.In this sense, each scientific and technological research should be oriented to the field of Proteomics as well as to the design of the efficient computational-mathematical algorithms which are able to identify and predict peptides and proteins with toxic action on this particular type of virus.These techniques may help to avoid the impractical but very necessary technique of trial and error involved in the chemical synthesis of new peptides and proteins.Although nature is recognized as a main source of proteins with toxic action against the influenza A virus subtype H1N1, recent research efforts have been directed to the production of synthetic and hybrid proteins.One of the procedures is to generate proteins replacing and/or removing constitutive amino acids from the natural proteins known for their anti-influenza A H1N1 action (Barik, 2012;Tsai et al., 2012), reducing their size simultaneously keeping or increasing their toxicity.Another technique is to join two peptides or strains of proteins that individually do not have this property but combined together become highly toxic (Mohamed et al., 2009).Altering a peptide to quantify its toxic action in a laboratory through traditional methods of trial and error would take a combination of possibilities beyond any practicability as the number of peptides built from 7 amino acid peptide is 20 7 = 1.28 × 10 9 .Therefore, new techniques to build proteins against influenza A subtype H1N1 are based on mathematical-computational methods simulating peptide alterations as well as evaluating and qualifying them to determine if a peptide complies with the criteria required.These methods are highly complex in their mathematical-computational design and execution.They simulate the characteristics necessary to evaluate all possible combinatorial.In this work we describe the quantitative structure-activity relationship (QSAR) approach called Polarity Index Method by taking a single physicochemical property, namely the polarity, to identify efficiently the influenza A proteins subtype H1N1 from the UniProt database (Magrane, 2011) accessed on March 19, 2014.This method was previously applied to detect bacteria and selective cationic amphipathic antibacterial peptides (SCAAP) (Polanco & Samaniego, 2009;Polanco et al., 2012), taking the existent 20 proteic amino acid classification differentiated by the side chain R and divided into four different categories according to their polarity profiles (Kawashima & Kanehisa, 2000).The Polarity Index Method uses this classification only to identify the characteristic template of the influenza A protein subtype H1N1 group, which was exhaustively tested with 7 databases and was proven to be highly efficient.Our work shows the efficiency of a computational mathematical method that identifies with a high level of precision influenza A subtype H1N1 proteins but does not intend to carry out any experimental verification on the peptides.

MATERIAL AND METHODS
The Polarity Index Method has already been published to identify efficiently selective antibacterial peptides from the APD2 database (Polanco et al., 2012).For this reason, we mention only the necessary modifications for the identification of influenza A subtype H1N1 proteins.Later, we present a detailed example to clarify its mechanism (Section 2.8).

Polarity Index Method updates
The method essentially measures the polar profile of the peptide in a comprehensive manner by taking into account 16 polar interactions from the four polarity groups P+, P -, N, and NP (Polanco et al., 2012).Its metric considers reading of the linear sequence of the amino acids of the peptide or protein.In order to perform a comprehensive test, we considered all groups of peptides and proteins that have been studied so far.First of all, we calibrated with the peptides found in the Uniprot database and verified our approach with the following databases: the entire set of antimicrobial peptides from APD2 database (Wang & Wang, 2009), the set of cells penetrating the endocytic pathway of peptides and the non-endocytic pathway from the CPPsite database (Gautam et al., 2012), the set of influenza proteins and human neuronal proteins from the Uniprot database (Magrane, 2011), the amyloid peptides from the AmyPDB database (Pawlicki et al., 2008), the set of selective antibacterial peptides studied by del Rio and coworkers (2001), and the set of natively unfolded proteins and natively folded proteins, studied by Oldfield et al. (2005).

Modifications
The P[i,j] matrix in the source program (Polanco et al., 2012) is substituted with the profile of incidents for the corresponding set of influenza A subtype H1N1 proteins.Its worth noting that in this case it was necessary to obtain nine P[i,j] matrices, because we obtained the same number of sub-classifications (Sections APD2 database preparation -SCAAP Database preparation).Once the P[i,j] matrix is concluded for each sub-group, it is normalized to unity.In the same way, the Q[i,j] matrix contains the profile of incidents for the sequence in study.
The Polarity Index Method selected as influenza A subtype H1N1 proteins candidates whose P[i,j] + Q[i,j] vector space complied with different rules.The rules mentioned (Table 1) are a result of observing that polar interactions are more frequent than others, today already working in a fully automated version to avoid producing this step manually.Those peptides that meet 4 or 5 rules mentioned in the Table 1, the polarity index method be regarded as peptides associated with influenza A type H1N1.E.g. the rule 1, "Polar interaction 8 is not present in the 12th position" means that the polar 8 interaction [P-, NP] can occur on any of the 16 possible positions, but not in the 12th position.In case of rule 3 "Polar interaction 12 is present in the 1 th position" means that the interaction 12 [N, NP] must be present in the first position only.

Multiple and unique action
Peptide sets with unique toxic action are those peptides with verified experimental action over one pathogenic agent, whereas multiple action peptide sets are formed with those peptides with toxic action over two or more pathogens that are over-represented.

CPPsite database data preparation
115 cell-penetrating peptides were classified from the CPPsite database (Gautam et al., 2012) by their uptake mechanism as follows: 93 non-endocytic pathway, and 22 endocytic pathway.Those peptides with different penetration mechanisms included in the CPPsite database were not considered.
Natively unfolded and folded proteins data preparation 148 proteins, of which 51 natively unfolded proteins and 97 natively folded proteins, were selected from the Supplementary information from Oldfield et al. (Oldfield et al., 2005).

UniProt database preparation
Proteins extracted from the Uniprot database (Magrane, 2011): (i) set of proteins associated with influenza A type H1N1 in nine subgroups: 33 HA, 33 M1, 16 M2, 27 NA, 58 NP, 29 NS1, 24 PA, 49 PB1, and 1 PB2 proteins, and (ii) 3616 proteins which expressed in neurons, and located in every living organism studied.In that set we found 755 human revised proteins expressed in neurons, and 2879 non-human revised proteins expressed in neurons.

AmyPDB database data preparation
We analyzed 15 of 1705 proteins originally classified in several amyloid protein families stored in the AmyP-DB database (Pawlicki et al., 2008) and restricted to: (i) Amyloid formed in vivo (the precursor protein, or a specific sub-segment, forms fibrils in human), and (ii) Amyloid formed in vitro (the polypeptide forms fibrils under experimental conditions).

SCAAP Database preparation
30 Selective Cationic Amphipathic Antibacterial Peptides (SCAAP) were used in Table 2 and Table 2A from del Rio and coworkers (2001).Proteins associated to influenza A subtype H1N1

Test plan
The discriminative efficiency of the polarity index method is measured by calculating three aspects: (i) the number of hits in the identification of the specific group; (ii) the percentage of errors in the identification of the other groups.In this sense, the method must be efficient in identifying the group and simultaneously rejecting those peptides or proteins which are not a part of this group, and (iii) graphing the relative frequency of each polar interaction of all subgroups of the proteins (Sections APD2 database preparation -SCAAP Database preparation), associated with influenza A subtype H1N1, extracted from Uniprot database (Magrane, 2011).

Example
Although this method has been already published (Polanco et al., 2012), we provide here a detailed description of an illustrative example in order to clarify the used algorithm.Our aim is to get to know if the protein MSLLTE-VET YVLSIIP SGPLKAEIAQRLEDVFA GKNT-DLEVLM EWLKTRPILSPLTK GILGFVFTLTVPSER-GLQRRRFV QNALNG NGDPNNMDKAVKLYRKLK REITFHGAKEISLSYSAGALASCMGLIYNRM GAVT-TEVAFGLVCATCEQIADSQHR SHRQMVTTTNP-LIRHENRMVLAS TTAKAMEQMAGSSEQAA EAME-VASQ ARQMVQAMRTIGTH PSSSAGLKNDLLEN-QAYQKRMGVQ MQRFK, is in accordance with the polarity index method.To answer this question it is necessary to follow the following five steps: The above sequence is converted to its numeric equivalent according to the following rule of equivalence: The amino acids: H, K, and R are replaced by the number "1"; the amino acids: D, and E are replaced by the number "2"; the amino acids: C, G, N, Q, S, T, and Y are replaced by the number "3"; finally the amino acids: A, F, I, L, M, P, V, and W are replaced by the number "4".Note that the four numerical equivalents {1, 2, 3, and 4} correspond to the four polar groups: [P+] Read the resulting numerical sequence, from left to right, moving one position at a time.Each pair is considered as an element (i,j), in this case the first pair is (i,j) = (4,3), the second pair is (i,j) = (3,4),respectively, and same strategy should be applied further, until the last pair (i,j) = (4,1).Please note that the pairs (i,j) correspond to a square matrix of order 4 which we named forward Q[i,j] matrix and where the element i represents the row and j represents the column of Q[i,j] matrix.
Count the occurrences of every (i,j) pair in the Q[i,j] matrix.In this way the Q[i,j] matrix represents the occurrences of the numerical sequence.
Polar interaction 8 is not present in the 12 th position Polar interaction 10 is not present in the 1 th position

RESULTS
(i) The Polarity Index Method made a discriminating and positive identification of the nine subgroups of proteins associated with influenza A subtype H1N1 extracted from the UniProt database (90%, double-blind test) and shows an almost discriminative score with the remaining eight sub-classifications containing APD2, AmyPDB, Uniprot (Human and non human proteins), and CPPsite database, and the sets from del Rio et al.,and Oldfield et al. (Table 3).
The smooth graphics (Fig. 1), which correspond to the nine subgroups of proteins associated to influenza virus subtype A H1N1 (Section APD2 database preparation -SCAAP Database preparation), have no coincidences in their maximum and minimum points for the polar interactions: 3 and 4 ([P+,N], and [P+,NP]), from 6 to 8 ([P-,P-], [P-,P+], and [P-,N]), 11, and 12 ([N,N], and [N,NP]), and 14 ([NP,P-]).The method is sensitive to the number of differences, i.e. the greater is the number of differences, the greater is its efficiency.In this case, the number is very high while usually in this kind of evaluations the number of differences is two or less.

DISCUSSION
The polarity is a measure of the electromagnetic stability of matter, while electronegativity (Matsunaga et al., 2003) is a numeric equivalent, which metric is involved in more than 84% (Thakur et al., 2012) of the bioinformatics algorithms related to understand the toxic action of proteins.We think this metric is not sufficient if it is represented only by a single number.Instead, we have shown that the count of the polar incidences, i.e. 16 in the case of the Polarity Index Method, is much more  Percentages (hits/total rounded integer part) found by polarity index method pointed to: nine subgroups of proteins associated to influenza A type H1N1: HA, NA, NP, M1, M2, NS1, PA, PB1, y PB2 from Uniprot database (Magrane, 2011).B+: GRAM+ bacteria, B-: GRAM-bacteria, B+/-: GRAM+ and GRAM-bacteria, Fu: Fungi, Pa: Parasites, Ca: Cancer cells, Ma: Mammalian cells, and In: Insects from APD2 database (Wang & Wang, 2009).Am: Amyloidosis proteins from AmyPDB database (Pawlicki et al., 2008).Sc: Selective antibacterial peptides from del Rio et al. (del Rio et al., 2001).Ce: Cells penetrating peptides endocytic pathway proteins, and Cne: Cells penetrating peptides non-endocytic pathway proteins from CPPsite database (Gautam et al., 2012).Un: Natively unfolded proteins, and Fo: Natively folded proteins studied by Oldfield et al. (Oldfield et al., 2005).Hu: Human neuronal proteins, and Nh: Non human neuronal proteins from Uniprot database (Magrane, 2011).U: Unique action: Peptides with pathogenic action against only one group.M: Multiple action: Peptides with pathogenic action against two or more groups (Section Test plan).comprehensive.We believe that this characteristic explains why this method provides an effective discriminative measure of the influenza A proteins group subtype H1N1, the SCAAP (Polanco et al., 2012;Polanco et al., 2013).In addition, it is also important to mention that the metric considers only one measure.This means that the algorithm is not complex, allowing its implementation for cluster computing under parallel programming, i.e. in a collaborative programming that allows to run multiple instructions simultaneously.This could help analyze peptide spaces in order to better understand the selection mechanisms of biological systems concerning the amino acids subgroups.The method mentioned here has been defined as a QSAR method (González-Díaz & Uriarte, 2005), although, due to its polarity matrix, we consider this method rather as a Markov model (Rabiner, 1989).It has already been used in a more comprehensive version called hidden Markov model (Rabiner, 1989).However, the main obstacle to consider it as a Markov model is that its polarity matrix does not conform exactly as a Markov matrix, because it is not stochastic.A stochastic Markov matrix is this one in which the lines or columns add up to 1. Nevertheless, we believe that rendering the stochastic matrix will undoubtedly enhance the efficiency of a method.Therefore, by using multiple Markov matrix on a Markov model, called Hierarchical Hidden Markov Model (HHMM) (Wang et al., 2013), the new method will have different profiles of the same phenomenon, each of them represented by a Markov matrix and interacting together under a hierarchical weighting.Such Markov model has been used extensively on speech recognition (Lee, 2008).Bioinformatics arose thanks to these kind of algorithms (Hagen, 2000) making it easier to identify similarities in protein strains.For that reason, we are developing a new version of our model with such Markov structure.The effectiveness of the polarity index method in double-blind test reaches 90%, on eight different databases of proteins associated to influenza virus subtype A (H1N1).This level of success is high enough not to consider it as a lucky coincidence, although the reason at the more molecular level remains unknown for us so far.We believe that the polarity is a fundamental property of matter that characterizes the form of how a protein adopts to the lipidaqueous space so that the amino acid sequence (primary structure) expresses such conformational structure.Until now, we have verified this conjecture in all groups of peptides and proteins that we cite in this work.We have even used that property for modeling prebiotic scenarios.Nevertheless, the closer biochemical reason still remains unknown.However, the present lack of biochemical insight stands in contrast to the efficiency of the Polarity Index Method in the identification of peptides and/or proteins and its usefulness for prospective drug design from a more macroscopical settled modeling approach, or as a first filter in the bioinformatics identification of peptides/proteins.

CONCLUSIONS
The adaptation of the polarity index method to identify the nine subgroups of influenza A subtype H1N1 proteins, and reject eight groups of peptides not associated with influenza, scattered in different databases.It has proven to be an efficient algorithm, measuring the polarity of the protein from its linear sequence.

Figure 1 .
Figure 1.Comparison of the polar profile, from the nine subgroups of influenza proteins: HA, NA, NP, M1, M2, NS1, PA, PB1, AND PB2.The 18 columns on the x-axis correspond to 16 amino acids of vector incidences (Section Test plan).

Table 1 .
PNumber of incidences of proteins expressed in terms of their relative frequencies (see example Section 2.10).

Table 3 .
Percentages of polarity matches.