Detection of selective cationic amphipatic antibacterial peptides by Hidden Markov models

Antibacterial peptides are researched mainly for the potential benefit they have in a variety of socially relevant diseases, used by the host to protect itself from different types of pathogenic bacteria. We used the mathematical-computational method known as Hidden Markov models (HMMs) in targeting a subset of antibacterial peptides named Selective Cationic Amphipatic Antibacterial Peptides (SCAAPs). The main difference in the implementation of HMMs was focused on the detection of SCAAP using principally five physical-chemical properties for each candidate SCAAPs, instead of using the statistical information about the amino acids which form a peptide. By this method a cluster of antibacterial peptides was detected and as a result the following were found: 9 SCAAPs, 6 synthetic antibacterial peptides that belong to a subregion of Cecropin A and Magainin 2, and 19 peptides from the Cecropin A family. A scoring function was developed using HMMs as its core, uniquely employing information accessible from the databases.


BACkground
The increasing number of pathogens resistant to conventional antibiotics and the rising cost of production of the latter have led to the search for new drugs.One option for the development of these drugs is the production of antibacterial peptides found in nature, for these are the first defence line of living beings.
Antibacterial peptides have a wide variety of applications, from their use as antimicrobials to their use, after adaptations, as anticarcinogens (Ellerby et al., 1999;Del Río et al., 2001) to human obesity control aids (Kolonin et al., 2004).It has also been observed that antibacterial peptides do not necessarily act exclusively against just bacteria.An example of a large non-specific antibacterial 85-peptide is gambicin: MKQQTVFVLLALLLVSASCVDALVYVYAKTC-STCRSLGARNCGYGSLGSKKYVSCDGATAIRNCD-DCRRRFGTCQDRYITECFIG-NH 2 , which shows activity against bacteria and fungi (Vizioli et al., 2001).
The Selective Cationic Amphipatic Antibacterial Peptides (SCAAPs) are a recent and promising alternative for discovering new drugs effective in treating bacterial infections.They are characterized by being less than 60 amino acids in length, not adopting an α-helicoidal structure in neutral pH water solution and having a therapeutic index higher than 75 (Del Río et al., 2001).The therapeutic index of a peptide is defined (Ellerby et al., 1999;Del Río et al., 2001) as the ratio between the minimum inhibitory concentrations observed against mammalian and bacterial cells: the higher the value, the more specific the peptide for bacterial-like membranes.In other words, SCAAPs display strong lytic activity against bacteria, but have no toxicity against normal eukaryotic cells such as erythrocytes (Shin et al., 2000).
Computer-based approaches may accelerate the discovery of new SCAAPs.However, detection of SCAAPs among every possible antibacterial peptide is not feasible either computationally or by biological assays.Their variation is 20 n where n∈N is the length of peptide.For instance, an improved version of our program APAP (Del Río et al., 2001) executed on a cluster of 100 CPUs can not evaluate more than 20 13 sequences of length 13 aa; it takes more than 10 months of processing time in a single PC (not shown).APAP-I, as well as APAP, evaluates the following physical-chemical properties for each peptide: isoelectric point (IP), average helical hydrophobic moment (HM), mean hydrophobicity (MH), mean net charge (MC) and AGADIR (helix/coil transition algorithm).APAP-I is 396 000 times more efficient than the program APAP because it was designed to run on a high performance computing platform, and oriented to evaluate short peptides (8-11 aa).Thus, identification of new SCAAPs by searching the full space of peptide sequences may not be practical.
An alternative approach would be to search for new SCAAPs in sequences likely to have antibacterial activity.In this regard, it is possible to search for SCAAPs in peptides obtained from venoms (Conde et al., 2000) or to identify sequence patterns present in known antibacterial peptides.To identify such patterns, Hidden Markov Models (HMMs) provide a theory for profile methods (Resch, 2004;Prado-Prado et al., 2007a;2007b).These HMMs may be used to predict new antibacterial peptides based on numeric indices of the peptide.
The idea has been extended to include also Quantitative Proteome-Property Relationship (QPPR) models that personalize predictions of drug cardiotoxicity (González-Díaz et al., 2008a;2008b;2008c), or human prostate cancer (Ferino et al., 2008;González-Díaz et al., 2009), based on protein composition of Blood Proteomes.These Markov methods use dif-ferent types of transition probabilities described by atom-atom, nucleotide-nucleotide, amino acid-amino acid, or even protein-protein matrices.Two recent in-depth reviews of the field were published recently (González-Díaz et al., 2008a;2008c).
This article presents an approximation by Hidden Markov Models to detect SCAAPs based on physical-chemical similarity.As previously described (Del Río et al., 2001) the advantage of HMMs for this purpose is that they may identify patterns not obvious from iterative approaches such as APAP.This in turn may accelerate the discovery of new SCAAPs.
HMMs were implemented by using four sets of antibacterial peptides and one set of proteins: Set A: 59 natural and synthetic antibacterial peptides extracted from (set C), which act exclusively against bacteria, fungi, viruses and mammalian cancer cells, with 3D structure determined by NMR spectroscopy or X-ray diffraction (NCBI, September, 2007).
Set B: 28 natural and synthetic antibacterial peptides extracted from (set C), which act exclusively against bacteria, with their 3D structure were detected by NMR spectroscopy or X-rays (NCBI, September, 2007).
Set C: 500 natural and synthetic antibacterial peptides which have a non-specific action against bacteria.The method used to predict the 3D structure is not relevant (NCBI, September, 2007).
A stochastic process is a mathematical model for any phenomenon evolving or varying in time (or space etc.) subject to random influences (e.g., the stock market price of a commodity observed in time, the distribution of colors or shades in a noisy picture observed in an unordered two-dimensional lattice etc.).

Markov Models. Introduction
The condition prediction H at the time t∈N is concerned with hypothesizing what the condition H will be at the time t + 1, based on the observations of the condition H in the past (Resch, 2004).
We collected the relative frequency on the condition h i (on time i) depending on what the condition H was like one day earlier h i-1 , the day before that h i-2 , and so forth.
The conditional probability is However, the larger the value of i is, the more observations we must collect.For n states of

= P P
selective cationic amphipatic antibacterial peptides the condition H the number of past histories will be |H| n-1 .
If we take the Markov assumption, we would have the probability of an observation at time i depend on h -1 .So we can express the probability of a sequence {h 1 ,...,h n } using this assumption: As a consequence of the Markov assumption, the number of past histories is reduced to h n × h n-1 .

HMMs. Mathematical description
If A, B are two events, then we define the probability of A given B as One can work in the mathematical ideal world with the probability P to achieve various mathematical objectives, and then reinterpret these results back in the real world with a measure change back to P via the inverse Radon-Nikodym derivative.
If circumstances only allow us to obtain the condition H based on another condition O, the condition H is hidden from us.We evaluate the conditional probability P(h i |o i ) according to Eqn. ( 2).
If we assume that, for all i the H i , O i , are independent of all o j , h j , for all i ≠ j, Eqn.(1) can be rewritten as Eqn.( 3) is known as a measure of the probability and is referred to as the likelihood function L.
The expectation maximization (EM) algorithm reestimates the parameters of the model.
Many of the density functions are exponential in nature; it is therefore easier to compute the EM of a likelihood function by finding the maximum of the natural ln of L, known as the ln-likelihood function: due to the monotonicity of the ln function.

HMMs. Terminology
HMMs are specified by the set of states s = {s 1 ,s 2 ,...,s n }, corresponding to the possible condition H, and the parameter set Ω = {π, A,B}: The initial probabilities π i = P(h i = s i ) are probabilities of s i being the first state of a state sequence h i .They are collected in the vector P 0 .
The transition probabilities are the probabilities that go from state i to state j: a i,j = P(h n = s j )|h n-1 = s i ).They are collected in matrix A.
The emission probabilities characterize the The likelihood of O = {o 1 ,...,o n } along the path H = {h 1 ,...,h n } determined from HMMs with parameters Ω, is given by: where the probabilities P(O|H,Ω) and P(H|Ω) are expressed in terms of matrices A, B (Eqns. 5 and 6) and the vector P 0 .P(O,H|Ω) (Eqn.4) is known as the joint likelihood of an observation sequence and it is equivalent to Eqn. (1).

HMMs. Implementation
The set of states s corresponding to the twenty different amino acids from which every antibacterial peptide is formed: s = {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,-R,S,T,V,W,Y}, and the parameter set was formed by Ω = {P 0 ,A,B}.
The vector p 0 contains , where n is the length of the peptide to be tested, and p 0i is the relative frequency distribution of amino acids from the same peptide, derived from the absolute frequency distribution from natural and synthetic antibacterial peptides from (set A) (Table 1).Their 3D structure was detected by NMR spectroscopy or X-rays dif- Absolute frequency of natural and synthetic antibacterial peptides which act exclusively against bacteria, fungi, viruses and mammalian cancer cells (set A) to vector P 0 .The letters in the table refer to the 20 amino acids (one-letter code), and the numbers represent the corresponding frequency of that amino acid in the set.
The matrix A represents the relative frequency of all 400 possible pairs of amino acids.These pairs were taken in two directions: (a i,j ,a i+1,j ) and (a i-1,j ,a i,j ), for specific j.The matrix was built from natural and synthetic antibacterial peptides which have non-specific action against bacteria (set C); the method used to predict the 3D structure is not relevant (Table 2).These peptides were taken from the database BBCM (NCBI, September, 2007).
Every pair of amino acids from the peptide to be tested was extracted from matrix A.
The matrix B exhibits the conditional probability of the peptide to be tested as the result of two conditions: first, the calculation of each natural and synthetic antibacterial peptide by program APAP-I (this program evaluated if the peptide is or is not a candidate SCAAP); second, if the Index A ≥ 0.08.Index A (Eqn. 7) is formed by the relative frequency distribution of amino acid A i from the peptide to be tested, derived from the absolute frequency distribution from natural and synthetic antibacterial peptides which act exclusively against bacteria (set B) (Table 3).(NCBI, September, 2007).
The program APAP-I was used to evaluate if a peptide to be tested from (set C) was a candidate SCAAP or not, with the evaluation of different physical-chemical properties.APAP-I is formed by two subprograms: APAP-IA which evaluated the isoelectric point IP, helical hydrophobic moment HM and AGADIR.

APAP-I-B which evaluated the isoelectric
point IP, helical hydrophobic moment HM, mean hydrophobicity MH and mean net charge MC.
The physical-chemical properties in acceptable ranges were: Isoelectric point (IP) (Del Rio et al., 2001).This is the pH at which a particular peptide carries no net electrical charge.The value range considered was from 10.8 to 11.8.(Eisenberg et al., 1982).This is a sum of the hydrophobicities of the side chains of a helix of n amino acids.The length of a vector corresponding to the hydrophobicities is the numerical hydrophobicity associated to the kind of side chain, and its direction is determined by the orientation of the side chain according to the helix axis.A large value of HM means that the helix is amphiphilic perpendicular to its axis.The value range considered was from 0.4 to 0.6.Absolute frequency distribution of all amino acids taken of pairs (contiguously), from (set C).Every letter is equivalent to each amino acid, in this manner, the occurrence of pair of amino acids (A ci ; A cj ) is built with the amino acid from row ( i ) and the amino acid from column ( j ).

Helical hydrophobic moment (HM)
selective cationic amphipatic antibacterial peptides Mean hydrophobicity (MH) (Del Río et al., 2001).This is the mean of the hydrophobicities of the amino acids normalized to 1 over all amino acids of the peptide.The algorithm was given by the technical department of the Swiss Institute of Bioinformatics (Swiss).The value range considered was from 0.35 to 0.55.
The variables R i , K i , D i and E i represent the number of times the amino acids arginine (R), lysine (K), aspartic acid (D) and glutamic acid (E) appeared, accepting those peptides whose MC(R,K,D,E) evaluated with Eqn. ( 8) are above or equal to the number obtained by Eqn. ( 9) with the same mean hydrophobicity (MH).
AgAdIr (Lacroix et al., 1997;Del Río et al., 2001).Predicts the helical behaviour of a peptide.The value range considered was from 0.00 to 10.00.
The matrix B shows the conditional probability of P(o i |h i?IndexA ) to be candidate SCAAPs if (o i = true) the P(o i = true|h i?IndexA ) = 0.95, and its complement (o i = false) P(o i = false|h i?IndexA ) = 0.05.These numbers are obtained as a result of many computational assays.

HMMs. Tests
As a negative test, the validation of HMMs to detect candidate SCAAPs consisted of testing: The total number of natural and synthetic antibacterial peptides which had a non-specific action and whose structure could not be determined by either method (set C) (i.e.NMR spectroscopy or Xrays) over two sets: A set of three natural and synthetic antibacterial peptides (set D): Gambicin characterized by non-specific action and no SCAAPs (according to the program APAP-I); Mellitin characterized by toxicity against erythrocytes; Temporin [H XXA, frog] was determined by circular dichroism (CD).
The total number of natural and synthetic proteins that were detected in nature (set E) were used to build the matrices A and B, and test the (set C).

HMMs. Statistical analysis
A two-sample rank test by Wilcoxon, Mann and Whitney (Kreyszig, 1979) was made to test over two populations: Natural and synthetic antibacterial peptides (set C) versus natural and synthetic antibacterial peptides which act exclusively against bacteria (set B).
Natural and synthetic antibacterial peptides with an exclusive action against bacteria (set B) versus natural and synthetic antibacterial peptides detected by program APAP-I.
These statistical tests were used to verify the hypothesis that two populations have the same distribution to be a candidate SCAAPs or not.The assumption was that the populations tested correspond to continuous distributions, and to obtain critical values c 1 and c 2 , using the fact that if the hypothesis is true, then the random variable W, over the populations described is approximately normal with mean and variance (Eqns.10 and 11) Hence c 1 and c 2 were obtained substituting µ W and σ W in Eqns.( 12) and ( 13) The test was conducted only on the (sets B and C) because this pair is more similar than the other sets involved (A, D and E).

rESulTS objective
The use of HMMs for prediction and understanding of antimicrobial peptides has been reported for the last three decades (Andrés & Dimarcq, 2007), particularly the detection of antimicrobial peptides by multivariate linear regression and physical-chemical properties (Hilpert et al., 2008).
In this article we use HMMs for the prediction of candidate SCAAPs based on five physical chemical properties: isoelectric point (IP), helical hydrophobic moment (HM), mean hydrophobicity (MH), mean net charge (MC), and AGADIR; and the relative frequency distribution of single and pair amino acids over the sequence of the peptide.
The entire cluster was further analyzed by a search against Swiss-Prot and Translated EMBL protein databases by Smith-Waterman algorithm on GCG/SeqWeb to ensure the identification of these peptides.They are described in Table 4.
Note that the peptide number 32 (position 20 in Table 4) was not accepted by the programs APAP-IA and APAP-IB, but it was accepted by HMMs because of its score.

negative tests of HMMs
HMMs were tested with: Three peptides: Gambicin characterized by non-specific action against bacteria, fungi, viruses and mammalian cancer cells; Mellitin characterized by toxicity against erythrocytes; and Temporin H [XXA, frog] determined by circular dichroism (CD).All peptides were accepted by HMMs.
As a full test, we retrieved the complete set of proteins (391 836) from the Uniprot protein database and a new HMM profile was built from these sequences.After calibration, the new HMMs were used with the same set of 500 natural and synthetic antibacterial peptides (set C) that we refer to in the identification of SCAAPs in Table 4: No candidate SCAAPs or SCAP family was detected.

Statistical verification of HMMs
In order to verify if a statistical similarity exists between the referred set of peptides involved in the tests, we decided to compare only the more biologically similar sets: the set of 500 natural and synthetic antibacterial peptides which have non-specific action against bacteria (set C), and the set of 28 natural and synthetic antibacterial peptides which act exclusively against bacteria, with their 3D structure detected by NMR spectroscopy or X-ray diffraction (set B).
We ran a Wilcoxon, Mann and Whitney nonparametric test (with p-value < 0.05): the test did not observe any normal correlation between those sets, and consequently it was concluded that no sets had any statistical relation.

dISCuSSIon
In this article, we have described the detection of nine SCAAPs by applying a mathematicalcomputational tool, the HMM search on a predicted peptide database.Compared with the experimental assay search, the HMM is much more sensitive due to its summarizing nature.The key point for a successful HMM search lies in constructing the HMMs profile (a combination of physical-chemical properties and relative frequency distribution of amino acids over the sequence of the peptide).The inclusion of the complete set of proteins from the Uniprot protein database in order to reconfigure HMMs, and the inclusion of three wrong sequences provides more reliability and robustness of this HMM profile.
We recognize some bias with this approach.The major issue is related to the incompleteness of the existing databases.The degree to which the current database is complete is not known, even though our studies are designed to be exhaustive.
While this manuscript was being prepared, a paper was in press that described the detection of short linear cationic antimicrobial peptides using, principally, the nonlinear techniques of support vector machines and artificial neural networks (Hilpert et al., 2008).Their methods are more selective and less comprehensive than HMMs described.Thus, these two approaches could be used as complementary tools in identifying novel candidate members of a specific protein family.

Comparative studies
Our HMMs profile was compared with three stochastic methods named HMMER (HMMER), MAST (Bailey & Gribskov, 1998;2000) and GLAM (Frith et al., 2004).These comparisons were concerned with the number of hits each method offers, and the results show that GLAM was superior to the other methods but that HMMs, MAST and HMMER were equally effective.

Table 3 . Elements of vector Index A .
Absolute frequency of natural and synthetic antibacterial peptides which act exclusively against bacteria (set B) to vector Index A .The letters in the table refer to the 20 amino acids (one-letter code), and the numbers represent the corresponding frequency of that amino acid in the set.

Table 4 . Cluster of antibacterial peptides predicted by HMMs and listed in descending order (set C)
. Polanco and J.L. Samaniego NL: Position of the antibacterial peptide on the list.NP: Number which corresponds to the antibacterial peptide according to HMMs.F: Family.If natural SCAAPs were a part of (set B), [s].If Brevinin, [B].If Cathelin, [Ca].If Cecropin, [C].If Moricin, [M].AP-A: Peptide which was accepted by the program APAP-IA (Section HMMs.Implementation).AP-B: Peptide which was accepted by the program APAP-IB (Section HMMs.Implementation)