Structure and Function Relationships of Proteins Based on Polar Profile: a Review

Proteins in the post-genome era impose diverse research challenges, the main are the understanding of their structure-function mechanism, and the growing need for new pharmaceutical drugs, particularly antibiotics that help clinicians treat the ever-increasing number of Multidrug-Resistant Organisms (MDROs). Although, there is a wide range of mathematical-computational algorithms to satisfy the demand, among them the Quantitative Structure-Activity Relationship algorithms that have shown better performance using a characteristic training data of the property searched; their performance has stagnated regardless of the number of metrics they evaluate and their complexity. This article reviews the characteristics of these metrics, and the need to reconsider the mathematical structure that expresses them, directing their design to a more comprehensive algebraic structure. It also shows how the main function of a protein can be determined by measuring the polarity of its linear sequence, with a high level of accuracy, and how such exhaustive metric stands as a " fingerprint " that can be applied to scan the protein regions to obtain new pharmaceutical drugs, and thus to establish how the sin-gularities led to the specialization of the protein groups known today.


INTRODUCTION
In Proteomics, the Supervised learning (Larrañaga et al., 2006) essentially seeks to identify a regularity (Oestreicher, 2007) among a group of proteins with a particular characteristic or "training data", once this regularity is isolated, a mathematical-computational algorithm is built (Kitaev, 1997) to find the same regularity or the absence of it in other groups of proteins.The best scenario is when the desired regularity is evident in a group of proteins, however, that is not usually the case as the efficiency of algorithms is frequently low; this occurs particularly when trying to identify the primary function of a protein.Proteins do not normally have a unique action associated to them, i.e.only 1% of the peptides located in APD2 database (Wang et al., 2009) have a unique pathogenic action.The search of a non-evident regularity, such as the main function associated to a protein, cannot be done by finding similarities in the protein linear sequence, like sequence alignment algorithms do e.g.BLAST (Madden et al., 1996) or FASTA (GenBank, 2011).It requires strategies where it is possible to identify minimal regularities.Within the group of Supervised learning algorithms there are some that focus on relating chemical structure to biological activity, evaluating only one physico-chemical property and obtaining the best results in the identification of the main action or function of a protein, these algorithms are called Quantitative structure-activity relationship models (QSAR models) (Putz et al., 2011).The more than 80 QSAR algorithms known (Qureshi et al., 2014) use physico-chemical metrics involving the linear representation and/or the 3D structure of the protein and evaluate one or more properties simultaneously.What differentiates each QSAR model is the metrics they use, however, all of them produce a real value from a predetermined range, e.g.isoelectric point (Kosmulski, 2009) at 25°C for tungsten (VI) oxide WO3 in water: [0.2-0.5].At first glance, the greater number of physico-chemical properties, lesser the number of "false positives", but this is not true here, there are QSAR models that include all known physico-chemical properties in their metrics (Yap, 2011), and yet the false positives still occur, with the percentage of efficiency not exceeding 80% in most of the models (Brendel et al., 1992).The probable cause is that when the result comes from a predetermined range, the completeness property of the real numbers is not considered, therefore, the combination of the physico-chemical properties does not add effectiveness to the algorithm, but it adds complexity to the computational implementation.A minimalist approach to the assessment of the physico-chemical properties can significantly improve the performance of a QSAR model, this approach consists of identifying the fundamental physico-chemical property influencing the studied phenomenon, and building a metric that expresses its dynamic and static behavior.
An example of this new family of QSAR models is polarity index method (Polanco et al., 2012), which assumes that the three-dimensional conformation of a protein defines its specific function and is the result of its electromagnetic balance.It also conjectures that this 3-D conformation is expressed in the linear sequence formed by its amino acids and that this balance can be measured through their polarity.For this purpose, amino acids are classified in four different groups: polar positively charged, polar negatively charged, polar neutral, and non-polar.If the amino acid sequence is read from N-terminal to C-terminal from left to right, moving one amino acid at a time and the 16 possible incidents are registered in an array, a comprehensive metric of the polar behavior of the protein will be obtained from this linear sequence.If this procedure is carried out with a training data, an array of polar incidents representative of that particular set will be generated.This array can be considered a "fingerprint" of the protein group studied and since this algorithm can simultaneously evaluate multiple proteins, it can be used for the polar classification of the existent protein groups (Boman, 1995), the exploration of peptide regions of a determined length, the construction of new pharmaceutical drugs from fully synthetic proteins, or in basic science, for discovering the profile of the first proteins from four billion years ago (Gaucher et al., 2010;Polanco et al., 2013;2014;2014b).

FOUNDATION
The mathematical-computational algorithm called polarity index method (Polanco et al., 2012) had its foundations in the early studies this team did on polymerization of prebiotic proteins that had to be present 4 billion years ago (Gaucher et al., 2010), since it was not possible to use the current genetic code (Sharp, 1985), consisting of 20 amino acids, a random generation of amino acids from also randomly produced nucleotide triplets was used.It was observed that although the first amino acids did not correspond to the 20 amino acids known today, neither in number nor in type, it was possible to use the polar profile that was the result of the electromagnetic balance reached by each one of these amino acids, as this property was defined for all of them.This led to the construction of a polar equivalence (injective mapping) (Vinogradov, 1985) that allowed the comparison of the prebiotic proteins computationally built with those known today.
These groups of peptides and proteins (Figs. 1-3) do not have any coincidence in the minimum, maximum, or turning points.The intrinsically disordered protein group (Fig. 2) has similarities, however, there is a translation between the curves.The SCAAP group (Fig. 3), is substantially different from the others.The characteristics of the curves is typical for each group, and the proteins in each group are similar.This is the reason why the polarity profile is an effective discriminant for the functional (bacteria, fungi, virus, etc.) and structural groups (disordered proteins) studied.
The graphical representations used in different groups of peptides and proteins showed that the polarity matrix is neither symmetric nor antisymmetric (Munkres, 2000), we could verify that the inflection points (Munkres, 2000) located in the X-axis of the smooth curves characterize the group studied, i.e. the location of these points was an effective discriminant (80-90% in a double-blind statistical test), it was tested in more than 14 protein and peptide groups studied (Polanco & Samaniego, 2009;Polanco et al., 2012;2013a;2013b;2014a;2014c;2014d;2014e).However, we decided to choose the most accurate computational interpretation of the matrix and the analytical construction of the smooth curve presented a problem as the characteristic polynomial of the curve differed for each of the techniques used to obtain it.
If we consider these 16 polar interactions are the characteristics of the main action of a particular group of proteins, then the space where the pro-tein is defined would be either ℝ 16 (real field) or ℂ 8 (complex field) (Munkres, 2000), and as the number of inflection points will always be minor to this number of incidents, the space where the discriminant property is defined will be a subspace of ℂ 8 .Furthermore, vector "x" whose 16 components are the elements of the polarity matrix, can be measured ||x|| (Munkres, 2000) therefore, every protein can be assigned to a group not only according to the similarities of their 16 components but also according to the length of the vector searched.From these considerations we can emphasize one aspect of the attributes of the discriminant -the feature that discriminates is identified by the singularities or inflection points (singularities degenerated), and not by the regularities or maximum-minimum points (singularities non-degenerated) observed in the smooth curve, this is how the physico-chemical property studied was identified (Polanco et al., 2012).
As mentioned before, the electromagnetic balance of the peptide or protein is classified into four polar groups (Pauling, 1960) closely related to the nature of the elements in all living matter, mainly formed by carbon (C), hydrogen (H), oxygen (O), and nitrogen (N).Therefore, the electromagnetic balance should be defined as a quantum electromagnetic balance i.e. the nature of the balance is not Newtonian, since this balance is the result of the energy exchange between the atoms and the particles in conjunction with the nucleus of the elements.At this level it can be explained as the polarity profile or electronegativity of the amino acid.

MEDICAL IMPLICATIONS
The medical implications of Bioinformatics in the manufacture of new pharmaceutical drugs is incipient (Khan, 2011), mainly because the mathematicalcomputational algorithms known today do not include an exhaustive computational verification, this means they focus on the assessment of a property that is presumed to be an effective discriminant, and exclude the virtual recreation of the environment where the synthetic peptide acts.The complexity involved in simulating this virtual scenario is certainly high, currently this virtual scenario is replaced by the synthesis of peptides and their subsequent experimental testing in a laboratory.However, the new generation of Bioinformatics algorithms will have to provide a virtual scenario as well as a drastic reduction of the lab testing of the synthetic proteins produced by them.Furthermore, we think that the construction of such virtual scenario to test peptides should be a global initiative (Goodman, 2011) involving research groups in Bioinformatics from different countries, for two reasons: the complexity of the construction of the virtual scenario, and the standardization of the factors and variables that will have to be considered so the synthetic peptides are always evaluated under similar conditions.This initiative is important as it would prevent using bioinformatics algorithms as "filter algorithms", improving their efficiency, and bridging the gap between academic and industry institutions with regulatory agencies (Lesko, 2012).
In this work, we presented the results obtained with polarity index method for three groups of proteins that are a current topic in medicine: SCAAPs, intrinsically disordered proteins, and lipoproteins related to atherosclerosis.The efficiency of the SCAAPs found in nature is high, however, there are two problems: the increasing difficulty to find them in other organisms and the high costs involved in their synthesis and experimental verification.Therefore, it is imperative to encourage the identification of SCAAPs, given the resurgence of MDROs and the epidemic outbreaks that turned pandemic during the last decade.The intrinsically disordered proteins have shown their association with neurodegenerative diseases known as Amyloidosis, which will have a high impact on the world population during the next decades; and the proteins related to atherosclerosis are associated with coronary artery disease, which is the first cause of death in the USA and in Europe it has been for decades a problem that impacts the health of workers.

PERSPECTIVES
In humans 25 000-30 000 genes encode proteins so it is reasonable to consider existence of 500 thousand to one million different proteins, this is the result of two factors: a gene may express different proteins and they undergo post-translational changes (Crawford et al., 2004).Considering that a computational algorithm takes only one second to analyze the linear sequence of a protein, it will mean eleven days of continuous processing in case of a uniprocessor computer, or an hour in case of a 200 processor cluster (Niiler, 2001).The problem lies not in processing but in the effectiveness of the algorithm and, as it was noted before, the sum of all known algorithms applied to the same protein does not provide more effectiveness but it makes the analysis impractical due to the time-consuming processing.The hardware-software is not and will never be an impediment for the bioinformatics processes applied to Proteomics, but efforts should be aimed at duplicating massive storage capacity and simultaneously at reducing data processing time.
We think that in the near future, the approach to the metrics in new algorithms should be reconsidered to improve their effectiveness, using the known physico-chemical properties but changing their algebraic structure in such a way that they thoroughly inform the dynamic-static aspect of the property studied.As already mentioned, the physico-chemical property Polarity has been considered in many Bioinformatics algorithms (Qureshi et al., 2014), however, it was its comprehensive assessment that considers 16 possible polar interactions, which made the difference.To reconsider the approach does not mean to start from scratch, but to examine the most evaluated physicochemical properties, and study them separately to avoid the over-expression of a property.This aspect in a minimalist approach means not only the expression of the physico-chemical property in the broadest sense, but also its isolation i.e. if a property defines what is sought it should not coexist with another property as this will distort the algorithm.Future algorithms should aim to be exhaustive but minimalist at the same time.A final aspect to consider during the design of these algorithms is that they should be embarrassingly parallel (Snir, 1998), this means programming should process the instructions or tasks of the algorithm simultaneously and computer programs should take into account the same outlook; it is worth mentioning that this technique is not new, its history goes back to 1950 (Wolinsky, 2007).
Finally, in our opinion it is essential to continue the exploration of the polar profile of the first proteins and the effect the bombarding of minimally biased amino acids had on them billion years ago, as the actual knowledge on proteins is negligible compared with the information this span of time can provide, particularly about the role the biases played during the forming of amino acids.On this topic it will be essential to implement broad prebiotic scenarios that allow the recreation of multiple variables from stochastic processes (Rabiner, 1989).In few decades, the design of new drugs will face a drastic reduction in experimental tests on animals (European ℂommission, 2014), this will involve the design of new algorithms not only according to the guidelines mentioned above but also consistently with the outline of computational biological scenarios that minimize the number of the synthetic proteins tested.The challenges are great and the financial implications considerable, but with the emergence of Multidrug-Resistant Organisms it is evident that it is the human race which is at stake and we have to be prepared to spare no efforts in that endeavor (Zuckerman et al., 2009).

Figure 2 .
Figure 2. Relative frequency distribution for the unfolded and folded proteins (Polanco et al., 2015a).The X-axis represents the 16 polar interactions.