Mapping of the Influenza-A Hemagglutinin Serotypes Evolution by the ISSCOR Method

Analyses and visualizations by the ISSCOR method of influenza virus hemagglutinin genes of different A-subtypes revealed some rather striking temporal relationships between groups of individual gene subsets. Based on these findings we consider application of the ISSCOR-PCA method for analyses of large sets of homologous genes to be a worthwhile addition to a toolbox of genomics - allowing for a rapid diagnostics of trends, and ultimately even aiding an early warning of newly emerging epidemiological threats.


Introduction
Living organisms have very often quite biased preferences for synonymous encoding of the same amino acids. These differences and their variation have been extensively studied, however, no decisive governing rules have yet been discovered. Frequencies of codons for many species are in close correlation with their genome's GC contents, but the underlying forces governing this are not clear -it might be possible, that it is the GC content which is determining a genome's amino acids predilection for the specific codons being used and their bias [1]. On the other hand, it might be that reverse causative relationships are in operation: codons-specific amino acids usage is a driving factor for observed GC contents. Possible factors and forces driving synonymous codons usage postulated so far include, among many others: translational optimization [2][3][4][5][6], mRNA structural effects [7], protein composition [8], and protein structure [9], gene expression levels [2,10], the tRNA abundance differences between different genomes, and tRNA optimization [11][12][13], different mutation rates and patterns [14]. Also, some other possibilities were hypothesized, like local compositional bias [15], and even gene lengths might play a role too [16].
It is clear, that many interesting biological mechanisms underlie the basic phenomenon of genetic code degeneracy. One of its aspects, however, has not been studied until recently [17] -the question dealing with the sequential order of occurrence of synonymous codons. Obviously, an order of elements in a linear set is a different property, than the frequency of elements in the set. The amino acid composition of a protein carries much less information than the amino acid sequence of such protein, which in turn is less information intensive than a corresponding nucleotide sequence coding the same protein. This question can be formulated more precisely if we consider a given frequency of synonymous codon usage characteristic for a gene. There is a very large number of different orders in which the synonymous codons can appear sequentially along the gene without changing either the amino acid sequence of the encoded protein, or the codon usage of the gene.
Influenza viruses are antigenically variable pathogens, capable to continuously evading immune response. Influenza epidemics in humans cause an estimated 500,000 deaths worldwide per year. The genome of influenza A viruses consists of eight RNA segments that code for 10 viral proteins. Based on the antigenic specificities of the hemagglutinin (HA), or neuraminidase (NA) proteins the influenza A viruses have been divided respectively into 16 HA (H1-H16), and nine neuraminidase (N1-N9) subtypes. Accumulation of mutations in the antigenic sites of the HA and NA, altering viral antigenicity, is called the ''antigenic drift''. In circulating influenza viruses this antigenic drift is a major process, accumulating mutations at the antibody binding sites of receptor proteins, and enabling the virus to evade recognition by hosts' antibodies. The HA protein consists of two domains, HA-1 and HA-2 -the HA-1 domain, the major antigenic protein of influenza A viruses, contains all of the antigenic sites of HA, and it is under constant immune-driven selection. The segmented nature of influenza genome allows also for exchange of gene segments -a process of genetic reassortment, involving type A influenza viruses of different subtypes, and may result in the so called ''antigenic shift'', which occurs when progeny viruses that possess a novel HA, or a novel HA and NA, emerge [18], [19].
Recently we have proposed an in silico method [17] to tackle the problem of the sequential order of synonymous codons, called ISSCOR (Intragenic, Stochastic Synonymous Codon Occurrence Replacement) -synonymous codons, which occur at different positions of an ORF are replaced randomly by a Monte Carlo routine with their equivalents -the method generates nucleotide sequences of non-original ORFs, which have identical codon usages, and would encode identical amino acid sequences. The ISSCOR method was then used to analyze temporal and spatial aspects of the three sets of orthologous gene sequences isolated from various strains of hemagglutinin of the influenza A virus subtypes: A/H3N2, A/H1N1 (of both the seasonal, and the 2009 pandemic variants), and A/H5N1 [20] in an alignment-free manner.
The rich collection of the data gathered recently during the last bouts of the pandemic H1N1 flu gave possibility to give a fresh look on the perennial questions of influenza epidemiology. The role of founder effects is important for epidemiological scenarios [21,22] assuming that a genetic variability common to a small founder population will then also be found in most descendants. In viral outbreaks, such effects can be at play when specific mutations are enriched in samples coming from the same region, and/or the same time. Considering phylogenetic relations it is useful to identify such viral lineage founder events. The global strain sequencing efforts, combined with robust statistics allow novel insights into phylogeny, and especially variability of this highly changeable RNA virus. The variability problem is of especially high interest in view of recent results concerning the switching of receptor selection by the hemagglutinin [23,24] leading to a possible acquiring of airborne infection transmissibility for mammalian hosts. Russell et al. [25], based on the combined results of Imai et al. [23] and Herfst et al. [24], proposed a mathematical model of within-host H5N1 virus evolution to study some aspects influencing increase or decrease in probability of subsequent substitutions leading to aforementioned switch. They stressed that more data are needed for assessing calculated evolution rates based on the assumed mutation rates, which are of high interest for weighting out the speed of evolution of HA in switching receptor selection. In the current work we have postulated that rates of mutation frequencies in HA commonly accepted are routinely overestimated for at least one order, and proposed an enhanced method of finding evolutionary correlations between multiple strains of the H1N1 2009 pandemic virus [26].
The goal of the current work is to explore in some detail relationships between different orthologous hemagglutinin sequences of various influenza-A serotypes, using their ISSCOR descriptors. As these descriptors are easy to calculate, and yet encompass a rich spatial information involving also long range interactions, together with immediate neighborhoods of constituting residues, it should be possible to correlate stochastic genes' representation based upon the deviates, and their possible 3D epitopic functional interactions -either through theoretical 3D modeling, or perhaps through building associations with appropriate (in vitro?) in vivo data.

Materials and Methods
The A-influenza hemagglutinin full length gene sequence data, isolated mostly from avian, human and swine hosts, available for serotypes: H1N1 (seasonal), H1N1 (2009-2010 pandemic), H1N2, H2Nx (all neuraminidases), H3N2, H5N1, and H7Nx (all neuraminidases) were obtained on Dec. 9 th 2012 from the NCBI influenza resource. From this collection unique sequences were selected -such that from each subset of identical genes, the ones with the earliest dates of sample isolation were chosen as representatives.  Table I Overview of the ISSCOR approach: Previously [27,28], we have described alignment free approaches to the problem of comparison and analysis of complete genomes, and some techniques enabling to cope with the sparseness of the n-gram type of genomic information representations. The problem of sparse occurrence matrices is not only present, but even more pronounced when dealing with the number of permutations of the possible synonymous codons. Calculating the set of n-grams for such occurrences will lead to a vector representations, which are severely sparse, especially for higher n-grams lengths, and hence to very poor statistics. To alleviate this problem, we proposed [17] a hybrid approach. Namely, when computing counts of codon-pair patternsseparated by codon sub-sequences of differing length -the actual composition of these spacer sub-sequences will be neglected. However, when such partial counts are used as a composite set, poor statistics is no longer a hindering obstacle, and the complete information about particular n-gram frequencies profile is preserved, albeit in a distributed and convoluted form.
For every protein coding gene, with its original nucleotide sequence j 0 , a set of equivalent nucleotide strings (j 1, j 2, j 3,…, j N ) is created by a Monte Carlo approach. These artificial sequences have the following properties: • they are all of the same nucleotide lengths as the j 0 ; • they have exactly the same amino acid sequence as the j 0 (i.e., the proteins translated from the j 1, j 2, j 3,…, j N are identical to j 0 ); • they have in the vast majority of cases a synonymous codon order different from the original sequence j 0. The last is an essential point, which merits a commentary. The probability that a given string j i generated stochastically has the same synonymous codon order as the original j 0 decreases with the product of its length, with a probability limit tending rapidly to zero. Therefore, the ISSCOR method allows comparing the original codon sequence with an ensemble of different synonymous sequences -yet all of them coding for the same sequence of amino acids.

The Computational Procedure:
Full description of the method is given in [17], however, mathematical steps are briefly outlined here for convenience. First, the codon usage frequencies are determined, and on that basis the probabilities of replacement are calculated, separately for each codon-degeneracy equivalence group E: where: the -probability that any other codon from the same degeneracy equivalence group E -will be randomly replaced by the codon k; and u k P k is the synonymous codon k triplet frequency for a given amino acid in a whole gene. Therefore, obviously for any given degeneracy equivalence group E, the sum of such probabilities will always be equal to 1.
Then, successively for each codon in a gene the procedure of it's synonymous random replacement is performed based on probabilities according to the equation (1). Finally, the resulting shuffled sequences are determined, and compared to the original sequence of the gene.
For each protein coding sequence, we need to determine a complete matrix of all codon-pair patterns. Obviously, in a protein coding sequence, there are at most 3904 (61x61 + 61x3) unique codon-pair patterns. For a given sequence V, and the all codon-spacer lengths, in order to calculate observed values of a particular codon-pair pattern (c k , c l ), first we need to construct a series of matrices O λ (occurrence matrices). Each element of every matrix O λ contains the counted sum of all specific codon-pair patterns (c k , c l ), separated by a string of other codons present in this sequence, where the λ denotes the number of other codons separating the given codon-pair pattern (c k , c l ). Using a sliding window of the length 3*(λ+2) nucleotides, and starting at the position m, we would scan the whole sequence V, calculating elements of the matrix by the formula: where M is the sequence's length, and Comparisons involve matches between the predefined codon-pair patterns, of the first codon c k , always taken together with the second codon c l . That is, a particular positional comparison p involves only one nucleotide from the first codon c k , and one nucleotide from the second codon c l , ignoring all four remaining nucleotides, which corresponds to a pattern (for convenience we name each such pattern a hexon). Thus, there are e.g., nine patterns containing the adenine (A) at any position in a first codon, together with the cytosine (C) at any position in a second codon, etc. Obviously, when λ = 0, one has an adjacent codon-pair pattern (hexanucleotide), for λ = 1 it is a nonanucleotide, and so on. Note, that since these are ordered counts, each starting at the sequence's 5'-terminus, the matrices O i λ are not symmetrical, that is the count of the pair (c k , c l ) is different from the count of the pair (c j , c k ).
To make the results independent of a particular sequence size (or a set of sequences, as described already in [17] on the example of the complete genome of Helicobacter pylori), we need to calculate how much the number of actually observed hexons in the original sequence, differs from the mean number of the corresponding hexons, observed after performing N number of random ISSCOR permutations, divided by the standard deviation observed in the corresponding shuffled sequences: are the numbers of occurrences for any given hexon pattern counted after codons of the whole sequence have been shuffled randomly (as described above), thus the / N is a mean number of such occurrences after the N such random shuffles; n shuffled R STD shuffled is a standard deviation for all occurrences of a given hexon pattern, after N random shufflings of the whole sequence (we have determined previously [17] that 500 shuffles are sufficient to obtain systematic and highly repetitive results).
Phylogenetic analyses were performed using the classic Neighbor Joining [29,30], and the modified NJ algorithm: QPF [31]. Also the results from the II-iteration trees after multiple alignment, available through the MUSCLE package [32], were checked for consistency with distance-based methods. For tree manipulations (in the Newick format) and their visualization the Dendroscope package of Huson et al. was used [33].

Results and Discussion
The ISSCOR deviates (equation 3) for all the 9131 hemagglutinin sequences collected were calculated as described earlier, using. codon spacer values from λ=0 to λ=16, and creating the matrix MA of 9131 rows by 2448 columns. The results of principal component analysis (PCA) for the matrix MA showing 9131 data points (marked in light gray), will be presented on a series of figures (Figs. 1 to 4), showing maps of PC-1 values (45.4% of a total variance explained) -plotted on the abscissas, and the PC-2 values (further 18.3% of a total variance) -plotted on the ordinate axes respectively. The intent is that the 9131 points present in all the maps will provide a common frame of reference. On such a background the points corresponding to the different serotypic subsets of genes will be color-coded, and grouped also by the hosts infected (avian, human, swine, or in few cases also ferrets).

Chronology of evolution
The presumed trend in a chain of infectivity from avian, through porcine, to human hosts can be well observed e.g. in the case of H3N2 serotype on the right hand column panels in Figs. 2A and 2B. However, it is interesting that for the 2009 pandemic H1N1 strains, isolated from avian hosts, there was most probably a reversal of influenza infectivity chain -although it is usually considered that the avian hosts viruses form a primeval reservoir of infections; and the process of the propagation follows from avian, through mammals (mostly porcine) to human hosts. Such a conclusion, stemming purely just from the analyses of the maps here, was subsequently confirmed by the search in original literature: there are only four such isolates present in the NCBI database to date (accessions: HM370960, HM370967, HM370975, HM450134) -all quite late, isolated in Canada from turkeys [34] between Oct. and Dec. 2009. These sequences group on the map together with the cluster of the isolates from human and swine hosts (the left column on Fig. 1).
In contrast to the H3N2 strains, which all form together one large, elongated cluster, with a clearly visible evolutionary trend (right side of Figs. 2A and 2B; c.f. also [20]), the behavior of all other isolates (H1N1, H5N1, H1N2, H2Nx, and H7Nx) is much more complex, as they intermingle, forming a network of possible evolutionary paths. It is possible to roughly trace an early chronology of e.g. human and porcine of the seasonal H1N1 isolates (Fig. 1)  Assuming that a hypothetical, primeval ancient influenza hemagglutinin might have occupied a position somewhere in the middle of this triangle, we can follow the chronological spreading of more recent strains towards more and more remote locations (c.f. Fig. 1). Therefore, it seems reasonable to assume that HA orthologs occupying positions in apexes of this triangular distribution can be considered to mark the extents of this gene changeability in influenza A-type viruses, although of course further genetic drift might expand these boundaries even more. The HA sequences of the three most extended positions on the map are all obtained from isolates of human hosts: the topmost -A/Hamburg/1/2005 (H1N1 seasonal; FJ231765), bottom-left -A/Sydney/DD3_17/2010 (H1N1 pandemic, CY092550); and the bottom-right -A/Thailand/Siriraj-06/2002 (H3N2, JN617982), These observations are consistent with the hypothesis that all influenza viruses originated from their ancient stock harbored in wild, migratory aquatic birds -accordingly the avian host isolates are, on average, closest to the center of the triangular spread of HA genes observed here, followed by porcine isolates, and only then isolates from humans.

Putative origins of the XX c. pandemics
The question arises then whether it might be possible to trace back an origin of genetic shifts in HA leading to known major pandemic outbreaks. There is clearly a complete dearth of data of sequences prior to the H1N1 1918 strain. Also, for the next major pandemic emergence in 1957 the H2N2 viruses that caused the Asian flu, but disappeared from human population a decade later. There are only six complete HA H2N2 genes present in this set, which means not sufficient information to draw valid conclusions. The Table II contains the distance matrix, showing the respective numbers of nucleotide differences between these strains (c.f. also Fig. 3B, top-right panel). Schafer et al. [35] determined the most probable avian origin of the Asian flu pandemic, and shown antigenicity of H2 HAs from representative human and avian viruses, as well as their evolutionary characteristics in respective hosts.  Supposedly [35], the human H2 HAs, that circulated subsequently in the 1957-1968 period, formed a separate phylogenetic lineage, most closely related to the Eurasian avian H2 Has, and the antigenically conserved counterparts of the human Asian pandemic strains of 1957 might still continue to circulate in the avian reservoir, continuously coming into a close proximity with susceptible human populations. There was also an increased prevalence of H2 influenza viruses among wild ducks in 1988 in North America [35], preceding the appearance of H2N2 viruses in domestic fowl. As the prevalence of avian H2N2 influenza viruses increased on turkey farms and in live bird markets in New York City and elsewhere, greater numbers of these viruses have come into direct contact with susceptible humans. Unfortunately, the earliest avian host full HA sequence in our 9131 set was isolated only in the 1969, so we can't correlate their findings on the ISSCOR maps. , while the third one is distant from both by about 1200 nucleotide differences, and clearly it was not likely to be a pandemic precursor. Scholtissek et al. [37] concluded that the H3N2 subtype was presumably derived from a H2N2, by retaining seven segments of the H2N2, while the gene coding for the HA was recombined from Ukrainian duck or another highly related avian strain (not present on the ISSCOR map here).

The swine-like virus 2009-2010 H1N1 pandemic
In contrast to the previous outbreaks, during the 2009 "swine flu" H1N1 pandemic an abundant collection of sequential data was gathered in concert all across the globe. Interestingly, among the 178 pandemic strains there are two: the A/Malaysia/2142295/2009 (CY119330), and the A/Malaysia/2142299/2009 (CY119338), both collected on Jan. 9 th , that is much earlier than the commonly acknowledged "start" of the 2009 pandemic in March.
The Figure S1 (suppl. on-line materials) shows the NJ+ phylogram of the 2009 pandemic H1N1 HA all early 178 sequences (isolated between Jan. 1 st and April 30 th ). The tree was calculated together with the 380 putative precursor sequences of other serotypes (all collected during the same period of Jan. 1 st and April 30 th ). The two most probable swine precursors: the H1N1 A/swine/Missouri/46519-5/2009 (HQ378741), and the H1N2 A/swine/Hong_Kong/NS252/ 2009 (CY085998), form a small sub-clade, adjacent to the newly emerged pandemic strains (which form together one tight cluster); with all the other possible ancestors markedly distant. The CY085998 one is closer to the pandemic genes by few mutations than the HQ378741 strain. The CY085998 is a triple reassortant swine strains, present in a large study of phylogenetic evolution relationships of putative precursors to the 2009 H1N1 pandemic, examined in great details in [38] -and although it wasn't indicated as a most probable human pandemic precursor by the authors, its position on their NJ tree (c.f. Fig. S2a in suppl. materials of [38]) marks it as a very likely candidate 1 .
As the incubation and infectivity period of the virus lasts about one week, it is clear that the precursors ought to be extant at a time of possible genetic shift event to occur, however, we did repeat the analysis including also as an additional candidates all the 569 strains isolated during 2008. Partial results are shown on Figure S2 (suppl. materials), besides two sequences described, there were also three other porcine H1N1 strains, preceding the HQ378741, in the same small sub-clade: the A/swine/North Carolina/3793/2008 (JQ624667), the A/swine/Illinois/02064/2008 (CY099095), and the A/swine/Ohio/02026/2008 (CY099159), confirming validity of the former analysis. The relationships between these five putative precursor sequences on the ISSCOR map are shown on Figure S3 (suppl. materials).
The pandemic 2009 H1N1 cluster contains also five other porcine-host HA genes, but they are all more recent than May [39] have found that molecular markers predictive of adaptation to humans were not present in the early (as of May 2009) pandemic H1N1 viruses, and that antigenically the viruses were homogeneous -similar to North American swine H1N1 viruses, but distinct from seasonal human H1N1 isolates. Similarly, Smith et al. [40] concluded that the initial transmission to humans must have occurred several months before recognition of the outbreak Moreover, the unsampled history prior to pandemic means that the nature and location of the genetically closest swine viruses revealed little about the immediate origin of the epidemic.
Surprisingly if we compare the long dominance of over 40 years for the H3N2 serotype, already at two years since the 2009 pandemic outbreak, the new H1N1 variant displays now rather low abundance. Of the 190 full length HAs isolated during eleven months of 2012 till Dec. 9 th there were -from porcine hosts: 115 of the H1N1 seasonal serotype, 2 of H1N2, 34 of the H3N2; from avian hosts: 10 of the H5N1, and 1 of the H7N3; from human hosts: 1 of the H7N3, 3 of the H5N1, 14 of the H3N2; however, there were only 10 of the recently dominant H1N1 2009-2010 pandemic type strains. It might be of a high interest to observe the behavior of swine-like H1N1 strains during the forthcoming 2012-2013 flu season, and its prevalence ratios in comparison especially to the H3N2 isolates.

Laboratory-induced transition to the droplet-transmission infectivity
The Table ST1 (supplementary on-line materials) contains the list of accession numbers for all H5N1 sequences carrying the full pattern of the seven nucleotides (C355, C510, A514, A713, G718, A720, C992) required to be present in order to undergo the transition [24] from avian transmissible (the "before" state) to mammals airborne-transmissible (the "after" state, i.e. carrying the pattern: T355, A510, G514, T713, A718, C720, T992). The distributions of genes carrying each of the seven individual nucleotides, specific to this transition (c.f. [24] and their table S1 on page 15 of supplementary materials -pls. note that the numeration of nucleotides in their table differs from the shown here -our numbering starts always from the UAG codon of each gene) are mapped on Figs. 4A "before", and 4B "after" transition. The distribution of the single-individual "after" positions is quite strikingly different from the distribution of the single-individual "before" positions -too numerous to list here, except for the T355 position (present only in 17 strains; top-left panel on Fig. 4B), and the T713 position (present only in 23 sequences; second top-right panel on Fig.  4B).
It is also noteworthy, that as this transition can take place only among H5N1 strains, as all other strains are already capable of droplet borne infectivity in mammals, the three mutations: C355T, A713T, and A720C are indeed crucial for the emerging strain -among all the 9131 alleles they are present together in just a single pair of sequences: CY116646 and CY116654, indicative of their artificial, laboratory origin. Indeed, the visualization which of the "before" seven mutated positions were present with just one exception (that is which genes were carrying six out of the seven "before" nucleotides, c.f. Fig. 4S in suppl. materials) is quite striking. In contrast to a very broad dispersal for the single "before" positions as seen on Fig. 4A encompassing each of HA serotypes; all of the six-out-of-seven cases were present exclusively among H5N1 isolates (mostly avian, but also 8 porcine, and 54 human, c.f. Table ST1); ranging in numbers from 220 (positions: C355, G718, C992; and also the all seven positions) to 404 (for the position A514) sequences. Our results differ from these of Russell et al. [25] who performed a phylogenetic analysis of some H5N1 strains to reveal temporal, and to some extent also spatial, distributions of the between two to five of these seven mutations. On the one hand, it might seem that 220 orthologs, each carrying already all seven of the prerequisite "before" positions indicate quite substantial presence within all H5N1 strains in the whole data set. However, the comparison to just one sequence in the mutated "after" state clearly indicates rather very small probability of such an event occurrence.

Conclusions
The ISSCOR-PCA method enables fast and efficient visualization of evolutionary relations present in a very large, complex set of homologous sequences -otherwise not an easy and rather tedious task for other phylogenetic analysis algorithms available, when it involves many thousands of sequences. Our approach significantly simplifies the effort, by producing two-dimensional projections from the multidimensional hyperspace of descriptors characterizing each of individual strains, allowing a clear evaluation of the genetic diversity ranging inside such large set of homologs.
Based on the size of the sequences 9131 set, we have examined the odds of putatively tracking an origins of all flu pandemics in XX c, however, their mostly unsampled history and especially severe paucity of zoonotic data for all major genetic shifts prior to the H1N1 pandemic of 2009-2010 deemed the task not possible. In contrast, for the 2009 "swine flu" H1N1 pandemic an abundant collection of sequential data was gathered from human hosts, but again not so many from the porcine or avian ones. The ISSCOR maps confirmed the close affinity of the earliest HAs of human isolates to their tentative precursors from swine, and yet even for this well documented epidemic the nature and location of the genetically closest swine viruses reveal little about the immediate origin of the infection. Clearly, much higher ratio of porcine and avian isolates needs to be routinely monitored in future for analyses to feasibly pinpoint inter-species acts of transmission, even if only in ex post descriptive a manner.
In an interesting study Renzette et al. [41] examined ab initio passaging of the A/Brisbane/59/2007 (H1N1), and the A/Brisbane/10/2007 (H3N2) in MCDK cell cultures, followed by a deep sequencing study, and demonstrated that some surprises might await there, as they have shown rather unexpected increase in both serotypes' viral diversity -occurring concurrently at the same time, albeit in two separate cell lines. As a deep sequencing ab initio experiments produce large amounts of reliable evolutionary data it would be of interest to study them in much more detail. Therefore, it is of high interest to apply the ISSCOR-PCA method presented here to such an ab initio dataset, optimally also in conjunction with ancestry tracking based on disentangled neighbor joining trees [26].
Tracing the chronology of individual strains isolation times on the maps specific to different serotypes revealed that oldest strains occupy mostly positions in the middle of the roughly triangular shape (c.f. Fig. 1) distribution, whereas newer strains spread gradually towards apexes of that triangle. This point is of importance -our analysis shows that the most distant sequences of hemagglutinin were all isolated from human hosts: A/Hamburg/1/2005 (H1N1 seasonal), A/Sydney/DD3_17/2010 (H1N1 pandemic 2009-2010); and A/Thailand/Siriraj-06/2002 (H3N2). The ISSCOR analysis, unexpectedly for us, showed that the hemagglutinin variability is largest in case of strains invading humans, and seems to be less pronounced in case of strains detected in birds. However, as sampling is much skewed towards human strains (c.f. Table I), the statistics of the collected set do not allow for any far-reaching hypotheses concerning species' specific virus-host interactions. Nevertheless, taking into account the sheer volume of the data analysed we propose that the edges of this assembly delimit the extent of genetic diversification of the influenza virus hemagglutinin. This bears on the immunological variability of this gene, allowing for broader look on influenza epidemiology.

Conflicts of interests
Authors declare no conflict of interests. Figure S4 -The enlarged fragment of the PC-1 vs. PC-2 map ot of principal component analysis of the ISSCOR descriptors for the 9131 full-length hemagglutinin sequences (light gray points), with the sequences corresponding to the region of the H5N1 for the "before" state of avian transmissible to mammals airborne-transmissible. The sequences carrying all seven positions (as described in [24]), as well as carrying combinations of the six positions with exclusion the one indicated, are color-coded as shown.