Applications of Correspondence Analysis to Genome data

Use of Correspondence Analysis in Genome Exploration

Fredj Tekaia

Text and figures are in http://www-alt.pasteur.fr/~tekaia/caongenomes.html

Introduction

The growing number of completely sequenced organisms offers the opportunity to systematically investigate as a whole their predicted ORF products and their codons as well as their amino-acid content. In genome explorations it is of interest to identify organisms or parts of their ORF products that are characterized by some preferred amino-acids or codons. These investigations generally involve large and complex datasets. The appropriate method to handle such data as a whole is Correspondence Analysis*(CA). This method has received recent applications: in the analysis ofB. subtilis(Kunst et al.), in comparing M. tuberculosis (Cole et al.) amino-acid composition with that time existing completely sequenced organisms, in the prokaryotic genome evolution as assessed by a comparative study of the codon usage patterns in three completely sequenced organisms (McInerney 1997) and in the replicational and transcriptional selection on codon usage in B. Burgdorferi ( McInerney 1998).

In this work, using this method we present a systematic analysis of all available completely sequenced organisms and their comparisons according to their amino-acid and codon compositions.

Materials and Methods

Correspondence Analysis

A brief description of the method is presented here, detailed explanations with worked examples can be found in Benzecri 1973 and in Greenacre 1984. Correspondence analysis is a multivariate method that applies for positive numerical data tables. Lines of such tables are the "observations" or "cases" and columns the "variables". It allows the construction of an orthogonal system of axes (called factors and denoted F1, F2, etc...) where observations and variables can be jointly displayed. The factors are constructed according to the information they represent and therefore are presented in a decreasing order of importance. A maximum of n-1 such factors can be determined, where n is the lowest of the 2 numbers of observations and of variables. The information included in a subspace of dimension p (p <= n-1) equals the sum of informations included in the p factors. The average proportion of the total information represented by one factor is 100/(n-1). This value serves as a guide in determining the relative importance of a given factor. In this system proximity between observations or between variables are interpreted as strong similarity. Proximity between observations and variables are interpreted as strong relationship. The ability of displaying simultaneously observations and variables on the same factorial space makes it easy to discover the salient information included in a given data table.

A very simple example showing the interest and strength of the method can be seen in the obtained distribution of completely sequenced organisms, according to their base (A, C, G and T) composition. Considering solely A, C, G and T columns, this table can be simply analysed by eye, to notice which organisms are G+C rich, which are A+T rich, and which have equivalent (G+C) and (A+T) values. All of these can be seen in the following Figure obtained by correspondence analysis. It can be seen that organisms are plotted near the bases where they have high values. As expected G and C have similar profiles and so have A and T. A little bit difficult to note by eye (unless one is aware of the particularity) is that M. pneumoniae (MP) the proportion of A and T are particularly different. Thus MP appears on the figure as an outlier, situated in the direction of A, meanning that MP has a high proportion with reagrd to A.

This simple example shows how information included in a given data table can be synthesized on a graph in a very efficient way.

If K denotes the table relative to the ORFs of a given organism, each value Kij equals the number of codon j included in the ORF i. If K represents the predicted ORF products of a given organism, Kij corresponds to the number of amino-acid j included in the predicted ORF product i. If K denotes the organisms as defined by their amino-acid compositions, Kij is the proportion of amino-acid j in organism i.

It should be noted that the average proportion of the information per factor is (if n-1 denotes the lowest of 2 numbers of observations and variables):

5.3, if n-1 = 19 (n = the number of amino-acids);

5.9, if n-1 = 17 (n = the number of organisms considered);

1.6, if n-1 = 63 (n = the number of codons).

Observations and variables are defined by their coordinates in the factorial space obtained by CA. They can then be classified according to their neighbourhood (distances) which allows the determination of homogeneous clusters of observations or of variables. A tree can also be constructed to represent the degree of homogeneity between these clusters. Thus when observations represent ORF products or organisms and variables represent the 20 amino-acids or the 64 codons, it is possible to display the ORF products according to their composition and to define clusters of interest.

Usually observations and variables are displayed jointly on the same factorial plane defined by the first (F1) and the second (F2) factors.

Correspondence analysis also allows the representation of subsets of variables or of observations as "illustrative elements", so that they can be situated with regard to all other "active" variables or observations. For example charged, Polar(uncharged) and hydrophobic subsets of amino-acids, can be represented as "illustrative variables". Such variables are simply the barycenters of the subsets of amino-acids they represent.

List of the considered organisms

The following 18 presently available organisms+ and abbreviations have been considered for this work: S. cerevisiae (Y), C. elegans (CE), M. jannaschii (MJ), M. thermoautotrophicum (MTH), A. fulgidus (AF), H. influenzae (HI), M. genitalium (MG), M. pneumoniae (MP), Synechocystis sp. (Ssp.), E. coli (EC), H. pylori (HP), B. subtilis (BS), B. burgdorferi (BB), M. tuberculosis (MT), A. aeolicus (AE), T. pallidum (TP), Pyrococcus horikoshii OT3 (PH), chlamydia trachomatis (CT).

Results

We have systematically applied CA in the investigation of these organisms by comparing:

a) all organisms according to their amino-acid composition and according to their codon usage;

b) all predicted ORF products of every organism according to their amino-acid and according to their codon compositions;

Organisms versus amino-acid composition

We first applied correspondence analysis to the data table including all available completely sequenced organisms, each is defined by its amino acid composition. Results are displayed on Figure 1a. This figure shows the organisms distribution on the factorial space defined by the first (F1) and the second (F2) factors. As indicated on the figure, F1 and F2 represent respectively 52.4% and 25.5% of the total information included in the analysed data table. It can be seen that the organisms are distributed along axis F1 roughly in a increasing order of their G+C contents, going from B. burgdorferi (BB) which has the lowest (28.5%) G+C content to M. tuberculosis (MT) which has the highest (65.5%) content.

In this analysis charged, polar (uncharged) and hydrophobic represent subsets of amino-acids corresponding respectively to the charged residus (DEKRH), Polar/uncharged residus (GSTNQYC) and the hydrophobic residues (LMIVWPAF). One apparent trend in this figure is that hyperthermophile organisms M. jannaschii (MJ), A. aeolicus (AE), A. fulgidus (AF), M. thermoautotrophicum (MTH), P. horikoshii (PH, P. abyssi (PA) and T. maritima (TM) are situated near to the charged amino-acids (charged), which suggests that these organisms encode on average, higher levels of charged amino-acids. These organisms are mainly defined by the relatively high values of Glutamic acid and low values of Glutamine. In the opposite, uncharged amino-acids (uncharged) are more specific to the bacterial and euckaryotic organisms including H. pylori (HP), H. influenzae (HI), C. elegans (CE), S. cerevisiae (Y), M. genitalium (MG) and M. pneumoniae (MP) which encode on average, higher level of polar residues. This is mainly due to the relatively high values of Glutamine low values of Glutamic acid. This result is firmly confirmed by the classification of the considered organisms according to their neighbourhood in the whole factorial space as shown in Figure 1b. This tree shows two main clusters, each includes three distinct clusters. More precisely this tree shows that the considered organisms can be partitioned, according to their amino-acid composition, into 7 main clusters as follows:

  1. (MJ, BB, RP, CJ, HP);
  2. (MG, MP);
  3. (AE, PH, PA, TM, AF, MTH);
  4. (MT);
  5. (TP, Ssp, EC);
  6. (BS, HI, CT, CP);
  7. (CE, SP, SC ).
Each of these clusters shows more or less variability between its members. Since vertical lines between nodes are proportional to their similarity, more stringent similarity between organisms results in more homogeneous and distinct clusters, less stringent similarity may result in 4 main clusters.

This example is typical for what can be expected from correspondence analysis when applied to a data table: a graphical representation allowing the discovery of the main variation trends between its observations and/or variables.

Organisms versus codons usage

Similarly correspondence analysis was applied to the table including completely sequenced organisms, each is defined by its codon usage values. The obtained distribution of organisms and codons is shown in Figure 2a. Organisms are distributed clearly according to their G+C contents: increasing from left to right.

The organisms classification according to their neighbourhood in the factorial space is shown in Figure 2b. This tree shows two main clusters including the seven homogeneous subclusters :

  1. (MT);
  2. (TP, EC, BS);
  3. (MTH, AF, AE, PH);

  4. (MJ, BB);
  5. (Ssp., MP, HP);
  6. (MG, HI);
  7. (CT, CE, Y).
As noted in the previous tree, more or less stringent similarity between organisms leads to more or less subclusers.

The organisms classification shown by the two previous trees are significantly different. Only Y and CE cluster together when considering their amino-acid as well as their codon composition. AE, PH, AF and MTH cluster also together in both situations but in a lesser degree of homogeneity.

Organisms's ORF products versus their amino-acid or codon composition

We have systematically applied this multidimensional method in genome analysis to detect salient relationships between predicted ORF products of a given organism according to their amino-acid or codon contents. For each completely sequenced organism, two data tables were constructed. The first has lines (i) representing predicted ORFs and columns representing amino-acids (j). The value of Kij equals the number of amino-acid j included in the predicted ORF product i. The second table was constructed with lines representing ORFs and columns representing codons.

For example the Mycoplasma genitalium genome predicted ORF products versus their amino-acid composition is shown in table 1 and the ORFs versus their codon composition is shown in this table 2. Similar tables were constructed for each completely sequenced organism. These tables were analysed using correspondence analysis, and results are shown on the following figures which represent for each organism, the obtained first and second factorial axes F1 and F2. On these figures, predicted ORFs are represented by points, amino-acids and codons are written with their usual abbreviations.

  1. Yeast Saccharomyces cerevisiae predicted ORFs versus amino-acid composition (Figure 3a) and codon composition (Figure 3b).

  2. C. elegans predicted ORFs versus amino-acid composition (Figure 4a) and codon composition (Figure 4b).

  3. M. jannaschii predicted ORFs versus amino-acid composition (Fugure 5a) and codon composition (Figure 5b).

  4. M. thermoautotrophicum predicted ORFs versus amino-acid composition (Figure 6a) and codon composition (Figure 6b).

  5. A. fulgidus predicted ORFs versus amino-acid composition (Figure 7a) and codon composition (Figure 7b).

  6. P. horikoshii predicted ORFs versus amino-acid composition (Figure 8a) and codon composition (Figure 8b).

  7. P. abyssi Predicted ORFs versus amino-acid composition

  8. H. influenzae predicted ORFs versus amino-acid composition (Figure 9a) and codon composition (Figure 9b).

  9. M. genitalium predicted ORFs versus amino-acid composition (Figure 10a) and codon composition (Figure 10b).

  10. M. pneumoniae predicted ORFs versus amino-acid composition (Figure 11a) and codon composition (Figure 11b).

  11. Synechocystis sp. predicted ORFs versus amino-acid composition (Figure 12a) and codon composition (Figure 12b).

  12. E. coli predicted ORFs versus amino-acid composition (Figure 13a) and codon composition (Figure 13b).

  13. B. subtilis predicted ORFs versus amino-acid composition (Figure 14a) and codon composition (Figure 14b).

  14. H. pylori predicted ORFs versus amino-acid composition (Figure 15a) and codon composition (Figure 15b).

  15. B. burgdorferi predicted ORFs versus amino-acid composition (Figure 16a) and codon composition (Figure 16b).

  16. A. aeolicus predicted ORFs versus amino-acid composition (Figure 17a) and codon composition (Figure 17b).

  17. M. tuberculosis predicted ORFs versus amino-acid composition (Figure 18a) and codon compositions (Figure 18b).

  18. t. pallidum predicted ORFs versus amino-acid composition (Figure 19a) and codon composition (Figure 19b).

  19. c. trachomatis predicted ORFs versus amino-acid composition (Figure 20a) and codon composition (Figure 20b).

  20. t. maritima predicted ORFs versus amino-acid composition or codons composition.

Discussion

Each of these figures shows the distribution of the predicted ORFs in a given organism according to their amino-acid or codon composition. Points represent predicted ORFs whereas amino-acids and codons are represented by their usual abbreviations. Points situated near the origin of the axes represent ORF products having more or less average values in all considered amino-acids or codons. Such ORFs cannot be discriminated by their amino-acid or codon contents. Existence of homogeneous clusters of points which are distant from the origin of the axes is indicative of their specificity. In such clusters, proximate points represent ORFs exhibiting similar preferences of amino-acid(s) or codon(s) situated in their neighbourhood.

Among salient clusters of predicted ORF products according to their amino-acid composition:

Yeast predicted ORF products distribution shows sets of ORFs that have preferences to a) Cystein (C), Triptophan (W) and Phenylalanine (F), b) Asparagine (N), Glutamine (Q), Lysine (K) and Glutamate (E) and c) Serine (S) and Threonine (T) amino-acids.

M. autotrophicum and A. fulgidus show clusters of ORF products specific to Cystein (C) and others to Triptophan (W) and Phenylalanine (F).

H. influenzae, E. coli, B. subtilis and A. aeolicus ORF products are distributed in 2 main clusters one of them is defined by Tryptophan (W) and Phenylalanine (F). H. pylori shows also a cluster with W and F preferences. C. trachomatis shows a cluster with W, F and C prefernces.

In C. elegans : the left part of the first axis corresponds essentially to the Phenylalanine (F) rich ORF products whereas the right upper part corresponds to Glutamic acid (E), Lysine (K), Arginine (R) and Aspartic acid (D) and to Glycine (G) and Proline (P) rich ORF products in the right lower part.

P. horikoshii distribution shows a large Serine (S) rich predicted ORF products situated at the left side of the first factorial axis F1 and a Cysteine (C) rich scattered predicted ORF products along the second factorial axis F2.

In M. tuberculosis: two subsets of ORF products corresponding to the PE Glycine (G) rich and PPE Asparagine (N) rich ORF product families.

Among salient clusters of predicted ORF products according to their codon composition:

Y: 2 clusters can be observed. One is mainly defined by {GCT, GGT, GCC and GTC}, the other by {TGC, CTC, TGT and TGG}

CE: 2 clusters can be observed. One is mainly defined by {GGA, CCA} the other by {GCC}

MJ: No clear clustering is observed, nevertheless some ORFs have preferences to {GCG, TCA, ACG, CTC, CTG, TCC}

MTH: No clear clustering can be observed, nevertheless some ORFs have preferences to {CAA, TTA, TTG, CTA, ACT, TTT, TAT, AAR}

AF: No clear clustering can be observed nevertheless some ORFs fave preferences to : {CAA, TTA, CGT, CGA, CGG, AAT, AGT} others to {GCC, CTC, TTC, CTG, GCA, TCC, TGG}

PH: 2 clusters can be observed. One is mainly defined by {TCC, TCT, TCG, TCA} the other by {CGA, TGC, CGT, CGC, CGG}

HI: No clear clustering is observed

MG: No clear clustering, nevertheless many ORFs have preferences to some codons. As for eaxmple {CGG, ACC, CAC, CAG, AGG,...}

MP: No clear clustering but some ORFs have clear preferences to a) {TGG, TCG, CCC, ACG, CCG, GCG, TCC} and b) { AGA, TCT, ATA, CTT, ACT, AAT, AAA}

Ssp.: One cluster defined by : {AGA, ATA, TCA}

EC: Two clusters can be defined. One by {AGA, ATA, AGG}, the other by {GAC, GGT, CGT, ATC}, with {TCT, AGT, GCT, AAA, GTT} belonging to both previous clusters

BS: Two clusters defined by : a) {TTT, TGG, TTG, CCC, TCG, ACC, CTC} and b) {ACT, AAT, AGT, CTA, TTA}

HP: No clear clustering but some ORFs have preferences to some codons as for example : {TCG, GCG, CCG, GTG, ACA,AAC, ....}

BB: The only organism presenting a distribution of ORFs into 2 distinct clusters. One is mainly defined by {ATA, TGC, CTG, ACA, GAC, TAC, ATC, CTC, CTA,..}, the other is defined mainly by :{TGG, TAT, CTT, CCT,...}

AE: 2 clusters can be observed. One mainly defined by {TTC, TGG, TCC, TTT, TCT} and the other by { CGA, AAT, TAT, TTA}

MT: One cluster mainly defined by {GGC, AAC, GGT}

TP: No clear clustering can be observed but some ORFs have preferences.

CT: 2 clusters can be observed. One mainly defined by : {AAG, GAG, TAG, CGG, AGC, TTG, GTG, GTT, GGG,...} and the other by {ACC, TCC, CTC, CCC, CTA, CAC, CGC, ATC,...}.

As already noted in the global analysis, all organism predicted ORFs are distributed differently according to their amino-acid and to their codon compositions.

Conclusion

Correspondence analysis proved to be a useful method in the analysis of whole organisms or organism's ORF products according to their amino-acid or codon compositions. Planer graphical representions of ORF products, their amino-acids and codons allow salient relationships to be easyly detected.


* Greenacre, M.J. (1984). Theory and application od Correspondence Analysis. London. Academic press.

* Benzecri, J-P. (1973). L'Analyse des donnees. Vol 2: L'Analyse des Correspondances (Dunod, Paris).

+ References to these organisms are shown in the TIGR Microbial DataBase.

McInerney, J. O. (1997) Microb. Compar. Genomics 2, 1-10.


Back to top menu Fredj Tekaia