| Journal of Molecular Evolution |
| © Springer-Verlag 2004 |
| 10.1007/s00239-004-2591-1 |
| (1) | Department of Zoology, University of Hong Kong, Pokfulam, Hong Kong SAR, China |
| (2) | Unite GGB, URA 2171, Institut Pasteur, 28 rue Dr. Roux, 75015 Paris, France |
| (3) | HKU-Pasteur Research Centre, 8 Sassoon Road, PokFulam Hong Kong SAR, China |
| Antoine Danchin Email: antoine.danchin.64[at]normalesup[dot]org |
Received: 14 June 2003 Accepted: 10 December 2003
keywords CpG deficiency - C5-specific methylation - C5 methyltransferase - Recognition sites - Bacterial genomes - GC content
CpG deficiency was first observed in vertebrates (Josse et al. 1961; Swartz et al. 1962), then in some species of archaea, bacteria, and fungi, as well as in mitochondria belonging to many organisms (Cardon et al. 1994; Karlin et al. 1998). CpG dinucleotides play an important role in cell differentiation and in the regulation of gene expression in vertebrates (Bestor 1990). CpG deficiency can also influence codon usage bias (De Amicis and Marchetti 2000) and the relative abundance of oligonucleotides, thereby indirectly affecting a variety of cell functions. This triggered many studies aiming at understanding genome base composition biases (Karlin et al. 1998). Several hypotheses have been put forward to explain CpG deficiency, including counter-selection at the translation level (Subak-Sharpe et al. 1966), DNA methylation (Bird 1980), DNA structural constraints (Antri et al. 1993), DNA–protein interaction, and stressful environments (Karlin et al. 1994b). Among them, DNA methylation is the most popular hypothesis.
Cytosine deamination is a major cause of mutation in living organisms, especially in open DNA structures (for recent references and discussion see Lobry and Sueoka 2002). It is, however, readily repaired, since deamination leads to uracil, subject to proofreading in DNA. It is widely documented that methylated cytosine is even more prone to spontaneous deamination and this induces transition mutations to the natural base thymine (Coulonder et al. 1978). Such mutations are hard to repair (Coulonder et al. 1978). Since methylated cytosines were predominantly found within CpG dinucleotides in vertebrates, CpG deficiency was naturally linked to CpG methylation (Bird 1980). The presence of highly methylated CpG dinucleotides in both male and female germ cells provided strong evidence for the relationship between DNA methylation and CpG deficiency in the human genome (El-Maarri et al. 1998). However, cytosine methylation may not be the ultimate or only explanation for CpG deficiency. For example, CpG deficiency in most mitochondrial genomes is unlikely to be related to DNA methylation, because DNA methylase has not yet been discovered in these organelles. One of the few reports on methylation in mitochondria identified an RNA methylation by a nucleus-encoded RNA adenine methyltranferase (McCulloch et al. 2002). CpG deficiency was also found in many bacterial species and their phages (Karlin et al. 1994a, 1997), where cytosine methylation is not widespread (see below).
This prompted us to revisit the association between DNA methylation and CpG deficiency in bacterial genomes. In bacteria, DNA methylation is generally associated with restriction-modification systems (RM systems) (Wilson 1988). These elements may prevent the invasion of the cell by bacteriophages. So far, more than 2000 different RM systems have been identified and over 700 methyltransferases are known to recognize at least 300 different DNA sites (http://www.neb.com/rebase) (Roberts and Macelis 2001). Three kinds of DNA methylation systems were found in bacteria: A6-adenine methylation, N4-cytosine methylation, and C5 cytosine methylation (Bestor 1990). In this report, we focus our attention on C5 cytosine-specific methylation, the same DNA methylation process that is assumed to induce CpG deficiency in eukaryotes. Due to versatile functions and recognition sites of DNA methylation in bacteria compared to vertebrates, DNA methylation is unlikely to share a common role in all bacterial genomes. This was previously suggested in a study on the Mycoplasma genitalium genome in which CpG deficiency was suspected to be unrelated to DNA methylation (Goto et al. 2000). The suspicion was based on the finding that the high substitution rate from C to T was not specific to CpG and TpG dinucleotides and the fact that there was no reported methylation activity in mycoplasmas (Goto et al. 2000). In the present study, we further document that deamination of methylated cytosine is probably not the reason for the CpG deficiency in bacteria.
First, the fully sequenced bacterial genomes were surveyed, after being retrieved from the NCBI (http://www.ncbi.nlm.nih.gov). We searched for potential C5 methyltransferase genes using the annotation files. When such a cytosine methyltransferase was identified, the bacterial identification was used to search for the corresponding enzyme in the REBASE database (http://rebase.neb.com) (Roberts and Macelis 2001). Almost all the cytosine methyltransferases were C5 methyltransferases, and the one case of N4 cytosine methylation was discarded. When more than one C5 methyltransferase was found in a genome, only the one including a CpG dinucleotide at the restriction site was included.
Cytosine-specific methyltransferase genes are labeled as
putative
in some bacteria. This makes in-depth analysis difficult because the biochemical properties of their products are not substantiated in REBASE. Therefore, the latter approach is only feasible for well-studied bacteria in which the presence of cytosine methylation has been studied. As a complement of explicit identification, we used the BLASTP tool provided by REBASE to ascertain that a CDS putatively coding for a C5 methyltransferase is highly similar to a known C5 methyltransferase CDS.
Second, utilizing REBASE, we also identified C5 methyltransferases in the unfinished genomes of several bacterial species. When such a gene was found, we collected the available DNA sequences from NCBI, extending our study to the corresponding organisms. By exploring REBASE in addition to two other protein databases, Pfam at the Sanger Centre (http://www.sanger.ac.uk/Software/Pfam/) and TIGRFAMs at TIGR (http://www.tigr.org/TIGRFAMs/), we collected DNA sequences from all the bacteria that are likely to express C5 methyltranferases. Finally, only bacterial species for which more than 20 nonredundant sequences (excluding ribosomal DNA) could be retrieved from GenBank were included in the analysis.
To measure the frequency of dinucleotides in a long genomic sequence, the value of relative abundance was calculated by computing the relevant odds ratio (Burge et al. 1992). In the case of CpG dinucleotide, the formula is
CpG
=
F
CpG/F
C*F
G, where
CpG denotes relative abundance of CpG and F
CpG denotes the frequency of CpG dinucleotide. If
CpG falls between 0.81 and 1.20, the CpG dinucleotide is considered to be at a normal level. If it is lower than 0.81, the CpG relative abundance is classified as being deficient. However, the relative abundance of this dinucleotide can be further classified as follows: 0.78–0.81 is marginally low, 0.70–0.78 is significantly low, 0.50–0.70 is very low, and
0.50 is extremely low (Burge et al. 1992). In this study, the bacteria with CpG relative abundances lower than 0.78 were considered to be CpG deficient.
Generally bacterial CDSs are short in size, so the variance of CpG relative abundances of the CDSs with the same GC content is very large. Especially in low-GC content CDSs, the values will highly deviate from the trend line when they are plotted against GC content. The deviation could strongly mask the changing tendency of CpG relative abundance. Since the calculated
CpG for longer sequences do not deviate from actual values as much as those for shorter sequences (i.e., decreasing magnitude of deviation from actual value as CDS length increases), we first listed all the CDSs according to their GC contents. We then concatenated every 40 CDSs (every 20 CDSs for some small bacterial genomes, like the C. trachomatis genome) to generate long coding sequences for this study. The third position of a codon is under less selective pressure due to the redundancy in the genetic code, therefore we chose C3pG1 (C in the third position of a codon; G in the first position of the following codon) to study the mutation pattern of CpG dinucleotides. The relative abundances of C3pG1 and T3pG1 in each sequence were calculated and then plotted against the GC content of the CDS.
|
Bacteria |
GC content (%) |
|
C5 methyltransferase |
Recognition site |
|---|---|---|---|---|
|
Acetobacter pasteurianus* |
55.4 |
0.93 |
M. ApaLI |
GTGCAC |
|
Anabaena variabilis* |
42.4 |
0.83 |
M. AvaIX |
RmCCGGY |
|
Bacillus brevis* |
44.1 |
1.06 |
M. BbVI |
GmCAGC |
|
Bacillus cereus* |
36.8 |
0.95 |
M. HaeIII |
GGmCC |
|
Bacillus firmus* |
41.1 |
0.86 |
M. BfiIB |
AmCTGGG |
|
Bacillus halodurans (NC_002570) |
51.7 |
1.31 |
M. BhaII |
GGCC |
|
Bacillus pumilus* |
40.2 |
0.86 |
M. Bpu10IA |
CCTNAGC |
|
Bacillus sphaericus* |
35.9 |
0.89 |
M. BspRI |
GGmCC |
|
Bacillus subtilis (AL009126) |
43.5 |
1.04 |
M. BsuFI |
mCCGG |
|
Citrobacter freundii* |
59.3 |
1.14 |
M. Cfr10I |
RmCCGGY |
|
Clostridium acetobutylicum (NC_003030) |
30.9 |
0.45(- - -) |
M. Cac824I |
GCNGC |
|
Corynebacterium glutamicum (NC_003450) |
53.7 |
0.97 |
M. CglI |
GCSGC |
|
Enterobacter aerogenes* |
53.6 |
1.19 |
M. EaeI |
YGGmCCR |
|
Enterobacter cloaceae* |
54.6 |
1.1 |
M. Ecl18kI |
CmCNGG |
|
Escherichia coli K12 (U00096) |
50.8 |
1.16 |
M. EcoKDcm |
CmCWGG |
|
Escherichia coli O157:H7 Sakai (NC_002695) |
50.4 |
1.12 |
M. EcoKO157DcmP |
CCWGG |
|
Lactococcus lactis subsp. Cremoris |
35.5 |
0.83 |
M. ScrFIA |
CCNGG |
|
Neisseria gonorrhoeae* |
52.6 |
1.33 |
M. NgoPII |
GGmCC |
|
Neisseria lactamica* |
49.7 |
1.33 |
M. NlaIV |
GGNNCC |
|
Neisseria meningitidis (NC_003116) |
51.7 |
1.31 |
M. NmeAORF191P |
CCWGG |
|
Neisseria meningitidis MC58 (NC_003112) |
51.4 |
1.31 |
M. NmeBIA |
GGNNCC |
|
Nostoc sp. PCC 7120 (NC_003272) |
41.3 |
0.79 |
M. AvaIX |
RmCCGGY |
|
Salmonella enteritidis* |
48.8 |
1.07 |
M. SenPI |
CmCNGG |
|
Salmonella typhi CT18 (NC_003198) |
52.0 |
1.24 |
M. StyCDcmP |
CCWGG |
|
Salmonella typhimurium (NC_003197) |
52.2 |
1.24 |
M. StyLT2DcmP |
CCWGG |
|
Shigella sonnei* |
43.1 |
1.02 |
M. SsoII |
CmCNGG |
|
Yersinia pestis (NC_004088) |
47.6 |
0.99 |
M. YpeORF391P |
CCWGG |
CpG) followed by (-), (- -), (or) (- - -) is significantly low, very low, or extremely low, respectively. Methylated cytosines are preceded by a superscript m. W denotes A or T; S denotes G or C; N denotes any nucleotide.|
Bacteria |
GC content (%) |
|
C5 methyltransferase |
Recognition site |
|---|---|---|---|---|
|
Bacillus stearothermophilus* |
47.4 |
1.33 |
M. BsrFI |
RCCGGY |
|
Caulobacter crescentus CB15 (NC_002696) |
67.1 |
1.16 |
M. CcrMORF1033P |
Unknown |
|
Haemophilus influenzae (NC_000907) |
38.1 |
1.09 |
M. HindV |
GRCGYC |
|
Herpetosiphon giganteus* |
41.8 |
1.01 |
M. HgiGI |
GRCGYC |
|
Listeria monocytogenes EGD (NC_003210) |
38.0 |
1.11 |
M. LmoEORF2316P |
Unknown |
|
Nostoc punctiforme* |
41.6 |
0.84 |
M. NpuORFC230P |
RCCGGY |
|
Ralstonia solanacearum* |
66.6 |
1.2 |
M. RsoORF3438P |
Unknown |
|
Sinorhizobium meliloti (AE006469) |
62.6 |
1.29 |
M. SmeORF3763P |
Unknown |
|
Streptococcus pneumoniae (AE000514) |
39.6 |
0.69(- -) |
M. SpnORF1336P |
Unknown |
|
Streptococcus pyogenes M1 (AE009949) |
38.5 |
0.71(-) |
M. SpyORF1077P |
Unknown |
|
Ureaplasma urealyticum (AF222894) |
25.4 |
0.88 |
M. UurORF528P |
Unknown |
|
Vibrio cholerae (AE003852) |
47.5 |
1.04 |
M. VchAORF198P |
Unknown |
|
Xylella fastidiosa (NC_002488) |
52.6 |
1.01 |
M. XfaORF1774P |
Unknown |
|
Bacteria |
GC content (%) |
|
C5 methyltransferase |
Recognition site |
|---|---|---|---|---|
|
Escherichia coli O157:H7 EDL933 (AE005174) |
50.2 |
1.12 |
M. EcoO157ORF2389P |
CGATCG |
|
Haemophilus parainfluenzae* |
39.2 |
1.06 |
M. HpaII |
CmCGG |
|
Helicobacter pylori 26695 (AE0005II) |
38.8 |
0.93 |
M. HpyAVIII |
GmCGC |
|
Helicobacter pylori J99 (AE001439) |
39 |
0.94 |
M. Hpy99XI |
AmCGT |
|
Mycoplasma pulmonis (NC_002771) |
26.6 |
0.28 (- - -) |
M. MpuCORF430P |
mCG |
|
Synechocystis sp. PCC 6803 (AB001339) |
47.6 |
0.75(-) |
M. Ssp6803I |
CGATCG |
|
Xanthomonas oryzae* |
62.4 |
1.13 |
M. XorII |
CGATCG |
|
Bacteria |
GC content (%) |
|
|---|---|---|
|
Brucella melitensis (AE008917) |
57 |
1.20 |
|
Buchnera aphidicola AP (NC_002528) |
26.2 |
0.87 |
|
Campylobacter jejuni (NC_002163) |
30.5 |
0.62 (- -) |
|
Chlamydia muridarum (AE002160) |
40.3 |
0.75 (-) |
|
Chlamydia trachomatis (AE001273) |
41.2 |
0.79 |
|
Chlamydophila pneumoniae (AE002161) |
40.5 |
0.73 (-) |
|
Clostridium perfringens (BA000016) |
29.4 |
0.21 (- - -) |
|
Fusobacterium nucleatum (AE009951) |
27 |
0.16 (- - -) |
|
Lactococcus lactis IL1403 (AE005176) |
35.3 |
0.77 (-) |
|
Listeria innocua (NC_003212) |
37.3 |
1.11 |
|
Mycoplasma genitalium (NC_000908) |
31.6 |
0.39 (- - -) |
|
Mycobacterium leprae (NC_002677) |
57.7 |
1.12 |
|
Mycoplasma pneumoniae (NC_000912) |
39.9 |
0.82 |
|
Mycobacterium tuberculosis (NC_002755) |
65.5 |
1.18 |
|
Pasteurella multocida (AE004439) |
40.3 |
1.07 |
|
Rickettsia conorii (NC_003103) |
32.4 |
1.03 |
|
Rickettsia prowazekii (NC_000963) |
28.9 |
0.77 (-) |
|
Ralstonia solanacearum (AL646052) |
66.8 |
1.19 |
|
Staphylococcus aureus Mu50 (BA000017) |
32.7 |
0.94 |
|
Treponema pallidum (AE000520) |
52.7 |
1.08 |
CpG) at a level of significantly low, very low, or extremely low is labeled with (-), (- -), or (- - -), respectively.RM systems in free-living bacteria are often horizontally transferred by means of linkage with mobility-related elements such as phages and plasmids (Kobayashi 2001 and references therein). RM systems act like an infectious agent, by rendering the bacteria dependent on the functioning of the methylase to avoid chromosome degradation by the nuclease. These bacteria thus suffer a selective pressure for the avoidance of restriction sites (Rocha et al. 2001). Since most of the underrepresented sites are not recognition sites for the known RM systems of a given bacterium, the avoidance on these sites indicates the impact of RM systems in bacteria
s evolutionary history (Rocha et al. 2001). Therefore the current status of DNA methylation does not allow investigating the avoidance of the sites that may have been methylated in the past due to RM systems that were lost. Because free-living bacteria can often contact with other bacteria living in the surrounding environment, they can easily obtain a new RM system through horizontal transfer. Obligatory intracellular parasites and symbionts cannot do so due to their occlusive living environment. Such bacteria are currently devoid of such systems, and are generally thought to lack horizontal transfer. Thus, one may suppose that they have not been in contact with such systems for a large period of their recent evolution. We therefore made a comparative analysis of obligatory intracellular bacteria with the free-living bacteria holding at least one RM system. We observed that only two free-living bacterial species, Streptococcus pneumoniae and Streptococcus pyogenes, are CpG deficient. In contrast, 6 of 12 intracellular pathogens or symbionts show CpG deficiency. Thus, CpG dinucleotides are more significantly depleted in intracellular pathogens or symbionts than in proteobacteria (
2 test, p < 0.01). This is the opposite of what was expected under the cytosine deamination theory via the spread of RM systems.
Among the 34 recognition sites identified in bacterial genomes (Tables 1 and 3), only seven methylated CpG dinucleotides were found within the recognition sites. Therefore, cytosine methylation in bacteria is not generally associated with CpG dinucleotide methylation.
Surprisingly, we find CpG deficiency in eight bacterial species (Campylobacter jejuni, Chlamydia muridarum, Chlamydophila pneumoniae, Clostridium perfringens, Fusobacterium nucleatum, Lactococcus lactis IL1403, Mycoplasma genitalium, and Rickettsia prowazekii) that are devoid of C5 methyltransferase (Table 4), and this is in contrast to five species (Clostridium acetobutylicum, Mycoplasma pulmonis, S. pneumoniae, S. pyogenes, and Synechocystis sp. 6803) that contain C5 methyltransferase but are significantly CpG deficient (Tables 1, 2, and 3). This suggests that CpG dinucleotide deficiency is more frequent in bacteria lacking cytosine methylation (
2 test, p < 0.01). We cannot exclude, however, that this is due to a genome sampling effect since genome programs did not select the bacteria of interest in a random way.
Finally, a t-test shows that the CpG relative abundances in bacteria containing RM systems methylating CpG dinucleotides (Table 3) are not significantly lower than those of other bacteria (Tables 1 and 4; p > 0.1), indicating that the presence of methylated CpG dinucleotides in recognition sites does not give rise to CpG deficiency.
The above analyses do not support the idea that cytosine methylation is responsible for CpG deficiency. Therefore, we have performed a set of analyses to further explore potential reasons behind CpG deficiency in bacteria.
=
0.0002, slope
=
–0.005, p < 0.001), indicating that the change in CpG relative abundance is not correlated with that of TpG relative abundance. In sharp contrast, a negative correlation of the two values was found in the human genome (addressed below). The regression of CpA on CpG (Figure 1B) also results in a nearly horizontal line (R
2
=
0.006, slope
=
0.023, p < 0.001). These findings indicate that CpG variation is not significantly negatively correlated with TpG or CpA abundances. As such, it seems unlikely that CpG variation in bacteria can be attributed to different rates of methylated cytosine deamination.
|
Bacteria |
CpG |
TpG |
ApG |
GpG |
CpA |
CpT |
CpC |
|---|---|---|---|---|---|---|---|
|
C. acetobutylicum |
0.45 |
1.02 |
1.13 |
1.22 |
1.02 |
1.13 |
1.21 |
|
C. jejuni |
0.62 |
1.03 |
1.09 |
1.11 |
1.03 |
1.09 |
1.1 |
|
C. muridarum |
0.75 |
0.96 |
1.14 |
1.09 |
0.97 |
1.15 |
1.07 |
|
C. perfringens |
0.21 |
0.94 |
1.24 |
1.34 |
0.93 |
1.23 |
1.29 |
|
C. pneumoniae |
0.73 |
0.96 |
1.19 |
1.05 |
0.96 |
1.18 |
1.06 |
|
F. nucleatum |
0.16 |
1.05 |
1.18 |
1.27 |
1.03 |
1.17 |
1.27 |
|
L. lactis IL1403 |
0.77 |
1.13 |
0.96 |
1.05 |
1.13 |
0.97 |
1.05 |
|
M. genitalium |
0.39 |
1.17 |
1.06 |
1.12 |
1.15 |
1.06 |
1.15 |
|
M. pulmonis |
0.28 |
1.12 |
1.12 |
1.04 |
1.11 |
1.13 |
1.07 |
|
R. prowazekii |
0.77 |
1.02 |
1.06 |
1.03 |
1.03 |
1.06 |
1.03 |
|
S. pneumoniae |
0.69 |
1.11 |
1.07 |
1.03 |
1.09 |
1.09 |
1.03 |
|
S. pyogenes |
0.71 |
1.12 |
1.04 |
1.03 |
1.11 |
1.05 |
1.04 |
|
Synechocystis sp. 6803 |
0.75 |
1.05 |
0.85 |
1.36 |
1.05 |
0.85 |
1.36 |
It has been pointed out that the negative correlation between CpG and TpG in different GC contents is an artifact ascribed to deamination of methylated cytosine in the human genome (Duret and Galtier 2000). In order to further test the hypothetical relationship between cytosine methylation and CpG deficiency in bacteria, we analyzed the covariation among dinucleotides CpG, TpG, and CpA under different contents.
In the bacteria studied here, CpG relative abundance is found to be higher in the DNA sequences with a high GC content. No bacterial species showing overall CpG deficiency has more than a 50% GC content (Tables 1, 2, 3, 4). We then analyzed the correlation between CpG relative abundance and GC content at the intragenome level. The GC content within a genome is not uniform, so we might expect CpG relative abundances in different genomic regions to correlate with the GC content. Because a bacterial genome is largely composed of CDSs, the effect of codon usage bias on CpG dinucleotide must not be ignored. For example, a study in plants showed that the negative correlation between C3pG1 and T3pG1 relative abundances was significant (De Amicis and Marchetti 2000). This was considered to be a consequence of heavy DNA methylation in plants. Therefore, we compared the relative abundance of the neutral dinucleotide sites, C3pG1 and T3pG1, in a CDS.
|
Organism |
C3pG1 |
T3pG1 |
|---|---|---|
|
C. acetobutylicum |
0.97 |
3.57 |
|
C. jejuni |
2.79 |
2.82 |
|
C. muridarum |
1.79 |
3.37 |
|
C. perfringens |
4.70 |
4.87 |
|
C. pneumoniae |
1.02 |
1.46 |
|
F. nucleatum |
0.69 |
5.76 |
|
L. lactis IL1403 |
3.12 |
5.66 |
|
M. genitalium |
–0.96 |
0.71 |
|
M. pulmonis |
3.27 |
1.62 |
|
R. prowazekii |
0.53 |
3.82 |
|
S. pneumoniae |
0.60 |
2.11 |
|
S. pyogenes |
0.45 |
3.25 |
|
Synechocystis sp. 6803 |
0.06 |
–1.67 |
In vertebrates, it is widely accepted that CpG deficiency is a consequence of CpG methylation (Bird 1980; Jeltsch 2002). The DNA methylation pattern on CpG dinucleotides is largely maintained by DNA methyltransferase1 (Dnmt1) (Lyko et al. 1999). Some essential differences in the properties of DNA methyltransferases in vertebrates and bacteria may explain the observed differences in CpG deficiency. First, bacteria vary widely in both the content and the size of their C5 methyltransferase recognition sites. Most of the recognition sites do not contain a methylated CpG dinucleotide, suggesting that cytosine methylation is not a determinant of CpG deficiency in bacteria. Although some RM systems have a methylated CpG dinucleotide, the large size of these recognition sites determines that most CpG dinucleotides are not methylated because of the low occurrence of these sites in the genome (i.e., CpG methylation mediated by a single methyltransferase in a rare site such as CGATCG is too weak to induce CpG deficiency).
Second, the DNA methylation in bacteria is a kind of de novo methylation (Bestor 1990). This is different from that in vertebrates because Dnmt1 can only function on hemimethylated DNA (Lyko et al. 1999). De novo methylation mediated by Dnmt3a and Dnmt3b indeed occurs in vertebrates, but it is restricted in very early embryonic stage (Ramsahoye et al. 2000; Gowher and Jeltsch 2001). These differences between bacterial C5 methyltransferases and those of vertebrates reinforce the idea that C5 methylation is not the major source of CpG deficiency in bacteria. It is possible that a more fundamental mechanism is affecting dinucleotide relative abundance and distribution in bacterial genomes, rather than cytosine methylation.
Third, RM systems are frequently gained and lost by horizontal transfer (Kobayashi 2001). As such, the presence of C5 methyltransferase is intermittent, and possibly rare, which necessarily implicates a much lower bias than methylated cytosine deamination that in genomes containing C5 methyltransferase in permanence, such as in humans. Most free-living bacteria are not CpG deficient compared to pathogen/symbionts. Therefore, the contribution of RM systems to CpG deficiency in bacteria appears suspicious in analysis involving either current or historic parameters. Interestingly, it was reported that free-living pathogens had a significantly higher GC content than intracellular pathogens and symbionts (Rocha and Danchin 2002). Here we show that CpG deficiency correlates with GC content and lifestyle.
In this study we find that C3pG1 relative abundance and GC content are generally positively correlated in those bacterial species that show CpG deficiency. We obtained qualitatively similar correlations using C1pG2 and C2pG3 in this analysis (results not shown). This strengthens the link between CpG dinucleotide relative abundance and GC content in bacteria. Identical correlations have been found in humans (Aissani and Bernardi 1991; Pesole et al. 1997) and RNA viruses (Rima and McFerran 1997). It was subsequently pointed out that this could be a mathematical artifact caused by the high mutation rate on methylated CpG dinucleotide (Duret and Galtier 2000). As methylated CpG deaminates to TpG or CpA dinucleotides, the number of C and G decreases in this process. This would lead to a lower expected number of CpG dinucleotides in the new sequence compared to the original sequence. This effect is found to be more evident when the GC content increases (Duret and Galtier 2000). However, the mutation process from methylated CpG to TpG dinucleotide is not present in most of the bacteria that show CpG deficiency. This is implied by parallel changing patterns of CpG and TpG in different GC contents in bacteria. As a result, Duret and Galtier
s artifact hypothesis does not explain satisfactorily the association of GC content and CpG deficiency in the bacterial context.
Two functions have been suggested for DNA methylation. A primary function is to defend a genome against the invasion of bacteriophages or transposon elements, and a secondary function, a new-developed function in evolution history, is connected with the regulation of gene expression (Yoder et al. 1997). We classify the organisms having DNA methylation into two groups according to the different functions: the first group includes bacteria, fungi, and invertebrates; and the second group includes vertebrates and plants. Only in the second group, CpG dinucleotides are massively methylated or demethylated to regulate gene expression activity. In conclusion, only the DNA methylation playing the secondary function in vertebrates and plants can be persuasively linked to CpG deficiency.
Actually the above boundary, within the animal kingdom, should be moved forward to the sea urchin, the only invertebrate species in which Dnmt1-like methyltransferase was identified (Aniello et al. 1996, 2003). As such, it should be distinguished from the other invertebrates. Dnmt1 is critical in playing the secondary function (Ramsahoye et al. 2000), so the presence of Dnmt1-like protein in sea urchin is probably a strong requirement of developmental regulation. Therefore, the evolution of methyltransferase genes from bacteria to human reflects the requirement of functions specialized in more complex organisms, making DNA methylation evolve from a protection mechanism to an epigenetics mechanism. This enables an organism to have an increased life span and to survive under more complex environmental conditions. This benefit comes at a cost. For one, vertebrate genomes confront a huge mutation pressure on the recognition sites for DNA methylation. Until now, no study has shown that vertebrates have found a strategy to compensate for the depleted CpG dinucleotides. Theoretically, continued CpG depletion will lead to a vertebrate genome crisis.
We studied the link between C5 methylation and CpG content in bacteria and found no significant correlation. Thus, C5 methylation is probably not the major factor inducing CpG deficiency in bacteria and more effort should be invested in looking for alternative explanations for this phenomenon. Finally, this study indicates that CpG dinucleotide deficiency is related to GC content. This can be taken as a clue in the search for factors that induce CpG deficiency in bacteria.