Tandem Duplication Detection in a genome

There many different ways to detect tandem duplications(TD) in a genome, although each is unique in the ability to detect TD. Here I will cover how to use a software that can identify more ancient tandem duplications in a genome, and one that can only identify recent and highly similar TD. Redtandem essentially chains smaller alignments together, so that more diverged and ancient TD can be identified. Mummer is essentially a self-alignment of the genome, that readily identifies recent TD, but misses ancient TD that have diverged in nucleotide similarity.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-83

Obtain your genome file

1
2
module load GIF2/redtandem/ReDtandem
ln -s ../08_LTRFinder/TAIR10_chr_all.fas

REDTANDEM – find ancient duplications

Redtandem seems to frequently get hung up on small satellite sequences, so I mask my genome with repeatmodeler/repeatmasker first (see RepeatModeler_RepeatMasker tutorial).

Format the fasta file headers so they match the format necessary to run redtandem.pl

1
2
3
4
5
6
7
8
9
10
11
12
1.  The  FASTA  header should  be formatted as ">ath1_1-9639975" where

    * ath : identifies the species considered
    * 1 :   is the chromosome number
    * 1-9639975 :  positions of the  first and last nucleotide  in the
      sequence.

2.  We also need to remove all lower-case letters from the sequence.  

#Here is how I changed the TAIR10 fasta headers to the correct format

sed -e 's/a/A/g' -e 's/c/C/g' -e 's/g/G/g' -e 's/t/T/g' TAIR10_chr_all.fas.masked | bioawk -c fastx '{print ">ath"$name"_1-"length($seq)" \n"$seq}' TAIR10_chr_all.fas.masked  |fold -79 >TAIR10Masked.reformatted.fasta

Now it is time to run redtandem. This took 33 minutes for this masked arabidopsis sequence

1
perl /shared/software/GIF/programs/redtandem/ReDtandem/ReDtandem.pl --species ath --dnafile TAIR10Masked.reformatted.fasta

Redtandem output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#chrom  start   end     u_start u_end   numDupli        dupli
1       147972  156372  148332  149762  2       147972..150120,154224..156372,
1       428520  439980  437962  439554  3       428520..430966,434762..437208,437534..439980,
1       453020  458137  455693  458024  2       453020..455578,455579..458137,
1       494704  497486  494934  495847  2       494704..496076,496114..497486,
1       523538  535050  529802  534543  2       523538..529294,529294..535050,
1       540988  544488  541221  542507  2       540988..542738,542738..544488,
1       639609  641357  640662  640947  4       639609..639977,640103..640471,640620..640988,640989..641357,
1       657884  666072  660741  663549  3       657884..660502,660835..663453,663454..666072,
1       669140  673972  668973  671724  2       669140..671556,671556..673972,
1       779859  793763  780284  781977  2       779859..782401,791221..793763,
1       786444  791597  789191  791429  2       786444..789020,789021..791597,
1       1165235 1178464 1177142 1178199 2       1165235..1166823,1176876..1178464,
1       1248981 1289958 1287751 1289516 2       1248981..1251631,1287308..1289958,
1       1636170 1647433 1636511 1637868 2       1636170..1638208,1645395..1647433,
1       1689725 1696943 1690264 1692413 2       1689725..1692951,1693717..1696943,
1       1701014 1704107 1703183 1703922 2       1701014..1702124,1702997..1704107,
1       1701904 1704801 1703991 1704639 2       1701904..1702878,1703827..1704801,
1       1824178 1828590 1824530 1826034 2       1824178..1826384,1826384..1828590,
1       1843523 1858296 1847872 1849786 4       1843523..1846397,1847391..1850265,1850958..1853832,1855422..1858296,
1       1922068 1930673 1922412 1926028 2       1922068..1926370,1926371..1930673,
1       1935216 1940681 1935520 1936731 2       1935216..1937034,1938863..1940681,
1       2025534 2037336 2035801 2037200 4       2025534..2027206,2032166..2033838,2033992..2035664,2035664..2037336,
1       2291092 2299901 2291361 2295228 2       2291092..2295496,2295497..2299901,
1       2408858 2413171 2412187 2412974 2       2408858..2410040,2411989..2413171,
1       2454770 2465569 2455250 2457163 3       2454770..2457642,2458227..2461099,2462697..2465569,
1       2482344 2488116 2485315 2488032 2       2482344..2485230,2485230..2488116,
1       2522766 2530165 2523363 2525744 2       2522766..2526340,2526591..2530165,
1       2658263 2665830 2663067 2665277 2       2658263..2661581,2662512..2665830,
1       2848303 2877283 2848432 2852017 4       2848303..2852145,2852146..2855988,2867013..2870855,2873441..2877283,
1       2909853 2920925 2916063 2920253 2       2909853..2915389,2915389..2920925,
1       2914257 2921827 2920265 2921514 2       2914257..2916133,2919951..2921827,
1       3055446 3066439 3055291 3061098 2       3055446..3060942,3060943..3066439,
1       3788689 3821367 3797744 3799427 5       3788689..3791215,3793942..3796468,3797322..3799848,3810016..3812542,3818841..3821367,
1       3796791 3814695 3800306 3800859 4       3796791..3797623,3800166..3800998,3812843..3813675,3813863..3814695,
1       3822846 3825559 3824246 3825518 2       3822846..3824202,3824203..3825559,
1       3851875 3859448 3851899 3855638 2       3851875..3855661,3855662..3859448,
1       4196848 4222596 4197232 4198763 2       4196848..4199146,4220298..4222596,
1       4478291 4483976 4482018 4483584 2       4478291..4480643,4481624..4483976,
1       4604651 4607835 4606590 4607586 2       4604651..4606147,4606339..4607835,
1       4619722 4638888 4633655 4636956 5       4619722..4622110,4625656..4628044,4631357..4633745,4634111..4636499,4636500..4638888,
1       4818392 4832076 4822784 4824028 4       4818392..4820260,4822471..4824339,4827856..4829724,4830208..4832076,
1       4833456 4837371 4833729 4834814 2       4833456..4835086,4835741..4837371,
1       4849847 4854580 4852548 4854248 2       4849847..4852213,4852214..4854580,
1       4861119 4871779 4868866 4871223 3       4861119..4864589,4864839..4868309,4868309..4871779,
1       5022895 5031658 5028068 5030940 2       5022895..5027205,5027348..5031658,
1       5127690 5134376 5128088 5129673 2       5127690..5130070,5131996..5134376,
1       5152011 5160449 5154634 5155596 4       5152011..5153457,5154391..5155837,5157101..5158547,5159003..5160449,
1       5211861 5227594 5224483 5226924 4       5211861..5215643,5215655..5219437,5220029..5223811,5223812..5227594,
1       5366916 5381071 5370417 5374382 4       5366916..5370188,5370763..5374035,5374527..5377799,5377799..5381071,
1       5426471 5434100 5431523 5433519 2       5426471..5429631,5430940..5434100,
1       5517047 5562362 5532271 5534876 7       5517047..5519973,5522377..5525303,5525303..5528229,5528627..5531553,5532110..5535036,5535602..5538528,5559436..5562362,
#truncated for example purposes

reformat the redtandem.out to a bed file for comparative purposes

1
less redtandem.outlJHm |awk '{print $1,$6,$7}' | tr ".." " " |tr "," "\n" |sed '/^$/d' |awk -v I=0 '{if(NF>2) {I=$1" "$2;print $0} else {print I,$1,$2}}' |awk '{print $1,$3,$4,$2}' >redtandem.bed

Mummer – Find recent duplications or assembly artifacts

Run mummer (unmasked genome is ok here)

1
mummer -mum -l 500 -b -n -qthreads 16 TAIR10_chr_all.fas TAIR10_chr_all.fas >SelfAlignmentsMummer.out

Reformat the mummer output so that we have something for comparative purposes. Also remove the exact whole chromosome alignments to self.

1
2
3
less SelfAlignmentsMummer.out |awk -v source=0 '{if(substr($1,1,1)==">") {source=$0;} else {print source,$0}}' |awk '{if($3=="Reverse") {print $2,$6,$6+$7"\t"$4,$5,$5+$7} else {print $2,$5,$5+$6"\t"$3,$4,$4+$6}}' |sort -k1,1V -k2,3n |awk '{if($1==$4 && $2==$5 && $3==$6) {next} else {print $0}}' >DuplicatedRegions.txt

awk '{print $1,$2,$3,$4":"$5"-"$6}' DuplicatedRegions.txt >DuplicatedRegions.bed

How do these datasets correllate?

hint: only the recent ones will be seen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
bedtools intersect -wo -a <(less DuplicatedRegions.bed|tr  " " "\t") -b <(less ../10_RedTandem/redtandem.bed |tr " " "\t") |less

1       13386657        13387307        5:10440852-10441502     1       13386618        13387900        5       650
1       13388447        13389095        2:4883882-4884530       1       13387901        13389183        5       648
1       21412731        21413328        1:9677328-9677925       1       21407689        21413155        2       424
1       21412731        21413328        1:9677328-9677925       1       21413321        21418787        2       7
1       24230387        24236076        1:6198742-6204431       1       24233150        24236638        2       2926
1       26643151        26645519        3:22696765-22699133     1       26644607        26646275        2       912
1       26645451        26646587        1:21528398-21529534     1       26644607        26646275        2       824
1       29646499        29647005        1:889432-889938 1       29645831        29647847        4       506
2       10634309        10634911        3:7279349-7279951       2       10633847        10634447        2       138
2       10634309        10634911        3:7279349-7279951       2       10634447        10635047        2       464
2       16079514        16080346        2:3626656-3627488       2       16078489        16081707        2       832
2       16412777        16414454        mitochondria:245475-247152      2       16413740        16416842        2       714
2       16414795        16416555        mitochondria:247520-249280      2       16413740        16416842        2       1760
2       16416552        16417703        mitochondria:249278-250429      2       16413740        16416842        2       290
2       16418728        16419317        mitochondria:251452-252041      2       16418927        16422029        2       390
2       16419544        16420129        mitochondria:252269-252854      2       16418927        16422029        2       585
2       16420374        16422112        mitochondria:253100-254838      2       16418927        16422029        2       1655
3       7367650 7368298 3:16221068-16221716     3       7366465 7368387 8       648
3       17386970        17387523        3:6056936-6057489       3       17387437        17391893        2       86
3       17387524        17388804        3:6057490-6058770       3       17387437        17391893        2       1280
3       17619672        17620314        3:9471268-9471910       3       17617857        17622855        6       642
5       8502569 8503246 5:10203683-10204360     5       8499963 8503197 2       628
5       14858156        14858803        5:67559-68206   5       14858281        14860575        6       522
5       15132063        15132594        1:15049737-15050268     5       15131517        15134069        5       531
5       15157469        15158050        5:11938756-11939337     5       15157873        15160579        2       177
5       15158931        15159949        5:11940219-11941237     5       15157873        15160579        2       1018
5       15260088        15261003        5:11838684-11839599     5       15260164        15263320        3       839
5       18767018        18767554        3:14601781-14602317     5       18763504        18769106        2       536
5       23994417        23995938        4:15556173-15557694     5       23994143        23996001        2       1521

Woah, only 29 regions seem to overlap between the two outputs


Table of contents