The sequence of the repeat allele has been reported, but the 7-, 9-, and repeat ones have not previously been determined. We determined the genomic structure of the 7-, 9-, , and repeat alleles and identified nine types of repeat unit, A — I.
Each repeat unit is different from the other units and contains two to five nucleotide substitutions. We found the 7- and 9-repeat alleles to have a novel repeat unit type I , which has five nucleotide substitutions from type A Figure 1. Structure of human DAT1 gene. The genomic structure shown here is based on Kawarai et al.
PCR primers are shown under the scheme for exon 15 by arrowheads. There were four haplotypes and nine units of repeat, types A — H and I , according to our sequencing studies; the repeat numbers of the VNTR polymorphism are shown on the left. The gray squares indicate the novel repeat unit, type I. Unit types are shown on the left. Nucleotides that differ from their counterparts in the type A unit are underlined. And the luciferase expression of the positive control pSecDK-luc, which contains the bovine growth hormone polyadenylation signal, is much greater than that of any other fusion vector.
The relative luciferase activities are 0. We also observed the same tendency using human glioblastoma A cells which express DAT1 naturally data not shown. The second open square indicates the open reading frame of luciferase. The study of DAT knockout mice suggested that the expression of the DAT1 gene is important in controlling synaptic dopamine levels. Homozygous DAT knockout mice show many adaptive changes such as a decreased content of dopamine in presynaptic terminals, significant downregulations of preproenkephalin A, D1 and D2 receptor mRNA, and a loss of tyrosine hydroxylase, while heterozygotes display intermediate values.
This suggests that these effects observed in both COS and A cells are thought to be conserved among higher species. Our in vitro results do not necessarily reflect the in vivo situation. In this work, we investigated the effects of 9-repeat and repeat alleles on the expression of the DAT1 gene in addition to the newly-found 7-repeat allele.
The effects of the rarer repeat on the expression of the DAT1 gene should be tested in future. In addition, we could not confirm the recent finding of Ueno. Genomic DNA was extracted from peripheral blood leukocytes according to the standard method. The luciferase expression vectors were transiently transfected into COS-7 cells by the electroporation method.
Iversen LL. Role of transmitter uptake mechanisms in synaptic neurotransmission Br J Pharmacol 41 : — Neurotransmitter transporters: recent progress Annu Rev Neurosci 16 : 73— Giros B, Caron MG. Molecular characterization of the dopamine transporter Trends Pharmacol Sci 14 : 43— Cloning, pharmacological characterization, and chromosome assignment of the human dopamine transporter Mol Pharmacol 42 : — CAS Google Scholar.
A conceptually similar construct is the repeat graph 28 , with sequences repeated multiple times in a genome represented by the same vertex. Graph analysis has been used to encode the elementary duplication structure of a genome 29 and for multiple-sequence alignment of repetitive sequences with shuffled domains 30 , making them well-suited to represent VNTRs that differ in both repeat count and composition.
The most straight-forward approach that combines a pangenome graph and a repeat graph is a de Bruijn graph, and was the basis of one of the earliest representations of a pangenome by the Cortex method 31 , The de Bruijn graph has a vertex for every distinct sequence of length k in a genome k- mer , and an edge connecting every two consecutive k -mers, thus k -mers occurring in multiple genomes or in multiple times in the same genome are stored by the same vertex.
While the Cortex method stores entire genomes in a de Bruijn graph, we construct a separate locus-RPGG for each VNTR and store a genome as the collection of locus-RPGGs, which deviates from the definition of a de Bruijn graph because the same k -mer may be stored in multiple vertices. We find loci that demonstrate population structure with respect to the inferred lengths of VNTR sequences, and importantly to discover 8, loci that show differential motif usage between populations.
We used public LRS data for 19 individuals with diverse genetic backgrounds, including genomes from individual genome projects 33 , 34 , structural variation studies 14 , and diversity panel sequencing 22 Fig.
Each genome was sequenced by either PacBio contiguous long read, or high-fidelity sequencing between 16 and fold coverage along with matched 22—fold Illumina sequencing Table 1. This data reflects a wide range of technology revisions, sequencing depth, and data type, however subsequent steps were taken to ensure accuracy of RPGG through locus redundancy and SRS alignments. We developed a pipeline that partitions LRS reads by haplotype based on phased heterozygous SNVs and assembles haplotypes separately by chromosome.
For other datasets, long-read data were used to phase SNVs. Reads from each chromosome and haplotype were independently assembled using the Flye assembler 36 for a diploid of 0. Total graph size built from GRCh38 and an average genome are also shown.
The alignments may be either used to select which loci may be accurately mapped in the RPGG using SRS that match the assemblies red , or may be used to estimate lengths on sample datasets blue. Source data are provided as a Source Data file. During locus assignment, danbing-tk expands boundaries and merges loci to ensure boundaries of all VNTRs are well-defined and harmonized across genomes Methods Fig.
The RPGGs are a collection of independently constructed bi-directional de Bruijn graphs of each VNTR locus and flanking bases from the haplotype-resolved assemblies. The RPGG differs from a standard bi-directional de Bruijn graph because a k -mer may be repeated in multiple subgraphs. The alignment of a read to an RPGG may be defined by the path in the RPGG with a sequence label that has the minimum edit distance to the read among all possible paths.
While methods exist to find alignments that do not reuse cycles 38 , others allow alignment to cyclic graphs but with high computational costs when applied to RPGG 27 or are limited to alignment in STR regions 9.
Efficient alignment with cycles is a more challenging problem recently solved by GraphAligner 39 to map long reads to pangenome graphs.
Using danbing-tk, When reads from the entire genome are considered, for The graph pruning step is the primary cause of missed alignments, and affects on average 2, loci per assembly. On real data, danbing-tk required If SRS reads from a genome were sequenced without bias, sampled uniformly, and mapped without error to the RPGG, the count of a k -mer in a locus mapped by an SRS sample should scale by a factor of read depth with the sum of the count of the k -mer from the locus of both assembled haplotypes for the same genome.
The quality of alignment aln- r 2 and sequencing bias were measured by comparing the k -mer counts from the 19 matched Illumina and LRS genomes Fig. Specifically, Loci with false mapping but retained in the final set have lower but still decent length-prediction accuracy 0. The limited sequence diversity provided by repeat-GRCh38 at this locus failed to recruit reads that map to paths existing in the RPGG but missing or only partially represented in repeat-GRCh A linear fit between the k -mers from mapped reads and the ground truth assemblies shows that there is a fold gain in slope, or measured read depth, when using RPGG compared to repeat-GRCh38 Fig.
The alignment quality is measured by the r 2 of a linear fit between the k -mer counts from the ground truth assemblies and from the mapped reads Methods. Loci with alignment quality less than 0. To visualize paths with less mapped reads, k -mer counts are clipped at left , middle , and right , respectively, with the maximal k -mer count of each graph being , , and , respectively.
New genomes with arbitrary combinations of motifs and copy numbers in VNTRs should still align to an RPGG as long as the motifs are represented in the graph. SRS reads from the missing genome were mapped into the RPGG, and the estimated locus lengths were compared to the average diploid lengths of corresponding loci in the missing LRS assembly.
As the SRS datasets used in this study during pangenome construction were collected from a wide variety of studies with different biases, there was no consistent LSB in either repetitive or nonrepetitive regions for samples from different sequencing runs Supplementary Figs.
However, principal component analysis PCA of repetitive and nonrepetitive regions showed highly similar projection patterns Supplementary Fig. Leveraging this finding, a set of nonrepetitive control regions were used to estimate the LSB of an unseen SRS sample Methods , giving a median length-prediction accuracy of 0. The read depth of a repetitive region correlates to the locus length when aligning short reads to a linear reference genome. However, estimation of VNTR length from read depth has an accuracy of 0.
Boxes span from the lower quartile to the upper quartile, with horizontal lines indicating the median. Whiskers extend to points that are within 1.
Loci are ordered along the x -axis by genotyping accuracy in repeat-GRCh The red dotted line indicates the baseline without any improvement. The fraction of reads from these datasets that align to the RPGG ranges from 1.
Consistent with the finding in previous leave-one-out analysis, genomes from the same study cluster together in the PCA plot of LSB, and so within each dataset and locus, k -mer counts from SRS reads normalized by sequencing depth were used to compare VNTR content across genomes. The 1KGP samples contain individuals from African When comparing the average population length to the global average length, Each point is an individual. For each locus, the proportion of variance explained by the most informative k -mer in the EAS is shown for the EAS and AFR populations on the x - and y -axes, respectively.
Points are colored by the difference in normalized k -mer counts, with red and blue indicating k -mers more abundant in EAS and AFR populations, respectively. Edges are colored if the k -mer count is biased toward a certain population. VNTR loci that are unstable may undergo hyper-expansion and are implicated as a mechanism of multiple diseases 4.
Alignment to an RPGG provides information about motif usage in addition to estimates of VNTR length because genomes with different motif composition will align to different vertices in the graph. To detect differential motif usage, we searched for loci with a k -mer that was more frequent in one population than another and not simply explained by a difference in locus length, comparing African AFR and East Asian populations for maximal genetic diversity.
Lasso regression against locus length was used to find the k -mer with the most variance explained VEX in EAS genomes, denoted as the most informative k -mer mi-kmer.
A top example of these loci with r 2 at least 0. As the danbing-tk length estimates showed population genetic patterns expected for human diversity, we assessed whether danbing-tk alignments could detect VNTR variation with functional impact.
Genomes from the GTEx project were mapped into the RPGG to discover loci that have an effect on nearby gene expression in a length-dependent manner. Similar to the population analysis, the k -mer dosage was used as a proxy for locus length. The quantile-quantile plot shows the observed P- value of each association test two-sided t -test versus the P -value drawn from the expected uniform distribution. The eVNTRs have the potential to yield insight to disease.
The protein product of ERAP2 , or endoplasmic reticulum aminopeptidase 2, is a zinc metalloaminopeptidase involving in the process of Class I MHC mediated antigen presentation and innate immune response. Abnormal expansion of the VNTR might increase autoimmune disease risk through reducing ERAP2 expression, leaving longer and more antigenic peptides, yet potentially higher fitness against virus infection Deletion of this gene is linked to Koolen-de Vries syndrome 47 , and the locus is associated with Parkinson disease Previous commentaries have proposed that variation in VNTR loci may represent a component of undiagnosed disease and missing heritability 49 , which has remained difficult to profile even with whole-genome sequencing To address this, we have proposed an approach that combines multiple genomes into a pangenome graph that represents the repeat structure of a population.
This is supported by the software, danbing-tk and associated RPGG. We used danbing-tk to generate a pangenome from 19 haplotype-resolved assemblies, and applied it to detect VNTR variation across populations and to discover eQTL. For example, with our approach, we are able to detect loci with differential motif usage between populations, which could be difficult to characterize using an approach such as multiple-sequence alignment of VNTR sequences from assembled genomes.
There are several caveats to our approach. Datasets combined from disparate sequencing runs with batch effects will affect dosage estimates. In contrast to other pangenome approaches 27 , 38 , danbing-tk does not keep track of a reference e. Furthermore, because it is often not possible to reconstruct a unique path in an RPGG, only counts of mapped reads are reported rather than the order of traversal of the RPGG. An additional caveat of our approach is that genotype is calculated as a continuum of k -mer dosage rather than discrete units, prohibiting direct calculation of linkage-disequilibrium for fine-scale mapping The rich data Supplementary Data 7 — 9 provided by danbing-tk and pangenome analysis provide the basis for additional association studies.
While most analysis in this study focused on the diversity of VNTR length or association of length and expression, it is possible to query differential motif usage using the RPGG. The ability to detect motifs that have differential usage between populations brings the possibility of detecting differential motif usage between cases and controls in association studies.
This can help distinguish stabilizing versus fragile motifs 52 , or resolve some of the problem of missing heritability by discovering new associations between motif and disease Finally, this work is a part of ongoing pangenome graph analysis 53 , 54 , and represents an approach to generating pangenome graphs in loci that have difficult multiple-sequence alignments or degenerate graph topologies. Additional methods may be developed to harmonize danbing-tk RPGGs with genome-wide pangenome graphs constructed from other methods.
TRF 37 v4. The scope of this work focuses on VNTRs that cannot be resolved by typical short-read sequencing methods. To maintain data quality, VNTR loci that could not be assigned homology were removed from datasets. A misannotation of VNTR boundaries can cause erroneous length estimates. The algorithm maintains an invariant: the flanking sequence in any of the haplotypes does not share k -mers with the VNTR regions from all haplotypes.
Haplotypes with distance between adjacent loci inconsistent with the majority of haplotypes are removed. The second data structure enables graph threading by storing a bi-directional de Bruijn graph for each locus. The third data structure is used for counting k -mers originating from VNTRs. The read mapping algorithm maps each pair of Illumina paired-end reads to a unique VNTR locus in three phases: 1 In the k -mer set mapping phase, the read pair is converted to a pair of canonical k -mer multisets.
The VNTR locus with the highest count of intersected k -mers is detected with the first data structure. To account for sequencing and assembly errors, the algorithm is allowed to edit a limited number of nucleotides in a read if no matching k -mer is found in the graph. The read pair is determined feasible to map to a VNTR locus if the number of mapped k -mers is above an empirical threshold. Finally, the read mapping algorithm returns the k -mer counts for all loci as mapped by SRS reads.
Alignment timing was conducted on an Intel Xeon Ev2 2. Pan-genome representation provides a more thorough description of VNTR diversity and reduces reference allele bias, which effectively improves the quality of read mapping and downstream analysis. We ran the read mapping algorithm with error correction disabled so as to detect k -mers unsupported by SRS reads. The three data structures were updated by deleting all unsupported k -mers for each locus.
By referencing the mapping relation of VNTR loci across individuals, we encoded the variability of each VNTR locus by merging the three data structures across individuals. To evaluate the quality of the haplotype assemblies and the performance of the read mapping algorithm, VNTR k -mer counts in the original assemblies were regressed against those mapped from SRS reads.
The r 2 of the linear fit was used as the alignment quality score referred to as aln- r 2. To measure alignment quality in the pan-genome setting, only the k -mer set derived from the genotyped individual was retained as the input for regression. A final set of 32, VNTR regions was called by filtering based on aln- r 2. The quality of a locus was measured by the mean aln- r 2 across individuals.
Loci with mean aln- r 2 below 0. The final pan-genome graphs were used to genotype large Illumina datasets, measure length-prediction accuracy, analyze population structures and map eQTL.
Read depths at VNTR regions usually vary considerably from locus to locus. Furthermore, the sampling bias of different sequencing runs are also different, which limits our ability to genotype the accurate length of VNTRs.
We cross-validate our approach by leaving one sample out of the pan-genome database and evaluating the prediction accuracy on the excluded sample. For comparison, VNTR lengths were also estimated by a read depth method. A one-band pattern can occur if the size of the two parental bands are the same or nearly the same. For our simple example of three different alleles designated A, B, and C illustrated above, six unique DNA profiles are possible.
When profiles from a single VNTR locus from unrelated individuals are compared, the profiles are normally different. However it is possible for two individuals to have the same profile at one or two loci by chance.
0コメント