A short time ago, I mentioned this article. This study was the product of a collaboration between five laboratories – two plant poly(A) labs, a seed biology lab, and two bioinformatics groups. As the abstract indicated, this paper describes the results of a characterization of polyadenylation in plants using so-called Next Generation DNA sequencing technology; as such it is an addition to other recent studies, albeit the first (to my knowledge) that deals with plants.
I’m more than happy to answer questions about the paper in the comments. What I will do in the essay is described one of the more perplexing findings, and “amend” the PNAS paper with a few illustrations that we couldn’t include in the paper (even the online Supplemental Files – we maxed out the print and SI page limits).
First, the curious finding. The main point of the paper was an accounting of all of the poly(A) sites “encoded” by the Arabidopsis genome (at least all of the sites seen in mRNAs present int he leaf and seed). As Fig. 1 of the paper shows, a bit more than 10% of all “sense”* sites fell within annotated protein-coding regions. (The study by Shepard et al. reported a similar finding.) The existence of so many of these sites was quite unexpected. This is because processing and polyadenylation within a protein coding region will, except in rare cases, produce an RNA that lacks a translation termination codon. These so-called non-stop mRNAs are expected to be relatively unstable and polypeptides produced by translating such RNAs should be degraded; these activities are necessary to promote proper recycling of ribosomes, that otherwise would pile up along such mRNAs owing to the absence of translation termination codons. I haven’t got a good explanation for these sites, other than they exist (and they are not rare, judging from the numbers of tags that correspond to these sites). This goes to show, though, that there is probably much to be learned about non-stop mRNAs in plants, and about the interplay between mRNA surveillance and RNA processing.
A related observation we made was that these coding sequence-associated poly(A) sites seemed to possess different polyadenylation signals. This was determined by using a sort of genome-scale assay. The approach has been described before (for example, here and here); basically, what one does is line up all of the sequences adjacent to poly(A) sites and tabulate the relative frequencies of the four bases on a position-by-position basis. When this is done for “normal” plant poly(A) sites, one gets this (from our recent PNAS paper):
As I alluded to before, one can easily see the A-rich Near Upstream Element and U-rich Cleavage Element (that surrounds the actual poly(A) site at -1). Here is what the coding sequence-associated sites look like:
What is striking about these sites is the relative paucity of U’s around the poly(A) site and the relative abundance of G’s. Indeed, these sites are defined by their A+G richness. These sites thus would seem to be a class of non-canonical poly(A) site, the first described for plant nuclear genes. (This isn’t really surprising – in animals, non-canonical sites can be recognized by the deviation from normal of the hexanucleotide AAUAAA motif. Plants have no highly-conserved signal, so it hasn’t really been possible to use this as a basis for poly(A) signal classification.)
Second, what is neat (and addicting, and eventually exhausting and annoying) is the genome-wide picture of polyadenylation that can be put together using computational tools. We originally intended to provide several examples in the Supplemental Files, but the powers-that-be told us that each illustration would be considered a separate figure, and thus count against the 10-item limit. So we had to remove them. Here, for your viewing pleasure, is a small sampling to illustrate some of the interesting things that can be seen:
In case anyone is interested, these illustrations were made using CLC Genomics Workbench (to map tags to the genome and make the .sam and .bam files), SAMtools (to do the indexing of the .bam files), and Integrated Genomics Viewer (to display the tags using the .bam files). I’m far from a computer geek, but even I can manage to use these tools with little frustration.
Enjoy the paper, and, as I said above, feel free to ask questions in the comments.
* – so-called “sense” sites are those sites that are oriented in the same direction as an annotated Arabidopsis gene or feature.