Abstract
Environmental microbes often form complex communities that live in soil, water, or in symbiotic or parasitic relationships with plant or animal hosts. RNA-sequencing technologies provide a cost-effective means of generating short reads of RNA expressed by these communities. Analysis of such data sets, however, requires two computational components: (a) the reconstruction of transcripts over their entire lengths from short reads; and (b) the taxonomical characterization of each transcript. Here, we propose to implement novel algorithms to address both challenges, based on efficient strategies that exploit error-patterns specific to next-generation sequencing technologies, as well as machine-learning methods to assign taxa to each transcript. We will apply the methods to data sets generated by our collaborators: endophytic fungi in spruce and aspen, which can be developed into a monitoring tool to ensure tree and forest health; and cyanobacterial communities in water, which frequently produce toxins and deteriorate water quality. The project proposed here will significantly expand existing methods to assess and monitor forest health and changes in the environment.