We are particularly concerned with the problem of detecting low abundance species in complex datasets. This includes detecting spurious bacterial pathogens in human or animal samples. In this task, Taxoner is approximately at least as KY-05009 accurate as, and at times even more accurate than BLAST + MEGAN and it requires considerably less CPU time. Taxoner is a program written in C that identifies taxa, primarily bacteria, by mapping NGS reads to a comprehensive Bis-Imidazole phenol IDH1 inhibitor sequence database such as the NCBI NT database or its predefined subsets. The program is developed so as to run on standard desktop or laptop computers under the Linux operating system. The idea behind Taxoner comes from a technical problem. Running fast aligners such as Bowtie2 on a large number of microbial genomes is prohibitively time consuming since, at least in principle, each of the small genomes have to be indexed separately. However if we concatenate the small bacterial genomes into larger units, i.e. concatenated FASTA files that we term ����artificial chromosomes����, the problem becomes more manageable. In such an artificial chromosome, a genome is a segment that is annotated by various identifiers including taxonomic name and GI identifier. As such the number of reads matching a particular genome can be counted at various taxonomic levels which corresponds to the well known principle of taxonomic binning. The only prerequisite is to know the starting and endpoints of the genomes and/or other segments incorporated into the ����artificial chromosome����, which is solved by pre-calculated index files. Importantly, this process is analogous to the mapping of reads to an annotated genome wherein the segments�Ci.e. the genes�Care named according to such schemes as COG, GO etc. Namely, in both cases, we map a read to a large sequence consisting of annotated segments, and the segments are named according to various ontologies. As a consequence, this algorithm can be used both for taxon identification and for function prediction based on NGS datasets. We highlight that mapping of counts to ontologies, sometimes also referred to as ����ontology binning���� is a problem known in other fields of medical informatics.
As far as proteins are concerned its extension to more complex
Leave a reply