Removing host sequences in microbiome datasets
Removing host sequences to alleviate the time consuming assembly tasks is helpful when the host genome is available. There are a few steps that need to be followed to achieve the “dehosting” process.
1. Bowtie2 mapping to the host
Mapping all reads to the host genome allows to know which are the reads that need to be eliminated.
a. Create bowtie2 index database (host_DB) from host reference genome
bowtie2-build host_genome.fna host_DB
b. bowtie2 mapping against host sequence database, keep both mapped and unmapped reads (paired-end reads)
bowtie2 -x host_DB -1 SAMPLE_r1.fastq -2 SAMPLE_r2.fastq -S SAMPLE_mapped_and_unmapped.sam
c. Convert file .sam
to .bam
samtools view -bS SAMPLE_mapped_and_unmapped.sam > SAMPLE_mapped_and_unmapped.bam
2. filter required unmapped reads
a. SAMtools SAM-flag filter: get unmapped pairs (both ends unmapped)
samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothEndsUnmapped.bam
-f 12 Extract only (-f) alignments with both reads unmapped:
3. split paired-end reads into separated fastq files .._r1 .._r2
a. Sort bam file by read name (-n) to have paired reads next to each other as required by bedtools
samtools sort -n SAMPLE_bothEndsUnmapped.bam SAMPLE_bothEndsUnmapped_sorted
b. Convert bam to fastq
bedtools bamtofastq -i SAMPLE_bothEndsUnmapped_sorted.bam -fq SAMPLE_host_removed_r1.fastq -fq2 SAMPLE_host_removed_r2.fastq