In the previous post we had a look at some interesting features of Helicobacter pylori. Now it is time to finally get our hands on the data. This post will present the data preparation, how to get everything ready in a comparable way for easy analyses.
Now that we know more or less what we are looking for, let’s have a look at the data available. The first stop is the NCBI genome section for Helicobacter pylori. This page displays the very basic genome information available.
- The genome size is around 1,6Mb.
- The average GC content is around 38,9%.
- Most importantly, there are 683 available genomes (Damn… when I started this whole thing there were only 607. This is how quickly genomic data moves, people!)
To make working easier, I set up a directory called “Epsilon”, which will be the root of all evil our work. I will download the genomes in the NCBI subfolder:
mkdir Epsilon
cd Epsilon
mkdir NCBI
rsync --copy-links --recursive --include "*.gbff.gz" --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Helicobacter_pylori/latest_assembly_versions/ ./NCBI/
We now have all the available genomes from NCBI in multiple formats: statistics files (assembly report and stats), a Genbank annotated file (gbff), raw sequence files (fna) which sometimes includes a subfile with only the RNA or CDS sequences, only the annotations (gff), the proteins sequences only (faa), etc… A quick scan through the random folders showed some variation in file composition. For example, some genomes don’t have any annotations at all. So from this file dump, we will mainly focus on two types of files. Firstly the fna files, which are the raw unannotated data. We will re-annotate all of the different genomes giving us a complete dataset annotated in the same way (with all annotations suffering the same bias). In parallel, we will use the GenBank files to collect the metadata available in those files (see box on the side).
To have our data tidy and organised, let’s get the fna files out of the NCBI rsync’d folder. Here I created a separate directory called “fasta”. Among the different genomic files, we are only interested in the complete set of sequences, not the CDS or RNA subsets ones (which is why they were not selected – see script below).
The Meta data rage
When I started this project, one important point became obvious to me. Having all these metagenomes to play with is very nice, but a key piece of information is where do they come from? Having the contextual information for each data set is very important, if we want to be able to infer any kind of hypothesis or correlation from the analysis.
Gathering this information from more than 20 years of H. pylori genome research is actually quite challenging. Tracking the origin of the strain, the condition of the host or even just the country of origin is sometimes impossible. Just as an example: To find where the original H. pylori closed genome strain came from, I had to dig through three papers to at the end have a very vague idea that the strain was sampled from either one of the original researchers or came from their laboratory’s culture collections. Who knows? (No really, Who? That would be helpful).
But things are getting better, I understand that more than 10 years ago standards were different and the need for metadata wasn’t as strong as it is today. I manage to collect more than half of the basic metadata that I am using at the moment from the original Genbank files. The information I want is: the country of origin, ethnicity of the host and the disease associated with the host.
I have compiled all of this information in a publicly available list, accessible via my GitHub page here or the big button bellow. Please feel free to use it for your own research.
Now, for any person who can help me complete this list, either by pointing out a mistake or a filling a blank send me an e-mail I am willing to send you an H. pylori postcard (on this template) as a thank you.
ls Pylori/*/*genomic.fna.gz | grep -v "cds" | grep -v "rna" > Original.names.w.folder.tab
Oname=$(<Original.names.w.folder.tab)
for i in $Oname; do cp $i fasta/; done
cd fasta
gunzip -d *
The NCBI page also mentions the presence of plasmids associated with some of the genomes. Even if we could consider them as part of the genetic potential of our H. pylori, we will only compare the genomes for this exercise. We use the following script to remove the plasmid sequences and then change the file and genome sequence names to something shorter, which will make things easier for browsing/managing/scripting. Finally, we will place them in their own directory.
mkdir Renamed_wo_plasmid
a=1;
for i in Original/*.fna;
do old=$(printf $i);
new=$(printf "HP-%04d.wop.fna" "$a");
nname=$(printf "HPwop-%04d" "$a");
#keep track of the old name and the new name in a list
echo -e $old ' \t ' $new >> Old.and.new.wo.plasmid.tab;
#Transform a multiline fasta file in one line per sequence
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < $i > Renamed_wo_plasmid/tmp.$new; let a=a+1;
#Found the (P/p)lasmid mention in the fasta header and then removes it along with the following line, then rename the sequence header with the name of the renamed file
sed '/lasmid/,+1 d' Renamed_wo_plasmid/tmp.$new | awk -v A=$nname '/^>/{print ">"A"_" ++i; next}{print}' > Renamed_wo_plasmid/$new;
rm Renamed_wo_plasmid/tmp.$new;
done
However, let’s not forget about the plasmids completely, we will have a look at them later. For now just run the following script to have only the plasmids in a specific directory.
mkdir Renamed_plasmid
a=1;
for i in Original/*.fna;
do old=$(printf $i);
new=$(printf "HP-%04d.p.fna" "$a");
nname=$(printf "HPp-%04d" "$a");
echo -e $old ' \t ' $new >> Old.and.new.plasmid.tab;
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < $i > Renamed_plasmid/tmp.$new; let a=a+1;
grep -A 1 "lasmid" Renamed_plasmid/tmp.$new | awk -v A=$nname '/^>/{print ">"A"_" ++i; next}{print}' > Renamed_plasmid/$new;
gcount=$(grep -c ">" Renamed_plasmid/$new);
if [ $gcount == 0 ];
then rm Renamed_plasmid/$new;
fi;
rm Renamed_plasmid/tmp.$new;
done
Once this is done, we can run the following commands to extract the available metadata from the GenBank files (gbff) present in the rsync’d directory. We move the gbff files out of the original directoy then use the following scrapper:
mkdir Genbank.original
mv ./NCBI/*gbff ./Genbank.original
for i in *gbff; do printf "%s\t %s\t %s\t %s\t %s\t %s\n" "$i" "$(grep "TITLE" $i | head -n 1)" "$(grep "AUTHORS" $i | head -n 1)" "$(grep "BioProject" $i | head -n 1)" "$(grep "country" $i | head -n 1)" "$(grep "LOCUS" $i | head -n 1)"; done > Meta.data.tab
The Meta.data.tab tabular file can then be merged with the “Old.and.new.wo.plasmid.tab” and we have a table with the original gbff file name, the new name and some of the metadata when available. As mentioned in the box above I had a look at the literature and tried to fill in as much information about the contextual data of the genomes as I could find.
All table operations were performed in excel and exported as a .csv file (available on my GitHub page), which we will use in later analyses.
We have now our raw fasta and metadata files ready to be processed in the next blog post.