I recently (re)discovered the edirect software package published on NCBI. As described on the website this “software suite that enables users to use the UNIX command line to directly access NCBI databases, as well as to parse and format the data to create customized downloads“.
How does it work? Basically like the “Entrez” research system on the NCBI website but with some extra feature. For example, I have a protein accession number and I want to know to what it correspond. I will use the command as follow:
esearch -db protein -query "AAC43379"
And it will output the following:
<ENTREZ_DIRECT>
<Db>protein</Db>
<WebEnv>NCID_1_3752066_130.14.22.215_9001_1469541026_407945051_0MetA0_S_MegaStore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>1</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
But then if coupled with the efetch script like this :
esearch -db protein -query "AAC43379" | efetch -format gb
This will output on the terminal the genbank format of the protein as you see it on the NCBI page.
What will be printed on the terminal (Example)
LOCUS AAC43379 347 aa linear BCT 23-JUN-1995 DEFINITION RecA [Helicobacter pylori]. ACCESSION AAC43379 VERSION AAC43379.1 GI:687249 DBSOURCE locus HPU13756 accession U13756.1 KEYWORDS . SOURCE Helicobacter pylori ORGANISM Helicobacter pylori Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter. REFERENCE 1 (residues 1 to 347) AUTHORS Thompson,S.A. and Blaser,M.J. TITLE Isolation of the Helicobacter pylori recA gene and involvement of the recA region in resistance to low pH JOURNAL Infect. Immun. 63 (6), 2185-2193 (1995) PUBMED 7768597 REFERENCE 2 (residues 1 to 347) AUTHORS Thompson,S.A. TITLE Direct Submission JOURNAL Submitted (17-AUG-1994) Stuart A. Thompson, Division of Infectious Diseases, Vanderbilt University School of Medicine, A3310 Medical Center North, Nashville, TN 37232, USA COMMENT Method: conceptual translation. FEATURES Location/Qualifiers source 1..347 /organism="Helicobacter pylori" /strain="84-183 (ATCC 53726)" /db_xref="taxon:210" /clone="pSAT105" /clone_lib="lambda ZAP II" Protein 1..347 /product="RecA" Region 1..347 /region_name="recA" /note="recombinase A; Provisional; PRK09354" /db_xref="CDD:236476" Region 6..331 /region_name="recA" /note="RecA is a bacterial enzyme which has roles in homologous recombination, DNA repair, and the induction of the SOS response. RecA couples ATP hydrolysis to DNA strand exchange; cd00983" /db_xref="CDD:238483" Site order(14..15,17..18,21,25..29,98..99,101..107,112..113, 215,219,267,313) /site_type="other" /note="hexamer interface [polypeptide binding]" /db_xref="CDD:238483" Site 67..74 /site_type="other" /note="Walker A motif" /db_xref="CDD:238483" Site order(69..75,97,101,104,145,195,229,242,264..267) /site_type="other" /note="ATP binding site [chemical binding]" /db_xref="CDD:238483" Site 141..145 /site_type="other" /note="Walker B motif" /db_xref="CDD:238483" CDS 1..347 /gene="recA" /coded_by="U13756.1:350..1393" /transl_table=11 ORIGIN 1 maidedkqka islaikqidk vfgkgalvrl gdkqvekida istgslgldl algiggvpkg 61 riieiygpes sgkttlslhi iaecqknggv cafidaehal dvyyakrlgv dtenllvsqp 121 stgeealeil etitrsggid lvvvdsvaal tpkaeidgdm gdqhvglqar lmshalrkit 181 gvlhkmnttl ifinqirmki gmtgygspet ttggnalkfy asvridirri aalkqneqhi 241 gnrakakvvk nkvappfrea efdimfgegi skegeiidyg vkldivdksg awlsyqdkkl 301 gqgrenakal lkedkalane itlkikesig sneeimplpd epleeme //
And from this, giving a good NCBI page the possibilities are endless!
This tool actually just made my day. I was working with a huge de novo transcriptome assembly composed of several thousands of contigs. One important step of the analysis is to know which contig code fore which gene. To have a quick and dirty idea of the annotation on function and taxonomy level, I checked if the contig blasted against something on the data base.
I just used diamond software against the ncbi non redundant (nr) database and outputted a tabular file looking like this :
TRINITY_DN177_c0_g1_i1 gi|908406704|ref|XP_013094042.1| 62.7 220 75 3 799 1455 320 533 3.7e-73 284.6
TRINITY_DN177_c0_g1_i1 gi|908406704|ref|XP_013094042.1| 48.5 206 94 6 52 651 1 200 5.5e-45 191.0
TRINITY_DN177_c0_g1_i1 gi|524908906|ref|XP_005109542.1| 64.2 193 68 1 880 1455 357 549 9.0e-72 280.0
TRINITY_DN177_c0_g1_i1 gi|524908906|ref|XP_005109542.1| 52.0 198 78 6 52 633 1 185 3.4e-47 198.4
TRINITY_DN177_c0_g1_i1 gi|918288237|gb|KOF67725.1| 50.8 264 88 4 805 1476 239 500 1.6e-65 259.2
TRINITY_DN177_c0_g1_i1 gi|443722686|gb|ELU11446.1| 54.0 213 89 6 814 1434 180 389 1.8e-56 229.2
TRINITY_DN177_c0_g1_i1 gi|926634111|ref|XP_013782890.1| 48.1 216 107 3 829 1464 193 407 1.5e-55 226.1
TRINITY_DN184_c0_g1_i1 gi|762125178|ref|XP_011445607.1| 92.4 92 7 0 281 6 1 92 6.8e-40 171.4
TRINITY_DN184_c0_g1_i1 gi|919047297|ref|XP_013406842.1| 87.1 93 12 0 281 3 1 93 1.1e-37 164.1
TRINITY_DN184_c0_g1_i1 gi|919047103|ref|XP_013406743.1| 87.8 90 11 0 281 12 1 90 2.7e-36 159.5
Then in order to match the Accession number to a protein product, I did the following command line jujitsu:
awk -F "|" '{print $4}' Diamond.result.tab > temp ##Extract the accession number in a temporary file
a=$(<temp) # put the accession name in a variable
for i in $a;
do esearch -db protein -query $i | efetch -format gb > source/$i.gbk;
cat source/$i.gbk | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g'| sed -n 's:.*DEFINITION\(.*\)\b].*\b:'$i'\t\1:p' | sed 's/[[].*$//' >> result/d.$SGE_TASK_ID.result; # extract definition name of the protein without the specie name between brackets and add it after the Accession number and a tabulation
cat source/$i.gbk | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g'| sed -n 's:.*ORGANISM\(.*\)\b[A-Z].*\b:\1:p' | sed 's/[.].*$//' | awk -F " " '{for (i=2; i<=(NF-1); i++) printf $i " "; print $NF}' >> result/t.resuĺt; # extract taxonomy name of the protein in one line
done
paste result/d.result result/t.result > annotation.tab
This output two file with some basic information about my contigs which can be used to have an idea of what is inside my transcriptome. The final file looks like this :
XP_013094042.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Panpulmonata; Hygrophila; Planorboidea; Planorbidae; Biomphalaria
XP_013094042.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Panpulmonata; Hygrophila; Planorboidea; Planorbidae; Biomphalaria
XP_005109542.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Euopisthobranchia; Aplysiomorpha; Aplysioidea; Aplysiidae; Aplysia
XP_005109542.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Euopisthobranchia; Aplysiomorpha; Aplysioidea; Aplysiidae; Aplysia
KOF67725.1 hypothetical protein OCBIM_22008945mg Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Cephalopoda; Coleoidea; Neocoleoidea; Octopodiformes; Octopoda; Incirrata; Octopodidae; Octopus
ELU11446.1 hypothetical protein CAPTEDRAFT_23984, partial Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Scolecida; Capitellida; Capitellidae; Capitella
XP_013782890.1 PREDICTED: headcase protein-like Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus
Hey Adrien,
nice post on the subject. You can even go further and let the edirect tools handle the extraction of the information. Something along the following lines (the RegEx could of course be tweaked further):
[code language=”bash”]
awk -F “|” ‘{print $4}’ diamond.tab > temp
for i in $(> my_annotation.tab
done
rm temp
[/code]
Furthermore, you could think about deduplicating the input accessions or using a conditional in the for loop to circumvent downloading GenBank files with the same accession over and over again.
Best Regards
Stefan
It still seems to filter the code. The relevant line is:
esearch -db protein -query $i | efetch -format gb -mode xml | xtract -pattern GBSeq -element GBSeq_accession-version GBSeq_definition GBSeq_taxonomy | sed “s:\(.*\) \[.*\]\(.*\):\1\2:g”
Hej Stefan,
Thank you very much to pass by and leave a comment,
I will look for including a way to leave code in the comment!
I will give a try to your suggestion seems a nice option !
Cheers!
Adrien
Hello again 😉
I tweaked the script a little bit. It should now avoid additional downloads by caching information in an array (should require bash 4). I don’t know how it will perform wit a larger data set. If you have time, feel free to test it out. Replace (lt) and (gt) with the “lowerthan” and “greaterthan” signs:
[code language=”bash”]
#!/bin/bash
declare -A accversion=() # declare associative array to cache information and avoid unnecessary downloads
awk -F “|” ‘{print $4}’ diamond.tab > temp # write accession.version to temporary file
for i in $((lt)temp); do
tr_i=$(echo $i | tr ‘.’ ‘-‘) # “dot” is not allowed in variable/key names, replace it by “dash”
if [[ -z “${accversion[$tr_i]}” ]]; then
accversion[$tr_i]=$(esearch -db protein -query $i | efetch -format gb -mode xml | xtract -pattern GBSeq -element GBSeq_accession-version GBSeq_definition GBSeq_taxonomy | sed “s:\(.*\) \[.*\]\(.*\):\1\2:g”) # fetch information
echo “${accversion[$tr_i]}” (gt)(gt) my_annotation.tab
else
echo “${accversion[$tr_i]}” (gt)(gt) my_annotation.tab
fi
done
unset accversion # delete array
rm temp # delete temporary file
[/code]
Stefan