
Download NCBI files and store relevant information in local MongoDB database

Download .zip Download .tar.gz View on GitHub



Collect, store, and retrieve records from NCBI with just the GI number. Uses NCBI's E-Utilities interface and MongoDB as a database for as a database for storing locally the most relevant information. Please check dependencies are locally installed before running.

The program will connect and download the file from NCBI corresponding to the GI number(s) provided and the following are extracted and stored in a MongoDB database: GI, accession, sequence, version, locus, organism, sequence length, gene, protein ID, translation.

This creates a local database that can be accessed downstream for many applications. Documents can be inserted, updated, read, and removed in order to help create the database you wish.


-ids            ID(s)
-file           File with ID(s) [CSV or TXT]
-db             Database (Nucleotide, protein, etc..) [optional]
-type           gb, fasta, etc... [optional]
-force          Force download? [optional]
-mongo          MongoDB database name
-collection     Collection name in MongoDB database
-insert         Insert into database [optional/default]
-update         Update database
-read           Read from database
-remove         Remove from database
-help           Shows help message

Database Operations


To insert new data (documents) in the database, provide the GI number(s) with the optional -insert flag. The following have the same function:

tango.pl -file gis.csv
tango.pl -file gis.csv -insert
tango.pl -id 74960989 4165050 -insert


To update data (documents) stored in the database, provide the -update flag followed by your query in format field:value you want to update. You will be asked the field you wish to update in that document.

The following looks for the document with _id field matching 34577062.

tango.pl -update _id:34577062

It will then tell you which document you are about to update and ask which field you would like to update.

UPDATING _id record [34577062] in database...
Available fields are:   _id accession sequence version locus organism seqLength gene proteinID translation

What field do you want? sequence
What is the NEW value for sequence field? NEWSEQUENCE
Document 34577062 updated, sequence field changed to NEWSEQUENCE.


To read data (documents) stored in the database, provide the -read flag followed by your query in format field:value. You will be asked what field from the document you want to report back.

The following reads documents with _id fields matching 34577062 and 74960989.

tango.pl -read _id:34577062 _id:74960989


To remove data (documents) stored in the database, provide the -remove flag followed by your query in format field:value you want removed.

The following removes documents with _id fields matching 34577062 and 74960989.

tango.pl -remove _id:34577062 _id:74960989


You need to have the following installed:

  1. BioPerl


    • Eutilities
    • GenBank
    • SeqFeatureI
  2. MongoDB with MongoDB Perl Driver
