Google

topp (CCP4: Supported Program)

NAME

topp - an automatic topological and atomic comparison program for protein structures

SYNOPSIS

topp
[Keyworded input]

top3d foo_1.pdb foo_2.pdb

topsearch foo_1.pdb

AUTHOR

Author:
Guoguang Lu
Div. of Molecular Structural Biology
Dept. of Medical Biochemistry and Biophycs,
Karolinska Institute, Stockholm, 17 177, Sweden
E-mail:
guoguang@alfa.mbb.ki.se

NOTES ON CCP4 VERSION

Note: TOPP has been renamed from the original TOP to avoid a clash with the UNIX command of that name.

TOPP can be run directly using the command topp with Keyworded input, or via the script top3d which takes two file names as arguments and program parameters from the file $CLIBD/TOP.PARM (see examples section). A search with one file against a database of structures can be done using the script topsearch which takes one file name as argument and program parameters from the file $CLIBD/SEARCH.PARM (see examples section).

Use of the browser facility to search a Protein Data Bank site requires two commands to be on the user's path, namely wget and pdbhtf. The latter is part of the CCP4 suite and should have been compiled and installed. On the other hand, wget is not part of CCP4, but is a GNU program available via internet from the usual GNU sites.

Index

DESCRIPTION

TOP is a protein TOPological comparison program which detects whether there are structural similarities between two proteins. It superimposes two protein structures automatically without any previous knowledge of sequence alignment. The program can be used to find out if a newly determined protein structure is similar to any structures in the Protein Data Bank (if this link does not work, try Protein Data Bank at RCSB), and rank the homologous proteins according to topological and structural diversities (similarities). The program (version 6 or higher) can directly browse data from Protein Data Bank or its mirror sites via internet, so that users can search most recent data without regularly downlowding the whole database to their local disks. The program has a 3DB browser interface so that it can perform rapid structure similarity search if users limit a searching range by sequence, keywords, resolution, date or other restraints. This provides possibilities that TOP is coveniently used for modeling homologous proteins and automatic tracing new coming similar structures related for special interests without literature reading.

TOP is designed to be user friendly. For example, once the program is properly set up on unix computers, users can use simple commands such as top3d file1 file2 so that the coordinate file2 will be automatically superimposed to file1. The Protein Data Bank (PDB) entry code can be recognized by the program. For example if the second molecule is 2cnd in PDB, user can just type top3d file1 2cnd@pdb so the program will browse the coordinates of 2cnd into the local disk and perform the comparison. If a user wants to know whether a structure in file is similar to any structures in PDB, one can type topsearch file.pdb so that the program will output a list of pdb code which are ranked according to 3d-structure similarities. The user can type top3d file.pdb code@pdb to get the interested coordinates superimposed to the probe model. The program can detect sequence permutation and be used for special purpose, such as motif searching.

The program runs two steps in each structure comparison. In the first step topology of secondary structures in the two are compared. The program uses two points to represent each secondary structure element (alpha helixes or beta strands) then systematically searches all the possible superposition of these elements between the two protein structures. Once a couple of elements in the two structures can fit each other in 3-d space (defined as, the rms, the angle between the two lines formed by the two points and the line-line distance are smaller than the given values), the program will search whether more secondary structure elements can fit by the same superposition operation. If secondary structures which can fit each other exceed a given number, the program will claim the two structures are similar, outputs names of secondary structures which correspond to each other in the two proteins and output the superimposed coordinates. It also outputs a matrix, with which one molecule can be rotated and translated to the other molecule. The program output a comparison score called "Topological Diversity", which considers both the rate of matching SSEs and structure difference of the representing points. In the data base searching, this parameter can be used for rank the topological similarities of SSEs.

While Ca atoms are available, the program can run the second step to find the alignment based on Ca atoms of all the residues from the initial comparison matrix, and improve the comparison matrix based on the superposition of newly aligned Ca atoms. The procedure is iterated until the member of matching residues converges. The program is able to overcome sequence permutation in the superpostions. According to both r.m.s deviations and numbers of matching residues, the program calculated a score of "Structure Diversity", which can be used to rank the structure difference of homologous proteins.

Use of a SSE database

The optimized way of database searching in TOP is to use a library of Secondary Structure Elements (SSEs). This can be created from a set of PDB files with the command MAKEVEC (see below).

The compact SSE library is automatically updated in Karolinska Institute every week, which include not only the current released structures in Protein Data Bank, but also compact SSE dastabases of independent family, super-family, structures classified in the SCOP database for efficient similarity search. It can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z . After you get this TAR file from FTP and save to your local disk as, for example /dir/sndlib.tar.Z, use following commands:


cd $TOPHOME
zcat /dir/sndlib.tar.Z | tar -xvf -

you can have the most recently updated SSE databases.

Keyworded Input

The parameters of the TOP program can be controlled by different lines of text, each of which is a "keyword command". Any command line which starts with "!" will be ignored. Available keywords are:

Keywords for location of protein coordinates

The TOP program can compare two structures or search similarities in database by comparing one structure with a group of other structures. The MOL1 command specifies the data location of one molecule (called Molecule 1) while the commands MOL2, LIBDIR, MOLVEC, PDBSITE or WEBSITE specify the data location of another molecule (called Molecule 2) or the other molecules (called database).

TOP can read 3d coordinates of protein structures in "Brookhaven" (PDB) format either from user's local computer disk, CD ROM or via internet. In the case of structure similarity searching, there can be many ways to read data. The recommended setup for the program is to use automatic updating of a secondary structure element (SSE) libary searching (see automatic updating of SSE library and MOLVEC). In this way the program can search most recent database from compact SSE library and browse the detailed coordinates of only those structures which are found similar with the molecule 1. It is considerably faster and does not require regular maintaining works for database after setup.

    MOL1 coordinate_file_name [zone]
    Example: MOL1 /nfs/disk1/guoguang/examples/test2.pdb The coordinates file name of molecule 1 for searching the similarity. Coordinate file must be in PDB (Brookhaven) format.

    If you don't have the coordinates in your local disk and wish to read the coordinates directly from a Web site by giving a PDB entry code, you could give the filename something like code@pdb in this command, for example: MOL1 2cnd@pdb, the program will use the code and browse the coordinates from a PDB mirror site or another web site, the URL address of which is specified in the PDBSITE or WEBSITE commands.

    MOL2 Coordinate_file_name or @List_file_name or @URL_address [zone]
    This command controls whether users wish to compare two structures or do a similarity search in Protein Data Bank. If the filename is something like 2cnd.pdb or 2cnd@pdb, the program will just superimpose two structures and give sequence comparisons.

    If the second text string in the command start with @ and the rest text does not start with http: or ftp:, the rest text in this string text will be assumed a name of List_file which lists names of a number of coordinate files such as:

    
    /nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent
    /nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent
    /nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent
    /nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent
    ....
    
    
    This can be used for searching structure similarities in PROTEIN DATA BANK.

    If the command PDBSITE or WEBSITE is given before this command, or the LIBDIR command specify a directory name which contains "current_release", the List_file can be list of PDB entry code, such as

    
    200d			!		|	pdb200d.ent
    200l			|	or 	|	pdb200l.ent
    300d			|		|	pdb300d.ent
    .... 			|		|	....
    
    
    the program will browse the coordinates of these PDB entries from a web site or local disk or CDs.

    This list of PDB codes can be obtained from "3DB browser" in Protein Data Bank or other bioinfomtics tools outside the program. It provide a possiblity that TOP search for a certain group of structures for a special purpose.

    LIBDIR directory_name If the program is searching a number of coordinates files (see MOL2) and those files are under an identical directory, the user can indicate in which directory the coordinates files are located. for example, if users have pdb200d.ent pdb3001.ent ... in the /nfs/pdb/all_entries/ directory, the user can use UNIX command: ls -1 /nfs/pdb/all_entries/uncompressed_files/ > allpdb.lis, this file will be something like

    
    pdb100d.ent
    pdb101d.ent
    pdb101m.ent
    pdb102d.ent
    pdb102l.ent
    ...
    
    
    then use
    
    libdir /nfs/pdb/all_entries/uncompressed_files/
    mol2 @allpdb.lis
    
    
    so the program will compare all the files under directory /nfs/pdb/all_entries/uncompressed_files/ and with file names in allpdb.lis and list out which one is similar with the structure specified in the MOL1 command.

    Alternatively, one can use UNIX command

    
    find /directory_name/ -name "*.ent" -print > pdball.lis 
    
    
    instead of the ls command. The LIBDIR command is not neccesary in this case. This is usually used when the users have whole protein data bank on their local disk or CD ROM.

    In the case the directory name in the LIBDIR command contains a substring ".../current_release/uncompressed_files", the program will think this directory is organised as "current_release" directory in Protein Data Bank i.e. PDB entries are distributed under subdirectories whose name correspond to the 2 middle characters of the PDB id code, e.g.

    
    ...pub/pdb_data/current_release/uncompressed_files/00
    ...pub/pdb_data/current_release/uncompressed_files/zy
    
    
    and program will assume each line in List_file is a PDB entry code such as
    
    100d				pdb1001.ent
    100e		or 		pdb100e.ent
    .....                           .... 
    
    
    Please notice the local PDB should contain the coordinates of the structures with these ID codes in the file.

    If the rest text after first character"@" start with "http:", the program will assume there is a 3db browser in this URL address and try to get a list of current released entries. (This command is not neccesary if PDBSITE command is present.)

    If the rest text after first character"@" start with "ftp:", the program will list all the files under the directories. This can be used for an anonymous ftp site in which a directory contains all the entries of the coordinates (such as old PDB directory .../all_release/compressed_files/*.pdb ) However, in this form, all the PDB files should be in one directory, but not distributed in sub-directories.

    PDBsite URL_address
    [default: http://www.pdb.bnl.gov]
    This command specifies an URL address of one of the official mirror sites of the Protein Data Bank. Given the "recognized mirror site", the program can browsed most recent data in PDB. A collection of the URL addresses which have been tested by the program is listed in http://gamma.mbb.ki.se/~guoguang/webtop/pdb_url_collect.html. To get efficient and fast data browsing, users should choose a site which is inside or close to their local countries.

    If this command is given, the commands WEBSITE, PDBSITE and LIBDIR are not neccessary to be present.

    WEBsite URL_address (or SITE or SERVER)
    Sometimes, users prefer to read data from a Web site other than a standard PDB site (for example a laboratory which is very in the same campus or city), user can use WEBsite instead of PDBSITE for example:

    
    WEBSITE http://pdb.pdb.bnl.gov/	or http://www.rcsb.org/pdb/ 
    WEBSITE  ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files
    
    WEBSITE  ftp://gamma.mbb.ki.se/pub/pdb/current_release/uncompressed_files
    
    
    This command indicates the URL address of Web server. If the address is given correctly, the program is able to browse coordinates from site which provide data of Protein Data Bank by either http or FTP service in compressed or uncompressed form. In each issue of Protein Data Bank Quarterly Newsletter, there is a list of which lab might provide this service. (most likely in form of FTP server). A current URL address collection of these sites are listed in http://gamma.mbb.ki.se/~guoguang/webtop/url_collect.html

    In the case it is FTP site, if the directory name contains a sub-string "current_release", the program can automatically find out the PDB entries in sub-directories. Otherwise, it will assume all the files are in the same directory in the argument of this command.

    MOLVEC vector_file_name
    Instead of reading all the PDB files in PROTEIN DATA BANK, the TOP program can use a compact database which is a library of secondary structures of each protein. This command indicate the filename of the database so that the program can perform the topological comparisons based on secondary structures. If the WEBSITE or LIBDIR commands are also present, the program will first perform the rapid topological search in the compact database. Once a structure in the data base with a pdb entry code is found, TOP will browse the PDB file from Internet or local disk and perform the comparisons based on Ca atoms. If users repeately use the database searching function, this command is the fast and efficient way, because it can save a lot of time for repeating browsing files and assign the secondary structures.

    The MAKEVEC command can help to update the compact database in order to follow the most recent changes in Protein Data Bank. The updated database can also obtained via the Web (See example 3).

    MAKEVEC output_database_filename pdb_list_file_name [format]
    If this command is present, the SSE library mentioned above is made. The program can read coordinates either from local disk/CD, which is specified by LIBDIR, or via internet which is specified by PDBSITE or WEBSITE. The first argument of this command is the name of the output SSE library file. The second argument is a name of List_file (as in the MOL2 command) which can contains either a list of file name or PDB entry codes. If the third (format) argument is ZONE or SCOP, the program will assume the second column in the pdb_list_file specifies the residue range (see ZONE1 and ZONE2 keywords) while the first column specifies the PDB code or file name of the structure.

    
    example:
    MAKEVEC sndnew.vec pdb.list
    
    
    If you have PROTEIN DATA BANK on the disk, TOP program can make a compact database file to let those who don't have protein data bank on disk be able to perform the similarity searching. The pdb_list_file_name contains something like
    
    101l.pdb
    102l.pdb
    103l.pdb
    104l.pdb
    ....
    
    
    use this list together with LIBDIR command, one can make a compact SSE library, sndnew.vec
    
    example: 
    PDBSITE http://www2.ebi.ac.uk
    MAKEVEC sndnew.vec
    
    example:
    MAKEVEC snd.vec ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files/
    
    
    
    If the file name starts with "ftp://" and ends with "/" the program will check the what PDB files contains under that FTP directory and browse all the coordinates in that directory. The files must be in the same directory but not sub-directory in this case. If the second argument starts with "ftp://" the program will request a 3DB server from the URL address to provide a list all the entries in PDB.
    
    example:
    MAKEVEC snd.vec ftp://gamma.mbb.ki.se/pub/guoguang/scop_family.lis scop
    
    
    If the second argument starts with "ftp://" or "http://" and ends with a file name, the program will assume URL address is a file which contains the PDB list. This example shows how to get an updated list for SCOP data base, which contains PDB code and range of a representing structure in each family or super family. (The format of TOP/SCOP list is the following)
    
    3sdh             a:  1.001.001.001.001.001   d3sdha_
    1phn             a:  1.001.001.001.002.001   d1phna_
    1grj           2-79  1.001.002.001.001.001   d1grj_1
    ....
    
    example: makevec.com
    # for PDB on local disk
    $LUEXE/top << 'end-top'
    LIBDIR /nfs/protein/pdb/current_release/
    MAKEVEC sndlib.vec pdblist.txt
    'end-top'
    #
    
    
    The pdblist.txt could be made by this way.
    
    cd /nfs/pdb/full/
    ls -1 *.pdb > /nfs/ylgs/guoguang/pdblist.txt
    
    
    If LIBDIR is replaced by PDBSITE, the progam will read updated data from PDB via web.

    In fact the keywords 3DBBEFore and 3DBAFTfer together with MAKEVEC provide a possibily of automaic making SSE libary of the new coming structures which can be appended to the old ones. This should be very quick.

Keyword Input for structure comparisons

    MATCH RATE rat1 rat2
    
    example: MATCH RATE 0.35 0.8
    	 MATCH auto 		[DEFAULT]
    	 MATCH 5
    
    
    If RATE appears as a subcommand, the program will read two more parameters RAT1and RAT2.

    RAT1 is the minimum matching rate of secondary structures. The program chooses a minimium secondary structures (comparing mode) or number of secondary of mol1 (searching mode) and times with rat1. If matching secondary structures of the two compared protein exceeds this rate, the program will think the two structures are similiar. For example, if mol1 has 12 secondary structures, and mol2 has 10, and rat1 is 0.5, the program will think the two structures are similar when there are 5 secondary structures that can match each other in comparing mode (or 6 in searching mode).

    AUTO is equvalent to RATE 0.35 0.8

    Alternatively, users also can give this number by estimating at least how many secondary structures can match each other before runing the program. It has to be lower than real number. If the number is overestimated, the program will fail to superimpose the two similar structures. Under-estimating is usually OK. However if user gives a too low value, (for example 3), the program might superimpose motif instead of overall structures. This might give many ways of superpositions, many of which do not really interest the users. In database searching, an over underestimate value can also slow down the speed unecessarilly.

    If user have no idea how to put this parameter, he/she can start either with 5 or 30%-50% of number of secondary structures in molecule 1 (use rate). This will be successful in 95% cases. If the comparison fails, look at the Hint section to see how to fix the problem.

    RESIDUE lstres
    LSTRES is the minimium number of residues in a consecutive fragment of protein. Default is 3. If lstres is smaller than or equal to 0 the program only compares the structures based on SSEs. In this case, no superimposed coordinates will be output. If lstres is larger than 0, the program will improve the comparison based on Ca atoms. When all Ca atoms in a fragment with more than LSTRES (usually 3) consecutive residues in one protein are closest to a fragment in the other protein and all the distances are smaller than DSTMIN, all the Ca atoms in these two corresponding fragments will be included in the superposition calculations. The rms and sequence comparison will be presented.

    DISTance dstmin
    [Default 3.8]
    This value the represents the maximium distance between Ca atoms of the matched residues. (see RESIDUE). If dstmin is more than 3.0, the value is not so important because of the rule that Ca atoms of matched residues must be closest to each other. A value between 3-7 usually does not change the result of which residues can match each other in the comparisons.

    WRITE
    If this statement is present and Ca comparison is carried out, the program will write out superimposed coordinates from Mol2 to Mol1. The file name will be something like mol2_mol1.xxx. For example if name of mol1 is sfv.pdb, name of mol2 is sin.pdb, the output name will be sin_sfv.pdb

    APPEnd yes/no
    If the input is yes and there are no secondary structure assignments in the input coordinates file, the program will append the assignment at the end of the coordinate file. [Default: NO]

    HLXRMS hlxrms
    If rms between an alpha helix and standard helix is higher than this value, this helix will not be used for the comparisons.

    BTARMS btahlx
    If rms between a beta strand and a straight line formed by the two representing points is higher than this value, this strand will not be used for comparisons.

    ERRRMSA errrmsa
    If rms value of certain helix or sheet is higher than this value, this helix or sheet is not considered to be similar.

    ERRANG errang_alpha, errang_beta
    If the direction difference of a certain helix in the two structures is higher than errang_alpha, this helix is not considered to be similar.
    If the direction difference of a certain sheet in the two structures is higher than errang_beta, this sheet is not considered to be similar.

    ERRDLL errdll_alpha, errdll_beta
    If the line-line distance of a certain helix in the two structures is higher than errdll_alpha, this helix is not considered to be similar.
    If the line-line distance of a certain strand in the two structures is higher than errdll_beta, this strand is not considered to be similar.

    MAXANG angmax
    When expanding the search for similar secondary structures, if the maximum direction difference exceeds this angle, the last expand is rejected.

    MAXDLL dllmax
    When expanding the search for similar secondary structures, if the maximum line-line distance exceeds this number, the last expand is rejected.

    SINGLE/NOSIngle (or MULTiple)
    If the SINGLE statement appears, comparison is only carried out on one polypeptide chain. If NOSINGLE or MULTIPLE appears, the program can compare protein structures with multiple chains.

    FAST
    When this option is chosen, if a helix disturbs the match of a beta strand, the program will delete the first helix and re-search for the match.

    DIRWEIGHT dirweight
    Weight of the direction in the refinement.

    REFWEIGHT refwalpha, refwbeta
    In the least squares refinement, the weight of alpha helix and beta strand.

    SND1 Yes/No [CA]
    If input is yes, the program will not read the secondary structure assignment in the coordinate file of Mol1 but will assign it self using a algorithm defined by Smith/Laskowski (SECSTR program from PROCHECK). If the input is no, the program will first try to use the assigned secondary structure in the coordinates file. If it does not exist or it does not work, the program will assign itself. If CA is present in the second input column after the keyword, the program will assign the secondary structures based only on Ca atoms.

    SND2 Yes/No [CA]
    same as SND1 but for Mol2

    AMPLify ampl ampltop [default: 1.5 2.0]
    ampl is the amplification order for structure diversity
    ampltop is the amplification order for topological diveristy
    The value of Structural Diversity and Toplogical Diversity used are used in TOP for describing the structure difference between the two compared structures based on both r.m.s deviation and number of matched residues or SSEs. The "amplification order" is used to control the influence from number of matched residues or SSEs (see the conventions for more details).

Keywords for 3DB interface

    TOP has an interface with 3DB browser (developed by Dr. Jaime Prilusky and colleagues). This connection enables TOP to perform rapid similarity searching by defining a searching range with sequence homology, keywords, date and/or other constraints, so that user can save a lot of time for interactive operations to the database. The following provides a simple description related to the 3DB connections from TOP. Some detailed description about 3DB parameters can be found in the 3DB Browser's Help file. Now that the Protein Data Bank resides at RCSB, the browser is called SearchLite and/or SearchFields. Description of the latter can be found in the PDB SearchFields Help.

    3DBSITE Site_name
    Example: 3dbsite http://www.pdb.bnl.gov
    If users wish to read data from their local disk/CD or a close Web site but use 3DB browser to choose searching range, one can use LIBDIR or WEBsite for specifying the location of coordinates and use this command to specify the URL address of 3DB server. The URL address of 3DB must be one of the mirror sites of Protein Data Bank. This 3DB sever site name does not have to be same as in the WEBSITE command. The program can obtain the PDB entry list from the 3DB server and browse the coordinates from other URL address. If WEBSITE, LIBDIR and PDBSITE commands are not given, the program will use this 3DB address for browsing coordinates. If this address is not given, the default server is from BNL. However, I strongly recommend choosing a PDB mirror site close to user's local lab.

    3DBKEYword word1 word2

    
    	example: 3DBKEYWORD FAD + FMN + FLAVIN
    		 3DBKEYWORD NITRATE REDUCTASE
    		 3DBKEYWORD FAD .or. FMN .or. FLAVIN
    
    
    Equivalent to the "Keyword" column in 3DB. If this command appears, the TOP program only searches those strucures with the words appearing in HEADER, TITLE, KEYWDS and COMPND fields. If two keywords are separated by space, relation between them are "AND". If separated by ".or." or "+" the relations between words are "OR".

    3DBTEXT Word

    
    		example:  3DBTEXT FAD + FMN + FLAVIN
    			  3DBTEXT REDUCTASE
    
    
    Equivalent to the "Text query" column in 3DB. If this command appears, the TOP program only searches those strucures with the Word in the complete PDB text. If two keywords are separated by space, relations between words are "AND". If two keywords are separated by + or ".or." relations between words are "OR".

    3DBSEQ (or 3DBFASTA) cutoff sequence (or cutoff @seq_file_name)

    
    Example: 3DBSEQ 0.02 GXGXTGGTX
         or	 3DBSEQ 0.02 @zm.seq
    
    
    Equivalent to the "FASTA" column in 3DB. If this command appears, the TOP program will request the 3DB server running the FASTA program to provide a list of structures with homologies to the given sequence. Then it only searches structure similarity to those structures and output superimposed coordinates if WRITE command is presented. The sequence must be 1 letter code. It must be either in 1 line or in a file such as following example:
    
    SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL
    NCGFSAEGYARAKGAAAAVVTYSVGALSAFDAIGGAYAENLPVILISGAP
    NNNDHAAGHVLHHALGKTDYHYQLEMAKNITAAAEAIY
    
    
    The format is free but the sequence can not exceed 5000 residues. The detailed description of cutoff value, see 3DB Browser Help File (For TOP, this value should be between 0.02 and 0.01). This command is good for searching structures with a short sequence figure print or structures in a sequence family and superimpose them together. This makes TOP can be used as simple a modeling program.

    3DBRESOlution res1-res2 or RESO res1 res2

    
    example: 3DBRESOLUTION 0.1-3.0	 or 3DBRESOLUTION 0.1 3.0
    
    
    Equivalent to the "Resolution" column in 3DB. If this command appears, the TOP program only search those structures with resolution higher than 3.0 A (and lower than 0.1 A) cutoff.

    3DBBEFore (or 3DBUPPer) date
    Example: 3dbbefore 12/3/1998
    Equivalent to the "Date (upper)" column in 3DB. If this command appears, the TOP program only search those structures which is deposited this date.

    3DBAFTer (or 3DBLOWer) date
    Example: 3dbafter 12/1/1998
    Equivalent to the "Date (lower)" column in 3DB. If this command appears, the TOP program only search those structures which is deposited after the date. This makes users to trace the new structures which are similar to a certain family. It is possible to let this procedure fully automatic by making a simple unix script file.

    3DBHET compound_name
    Example: 3dbhet FMN
    Equivalent to the "Associated group" column in 3DB. If this command appears, the TOP program only search those structures with this Hetero compound.

Conventions of the Coordinate files

When comparing two protein structures, the program needs two coordinates files in Brookhaven format. It can read the secondary structure elements (SSEs i.e. alpha helices and beta strands) which are pre-assigned in the files in the PDB format file as in the following example:

HELIX    1  F1 LEU     96  SER    103  
HELIX    2  N1 ILE    148  ARG    160 
HELIX    3  N2 ARG    184  GLU    193 
HELIX    4  N3 GLU    223  HIS    229 
HELIX    5 N4A PRO    245  GLN    249 
HELIX    6 N4B SER    253  GLU    257 
HELIX    7  N5 MET    263  SER    266 
SHEET    1  FB 6 LYS    58  TYR    64  0 
SHEET    2  FB 6 HIS    48  ILE    55 -1
SHEET    3  FB 6 TYR   109  LEU   116 -1
SHEET    4  FB 6 ILE    13  SER    24 -1
SHEET    5  FB 6 VAL    27  SER    33 -1
SHEET    6  FB 6 HIS    75  LYS    81 -1

If there are no SSE assignments in the coordinates file, the program will take some CPU time to calculate it. If the file contains coordinates of all mainchain atoms, the program will use the "Smith-Laskowski method" as in the PROCHECK package. If the file only contains Ca coordinates or many mainchain atoms are missing, the program can also automatically assign the secondary structures using another method, but some elements, especially beta strands, might be not as accurate as in the case that all the mainchain atoms are provided. However, this does not influence the structure comparisons in most cases.

Conventions of some output parameters

  • Matching Resdiues number of matched residues
    The program counts a pair of residues as matched residues when:
  • 1) There are at least a certain number of residues in a consecutive fragment which Ca number of the two superimposed structures are less than certain distance. The distance is defined in the DISTANCE command (default 3.8 angstrom) while the number of consecutive residues is defined in the RESIDUE command (default 3)
  • 2) The Ca atoms of the matched residues in the two superimposed structures must be the closest each other.
  • Identical residues and Identity
  • Identical residues represents the number of those matched residues which amino acid type are identical
  • Identity represents (Identical residues)/(matched residues)
  • r.m.s. deviation
                N
    r.m.s. = (Sigma(di)2 /N))1/2
                i
    

    where
    N is the number of the matchable Ca atoms
    di is the distance between the 1st molecule and 2nd molecule of the i'th atoms
  • Mean distance:
             N
    dmean = Sigma(di)/N
             i
    

    where
    N is the number of the matchable Ca atoms
    di is the distance between the 1st molecule and 2nd molecule of the i'th atoms

    Usually, if the difference is distributed homogenously all overall the two structures, values of dmean and r.m.s are close. If some parts of two structures are much more different than the other parts, r.m.s is usually significantly higher than dmean. In my opinion, dmean is more able to reflect the distance between the two structures in the comparisons than r.m.s.
  • Structural Diversity
    This value is used to describe the difference between the two structures, based on distance of matched Ca atoms and number of matched residues. The definition is:
    Structure Diversity = (r.m.s)*(Nmol1/Nfit)A
    

    where
    Nfit is the number of matched residues (Ca atoms)
    Nmol1 is the total number of residues in the 1st molecule.
    A is the amplication order for number of matched residues. (defined in the AMPLIFY command, default 2.0). Higher this value is, more the structure diversity is influenced by number of matched residues, rather than by the r.m.s deviation.
  • Topological Diversity
    This value is used to decribed to topolical difference of Secondary Structure Elements between the two molecules. The definition is
      Topological Diversity = (Angle + RMS)*(Mmol1/Mfit)A
      

      where
      Angle is the average angle of directions between the matched SSE pairs.
      RMS is calculated based on the two point representation of SSEs.
      Mfit is the number of matched SSEs
      Mmol1 is the total number of SSEs in the 1st molecule
      A is the amplication order for number of matched SSEs. (defined in the AMPLFIY command, default 1.5). Higher this value is, more the structure diversity is influenced by number of matched SSEs

Examples

In many cases, users can quickly learn how to use the program just by looking corresponding examples. One can use one of two ways to run TOP: simple commands or Unix script files. The simple commands is desgined for the convenience of those users who don't have Protein Data Bank in their local lab and use TOP for ordinary purpose. The Unix command files are more flexible for special purposes.

Simple commands:

  • Comparing two structures: top3d

    For comparing two structures which are similar, the program can do two things:
    1. superimpose the two structures together so tha user can display them on the graphics.
    2. Output sequence alignment and statistics about the difference such as r.m.s deviation, fitting residues and so on.

    For these purposes, one can just type top3d file1 file2 or top3d and answer the questions. For example if you type: top3d mol1.pdb mol2.pdb (in the case the two structures are similar) the program will output a sequence alignment of the two proteins and output a coordinates file mol2_mol1.pdb in which mol2.pdb is superimposed to mol1.pdb

    In the case the two molecules or one of them have been deposited to Protein Data Bank and the entry code is known, you tell the program by a special format: code@pdb. For example, if you want to compare PDB entry 1KXD and 1VCP, you can just type top3d 1kxd@pdb 1vcp@pdb the program will output a file 1vcp_1kxd.pdb in which 1VCP is superimposed to 1KXD.

    In the case user wish to change the parameter for the TOP program, one can edit a file TOP.PARM in the directory.

  • Searching Proteins which are similar in 3D in database: topsearch

    If user have a protein structure (for example mol1.pdb) and wish to detect which proteins in Protein Data Bank are similar to it, he can type topsearch mol1.pdb, the database searching will start. After the procedure is finished, there will be a long output file topsearch_name.log. and a two shortened list strdiv_name.lis and topdiv_name.lis.

    The file strdiv_name.lis is a list of similar structures ranked by "Structure Diversity" (based on Ca atoms). The file todiv_name.lis is a list of similar structures ranked by "Topological Diversity" (based on Secondary Structure Elements). If users wish to have detailed comparisons, one can pick up the code from one of these two lists and use the command top3d for further information.

Unix script file

There are several examples files available at http://gamma.mbb.ki.se/~guoguang/webtop/examples showing how to use the TOP program. Here is a summary of them
Name PDB data from Function
top.com local disk or internet Superimposing two protein structures and compare them
pdbscan.com local disk Searching similar structures in Protein Data Bank
topscan.com internet
pdbsearch.com local disk Searching similar structures in a compact database.
topsearch.com internet
top3db.com internet Searching similar structures with 3DB restraints
makevec.com local disk Making SSE library
makevec_web.com internet

Example 1: Compare two structures Two files 1kxd.pdb and 1vcp.pdb will be compared by the following script file. ($TOPHOME/examples/top.com in the distribution package)

#
rm fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
$LUEXE/top << 'end-top'
MOL1 1kxd.pdb
MOL2 1vcp.pdb
RESIDUE 3
WRITE
'end-top'
#

type "top.com > top.log", the program will output which secondary structure elements are corresponding to each other in the two structures. Optionally, the program also superimposes the two structures based on the Ca atoms and output the sequence comparison. (See instruction of keyword RESIDUE). The rms deviation is output. When the WRITE statement appears, the program will write a file which superimposes molecule 2 onto molecule 1. In this case the output file name is 1vcp_1kxd.pdb. Sometimes, there are more than one way to superimpose the two structures (e.g. when the two structures are dimers AB, the program can superimpose AB to A'B' and AB to B'A'). In this case the program will output several superimosed coordinates files, called 1vcp_1kxd.pdb, 1vcp_1kxd.pdb_2, 1vcp_1kxd.pdb_3,....). One can use any graphics program (such as O, Insight or Frodo) to display the superimposed coordinates together with 1kxd.pdb. Look at top.log for more information.

There are other commands concerning the paramenters for different purpose of the comparisons. For detail, please see "Keyworded Input"

The TOP software can directory browse coordinates from Protein Data Bank (PDB), if an URL address of a mirror site of PDB is provided. In this example, if you know one of structures PDB entry code is 1vcp , you can do the following: 1) add a command to indicate from which site you want to browse PDBSITE http://www.pdb.bnl.gov/ 2) use xxxx@pdb in the MOL2MOL2 1vcp@pdb So the program will directly read 1vcp from Brookhaven National Laboratory via internet.


Example 2: Searching similar structures in Protein Data Bank TOP can be used to see whether a protein is similar with certain structures in Protein Data Bank. Regarding how to obtaining the data from database, TOP may have two ways to run database searching.
  1. Search Protein Data Bank installed in the local disk. The example command files are shown in pdbsearch.com and pdbscan.com in the directory $TOPHOME/examples/
  2. Search Protein Data Bank via internet (see in topsearch.com and topscan.com).

The recommended way run TOP is first searching a compact library of Secondary Structure Elements (SSEs) . If SSEs constructions of some proteins are found to be similar to the studied structure, the program can do the further comparisons based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways requires a regularly updated SSEs library which can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z It can also be made and updated automatically (see instructions for " Automatic updating of SSE library"

If users choose not to use compact SSE library, one can use pdbscan.com or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB in local disk or via internet.

In pdbscan.com, it is assumed that user have all the Protein Data Bank files under directory /nfs/protein/pdb/current_release/uncompressed_files and all the files are called *.ent. In this example file, the command find $pdbdir -name "*.ent" -print > current.lis find all the PDB entries and write into the file current.lis which has contents like:


/nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent
/nfs/protein/pdb/current_release/uncompressed_files/01/pdb101d.ent
/nfs/protein/pdb/current_release/uncompressed_files/01/pdb201d.ent
....

In this way all the file names are stored in current.lis which will be read by the MOL2 command in the TOP program. MOL2 @current.lis In fact, one can search not only the whole protein data bank, but also a group of selected structures, for example, structures represent independent folding in the SCOP classification.

Still take pdbscan.com as an example. To run database searching, type "pdbscan.com &", after some hours, there will be all the information in pdbscan.log which users usually don't have to look at. User can look at the summary files: "strdiv.lis" or "topdiv.log" (If the program crash, you could also look at the middle results by typing "grep Str pdbscan.log | sort +3 -4" or "grep Top pdbscan.log | sort +3 -4")

The content of strdiv.lis is the following:


 1692 structures are found to be similar under the given criteria
 Best Structure Diversity   7.67  with   52 matched residues to 2cnd
 Best Structure Diversity   7.68  with   56 matched residues to 1azz
 Best Structure Diversity   8.13  with   57 matched residues to 1epa
 Best Structure Diversity   8.33  with   48 matched residues to 1cnf
 Best Structure Diversity   8.48  with   54 matched residues to 1ave
 Best Structure Diversity   8.70  with   54 matched residues to 1hav
 Best Structure Diversity   8.70  with   54 matched residues to 2pia
 Best Structure Diversity   9.28  with   51 matched residues to 1avd
 ............


The structure here 2cnd, 1azz, 1epa ... and so on are found similar to the searched model. (2cnd is ranked as most similar structure by the program). Users can use command file of example 1 and pick up the coordinates to run the individual comparison which gives superimposed structure and details of the comparison such as r.m.s and sequence alignment and so on (these information are also inside pdbscan.log, run nicelist.com or toplist.com to get a better output.)


Example 3: Searching similar structures from a compact SSE library As described in the description section, in the first step TOP detects the similarites based on SSE topology of two proteins. Except coordinates files in PDB format, the program can also read a compact database which contains SSE topology derived from Protein Data Bank. Using the SSE library is a fast and recommended way for similarity searching in database. To make the library from PDB in local disk, user can use $TOPHOME/examples/makevec.com. To make the library from PDB on Web, please use $TOPHOME/examples/makevec_web.com. This SSE library can be automatically updated according most recent PDB data. Please see installation secton.

The following is an example how to use SSE library for similarity searching. It is similar with example 2, but with one more command MOLVEC.


rm -f fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
cat > topsearch.inp << EOF
MATCH auto
PDBSITE http://www2.ebi.ac.uk
!LIBDIR /nfs/pdb/current_release/uncompressed_files/
MOL1 kinA.pdb
MOLVEC $TOPHOME/lib/sndlib.vec
EOF
$TOPBIN/top < topsearch.inp  > topsearch.log
grep Top topsearch.log | sort +3 -4 >> topdiv.lis
grep similar topsearch.log > strdiv.lis
grep Str topsearch.log | sort +3 -4 >> strdiv.lis

The runing and analysis procedure is similar with example 2

In this example, if you use LIBDIR /nfs/pdb/current_release/uncompressed_files/ instead of PDBSITE http://www2.ebi.ac.uk, the program will browse the coordinates from local disk instead of internet.

If you use an other SSE dastabase, for example MOLVEC $TOPHOME/lib/scop_structure.vec You search only about 2000 independent domain structures selected in the SCOP dastabase instead of 8000 in Protein Data Bank. The speed would be much faster (only 1/10 to 1/5 as before). For same reason, you could use $TOPHOME/lib/scop_family.vec (about 900 domain structures) or $TOPHOME/lib/scop_superfamily.vec (about 600 domain structures) to even search for a short time. The SCOP database is not updated as frequent as PDB, so far once every year. The the SSE database for most recent SCOP is always kept in our FTP distibution site

In the Web server of TOP, there is another way to search all the structures: The program search classification unit of independent domain structures, families or super-families in SCOP. Once it found the similarity, it can optionally futher search other structures in the same classification unit. The search in this way is very efficient in terms of speed although it does not search the most recent data in Protein Data Bank. Please have a look at: http://alfa.mbb.ki.se:8000/TOP/search_SCOP_new.html


Example 4: Superimpose all the sequence-homologous proteins in PDB If users wish to compare all the structures in PDB which have sequence homology to a particular structure, one can use following simple procedure to make all the superimposed structures.

#!/bin/csh
rm fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
$TOPBIN/top << 'end-top' 
MOL1 zmA.pdb
MOLVEC snd1.vec
pdbsite http://www2.ebi.ac.uk
3dbseq 0.02 @zm.seq
MATCH auto
WRITE yes
'end-top'

In this example zm.pdb is the PDB coordinates of the probe structure. zm.seq is the file which contains the sequence in format of 1-letter code:

SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL
TLKFIANRDKVAVLVGSKLRAAGAEEAAVKFTDALGGAVATMAAAKSFFP
EENALYIGTSWGEVSYPGVEKTMKEADAVIALAPVFN
....

The filename for all the superimposed coordinates will be 1pyd_zmA.pdb, 1pvd_zmA.pdb, 1pox_zmA.pdb....

Some hints about the program

  1. Database searching If you find that structures in Protein Data Bank are similar to your new structure, the program can not directly tell you which family it belong to. However there are some Web sites where you can get this information and classify your new protein according to the results from TOP program. Some of these sites are listed below.

    Name URL address Function Group
    SCOP http://scop.mrc-lmb.cam.ac.uk/scop Structure Classification of Proteins Chothia, Murzin...
    CATH http://www.biochem.ucl.ac.uk/bsm/cath Class Architecture Topology Homology Thornton...

    While searching similar structures in the whole protein data bank usually, a lot of time is wasted on tens of Lysozyme mutants or other closely related homologous proteins. It is possible to make a file list where only structures with independent folds or super-families are present (see example 2), if such information can be obtained from other sources. So far, no such a effort has been made by the author.

  2. Speed. When you have a huge structure with many domains, it is much faster if you divide your protein into several independent domains and search each domain individually. The results will be much easier to understand too.

  3. Parameter of MATCH Over-estimation: If the program fail to compare two similar structures, it can be because the parameter value in the MATCH command is too high. Users can find out in the following way. For example the MATCH number should be 4 or less, but you use 7, at the end of the output the program would write something like: ... No way to align in 12ca.pdb Maxminun match : 4 Minumun Align: 7 Then you can change MATCH from 7 to 4 and the program will run succesfully.

    In the case database searching, too high value in this command will cause that no or too few similar structures are found. Users can find out what is the proper parameter for by typing: grep "Maxminun match" pdbscan.log | sort +10 -11 (it is assumed that the log file is called pdbscan.log). For example, you give MATCH number 5 and you have no hitted structure, you will get something like

     ......
     ... No way to align in 1abj.pdb Maxminun match :  3 Minumun Align:  5
     ... No way to align in 1abn.pdb Maxminun match :  3 Minumun Align:  5
     ... No way to align in 1abo.pdb Maxminun match :  3 Minumun Align:  5
     ... No way to align in 12ca.pdb Maxminun match :  4 Minumun Align:  5
     ... No way to align in 1aag.pdb Maxminun match :  4 Minumun Align:  5
     ... No way to align in 1aao.pdb Maxminun match :  4 Minumun Align:  5
    
    
    In this example, you can get 3 more matched similar structures if you use 4 in the MATCH command.

    Under-estimation: Usually under-estimation of this number is OK. The program will find too many structures which you are not interested, but you can always rank the similarity by "Structure Diversity" or "Topological Diversity" and look only the structures at top in the rankings. If you find you think the speed of searching is too slow because of the too low value of this parameter, you also have some way to know the your wanted number far before the searching is finished. For example, you give 5 in the MATCH command. After a while of running the program, you can type grep "Max Align" pdbscan.log | sort +3 -4 you get

    .......
    ...(too many hints)...
    ......
     1cax.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cwa.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cwb.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cwc.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cxf.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cyn.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1dlc.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cnd.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
     1cne.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
     1cnf.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
    
    
    If you find only the last 3 structures fall into your "similarity" criterion, you can give "MATCH 6" (or 7) when you re-scan the database.

Reference

  1. Lu G., A WWW service system for automatic comparison of protein structures Protein Data Bank Quarterly Newsletter, #78, 10-11. 1996
  2. Guoguang Lu, An automatic topological and atomic comparison program for protein structures (in manuscript or http://gamma.mbb.ki.se/~guoguang/top.html).

Acknowledgment

The author is grateful to Doc. Ylva Lindqvist and Prof. Gunter Schneider for encouraging me to make this program and contributing important ideas. I also thank Dr. Roman Laskowski for permission to use his secondary structure assignment program and Dr. Jaime Prilusky for suggestions of 3DB interface. Thank a number of collegues for suggestions and bug reporting.