FASTA PARSABLE OUTPUT

[ Program Manual | User's Guide | Data Files | Databases ]

Introduction

This document may be useful for programmers and script writers, but can be skipped by most users of FastA and TFastA.

FastA's standard alignment formats are difficult to parse, and so it has been hard to extract the alignment information from a FastA output file for further processing. A new command-line parameter, -MARKx=10, saves the alignments in a format which is easily parsed. The following is a description of the parsable output file.

Records

The output file has three types of records. The header record starts with >>> . It contains information about the search as a whole, which version of the program was used, which analysis parameters were used, etc. There is only one header record per output file.

An alignment record contains information pertaining to a pairwise alignment, such as the scores for the alignment. It starts with >>. There will be one alignment record for each alignment that was saved.

Following each alignment record are two aligned sequence records, which start with > . Each of these records contains the information for one of the sequences in the alignment: the length of the sequence, the beginning and end of the alignment in that sequence's coordinates, etc.

Record Parameters

Information in each record consists of parameters and their values in a specific format. Parameters consist of a parameter tag, followed by an underscore, followed by the parameter's name. The complete format is:


; tag_name: value(s)

Parameters originating in William Pearson's FASTA package always have a two-character tag. Current FASTA tags are:

pg - program related: name, version, matrix used, etc.
fa - FastA results: scores, expect values, etc.
sw - Smith-Waterman results: scores, overlap values, etc.
sq - sequence information: length, type, etc.
al - alignment information: start, stop, display offset, etc.

Redistributors of the FASTA package may create their own parameters. If they do, they must use a tag with more than two characters, for example:


; ebi_access: M61687 ; gcg_ver: 9.0

GCG currently has no Wisconsin Package-specific parameters.

Interpreting Aligned Sequence Records

All of the parameters specified by two-character tags correspond to values that are presented in other FastA output formats, with the exception of parameters with the al tag:

al_start gives the location of the alignment start in the original sequence

al_stop gives the location of the end of the alignment in the original sequence

al_display_start gives the location of the first displayed residue in the original sequence. (This may not be the same as the first residue in the aligned region, because FastA provides some context for an alignment; even if the -SHOWall parameter is not used, FastA will try to provide about 30 residues on either side of the actual aligned region if the alignment is in the middle of one or the other sequence.)

Sequences may be padded with leading hyphens, if necessary. For example, if the beginning of the query sequence aligns with the tenth residue of the library sequence, then the query sequence will be padded with ten leading hyphens (-) to produce the alignment. The leading hyphens are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.

As an example, here is a pair of aligned sequence records:


>GT8.7 .. ; sq_len: 217 ; sq_type: p ; al_start: 3 ; al_stop: 180 ; al_display_start: 1 ---PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLN EKFKLGLDFPNLPYLIDGSHKITQSNAILRYLARKHH---LDGETEEERI RADIVENQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKR PWFAGDKVTYVDFLAYDILDQYRMFEPKCLDA------FPNLRDFLARFE GLKKISAYMKSSRYIATPIFSKMAHWSNK >ARP2_TOBAC .. ; sq_len: 223 ; sq_type: p ; al_start: 6 ; al_stop: 181 ; al_display_start: 1 MAEVKLLGFW-YSPFSHRVEWALKIKGVKYE---YIEEDRD--NKSSLLL QSNPV---YKKVPVLIHNGKPIVESMIILEYIDETFEGPSILPKDPYDRA LARFWAKFLDDKVAAVVNTFFRKGEEQEKGK--EEVYEMLKVLDNELKDK KFFAGDKFGFADIAANLVGFWLGVFEEGYGDVLVKSEKFPNFSKWRDEYI NCSQVNESLPPRDELLAFFRARFQAVVASRSAPK

To properly display this alignment, the first P of GT8.7 must line up with the first V in ARP2_TOBAC, and the actual aligned region (the region that scores as the best local alignment) starts with the first I in GT8.7 (amino acid 3) and the first L (amino acid 6) in ARP2_TOBAC.

An Example

Here is a printout of a complete parsable output file containing three alignment records, followed by a printout of the first alignment as it is printed by FastA when the default parameter -MARKx=3 is used.


>>>496 Gtr3_Chick vs SW:GTR5* library ; pg_name: FASTA ; pg_ver: Wisconsin Package 9.0 implementation of FASTA version 2.0u4... ; pg_matrix: GenRunData:Blosum50.Cmp ; pg_gap-pen: -12 -2 ; pg_ktup: 2 ; pg_optcut: 25 ; pg_cgap: 37 >>Sw:Gtr5_Rat ; fa_initn: 1135 ; fa_init1: 809 ; fa_opt: 1299 ; fa_z-score: 1299.0 ; fa_expect: 0 ; sw_score: 1299 ; sw_ident: 0.420 ; sw_overlap: 506 >Gtr3_Chick .. ; sq_len: 496 ; sq_type: p ; al_start: 6 ; al_stop: 491 ; al_display_start: 1 -----MADKKKITASLIYAVSVAAIGS-LQFGYNTGVINAPEKIIQAFYN RTLSQRSGETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFFNRFGRRNS MLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCGLCTGFVPMYIS EVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEALWPLLLGFTIV PAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGTQDVSQDISEMK EESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQLSGINAVFYYST GIFERAGI-TQPV-YATIGAGVVNTVFTVVSLFLVERAGRRTLHLVGLGG MAVCAAVMTIALALKE--KWIRYISIVATFGFVALFEIGPGPIPWFIVAE LFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVFLIFLVFLLI FFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKNPMVEMNSIE PDKEVA >Gtr5_Rat .. ; sq_len: 502 ; sq_type: p ; al_start: 11 ; al_stop: 497 ; al_display_start: 1 MEKEDQEKTGKLTLVLALATFLAAFGSSFQYGYNVAAVNSPSEFMQQFYN DTYYDRNKENIESFTLTLLWSLTVSMFPFGGFIGSLMVGFLVNNLGRKGA LLFNNIFSILPAILMGCSKIAKSFEIIIASRLLVGICAGISSNVVPMYLG ELAPKNLRGALGVVPQLFITVGILVAQLFGLRSVLASEEGWPILLGLTGV PAGLQLLLLPFFPESPRYLLIQKKNESAAEKALQTLRGWKDVDMEMEEIR KEDEAEKAAGFISVWKLFRMQSLRWQLISTIVLMAGQQLSGVNAIYYYAD QIYLSAGVKSNDVQYVTAGTGAVNVFMTMVTVFVVELWGRRNLLLIGFST CLTACIVLTVALALQNTISWMPYVSIVCVIVYVIGHAVGPSPIPALFITE IFLQSSRPSAYMIGGSVHWLSNFIVGLIFPFIQVGLGPYSFIIFAIICLL TTIYIFMVVPETKGRTFVEINQIFAKKNKVSDVYPEKEEK----ELNDLP PATREQ >>Sw:Gtr5_Human ; fa_initn: 1105 ; fa_init1: 797 ; fa_opt: 1266 ; fa_z-score: 1266.0 ; fa_expect: 0 ; sw_score: 1266 ; sw_ident: 0.409 ; sw_overlap: 509 >Gtr3_Chick .. ; sq_len: 496 ; sq_type: p ; al_start: 6 ; al_stop: 484 ; al_display_start: 1 ------MADKKKITASLIYAVSVAAIGS-LQFGYNTGVINAPEKIIQAFY NRTLSQRSGETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFFNRFGRRN SMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCGLCTGFVPMYI SEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEALWPLLLGFTI VPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGTQDVSQDISEM KEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQLSGINAVFYYS TGIFERAGITQP--VYATIGAGVVNTVFTVVSLFLVERAGRRTLHLVGLG GMAVCAAVMTIALALKE--KWIRYISIVATFGFVALFEIGPGPIPWFIVA ELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVFLIFLVFLL IFFIFTYFKVPETKGRTFEDISRGFEEQVETSS--PSSPPIEKNPMVEMN SIEPDKEVA >Gtr5_Human .. ; sq_len: 501 ; sq_type: p ; al_start: 12 ; al_stop: 497 ; al_display_start: 1 MEQQDQSMKEGRLTLVLALATLIAAFGSSFQYGYNVAAVNSPALLMQQFY NETYYGRTGEFMEDFPLTLLWSVTVSMFPFGGFIGSLLVGPLVNKFGRKG ALLFNNIFSIVPAILMGCSRVATSFELIIISRLLVGICAGVSSNVVPMYL GELAPKNLRGALGVVPQLFITVGILVAQIFGLRNLLANVDGWPILLGLTG VPAALQLLLLPFFPESPRYLLIQKKDEAAAKKALQTLRGWDSVDREVAEI RQEDEAEKAAGFISVLKLFRMRSLRWQLLSIIVLMGGQQLSGVNAIYYYA DQIYLSAGVPEEHVQYVTAGTGAVNVVMTFCAVFVVELLGRRLLLLLGFS ICLIACCVLTAALALQDTVSWMPYISIVCVISYVIGHALGPSPIPALLIT EIFLQSSRPSAFMVGGSVHWLSNFTVGLIFPFIQEGLGPYSFIVFAVICL LTTIYIFLIVPETKAKTFIEINQIFTKMNKVSEVYPEKEELKELPPVTSE Q >>Sw:Gtr5_Rabit ; fa_initn: 754 ; fa_init1: 454 ; fa_opt: 847 ; fa_z-score: 847.0 ; fa_expect: 0 ; sw_score: 996 ; sw_ident: 0.385 ; sw_overlap: 510 >Gtr3_Chick .. ; sq_len: 496 ; sq_type: p ; al_start: 4 ; al_stop: 482 ; al_display_start: 1 -----MADKKKITASLIYAVS--VAAIGS-LQFGYNTGVINAPEKIIQAF YNRTLSQRSGETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFFNRFGRR NSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCGLCTGFVPMY ISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEALWPLLLGFT IVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGTQDVSQDISE MK--EESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQLSGINAVF YYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRRTLHLVGL GGMAVCAAVMTIALALKE--KWIRYISIVATFGFVALFEIGPGPIPWFIV AELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVFLIFLVFL LIFFIFTYFKVPETKGRTFEDISRGF--EEQVETSSPSSPPIEKNPMVEM NSIEPDKEVA >Gtr5_Rabit .. ; sq_len: 486 ; sq_type: p ; al_start: 9 ; al_stop: 481 ; al_display_start: 1 MEQEGQEKKKEGRLTLVLALRTLIAAFGSSFQYAYNVSVCNSPSELMTEF YNDTYYDRTGELIDEFPLTLLWSVTVSMFPSGGFAGSLLVGPLVNKFGRK GALLFNNIFSIVPAILMGCSKVARSFELIIISRLLVGICAGVSSNVVPMY LGELAPKNLRGALGVESQLFITLGILVAQIFGLRSIRQQKG-WPILLGLT GGPAAAACPP--FFPESPRYLLIGQ-EPRCRQKALQSLRGWDSVDRELEE IRREDEAARAAGLVSVRALCAMRGLAWQ---LISVVPLMWQQLSGVNAIY YYDQ-IYLSPLDTDTQYYTAATGAVNVLMTVCTVFVVESWARLLL-LLGF SPLAPTCCVLTAALALQDTVSWMPYISIVCIIVYVIGHAIGPAIRSLY-- TEIFLQSGRPPTW--WGQVHWLSNFTVGLVFPLIQ-WAGLYSFIIFGVAC LSTTVYTFLIVPETKGKSFIEIIRRFIRMNKVEVS-PDREELKDFPPDVS E

------------------------------------------------------------------------------

SCORES Init1: 809 Initn: 1135 Opt: 1299 Smith-Waterman score: 1299; 42.0% identity in 491 aa overlap 10 20 30 40 50 Gtr3_Chick MADKKKITASLIYAVSVAAIGS-LQFGYNTGVINAPEKIIQAFYNRTLSQRSGET |:| | |: :||:|| :|:|||::::|:| :::| ||| | :|: |: Gtr5_Rat MEKEDQEKTGKLTLVLALATFLAAFGSSFQYGYNVAAVNSPSEFMQQFYNDTYYDRNKEN 10 20 30 40 50 60 60 70 80 90 100 110 Gtr3_Chick ISPELLTSLWSLSVAIFSVGGMIGSFSVSLFFNRFGRRNSMLLVNVLAFAGGALMALSKI | || ||||:|::| ||:|||: |::: | :||::::|: |:::: : ||: ||| Gtr5_Rat IESFTLTLLWSLTVSMFPFGGFIGSLMVGFLVNNLGRKGALLFNNIFSILPAILMGCSKI 70 80 90 100 110 120 120 130 140 150 160 170 Gtr3_Chick AKAVEMLIIGRFIIGLFCGLCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFG ||: |::| :|:::|: |: :: ||||::|::| :||||:|:: || |:|||||||:|| Gtr5_Rat AKSFEIIIASRLLVGICAGISSNVVPMYLGELAPKNLRGALGVVPQLFITVGILVAQLFG 130 140 150 160 170 180 180 190 200 210 220 230 Gtr3_Chick LEGIMGTEALWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGTQ |::::::| ||:|||:| ||| || : | | |||||:|||:| :| |: :|| ||| : Gtr5_Rat LRSVLASEEGWPILLGLTGVPAGLQLLLLPFFPESPRYLLIQKKNESAAEKALQTLRGWK 190 200 210 220 230 240 240 250 260 270 280 290 Gtr3_Chick DVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQLSGINAVFYYST ||:::: |:::|: : :| :||| : | :| :|:|: :|||||:||::||: Gtr5_Rat DVDMEMEEIRKEDEAEKAAGFISVWKLFRMQSLRWQLISTIVLMAGQQLSGVNAIYYYAD 250 260 270 280 290 300 300 310 320 330 340 350 Gtr3_Chick GIFERAGI-TQPV-YATIGAGVVNTVFTVVSLFLVERAGRRTLHLVGLGGMAVCAAVMTI |: ||: :: | |:| |:|:||: :|:|::|:|| |||:| |:|:: : |:|: Gtr5_Rat QIYLSAGVKSNDVQYVTAGTGAVNVFMTMVTVFVVELWGRRNLLLIGFSTCLTACIVLTV 310 320 330 340 350 360 360 370 380 390 400 410 Gtr3_Chick ALALKE--KWIRYISIVATFGFVALFEIGPGPIPWFIVAELFSQGPRPAAMAVAGCSNWT ||||:: :|: |:||| :: :| :||:||| ::::|:| |: ||:|: ::| :| Gtr5_Rat ALALQNTISWMPYVSIVCVIVYVIGHAVGPSPIPALFITEIFLQSSRPSAYMIGGSVHWL 370 380 390 400 410 420 420 430 440 450 460 470 Gtr3_Chick SNFLVGMLFPYAEKLCGPYVFLIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVET |||:||::||: : ||| |:|| :: |: |: :: ||||||||| :|:: | :: :: Gtr5_Rat SNFIVGLIFPFIQVGLGPYSFIIFAIICLLTTIYIFMVVPETKGRTFVEINQIFAKKNKV 430 440 450 460 470 480 480 490 Gtr3_Chick SSPSSPPIEKNPMVEMNSIEPDKEVA |: || |:|:: | Gtr5_Rat SDVYPEKEEK----ELNDLPPATREQ 490 500

Printed: November 18, 1996 13:05 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com