FACTOR

version 3.5c

FACTOR - Program to factor multistate characters.
(c) Copyright 1986-1993 by Christopher A. Meacham.  Permission  is  granted  to
copy  this  document  provided  that  no  fee  is  charged for it and that this
copyright notice is not removed.


      Programmed by C. Meacham, Botany, Univ. of Georgia, Athens, Georgia
   (current address: University of California, Berkeley, California  94720)
             additional code and documentation by Joe Felsenstein


     This program factors a  data  set  that  contains  multistate  characters,
creating  a  data  set  consisting entirely of binary (0,1) characters that, in
turn, can be used as input to any of the other discrete character  programs  in
this  package.  Besides this primary function, FACTOR also provides an easy way
of deleting characters from a data set.  The input format for  FACTOR  is  very
similar  to  the  input format for the other discrete character programs except
for the addition of character-state tree descriptions.

     Note that this program has no way of converting  an  unordered  multistate
character  into  binary  characters.   This  is  a  weakness  of  the  discrete
characters programs in this package.  For  the  time  being,  the  best  I  can
suggest  is  to  code  them  as  A,  C,  G, and T and use the DNA parsimony and
compatibility programs.  That is not a very good alternative, admittedly.

     The first line of the input file should contain the number of species  and
the  number of multistate characters.  This first line is followed by the lines
describing the character-state trees, one description per  line.   The  species
information  constitutes the last part of the file.  Any number of lines may be
used for a single species.


                                  FIRST LINE

     The first line is free format with the number of species first,  separated
by  at  least one blank (space) from the number of multistate characters, which
in turn is separated by at least one blank from the options, if present.


                                    OPTIONS

The options are selected from a menu that looks like this:


Factor -- multistate to binary recoding program, version 3.5c

Settings for this run:
  A      put ancestral states in output file?  No
  F   put factors information in output file?  No
  0       Terminal type (IBM PC, VT52, ANSI)?  ANSI

Are these settings correct? (type Y or the letter for one to change)

  A      Choosing the A (Ancestors) options toggles on and off the setting that
     causes a line to be written in the output that describes the states of the
     ancestor as  indicated  by  the  character-state  tree  descriptions  (see



     below).   If  the  ancestral  state  is  not  specified  by  a  particular
     character-state tree, a '?' signifying an unknown character state will  be
     written.   The  multistate  characters are factored in such a way that the
     ancestral state in the factored data set will always be '0'.  The ancestor
     line does not get counted as a species.

  F      Choosing the F (Factors) option toggles on and off a setting that will
     cause  a  'FACTORS'  line  to  be  written  in the output.  This line will
     indicate to other programs which factors came  from  the  same  multistate
     character.   Of the  programs currently in the package only SEQBOOT, MOVE,
     and DOLMOVE use this information.


                       CHARACTER-STATE TREE DESCRIPTIONS

     The character-state trees are described in  free  format.   The  character
number  of  the multistate character is given first followed by the description
of the tree itself.  Each description must be completed on a single line.  Each
character  that  is  to be factored must have a description, and the characters
must be described in the order that they  occur  in  the  input,  that  is,  in
numerical order.

     The tree is described by listing the pairs of character  states  that  are
adjacent  to  each other in the character-state tree.  The two character states
in each adjacent pair are separated by a colon (':').  If character fifteen has
this character state tree for possible states

                         A ---- B ---- C
                                !
                                !
                                !
                                D

then the character-state tree description would be

                        15  A:B B:C D:B

Note that either symbol may appear first.  The ancestral state  is  identified,
if  desired,  by  putting  it  "adjacent"  to  a  period.  If we wanted to root
character fifteen at state C:

                         A <--- B <--- C
                                !
                                !
                                V
                                D

we could write

                      15  B:D A:B C:B .:C

Both the order in which the pairs are listed and the order of  the  symbols  in
each  pair are arbitrary.  However, each pair may only appear once in the list.
Any symbols may be used for a character state in the input except the character
that  signals  the connection between two states (in the distribution copy this
is set to ':'), '.', and, of course, a blank.  Blanks are ignored completely in
the tree description so that even  B:DA:BC:B.:C  or B : DA : BC : B. : C  would
be equivalent to the above example.  However, at least one blank must  separate
the character number from the tree description.





                      DELETING CHARACTERS FROM A DATA SET

     If no description line appears in the input for  a  particular  character,
then  that  character will be omitted from the output.  If the character number
is given on the line, but no character-state tree is provided, then the  symbol
for  the  character  in the input will be copied directly to the output without
change.  This is useful for characters that are  already  coded  '0'  and  '1'.
Characters can be deleted from a data set simply by listing only those that are
to appear in the output.


                   TERMINATING THE LIST OF TREE DESCRIPTIONS

     The last character-state tree description should be  followed  by  a  line
containing  the  number  '999'.   This  terminates  processing of the trees and
indicates the beginning of the species information.


                              SPECIES INFORMATION

     The format for the species information is basically identical to the other
discrete character programs.  The first ten character positions are allotted to
the species name (this value may be  changed  by  altering  the  value  of  the
constant nmlngth at the beginning of the program).  The character states follow
and may be continued to as many lines as desired.  There is no  current  method
for  indicating  polymorphisms.   It  is  possible to either put blanks between
characters or not.

     There is a method for indicating uncertainty about states.  There  is  one
character  value  that stands for 'unknown'.  If this appears in the input data
then '?' is written out in all the corresponding positions in the output  file.
The  character  value  that  designates program, and can be changed by changing
that constant.  It is set to


                                    OUTPUT

     The first line of output will contain the number of species and the number
of binary characters in the factored data set followed by the letter 'A' if the
A option was specified in the input.  If option F was specified, the next  line
will  begin  'FACTORS'.   If  option  A  was specified, the line describing the
ancestor will follow next.  Finally, the factored characters  will  be  written
for  each  species  in  the  format  required  for  input by the other discrete
programs in the package.   The  maximum  length  of  the  output  lines  is  80
characters, but this maximum length can be changed prior to compilation.


                                    ERRORS

     The output should be checked for error messages.  Errors will occur in the
character-state  tree  descriptions  if  the format is incorrect (colons in the
wrong place, etc.), if more than one root is specified, if  the  tree  contains
loops (and hence is not a tree), and if the tree is not connected, e.g.

                             A:B B:C D:E

describes

                  A ---- B ---- C          D ---- E

This "tree" is in two unconnected pieces.  An error will also occur if a symbol



appears in the data set that is not in the tree description for that character.
Blanks at the end of lines when the species information is continued to  a  new
line will cause this kind of error.


                       CONSTANTS AVAILABLE TO BE CHANGED

     At the beginning of the program a number of are available to be changed to
accomodate  larger  data  sets.  These are "nmlngth", "maxstates", "maxoutput",
"sizearray", "factchar" and "unkchar".  The constant "nmlngth" is the length of
the  species  name.  The allowed in the input.  The CONSTant maxstates constant
"maxstates" gives the maximum number of states per character (set at 20 in  the
distribution copy).  The constant "maxoutput" gives the maximum width of a line
in the output file (80 in the distribution  copy).   The  constant  "sizearray"
must  be  less  than  the  sum  of  squares  of  the  numbers  of states in the
characters.  It is initially set to set to 2000, so that although 20 states are
allowed (at the initial setting of maxstates) per character, there cannot be 20
states in all of 100 characters.

     Particularly important constants are "factchar" and  "unkchar"  which  are
not  numerical  values  but  a  character.   Initially  set  to  the colon ':',
"factchar" is the character that will be used to separate states in  the  input
of  character  state trees.  It can be changed by changing this constant.   (We
could have used a hyphen ('-') but didn't because that would  make  the  minus-
sign  ('-')  unavailable as a character state in +/- characters).  The constant
"unkchar" is the character value in the input  data  that  indicates  that  the
state is unknown.  It is set to '?' in the distribution copy.  If your computer
is one that lacks the colon ':' in its character  set  or  uses  a  nonstandard
character code such as EBCDIC, you will want to change the constant "factchar".


                            INPUT AND OUTPUT FILES

     The input file for the program has the default file name "infile" and  the
output  file,  the  one  that has the binary character state data, has the name





























"outfile".


        ----SAMPLE INPUT-----  -----Comments (not part of input file) -----

           4   6   A           4 species; 6 characters; A option on
        1 A:B B:C              A ---- B ---- C
        2 A:B B:.              B ---> A
        4                      Character 3 deleted; 4 unchanged
        5 0:1 1:2 .:0          0 ---> 1 ---> 2
        6 .:# #:$ #:%          # ---> $ ---> %
        999                    Signals end of trees
        Alpha     CAW00#       Species information begins
        Beta      BBX01%
        Gamma     ABY12#
        Epsilon   CAZ01$


        ---SAMPLE OUTPUT-----  -----Comments (not part of input file) -----

            5    8    A        5 species (incl. anc.); 8 factors
        ANCESTOR  ??0?0000     Chars. 1 and 2 come from old number 1
        Alpha     11100000     Char. 3 comes from old number 2
        Beta      10001001     Char. 4 is old number 4
        Gamma     00011100     Chars. 5 and 6 come from old number 5
        Epsilon   11101010     Chars. 7 and 8 come from old number 6