Reading: Jambek ch13
A collection of files + programs to manage them
A full-featured Relational Database Management System
relational - eg oracle, mysql, sybase, postgres
file based - eg SRS
ACEDB
Other - object relational, object oriented, xml
A database for storing annotations on genes
| id | gene symbol | accession | annotation | |
|---|---|---|---|---|
| 1 | ORC1 | S0004530 | origin recognition complex | |
| 2 | ORC2 | S0000264 | origin recognition complex | |
| 3 | ORC3 | S0003927 | origin recognition complex | |
| 4 | ORC4 | S0006366 | origin recognition complex | |
| 5 | ORC5 | S0005205 | origin recognition complex | |
| 6 | ORC6 | S0001160 | origin recognition complex | |
| 7 | AGA1 | S0005327 | cell adhesion receptor | |
| 8 | AGA2 | S0003000 | cell adhesion receptor | |
| 9 | SAG1 | S0003764 | cell adhesion receptor | |
| 10 | AFR1 | S0002492 | receptor signaling protein | |
| 11 | MSN5 | S0002743 | importin alpha export receptor | |
| 12 | PEX5 | S0002652 | peroxisome targeting signal receptor | |
| 13 | PEX7 | S0002549 | peroxisome targeting signal receptor | |
| 14 | SEC63 | S0005780 | signal recognition particle receptor | |
| 15 | SEC66 | S0000375 | signal recognition particle receptor | |
| 16 | SEC72 | S0004283 | signal recognition particle receptor | |
| 17 | STE3 | S0001661 | mating-type a-factor pheromone receptor | |
| 18 | STE2 | S0001868 | mating-type alpha-factor pheromone receptor | |
| 19 | SEC63 | S0005780 | SRP-dependent, co-translational membrane targeting, docking | |
| 20 | SEC66 | S0000375 | SRP-dependent, co-translational membrane targeting, docking | |
| 21 | SEC72 | S0004283 | SRP-dependent, co-translational membrane targeting, docking | |
| 22 | LEM3 | S0005267 | cell surface receptor linked signal transduction | |
| 23 | AFR1 | S0002492 | regulation of G-protein coupled receptor protein signaling pathway |
|
CREATE TABLE gene (
id serial PRIMARY KEY,
symbol char (32),
dbxref char (32),
annotation char (200)
);
|
This is a definition of a database consisting of one table written in SQL
SQL - Structured Query Language; used by all modern relational databases
Each row in the database consists of an entry in each of the following columns : id, symbol, fullname, dbxref, annotation
each column has a type - generally numeric or string
the numbers indicate the amount of space to allocate for each field/column
Type "mysql" followed by the name of the database on the command line. The database called "annotations" has been preinstalled with the above data at Fiocruz.
bash-2.04$ mysql annotations Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 19561 to server version: 3.23.37 Type 'help;' or '\h' for help. Type '\c' to clear the buffer |
You are now using the MySQL interface. You can type in SQL commands here.
mysql> SELECT * FROM gene;
+----+--------+----------+--------------------------------------------------------------------+
| id | symbol | dbxref | annotation
+----+--------+----------+--------------------------------------------------------------------+
| 1 | ORC1 | S0004530 | origin recognition complex
| 2 | ORC2 | S0000264 | origin recognition complex
| 3 | ORC3 | S0003927 | origin recognition complex
| 4 | ORC4 | S0006366 | origin recognition complex
| 5 | ORC5 | S0005205 | origin recognition complex
| 6 | ORC6 | S0001160 | origin recognition complex
| 7 | AGA1 | S0005327 | cell adhesion receptor
| 8 | AGA2 | S0003000 | cell adhesion receptor
| 9 | SAG1 | S0003764 | cell adhesion receptor
| 10 | AFR1 | S0002492 | receptor signaling protein
| 11 | MSN5 | S0002743 | importin alpha export receptor
| 12 | PEX5 | S0002652 | peroxisome targeting signal receptor
| 13 | PEX7 | S0002549 | peroxisome targeting signal receptor
| 14 | SEC63 | S0005780 | signal recognition particle receptor
| 15 | SEC66 | S0000375 | signal recognition particle receptor
| 16 | SEC72 | S0004283 | signal recognition particle receptor
| 17 | STE3 | S0001661 | mating-type a-factor pheromone receptor
| 18 | STE2 | S0001868 | mating-type alpha-factor pheromone receptor
| 19 | SEC63 | S0005780 | SRP-dependent, co-translational membrane targeting, docking
| 20 | SEC66 | S0000375 | SRP-dependent, co-translational membrane targeting, docking
| 21 | SEC72 | S0004283 | SRP-dependent, co-translational membrane targeting, docking
| 22 | LEM3 | S0005267 | cell surface receptor linked signal transduction
| 23 | AFR1 | S0002492 | regulation of G-protein coupled receptor protein signaling pathway
+----+--------+----------+--------------------------------------------------------------------+
23 rows in set (0.00 sec)
|
mysql> SELECT * FROM gene WHERE annotation = 'cell adhesion receptor'; +----+--------+----------+------------------------+ | id | symbol | dbxref | annotation | +----+--------+----------+------------------------+ | 7 | AGA1 | S0005327 | cell adhesion receptor | | 8 | AGA2 | S0003000 | cell adhesion receptor | | 9 | SAG1 | S0003764 | cell adhesion receptor | +----+--------+----------+------------------------+ 3 rows in set (0.01 sec) |
mysql> SELECT symbol FROM gene; +--------+ | symbol | +--------+ | ORC1 | | ORC2 | | ORC3 | | ORC4 | | ORC5 | | ORC6 | | AGA1 | | AGA2 | | SAG1 | | AFR1 | | MSN5 | | PEX5 | | PEX7 | | SEC63 | | SEC66 | | SEC72 | | STE3 | | STE2 | | SEC63 | | SEC66 | | SEC72 | | LEM3 | | AFR1 | +--------+ 23 rows in set (0.01 sec) |
mysql> SELECT * FROM gene WHERE annotation LIKE 'cell%'; +----+--------+----------+--------------------------------------------------+ | id | symbol | dbxref | annotation +----+--------+----------+--------------------------------------------------+ | 7 | AGA1 | S0005327 | cell adhesion receptor | 8 | AGA2 | S0003000 | cell adhesion receptor | 9 | SAG1 | S0003764 | cell adhesion receptor | 22 | LEM3 | S0005267 | cell surface receptor linked signal transduction +----+--------+----------+--------------------------------------------------+ 4 rows in set (0.01 sec) |
mysql> SELECT * FROM gene WHERE annotation LIKE '%receptor%' AND annotation LIKE '%signal%'; +----+--------+----------+--------------------------------------------------------------------+ | id | symbol | dbxref | annotation | +----+--------+----------+--------------------------------------------------------------------+ | 10 | AFR1 | S0002492 | receptor signaling protein | | 12 | PEX5 | S0002652 | peroxisome targeting signal receptor | | 13 | PEX7 | S0002549 | peroxisome targeting signal receptor | | 14 | SEC63 | S0005780 | signal recognition particle receptor | | 15 | SEC66 | S0000375 | signal recognition particle receptor | | 16 | SEC72 | S0004283 | signal recognition particle receptor | | 19 | SEC63 | S0005780 | SRP-dependent, co-translational membrane targeting, docking | | 20 | SEC66 | S0000375 | SRP-dependent, co-translational membrane targeting, docking | | 21 | SEC72 | S0004283 | SRP-dependent, co-translational membrane targeting, docking | | 22 | LEM3 | S0005267 | cell surface receptor linked signal transduction | | 23 | AFR1 | S0002492 | regulation of G-protein coupled receptor protein signaling pathway | +----+--------+----------+--------------------------------------------------------------------+ 11 rows in set (0.01 sec) |
mysql> SELECT * FROM gene WHERE symbol = 'ORC1' OR symbol = 'ORC2'; +----+--------+----------+----------------------------+ | id | symbol | dbxref | annotation | +----+--------+----------+----------------------------+ | 1 | ORC1 | S0004530 | origin recognition complex | | 2 | ORC2 | S0000264 | origin recognition complex | +----+--------+----------+----------------------------+ |
mysql> select distinct annotation from gene; +--------------------------------------------------------------------+ | annotation | +--------------------------------------------------------------------+ | origin recognition complex | | cell adhesion receptor | | receptor signaling protein | | importin alpha export receptor | | peroxisome targeting signal receptor | | signal recognition particle receptor | | mating-type a-factor pheromone receptor | | mating-type alpha-factor pheromone receptor | | SRP-dependent, co-translational membrane targeting, docking | | cell surface receptor linked signal transduction | | regulation of G-protein coupled receptor protein signaling pathway | +--------------------------------------------------------------------+ 11 rows in set (0.01 sec) |
mysql> SELECT COUNT(symbol) FROM gene; +----------+ | count(symbol) +----------+ | 23 | +----------+ 1 row in set (0.01 sec) |
mysql> SELECT * FROM gene ORDER BY symbol; +----+--------+----------+--------------------------------------------------------------------+ | id | symbol | dbxref | annotation | +----+--------+----------+--------------------------------------------------------------------+ | 10 | AFR1 | S0002492 | receptor signaling protein | | 23 | AFR1 | S0002492 | regulation of G-protein coupled receptor protein signaling pathway | | 7 | AGA1 | S0005327 | cell adhesion receptor | | 8 | AGA2 | S0003000 | cell adhesion receptor | | 22 | LEM3 | S0005267 | cell surface receptor linked signal transduction | | 11 | MSN5 | S0002743 | importin alpha export receptor | | 1 | ORC1 | S0004530 | origin recognition complex | | 2 | ORC2 | S0000264 | origin recognition complex | | 3 | ORC3 | S0003927 | origin recognition complex | | 4 | ORC4 | S0006366 | origin recognition complex | | 5 | ORC5 | S0005205 | origin recognition complex | | 6 | ORC6 | S0001160 | origin recognition complex | | 12 | PEX5 | S0002652 | peroxisome targeting signal receptor | | 13 | PEX7 | S0002549 | peroxisome targeting signal receptor | | 9 | SAG1 | S0003764 | cell adhesion receptor | | 14 | SEC63 | S0005780 | signal recognition particle receptor | | 19 | SEC63 | S0005780 | SRP-dependent, co-translational membrane targeting, docking | | 15 | SEC66 | S0000375 | signal recognition particle receptor | | 20 | SEC66 | S0000375 | SRP-dependent, co-translational membrane targeting, docking | | 16 | SEC72 | S0004283 | signal recognition particle receptor | | 21 | SEC72 | S0004283 | SRP-dependent, co-translational membrane targeting, docking | | 18 | STE2 | S0001868 | mating-type alpha-factor pheromone receptor | | 17 | STE3 | S0001661 | mating-type a-factor pheromone receptor | +----+--------+----------+--------------------------------------------------------------------+ 23 rows in set (0.01 sec) |
mysql> INSERT INTO gene VALUES ('99', 'CDC1', 'S0002590', 'metal ion homeostasis');
Query OK, 1 row affected (0.00 sec)
|
mysql> UPDATE gene SET annotation = 'DNA recombination' WHERE id = 99; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> SELECT * FROM gene WHERE symbol = 'CDC1'; +----+--------+----------+-------------------+ | id | symbol | dbxref | annotation | +----+--------+----------+-------------------+ | 99 | CDC1 | S0002590 | DNA recombination | +----+--------+----------+-------------------+ 1 row in set (0.01 sec) |
mysql> DELETE FROM gene WHERE id = 99; Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM gene WHERE symbol = 'CDC1'; Empty set (0.00 sec) |
In the previous database, there were some genes that had more than >1 annotation;
mysql> SELECT * FROM gene WHERE symbol = 'AFR1'; +----+--------+----------+--------------------------------------------------------------------+ | id | symbol | dbxref | annotation | +----+--------+----------+--------------------------------------------------------------------+ | 10 | AFR1 | S0002492 | receptor signaling protein | | 23 | AFR1 | S0002492 | regulation of G-protein coupled receptor protein signaling pathway | +----+--------+----------+--------------------------------------------------------------------+ 2 rows in set (0.01 sec) |
This is undesirable, as it means we have to represent other information about the gene such as it's dbxref twice.
In order to fix this, we can change the database to have two tables instead of one:
CREATE TABLE gene (
id serial PRIMARY KEY,
symbol char (32),
dbxref char (32)
);
|
The process of splitting tables in this way is known as normalisation
A database has been set up at Fiocruz with this table structure.
mysql> SELECT * FROM gene; +----+--------+----------+ | id | symbol | dbxref | +----+--------+----------+ | 1 | ORC1 | S0004530 | | 2 | ORC2 | S0000264 | | 3 | ORC3 | S0003927 | | 4 | ORC4 | S0006366 | | 5 | ORC5 | S0005205 | | 6 | ORC6 | S0001160 | | 7 | AGA1 | S0005327 | | 8 | AGA2 | S0003000 | | 9 | SAG1 | S0003764 | | 10 | AFR1 | S0002492 | | 11 | MSN5 | S0002743 | | 12 | PEX5 | S0002652 | | 13 | PEX7 | S0002549 | | 14 | SEC63 | S0005780 | | 15 | SEC66 | S0000375 | | 16 | SEC72 | S0004283 | | 17 | STE3 | S0001661 | | 18 | STE2 | S0001868 | | 19 | LEM3 | S0005267 | +----+--------+----------+ 19 rows in set (0.01 sec) | |
| mysql> SELECT * FROM gene_annotation; +---------+--------------------------------------------------------------------+----------------+ | gene_id | annotation | literature_ref | +---------+--------------------------------------------------------------------+----------------+ | 1 | origin recognition complex | ref002970 | | 2 | origin recognition complex | ref004288 | | 3 | origin recognition complex | ref001883 | | 4 | origin recognition complex | ref001790 | | 5 | origin recognition complex | ref001458 | | 6 | origin recognition complex | ref004255 | | 7 | cell adhesion receptor | ref004193 | | 8 | cell adhesion receptor | ref002247 | | 9 | cell adhesion receptor | ref003055 | | 10 | receptor signaling protein | ref001792 | | 11 | importin alpha export receptor | ref001272 | | 12 | peroxisome targeting signal receptor | ref000447 | | 13 | peroxisome targeting signal receptor | ref004651 | | 14 | signal recognition particle receptor | ref001948 | | 15 | signal recognition particle receptor | ref003035 | | 16 | signal recognition particle receptor | ref002865 | | 17 | mating-type a-factor pheromone receptor | ref004887 | | 18 | mating-type alpha-factor pheromone receptor | ref001281 | | 14 | SRP-dependent, co-translational membrane targeting, docking | ref004243 | | 15 | SRP-dependent, co-translational membrane targeting, docking | ref004955 | | 16 | SRP-dependent, co-translational membrane targeting, docking | ref000929 | | 19 | cell surface receptor linked signal transduction | ref003586 | | 10 | regulation of G-protein coupled receptor protein signaling pathway | ref002567 | +---------+--------------------------------------------------------------------+----------------+ 23 rows in set (0.00 sec) |
mysql> SELECT gene.symbol, gene_annotation.annotation FROM gene_annotation, gene WHERE gene.id = gene_annotation.gene_id; +--------+--------------------------------------------------------------------+ | symbol | annotation | +--------+--------------------------------------------------------------------+ | ORC1 | origin recognition complex | | ORC2 | origin recognition complex | | ORC3 | origin recognition complex | | ORC4 | origin recognition complex | | ORC5 | origin recognition complex | | ORC6 | origin recognition complex | | AGA1 | cell adhesion receptor | | AGA2 | cell adhesion receptor | | SAG1 | cell adhesion receptor | | AFR1 | receptor signaling protein | | MSN5 | importin alpha export receptor | | PEX5 | peroxisome targeting signal receptor | | PEX7 | peroxisome targeting signal receptor | | SEC63 | signal recognition particle receptor | | SEC66 | signal recognition particle receptor | | SEC72 | signal recognition particle receptor | | STE3 | mating-type a-factor pheromone receptor | | STE2 | mating-type alpha-factor pheromone receptor | | SEC63 | SRP-dependent, co-translational membrane targeting, docking | | SEC66 | SRP-dependent, co-translational membrane targeting, docking | | SEC72 | SRP-dependent, co-translational membrane targeting, docking | | LEM3 | cell surface receptor linked signal transduction | | AFR1 | regulation of G-protein coupled receptor protein signaling pathway | +--------+--------------------------------------------------------------------+ 23 rows in set (0.00 sec) |
You can access a relational database such as MySQL from perl by using the DBI module.
This example shows how to write a script to query a database:
program 1
#!/usr/local/bin/perl # query_annots.pl |
This example shows how to write a script to add data to the same database:
program 2
#!/usr/local/bin/perl # query_annots.pl |
type "man DBI"
1. Type "mysql annotations" to start a MySQL session with the first of the two databases.
2. Type some of the SQL commands you saw previously in the lecture.
3. Type an SQL query to find the gene symbols for any genes that have "origin recognition complex" as their annotation.
4. Do a query that returns a count of the number of genes that are annotated as "origin recognition complex"
5. Do the same for receptor signaling proteins
6. Download the perl program
described above, run it with a few different search terms. What happens when you just hit return?
7. Modify the program so that it prompts the user for a gene dbxref and returns the annotation for that gene.
8. Repeat exercises 1-5 for the database: annotations2
9. Modify the perl program query_annots.pl to work on the database annotations2
10. Extend the program to allow addition of new genes and addition of annotations for these genes.
Berkeley Drosophila Genome Project
Date: Fri Jun 1 12:09:28 2001