انت هنا الان : شبكة جامعة بابل > موقع الكلية > نظام التعليم الالكتروني > مشاهدة المحاضرة
الكلية كلية التمريض
القسم قسم العلوم الطبية الاساسية
المرحلة 2
أستاذ المادة عماد هادي حميد الطائي
11/07/2018 07:24:40
Genomic Data 1. Sequence Data Formats A. FASTA Format B. PHYLIP Format 2. Conversion of Sequence Formats: Using Readseq 3. Primary Sequence Databases—GenBank, EMBL-Bank, and DDBJ A. Sequence Submission to the Databases 1. Submission to NCBI/GenBank 2. Submission to ENA/EMBL-Bank 3. Submission to DDBJ B. Availability of the Submitted Sequence to the Public C. Sequence Flatfile Format 1. GenBank Sequence Flatfile Format 2. EMBL-Bank Sequence Flatfile Format D. Sequence Accession Numbers and Redundancy in Primary Databases E. Divisions of the NCBI Primary Sequence Database
GENOMIC DATA A publication by Mark Gerstein and colleagues dating as far back as 2001 was entitled, Interrelating Different Types of Genomic Data, from Proteome to Secretome: ’Oming in on Function. This title captures the scope of different types of genomic data. In genomic parlance, the suffix “ome” means the entire collection of an entity. For example, a transcriptome is the entire collection of all RNA transcripts in a cell/tissue at a given time point. Although transcriptome includes all RNA molecules, such as mRNA, rRNA, tRNA, and other noncoding RNAs, it is mostly used in the context of mRNAs. Similarly, the proteome is the entire collection of all proteins, miRNome means the entire collection of all microRNAs (miRNAs) in a cell/tissue at a given time point, and interactome means the collection of all possible molecular interactions (or a subset of molecular interactions) in a cell.
In addition to the sequence and expression data, there are other kinds of data that are genomic data in a broader sense, such as • genome-wide monoallelic expression data, • proteome data, • metabolome data, • protein-protein interaction data, • protein structural data, • protein-DNA interaction data, • gene and protein network data, • small noncoding RNA (ncRNA) data
FASTA Format FASTA (pronounced fast “A”) stands for “fast all”. Many sequence-analysis programs, such as many sequence-alignment programs, need the data to be entered in FASTA format. The minimum amount of input information required in a typical FASTA format is as follows: the first line is the definition (or description) line that starts with the “ > ” sign, which is a crucial element in FASTA format. Analysis programs that need the sequence data input in FASTA format will fail to read the sequence if the “.” sign is not included. The “ > ” sign is followed by a definition (identifier) of the sequence. There should be no space between the “ > ” sign and the first letter of the definition line. FASTA format can allow more information on the definition line, as shown in the example below.
Example: >Mouse Oatp-5 mRNA atccattcac tgactaacac aaggacaagt ttggagtgat
PHYLIP Format PHYLIP stands for “phylogeny inference package.” It was developed by Dr Joe Felsenstein of The University of Washington, Seattle, in the mid-1980s. PHYLIP is a phylogenetic analysis package that can carry out many different analyses, such as parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include DNA and protein sequences, gene frequencies, restriction sites, distance matrices. • The first line of the input file shows the number of species (in this example, four) and the number of characters (in this example, 16 nucleotides) in text format, separated by a space only. The information for each species starts with a 10-character species name. If the species name is not 10 characters long, then a space is introduced to make it 10-character equivalent. In the example, H. sapiens has a space before “sapiens,” but other species names do not have any such space.
• 4 16 • M.musculusggtcgtgcgc aggccc • R.norvegicatcacgctcc tagaac • H. sapiensaccacgccct ccacgt • P.troglodyacgcctcccc caagtc
CONVERSION OF SEQUENCE FORMATS USING READSEQ 1. In order to change a given sequence format to any one of the common sequence formats used in sequence analysis or phylogenetic analysis, the Readseq program can be used. 2. It is a free web-based sequence file format conversion tool that reads the input sequence data and converts the input format to the format chosen by the user in a drop-down menu. 3. A total of 19 different file formats are supported by Readseq. Some examples of common formats supported by Readseq are GENBANK, NBRF, EMBL, GCG, DNA Strider, FASTA, PHYLIP, PIR, MSF, and CLUSTAL.
PRIMARY SEQUENCE DATABASES—GENBANK, EMBL-BANK, AND DDBJ 1. Primary sequence databases are archival in nature. They contain raw sequence data (experimental results) with some interpretation and explanation, but the data are not curated. There are also redundancies in the primary databases—that is, the same sequence might be submitted by different laboratories, sometimes under different names. 2. A great majority of protein sequences in the primary databases are derived from computational translation of the open reading frame (ORF); hence they have not been experimentally verified for the most part. 3. There are three primary databases that contain all the sequence data so far generated. These are GenBank, EMBL database, also called the EMBL-Bank, and DDBJ (DNA Databank of Japan).
Sequence Submission to the Databases During the early years of these databases, sequence data were obtained from the published literature and entered manually into the database. GenBank began accepting direct submissions in 1993. Sequence information can be submitted to the databases irrespective of publication of the information in a journal. However, any author reporting the cloning of a gene or an mRNA (as cDNA) in a publication needs to submit the sequence first to any one of the three primary databases, get an accession number, and provide that accession number with the publication.
1. Submission to NCBI/GenBank Sequences can be submitted to the GenBank database using its web-based sequence submission tool called BankIt, which is available at http://www.ncbi.nlm.nih. gov/BankIt/oldbankit.html. Complex submissions containing long sequences, multiple annotations, gapped sequences, or phylogenetic and population studies should be submitted using the Sequin submission tool (http://www.ncbi.nlm.nih.gov/Sequin/). A single Sequin file should contain less than 10,000 sequences for maximum performance. 2. Submission to ENA/EMBL-Bank Sequences can be submitted to EMBL-Bank using its web-based sequence submission tool called Webin. Webin allows submission of single and multiple sequences as well as very large numbers of sequences (bulk submissions). EMBL-Bank maintains the Sequence Read Archive (SRA) and Trace Archive.
3. Submission to DDBJ The web page for sequence submission in DDBJ has recently undergone a complete makeover (http:// www.ddbj.nig.ac.jp/faq/datasub-e.html). DDBJ recommends using the new web-based submission tool called the Nucleotide Sequence Submission System (NSSS; http://www.ddbj.nig.ac.jp/sub/websub-e.html). The NSSS has replaced Sakura, beginning November, 2012. Sakura was used for sequence submission for about 17 years (from 1995). However, if the sequences are very long or a large number of sequences are to be submitted at the same time, DDBJ recommends using its Mass Submission System (MSS), which is available at http:// www.ddbj.nig.ac.jp/sub/mss_flow-e.html. Like the NCBIand EMBL-Bank, DDBJ also maintains a Sequence Read Archive (SRA) and DDBJ Trace Archive (DTA).
Availability of the Submitted Sequence to the Public During submission of a sequence, the submitter may choose to release the sequence information to the public at a later date (many months later than the actual date of submission to the database) by giving instruction during submission. This usually happens if there are multiple laboratories working on the same gene/protein, and the work of the scientist submitting the sequence is still not completed for publication (at the time the sequence information is submitted).
Sequence Flatfile Format During sequence submission, the submitter has to provide some relevant information about the sequence, such as the name of the mRNA/gene, the source, annotation, open reading frame, and putative translation product. All this information is displayed, along with the sequence, in a flatfile. The websites where the respective flatfile formats are discussed are as follows: GenBank: http://www.ncbi.nlm.nih.gov/Sitemap/ samplerecord.html DDBJ: http://www.ddbj.nig.ac.jp/sub/ref10-e.html EMBl-Bank: ftp://ftp.ebi.ac.uk/pub/databases/embl/ release/usrman.txt (EMBL-Bank User Manual).
المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .
الرجوع الى لوحة التحكم
|