DBMODELING is a relational database of annotated comparative protein structure models and their metabolic, pathway characterization. It is focused on enzymes identified in the genomes of Mycobacterium tuberculosis and Xylella fastidiosa. The main goal of the present database is to provide structural models to be used in docking simulations and drug design. However, since the accuracy of structural models is highly dependent on sequence identity between template and target, it is necessary to make clear to the user that only models which show high structural quality should be used in such efforts. Molecular modeling of these genomes generated a database, in which all structural models were built using alignments presenting more than 30% of sequence identity, generating models with medium and high accuracy. All models in the database are publicly accessible at http://www.biocristalografia.df.ibilce.unesp.br/tools. DBMODELING user interface provides users friendly menus, so that all information can be printed in one stop from any web browser. Furthermore, DBMODELING also provides a docking interface, which allows the user to carry out geometric docking simulation, against the molecular models available in the database. There are three other important homology model databases: MODBASE...
Protein database searches frequently can reveal biologically significant sequence relationships useful in understanding structure and function. Weak but meaningful sequence patterns can be obscured, however, by other similarities due only to chance. By searching a database for multiple as opposed to pairwise alignments, distant relationships are much more easily distinguished from background noise. Recent statistical results permit the power of this approach to be analyzed. Given a typical query sequence, an algorithm described here permits the current protein database to be searched for three-sequence alignments in less than 4 min. Such searches have revealed a variety of subtle relationships that pairwise search methods would be unable to detect.
PSI-BLAST is an iterative program to search a database for proteins
with distant similarity to a query sequence. We investigated over
a dozen modifications to the methods used in PSI-BLAST, with the goal
of improving accuracy in finding true positive matches. To evaluate
performance we used a set of 103 queries for which the true positives
in yeast had been annotated by human experts, and a popular measure
of retrieval accuracy (ROC) that can be normalized to take on values
between 0 (worst) and 1 (best). The modifications we consider novel
improve the ROC score from 0.758 ± 0.005
to 0.895 ± 0.003. This does not include
the benefits from four modifications we included in the ‘baseline’ version,
even though they were not implemented in PSI-BLAST version 2.0.
The improvement in accuracy was confirmed on a small second test
set. This test involved analyzing three protein families with curated
lists of true positives from the non-redundant protein database.
The modification that accounts for the majority of the improvement
is the use, for each database sequence, of a position-specific scoring
system tuned to that sequence’s amino acid composition.
The use of composition-based statistics is particularly beneficial
for large-scale automated applications of PSI-BLAST.
The Yeast Protein Database (YPD) is a database for the proteins of the budding yeast,Saccharomyces cerevisiae. YPD is the first annotated database for the complete proteome of any organism. Now that the complete genome sequence of yeast is available, YPD contains entries for each of the characterized proteins and for each of the uncharacterized proteins predicted from the sequence. Contained in YPD are the calculated properties of each protein such as molecular weight and isoelectric point, experimentally determined properties such as subcellular localization and post-translational modifications, and extensive annotations from the yeast literature. YPD contains 25 000 lines of textual annotation that describe the known functions, mutant phenotypes, interactions, and other properties for the approximately 6000 proteins in the yeast proteome. The information in YPD is updated daily, and it is available on the World Wide Web at http://www.proteome.com/YPDhome.html .
The Yeast Protein Database (YPD) is a curated database for the proteome of Saccharomyces cerevisiae . It consists of approximately 6000 Yeast Protein Reports, one for each of the known or predicted yeast proteins. Each Yeast Protein Report is a one-page presentation of protein properties, annotation lines that summarize findings from the literature, and references. In the past year, the number of annotation lines has grown from 25 000 to approximately 35 000, and the number of articles curated has grown from approximately 3500 to >5000. Recently, new data types have been included in YPD: protein-protein interactions, genetic interactions, and regulators of gene expression. Finally, a new layer of information, the YPD Protein Minireviews, has recently been introduced. The Yeast Protein Database can be found on the Web at http://www.proteome.com/YPDhome. html
The Nuclear Protein Database (NPD) is a curated database that contains information on more than 1300 vertebrate proteins that are thought, or are known, to localise to the cell nucleus. Each entry is annotated with information on predicted protein size and isoelectric point, as well as any repeats, motifs or domains within the protein sequence. In addition, information on the sub-nuclear localisation of each protein is provided and the biological and molecular functions are described using Gene Ontology (GO) terms. The database is searchable by keyword, protein name, sub-nuclear compartment and protein domain/motif. Links to other databases are provided (e.g. Entrez, SWISS-PROT, OMIM, PubMed, PubMed Central). Thus, NPD provides a gateway through which the nuclear proteome may be explored. The database can be accessed at http://npd.hgu.mrc.ac.uk and is updated monthly.
The Arabidopsis Mitochondrial Protein Database is an Internet-accessible relational database containing information on the predicted and experimentally confirmed protein complement of mitochondria from the model plant Arabidopsis thaliana (http://www.ampdb.bcs.uwa.edu.au/). The database was formed using the total non-redundant nuclear and organelle encoded sets of protein sequences and allows relational searching of published proteomic analyses of Arabidopsis mitochondrial samples, a set of predictions from six independent subcellular-targeting prediction programs, and orthology predictions based on pairwise comparison of the Arabidopsis protein set with known yeast and human mitochondrial proteins and with the proteome of Rickettsia. A variety of precomputed physical–biochemical parameters are also searchable as well as a more detailed breakdown of mass spectral data produced from our proteomic analysis of Arabidopsis mitochondria. It contains hyperlinks to other Arabidopsis genomic resources (MIPS, TIGR and TAIR), which provide rapid access to changing gene models as well as hyperlinks to T-DNA insertion resources, Massively Parallel Signature Sequencing (MPSS) and Genome Tiling Array data and a variety of other Arabidopsis online resources. It also incorporates basic analysis tools built into the query structure such as a BLAST facility and tools for protein sequence alignments for convenient analysis of queried results.
The Arabidopsis Nucleolar Protein Database (http://bioinf.scri.sari.ac.uk/cgi-bin/atnopdb/home) provides information on 217 proteins identified in a proteomic analysis of nucleoli isolated from Arabidopsis cell culture. The database is organized on the basis of the Arabidopsis gene identifier number. The information provided includes protein description, protein class, whether or not the plant protein has a homologue in the most recent human nucleolar proteome and the results of reciprocal BLAST analysis of the human proteome. In addition, for one-third of the 217 Arabidopsis nucleolar proteins, localization images are available from analysis of full-length cDNA–green fluorescent protein (GFP) fusions and the strength of signal in different parts of the cell—nucleolus, nucleolus-associated structures, nucleoplasm, nuclear bodies and extra-nuclear—is provided. For each protein, the most likely human and yeast orthologues, where identifiable through BLASTX analysis, are given with links to relevant information sources.
A novel, “Ion Accounting” algorithm has been developed for protein identification using time-resolved, LC-MSE data from 1D and 2D LC-MS experiments. The data from a 1D LC-MS analysis generate a series of precursor-product tables that are initially queried against a protein database using the “Ion Accounting” algorithm. Hereby each precursor and product is associated with only single peptide identification. The database search is a hierarchal process containing three modules. With the first module, the data are matched to only correctly cleaved proteolytic peptides whose precursor and product ion mass tolerances are within 10 and 20 ppm, respectively. With the second module, precursor and product ions that have not yet been assigned are queried against a subset database of the identified proteins from the first module. The second module includes missed cleavages, in-source fragments, neutral losses, and variable modifications. With the last module, the remaining unidentified ions are considered against the complete database for additional protein identifications (including PMF) with improved selectivity and specificity from the elimination of those precursor and product ions from the first two modules.
Completion of human genome sequencing has greatly accelerated functional genomic research. Full-length cDNA clones are essential experimental tools for functional analysis of human genes. In one of the projects of the New Energy and Industrial Technology Development Organization (NEDO) in Japan, the full-length human cDNA sequencing project (FLJ project), nucleotide sequences of approximately 30 000 human cDNA clones have been analyzed. The Gateway system is a versatile framework to construct a variety of expression clones for various experiments. We have constructed 33 275 human Gateway entry clones from full-length cDNAs, representing to our knowledge the largest collection in the world. Utilizing these clones with a highly efficient cell-free protein synthesis system based on wheat germ extract, we have systematically and comprehensively produced and analyzed human proteins in vitro. Sequence information for both amino acids and nucleotides of open reading frames of cDNAs cloned into Gateway entry clones and in vitro expression data using those clones can be retrieved from the Human Gene and Protein Database (HGPD, http://www.HGPD.jp). HGPD is a unique database that stores the information of a set of human Gateway entry clones and protein expression data and helps the user to search the Gateway entry clones.
The Ciona intestinalis protein database (CIPRO) is an integrated protein database for the tunicate species C. intestinalis. The database is unique in two respects: first, because of its phylogenetic position, Ciona is suitable model for understanding vertebrate evolution; and second, the database includes original large-scale transcriptomic and proteomic data. Ciona intestinalis has also been a favorite of developmental biologists. Therefore, large amounts of data exist on its development and morphology, along with a recent genome sequence and gene expression data. The CIPRO database is aimed at collecting those published data as well as providing unique information from unpublished experimental data, such as 3D expression profiling, 2D-PAGE and mass spectrometry-based large-scale analyses at various developmental stages, curated annotation data and various bioinformatic data, to facilitate research in diverse areas, including developmental, comparative and evolutionary biology. For medical and evolutionary research, homologs in humans and major model organisms are intentionally included. The current database is based on a recently developed KH model containing 36 034 unique sequences, but for higher usability it covers 89 683 all known and predicted proteins from all gene models for this species. Of these sequences...
Viral Protein Database is an interactive database for three dimensional viral proteins. Our aim is to provide a comprehensive resource to the community of
structural virology, with an emphasis on the description of derived data from structural biology. Currently, VPDB includes ˜1,670 viral protein structures from
>277 viruses with more than 465 virus strains. The whole database can be easily accessed through the user convenience text search. Interactivity has been
enhanced by using Jmol, WebMol and Strap to visualize the viral protein molecular structure.
The Human Gene and Protein Database (HGPD; http://www.HGPD.jp/) is a unique database that stores information on a set of human Gateway entry clones in addition to protein expression and protein synthesis data. The HGPD was launched in November 2008, and 33 275 human Gateway entry clones have been constructed from the open reading frames (ORFs) of full-length cDNA, thus representing the largest collection in the world. Recently, research objectives have focused on the development of new medicines and the establishment of novel diagnostic methods and medical treatments. And, studies using proteins and protein information, which are closely related to gene function, have been undertaken. For this update, we constructed an additional 9974 human Gateway entry clones, giving a total of 43 249. This set of human Gateway entry clones was named the Human Proteome Expression Resource, known as the ‘HuPEX’. In addition, we also classified the clones into 10 groups according to protein function. Moreover, in vivo cellular localization data of proteins for 32 651 human Gateway entry clones were included for retrieval from the HGPD. In ‘Information Overview’, which presents the search results, the ORF region of each cDNA is now displayed allowing the Gateway entry clones to be searched more easily.
Soybean continues to serve as a rich and inexpensive source of protein for humans and animals. A substantial amount of
information has been reported on the genotypic variation and beneficial genetic manipulation of soybeans. For better
understanding of the consequences of genetic manipulation, elucidation of soybean protein composition is necessary, because of its
direct relationship to phenotype. We have conducted studies to determine the composition of storage, allergen and anti-nutritional
proteins in cultivated soybean using a combined proteomics approach. Two-dimensional polyacrylamide gel electrophoresis (2DPAGE)
was implemented for the separation of proteins along with matrix-assisted laser desorption/ionization time of flight mass
spectrometry (MALDI-TOF-MS) and liquid chromatography mass spectrometry (LC-MS/MS) for the identification of proteins. Our
analysis resulted in the identification of several proteins, and a web based database named soybean protein database (SoyProDB)
was subsequently built to house and allow scientists to search the data. This database will be useful to scientists who wish to
genetically alter soybean with higher quality storage proteins, and also helpful for consumers to get a greater understanding about
proteins that compose soy products available in the market. The database is freely accessible.
Halophilic archaea/bacteria adapt to different salt concentration, namely extreme, moderate and low. These type of adaptations may occur as a result of modification of protein structure and other changes in different cell organelles. Thus proteins may play an important role in the adaptation of halophilic archaea/bacteria to saline conditions. The Halophile protein database (HProtDB) is a systematic attempt to document the biochemical and biophysical properties of proteins from halophilic archaea/bacteria which may be involved in adaptation of these organisms to saline conditions. In this database, various physicochemical properties such as molecular weight, theoretical pI, amino acid composition, atomic composition, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (Gravy) have been listed. These physicochemical properties play an important role in identifying the protein structure, bonding pattern and function of the specific proteins. This database is comprehensive, manually curated, non-redundant catalogue of proteins. The database currently contains 59 897 proteins properties extracted from 21 different strains of halophilic archaea/bacteria. The database can be accessed through link.
Mass spectrometric (MS) data of human cell secretomes are usually run through the conventional human database for identification. However, the search may result in false identifications due to contamination of the secretome with fetal bovine serum (FBS) proteins. To overcome this challenge, here we provide a composite protein database including human as well as 199 FBS protein sequences for MS data search of human cell secretomes. Searching against the human-FBS database returned more reliable results with fewer false-positive and false-negative identifications compared to using either a human only database or a human-bovine database. Furthermore, the improved results validated our strategy without complex experiments like SILAC. We expect our strategy to improve the accuracy of human secreted protein identification and to also add value for general use.
Background: Recombinant DNA technology has been extensively employed to generate a variety of products from genetically modified organisms (GMOs) over the last decade, and the development of technologies capable of analyzing these products is crucial to understanding gene expression patterns. Liquid chromatography coupled with mass spectrometry is a powerful tool for analyzing protein contents and possible expression modifications in GMOs. Specifically, the NanoUPLC-MSE technique provides rapid protein analyses of complex mixtures with supported steps for high sample throughput, identification and quantization using low sample quantities with outstanding repeatability. Here, we present an assessment of the peptide and protein identification and quantification of soybean seed EMBRAPA BR16 cultivar contents using NanoUPLC-MSE and provide a comparison to the theoretical tryptic digestion of soybean sequences from Uniprot database. Results: The NanoUPLC-MSE peptide analysis resulted in 3,400 identified peptides, 58% of which were identified to have no miscleavages. The experiment revealed that 13% of the peptides underwent in-source fragmentation, and 82% of the peptides were identified with a mass measurement accuracy of less than 5 ppm. More than 75% of the identified proteins have at least 10 matched peptides...
Experimentally determining the biological function of a protein is a process known as protein characterization. Establishing the role a specific protein plays is a vital step toward fully understanding the biochemical processes that drive life in all its forms. In order for researchers to efficiently locate and benefit from the results of protein characterization experiments, the relevant information is compiled into public databases. To populate such databases, curators, who are experts in the biomedical domain, must search the literature to obtain the relevant information, as the experiment results are typically published in scientific journals. The database curators identify relevant journal articles, read them, and then extract the required information into the database. In recent years the rate of biomedical research has greatly increased, and database curators are unable to keep pace with the number of articles being published. Consequently, maintaining an up-to-date database of characterized proteins, let alone populating a new database, has become a daunting task.
In this thesis, we report our work to reduce the effort required from database curators in order to create and maintain a database of characterized proteins. We describe a system we have designed for automatically identifying relevant articles that discuss the results of protein characterization experiments. Classifiers are trained and tested using a large dataset of abstracts...
Despite the growing volumes of proteomic data, integration of the underlying results remains problematic owing to differences in formats, data captured, protein accessions and services available from the individual repositories. To address this, we present the ISPIDER Central Proteomic Database search (http://www.ispider.manchester.ac.uk/cgi-bin/ProteomicSearch.pl), an integration service offering novel search capabilities over leading, mature, proteomic repositories including PRoteomics IDEntifications database (PRIDE), PepSeeker, PeptideAtlas and the Global Proteome Machine. It enables users to search for proteins and peptides that have been characterised in mass spectrometry-based proteomics experiments from different groups, stored in different databases, and view the collated results with specialist viewers/clients. In order to overcome limitations imposed by the great variability in protein accessions used by individual laboratories, the European Bioinformatics Institute's Protein Identifier Cross-Reference (PICR) service is used to resolve accessions from different sequence repositories. Custom-built clients allow users to view peptide/protein identifications in different contexts from multiple experiments and repositories, as well as integration with the Dasty2 client supporting any annotations available from Distributed Annotation System servers. Further information on the protein hits may also be added via external web services able to take a protein as input. This web server offers the first truly integrated access to proteomics repositories and provides a unique service to biologists interested in mass spectrometry-based proteomics.