Proteomics-Introduction

Definition: "The analysis of complete complements of proteins. Proteomics includes not only the identification and quantification of proteins, but also the determination of their localization, modifications, interactions, activities, and, ultimately, their function. Initially encompassing just two- dimensional (2D) gel electrophoresis for protein separation and identification, proteomics now refers to any procedure that characterizes large sets of proteins. The explosive growth of this field is driven by multiple forces - genomics and its revelation of more and more new proteins; powerful protein technologies, such as newly developed mass spectrometry approaches, global [yeast] two- hybrid techniques, and spin- offs from DNA arrays; and innovative computational tools and methods to process, analyze, and interpret prodigious amounts of data."

The theme of molecular biology research, in the past, has been oriented around the gene rather than the protein. This is not to say that researchers have neglected to study proteins, but rather that the approaches and techniques most commonly used have looked primarily at the nucleic acids and then later at the protein(s) implicated.

The main reason for this has been that the technologies available, and the inherent characteristics of nucleic acids, have made the genes the low hanging fruit. This situation has changed recently and continues to change as larger scale, higher throughput methods are developed for both nucleic acids and proteins. The majority of processes that take place in a cell are not performed by the genes themselves, but rather by the proteins that they code for.

A disease can arise when a gene/protein is over- or under-expressed, or when a mutation in a gene results in a malformed protein, or when post translational modifications alter a protein's function. Thus to truly understand a biological process, the relevant proteins must be studied directly. But there are more challenges when studying proteins compared to studying genes, due to their complex 3-D structure which is related to the function, analogous to a machine.

Proteomics is defined as the systematic large-scale analysis of protein expression under normal and perturbed (stressed, diseased, and/or drugged) states, and generally involves the separation, identification, and characterization of all of the proteins in a cell or tissue sample. The meaning of the term has also been expanded, and is now used loosely to refer to the approach of analyzing which proteins a particular type of cell synthesizes, how much the cell synthesizes, how cells modify proteins after synthesis, and how all of those proteins interact.

There are orders of magnitude more proteins than genes in an organism - based on alternative splicing (several per gene) and post translational modifications (over 100 known), there are estimated to be a million or more.

Fortunately there are features such as folds and motifs, which allow them to be categorized into groups and families, making the task of studying them more tractable. There is a broad range of technologies used in proteomics, but the central paradigm has been the use of 2-D gel electrophoresis (2D-GE) followed by mass spectrometry (MS). 2D-GE is used to first separate the proteins by isoelectric point and then by size.

The individual proteins are subsequently removed from the gel and prepared, then analyzed by MS to determine their identity and characteristics. There are various types of mass analyzers used in proteomics MS including quadrupole, time-of-flight (TOF), and ion trap, and each has its own particular capabilities. Tandem arrangements are often used, such as quadrupole-TOF, to provide more analytical power. The recent development of soft ionization techniques, namely matrix-assisted laser desorption ionization (MALDI) and electro-spray ionization (ESI), has allowed large biomolecules to be introduced into the mass analyzer without completely decomposing their structures, or even without breaking them at all, depending on the design of the experiment.

There are techniques which incorporate liquid chromatography (LC) with MS, and others that use LC by itself. Robotics have been applied to automate several steps in the 2DGE-MS process such as spot excision and enzyme digests. To determine a protein's structure, XRD and NMR techniques are being improved to reach higher throughput and better performance.

For example, automated high-throughput crystallization methods are being used upstream of XRD to alleviate that bottleneck. For NMR, cryo-probes and flow probes shorten analysis time and decrease sample volume requirements. The hope is that determining about 10,000 protein structures will be enough to characterize the estimated 5,000 or so folds, which will feed into more reliable in silico structural prediction methods.

Structure by itself does not provide all of the desired information, but is a major step in the right direction. Protein chips are being developed for many of the processes in proteomics. For example, researchers are developing protocols for protein microarrays at institutions such as Harvard and Stanford as well as at several companies. These chips - grids of attached peptide fragments, attached antibodies, or gel "pads" with proteins suspended inside - will be used for various experiments such as protein-protein interaction studies and differential expression analysis.

They can also be used to filter out high abundance proteins before further experiments; one of the major challenges in proteomics is isolating and analyzing the low abundance proteins, which are thought to be the most important. There are many other types of protein chips, and the number will continue to grow. For example, microfluidics chips can combine the sample preparation steps prior to MS, such as enzyme digests, with nanoelectrospray ionization, all on the one chip. Or, the samples can be ionized directly off of the surface of the chip, similar to a MALDI target. Microfluidics chips are also being combined with NMR.

In the next few years, various protein chips will be used increasingly in diagnostic applications as well. The bioinformatics side of proteomics includes both databases and analysis software. There are many public and private databases containing protein data ranging from sequences, to functions, to post translational modifications. Typically, a researcher will first perform 2D-GE followed by MS; this will result in a fingerprint, molecular weight, or even sequence for each protein of interest, which can then be used to query databases for similarities or other information.

Swiss-Prot and TrEMBL, developed in a collaboration between the Swiss Institute of Bioinformatics and the European Bioinformatics Institute, are currently the major databases dedicated to cataloging protein data, but there are dozens of more specialized databases and tools. New bioinformatics approaches are constantly being introduced. Recent customized versions of PSI-BLAST can, for example, utilize not only the curated protein entries in Swiss-Prot but also linguistic analyses of biomedical journal articles to help determine protein family relationships. Publicly available databases and tools are popular, but there are also several companies offering subscriptions to proprietary databases, which often include protein-protein interaction maps generated using the yeast two-hybrid (Y2H) system.

The proteomics market is comprised of instrument manufacturers, bioinformatics companies, laboratory product suppliers, service providers, and other biotech related companies which can defy categorization. A given company can often overlap more than one of these areas. Many of the companies involved in the proteomics market are actually doing drug discovery as their major focus, while partnering, or providing services or subscriptions, to other companies to generate short term revenues. The market for proteomics products and services was estimated to be $1.0B in 2000, growing at a CAGR of 42% to about $5.8B in 2005.

The major drivers will continue to be the biopharmaceutical industry's pursuit of blockbuster drugs and the recent technological advances which have allowed large-scale studies of genes and proteins. Alliances are becoming increasingly important in this field, because it is challenging for companies to find all of the necessary expertise to cover the different activities involved in proteomics. Synergies must be created by combining forces. For example, many companies working with mass spectrometry, both the manufacturers and end user labs, are collaborating with protein chip related companies. The technologies are a natural fit for many applications, such as microfluidic chips which provide nanoelectrospray ionization into a mass spectrometer.

There are many combinations of diagnostics, instrumentation, chip, and bioinformatics companies which create effective partnerships. In general, proteomics appears to hold great promise in the pursuit of biological knowledge. There has been a general realization that the large-scale approach to biology, as opposed to the strictly hypothesis-driven approach, will rapidly generate much more useful information.

The two approaches are not mutually exclusive, and the happy medium seems to be the formation of broad hypotheses which are subsequently investigated by designing large-scale experiments and selecting the appropriate data. Proteomics and genomics, and other varieties of 'omics', will all continue to complement each other in providing the tools and information for this type of research.