Proteomics & Bioinformatics Protein Fractionation for Quantitative Plasma Proteomics by Semi-Selective Precipitation

Blood plasma is a highly complex mixture of proteins, metabolites and lipids, and a rich source of potential biomarkers for a range of diseases and conditions. The wide range in protein abundance poses a tremendous challenge for plasma proteomics. However, as a relatively small number of proteins make up most of the total protein pool, the concentration range can be compressed by depletion of abundant proteins, such as albumin. To reduce sample complexity and increase the protein coverage, we have developed a sample preparation method based on semi-selective precipitation with acetonitrile at different pH and built a data analysis pipeline, combining different search strategies. The method we propose is reproducible and easily parallelised (high throughput), and may be well suited to fractionate plasma for label-free quantitative proteomics in large clinical studies. Up to 90% of albumin and other abundant proteins were removed by adding an equal volume of acetonitrile to the samples adjusted to pH 5.


Introduction
Plasma contains carbohydrates, lipids, salts, vitamins, amino acids, nucleic acids, hormones and around 75 mg/mL protein [1]. Proteins from tissues leak into the extracellular fluid and are carried through the lymphatic system, to end up in the plasma. The carrier-protein albumin dominates with 45-50% of the total protein concentration, while immunoglobulin G and transferrin contribute 8-20% and 3-7%, respectively [2]. These and other highly abundant, large proteins mask less abundant ones by decreasing their relative concentration, and through effects such as ion suppression in electrospray ionization mass spectrometry. Although changes in the abundant proteins may also be indicative of the physiological status of the organism, [3] lowabundant proteins, for instance from tissue leakage, may mark an early state of a disease such as cancer [4,5]. Although plasma is easily sampled, the concentration range of proteins, spanning from picogram to microgram per millilitre, is a major challenge in clinical proteomics.
Numerous techniques have been suggested and employed to reduce the complexity of the plasma proteome, including depletion of abundant proteins [6], nonspecific enrichment of low-abundant proteins via combinatorial peptide libraries [7] and specific enrichment of targeted peptides after enzymatic digestion [8]. Complexity reduction can be performed by classical methods such as centrifugation or extraction with organic solvents [9] or by immunodepletion [10]. A range of depletion columns, spin cartridges and affinity capture beads for removal of albumin, IgG [11] and many other abundant proteins are commercially available. Several of these commercial kits have previously been compared by Chromy et al. [12] and Björhall et al. [13] for their utility in plasma proteomics. Immunoaffinity is efficient in depleting selected abundant proteins, but in significantly reducing the concentration range of proteins in plasma, many different antibodies are needed. As the immunoaffinity depletion is carried out under native conditions, other less abundant proteins may still be bound to one of the abundant proteins being depleted, for instance, albumin in plasma. Typically, commercial affinity columns use avian IgY antibodies against the most abundant ("top") plasma proteins, and remove from 50% (anti-albumin only) to 99% (top-20) of total plasma protein. In theory, assuming a 100% recovery, low abundance proteins would then be enriched by a factor 2 to 100, respectively. However, both reproducibly manufacturing and applying columns with a large number of different antibodies is not trivial. For instance, we have previously observed a significant column-to-column variation in commercial affinity depletion columns (unpublished results). Although, this may not be a serious problem in a general exploration of the plasma proteome, or in studies where proteins have been isotopically (or otherwise) labelled prior to the depletion/enrichment step, poor reproducibility obviously poses a serious problem for label-free studies.
Many of the abundant proteins in plasma have molecular weights exceeding 60 kDa (e.g. albumin, transferrin, fibrinogen, IgA, α-2antitrypsin, apolipoproteins, and acid-1-glycoprotein). A simple and semi-selective depletion of many of these large and highly abundant plasma proteins is possible by precipitation using organic solvents such as acetonitrile, and this has indeed been demonstrated in plasma and serum from several species with reproducible results [14][15][16][17][18]. This procedure results in a separation, wherein most of the more soluble low molecular weight proteins are left in the supernatant and the larger proteins precipitate. Acetonitrile has also been shown to release albumin-bound proteins, which could be potential biomarkers [5]. Protein solubility is also affected by pH, ionic strength and temperature [19], and by adjusting one or more of these parameters, the precipitation may be optimized to efficiently remove as much of the abundant proteins such as albumin, as possible in a single step, while maintaining low-abundant proteins in solution. Alternatively, several precipitation steps can be combined for a more efficient depletion of abundant proteins and increased recovery of lowabundant proteins. Semi-selective precipitation may also be tuned to, partition the proteome in two or more complementary fractions with Abstract Blood plasma is a highly complex mixture of proteins, metabolites and lipids, and a rich source of potential biomarkers for a range of diseases and conditions. The wide range in protein abundance poses a tremendous challenge for plasma proteomics. However, as a relatively small number of proteins make up most of the total protein pool, the concentration range can be compressed by depletion of abundant proteins, such as albumin. To reduce sample complexity and increase the protein coverage, we have developed a sample preparation method based on semi-selective precipitation with acetonitrile at different pH and built a data analysis pipeline, combining different search strategies. The method we propose is reproducible and easily parallelised (high throughput), and may be well suited to fractionate plasma for label-free quantitative proteomics in large clinical studies. Up to 90% of albumin and other abundant proteins were removed by adding an equal volume of acetonitrile to the samples adjusted to pH 5. limited overlaps, for increased combined coverage of the proteome. In this work, we focused on the effect of pH on the plasma depletion by acetonitrile and the method's suitability for clinical applications. Such a simple precipitation procedure is attractive for large scale studies as they are inexpensive, scalable, easy to parallelize, potentially robust and reproducible, and not dependent on expensive affinity separations with concomitant batch-to-batch or column-to-column variation that is problematic for label-free methods.

Sample preparation and organic precipitation
Human plasma from healthy volunteers was collected into BD Vacutainer® tubes with 18.0 mg K 2 EDTA (K 2 E, REF 367525, BD Vacutainer Systems, Plymouth, UK) and immediately spun down at 1,300´ g for 10 minutes at 21°C, and 50 µl aliquotes were stored at -80°C, until use. Samples were thawed at 4°C and then centrifuged at 16,100´ g at 4°C for 1 minute. The pH was adjusted in three identical aliquots to 5.0, 7.0 and 9.0, by adding acetic acid and ammonium hydroxide directly to the sample. Three other aliquots were diluted 1:10 (v: v) with 100 mM ammonium acetate buffer with corresponding pH's, to investigate the effect of protein concentration. For protein precipitation, acetonitrile was mixed with the samples in 1:1 (v:v) ratio and the samples were vortexed, three times at 1,000 rpm for 5sec, and then incubated for 10 minutes in an ultrasonic bath at room temperature. Vortexing and sonication steps were repeated twice, before the samples were centrifuged at 16,100´ g at 4°C for 10 minutes. The supernatants after precipitation were collected in fresh Eppendorf tubes and both the pellets and the supernatants were lyophilized. The precipitates were vigorously vortexed and sonicated in 100 µl BugBuster Master Mix (Novagen, Merck KGaA, Germany) for pellets and 30 µl for supernatants. The pellet precipitates were resuspended in a Bullet Blender (Next Advance Inc., Averill Park, NY) with 0.1 mm glass beads, which were then removed by centrifugation through 30 µm pore size micro-spin columns (Thermo Fisher Scientific, Waltham, MA) at the lowest speed. The protein concentration was then defined using a Bicinchoninic Acid (BCA) protein assay kit (Thermo Fisher Scientific). This protein extraction reagent has been developed for the lysis and protein solubilisation from bacteria, but is routinely used in our laboratory and directly compatible with BCA analysis, SDS-PAGE, tryptic digestion, and samples are easily cleaned up for analysis by Liquid Chromatography-Mass Spectrometry (LC-MS).

SDS-PAGE and in-solution digestion
Thirty micrograms of protein (BCA) per sample were loaded on a 1-mm 10-well 4-12% NuPAGE ® Bis-Tris gel (Invitrogen, Carlsbad, CA). All samples were diluted in 2X NuPAGE ® Sample Buffer (Invitrogen). Proteins were separated in the gel for 1 h at 180 V. The gel was stained in NuPAGE ® Colloidal Blue (Invitrogen, overnight at room temperature and destained with milli-Q water until the background was transparent. For in-solution tryptic digestion, 20 µg of each sample was used. The digestion was performed after DTT reduction (10 mM, 56°C for 45 min) and IAA alkylation (25 mM, 1 h in the dark at room temperature), in 25 mM ABC with protein to trypsin ratio 20:1 for 12 h at 37°C. The reaction was then quenched with 5 µL of 10 % TFA. The samples were stored at -35°C until analysis.

Liquid chromatography-mass spectrometry
Peptides derived from all protein digests were separated by splitless parallel reversed phase C18 NanoLC-Ultra 2D plus (Eksigent, Dublin, CA) ultra-high pressure liquid chromatography (PepMap trap columns C18 5-mm, 300 µm-i.d., Dionex Sunnyvale CA; ChromXP analytical C18 columns 15 cm, 300 µm-i.d., Eksigent), with an additional loading pump for fast sample loading and desalting. Samples were analyzed for 120 min using a linear gradient, from 4 to 33% acetonitrile in 0.05% formic acid with flow rate 2 µl/min. The MS and MS/MS (CID-only) spectra were recorded on an amaZon ETD high-capacity 3D ion trap with CaptiveSpray source (Bruker Daltonics, Bremen, Germany). The ten most abundant multiply charged species in the m/z range 300-1300, were automatically selected for MS/MS with one minute dynamic exclusion, after having been selected twice.

Data analysis
The complete experiment was analyzed in a single Taverna scientific workflow [20] (Figures 1 and 2) with all external software installed in their default locations. For each sample, the raw LC-MS/MS files were first converted to mzXML [21] using compassXport 3.0.5 (Bruker). The mzXML files were then processed as in the open source Trans-Proteomic Pipeline (TPP) [22], using both the X!Tandem [22,23] database search engine and the SpectraST spectral library search. With  Figure 1 and the two workflows may be combined into a single, complete workflow.
X!Tandem, we used the UniProt human reference proteome set (2012-02-05, canonical sequences only), carbamidomethylation as the only and fixed modification, the k-score plug-in [22] and a monoisotopic mass error ±0.5 Da, including the first and second isotopic peaks. For SpectraST, the NIST human spectral library from 2011-05-26 was searched with default settings except for carbamidomethylation ("CAM") of cysteines. All search results ( in pepXML [22] ) were analyzed by PeptideProphet [24], then refined and combined by InterProphet. Peptide-spectrum matches with a PeptideProphet probability p≥0.95, corresponding to approximately a 1% false discovery rate (FDR) were included in the analysis. For each protein sequence in the FASTA file, a BeanShell component in the comparison workflow (Figure 2), calculated molecular weight using average masses of amino acids, GRAVY score (using amino acid hydrophathy information from Kyte and Doolittle [25] ) and pI ( using pK values from Bjellqvist et al. [26] ). The protein spectral counts (number of peptide-spectrum matches per protein) in the different fractions, were then compared with respect to this information and visualized using an Rshell. The raw mass spectrometry proteomics data is deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository [27] with the dataset identifier PXD000042. The workflow is freely available via www.myExperiment.org ("Plasma Precipitation Analysis").

Results and Discussion
The method for protein fractionation explored here, was designed to partition the proteins in the sample, reducing the relative abundance of the dominating proteins, and if possible, simultaneously remove contaminants that might interfere with protein quantitation and biomarker detection in body fluids such as plasma. However, at high protein concentrations, such as in plasma, there is always a high risk of co-precipitating otherwise soluble proteins. Experimentally, we indeed found the preparation of diluted samples to be more robust, less timeconsuming and the results were highly reproducible (Figure 3). This method, therefore, could be more easily applied in larger studies. The fractionation of proteins in plasma by acetonitrile, is expected to be correlated with the molecular weight and hydrophobicity (at a given pH) of the proteins [28]. It was possible to influence the solubility of different plasma proteins by alternating the pH of the buffer. For example, proteins with pI 5-6, such as albumin, could be expected to readily precipitate at a pH of 5 or 7. Pellets obtained at pH 5 or 7 were relatively easy to resuspend, but precipitates at pH 9 were very hard to dissolve and required additional use of ultrasonication. The reproducibility of protein extraction from pH 9 pellets was also poor, with notable changes in the abundant distribution of the proteins. Plasma pH in the sample usually varies between 7.5 and 8.5 and not surprisingly its precipitation profiles are most similar at pH 9, where the pellet fraction is not much enriched in large proteins and the supernatant is still highly dominated by albumin (data not shown).
The combination of X!Tandem and SpectraST identified 8,418 spectra (672 unique peptides) in the LC-MS/MS analysis of raw plasma, 6,751 spectra (568 unique peptides) in pellet fraction and 8,799 spectra (463 unique peptides) in the supernatant. As expected, the largest difference or smallest overlap was observed between the precipitate and the supernatant (Figure 4). The total proteome coverage in the pellet and supernatant fractions was 25% higher, compared to a single analysis of crude plasma. A few peptides and proteins were only identified in the raw plasma, and not in either the pellet or the supernatant fraction. However, relative spectral counts clearly show that most of the abundant proteins precipitate at pH 5 and remain in the pellet, while small proteins are enriched in the supernatant fraction (Figure 5a). Examples of such small proteins include several apolipoproteins (e.g. A1, A2, A4, C1 and C3), as previously shown by Anderson and Hunter [29]. Some mid-range (40-60 kDa) molecular weight proteins also increased in relative abundance in the supernatant. The spectral counts for proteins between 60 and 80 kDa are primarily due to albumin (90% in the raw plasma). The fraction of identified spectra assigned to albumin peptides in the entire raw plasma dataset was close to 60%. In the supernatant sample, only 5% of the identified spectra were from albumin peptides, indicating a depletion of ~ 90%. Also a number of other large and highly abundant proteins, such as α-2-macroglobulin and complement C3, were found to be significantly depleted. The relative abundance of albumin in the pellet fraction was approximately the same as for crude plasma.  For identifying or quantifying very low abundant proteins, the methods based on immunoaffinity depletion or enrichment, or combinatorial peptide libraries for dynamic range compression are probably superior. However, increasing the relative concentration of already identified proteins tenfold may make it easier to quantify adducts or modifications to these proteins. The precipitation fractionation method could also be used as a first step, before depleting or enriching selected proteins or peptides Since the pH for precipitation is easily controlled and can be used to target depletion of abundant proteins, the predicted protein pI was used to compare the protein content in supernatants and pellets, and to evaluate the method (Figure 5b). Proteins are known to precipitate at the pH close to their pI values, and therefore most proteins including albumin were expected to precipitate at pH 5. However, more proteins with pI 5.0-5.5 were identified in the supernatant fraction than from the pellet. On the other hand, many fewer peptides from proteins with pI 6.0-6.5 were found in the supernatant, than in the pellet. Interestingly, despite the peaks at pI 6.5-7.0, 8.0-8.5 and 9.0-9.5, there were only minor differences between the precipitates generated at different pH. The histogram for raw plasma showed a similar distribution to the sum of the pellet and supernatant fractions, if produced at the same pH ( Figure 5b). Additional information such as the isoelectric point of a protein or its molecular weight, can be used to filter out the erroneous identifications in samples, fractionated in a pI or molecular weightdependent manner. The Trans-Proteomic Pipeline already implements this for pI, at the level of the peptides.
The workflow also calculated the protein hydrophobicity or GRAVY score. When comparing protein abundance in the pellet and the supernatant fractions, with respect to GRAVY score and protein molecular weight, we see somewhat surprisingly, that the hydrophobicity has a very small effect on the precipitation, in comparison with molecular weight (Figure 5c).

Conclusion
Although blood plasma is one of the most popular sample sources in biomarker discovery, the large dynamic range of the protein concentration provides a serious challenge. As was shown by Kay et al. [28], albumin can be precipitated by simply adding acetonitrile. We have shown that, adjustment of the pH prior to precipitation and addition of equal volume of acetonitrile was sufficient, to remove approximately 90% of albumin and many other large proteins from the supernatant extracts. This increases the relative abundance of other proteins, which may be beneficial for quantitative precision, especially in label-free analyses. Moreover, the proteome coverage has been increased by 25%, while identifying 34% more peptides. The procedure is simple, reproducible, can be quickly performed with common laboratory chemicals and equipment, and is compatible with standard techniques such as SDS-PAGE and LC-MS/MS. The method may be applicable in many types of proteomic analyses of plasma and other samples. For instance, optimised organic precipitation, not only can be used for the sample decomplexification but also to concentrate target proteins, which might be an advantage in biomarker discovery. This method has been successfully implemented in urine proteomics [30]. This method may also be adopted for the preparation of green plant material for mass spectrometry analysis, depleting the highly abundant RuBisCO (Ribulose-1,5-bisphosphate carboxylase oxygenase), as both subunits have isoelectric points near 6.  Figure 5: Histograms of the molecular weight (a) and predicted pI (b) distributions of proteins identified in the crude sample (orange), pellet (green) and supernatant (red), accompanied by a graph with calculated GRAVY score plotted against protein molecular weight (c) In the latter, proteins marked in blue have pellet to supernatant spectral count ratio ≥2, in red ≤0.5, and in green more than 0.5 and less than 2.
Although the gain in protein coverage is lower than what can be achieved with immunoaffinity procedures, it should be emphasized that the present technique is robust and can easily be applied in large clinical studies. Further improvements or adaptation of experimental protocols may focus on specific enrichment for protein modifications (sulfation, phosphorylation, glyco-or lipoproteins), as well as providing some constraints for the peptide/protein identification algorithms, such as limits on pI, molecular weight or post-translational modifications. Further optimization may also aim at improving the quality and albumin depletion of the pellet fraction.
As an additional remark, the Taverna scientific workflow used in this study, contains in a single workflow and interface, all the steps from raw mass spectrometry data through format conversion, peptide identifications, statistical evaluation, data mining to visualization in figures, essentially as they appear in this paper, completely automated and without any interactive manual input. The workflow and the data discussed here are available on-line, enabling anyone to repeat the analysis or adapt the workflow for any other experiment, comparing two or more tandem mass spectrometry datasets, with respect to physico-chemical protein properties.