\chapter[Basics in Polymer Chemistry]{Basics in\\ Polymer Chemistry} \label{chap:basics-polymer-chemistry} This chapter will introduce the basics of polymer chemistry. The way this topic is going to be covered is admittedly biased towards mass spectrometry and biological polymers. Moreover, the aim of this chapter is to provide the reader with the specialized words that will later be used to describe and explain the (inner) workings of the \pxm\ program. This manual is not a ``crash course'' in biochemistry! \renewcommand{\sectitle}{Polymers? Where? Everywhere!} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} Indeed, polymers are everywhere. If you ask somebody to show you something polymeric, he/she will point you at the first plastic object in the vicinity. Right, plastic materials are made of hydrocarbon polymers. But we have many different polymers in our body. Proteins are polymers, complex sugars are polymers, DNA (the so-called ``molecule of heredity'' is a \emph{huge} polymer. There are polymers in wine, in wood... Where? Everywhere! \bigskip The \textsl{Oxford Advanced Learner's Dictionary of Current English} gives for \emph{polymer} the following definition: \textit{\textbf{natural or artificial compound made up of large molecules which are themselves made from combinations of small simple molecules}}. \bigskip A polymer is indeed made by covalently linking small simple molecules together. These small simple molecules are called \emph{monomer}s, and it is immediate that a \emph{polymer} is made of a number of monomers. A general term to describe the process that leads to the formation of a polymer is \emph{polymerization}. It should be noted that there are many ways to polymerize monomers together. For example, a polymer might be either linear or branched. A polymer is linear if the monomers that are polymerized can be joined at most two times. The first junction links the monomer to an elongating polymer (thus making it the new end of the elongating polymer which, by the way, is longer than before by one unit) and the second junction links the new elongating polymer's end to another monomer. This process goes on until the reaction is stopped, the point at which the polymer reaches its \emph{finished state}. A branched polymer is a polymer in which at least one monomer is able to contract more than two bonds. It is thus clear that a single monomer linked three times to other monomers will yield a ``T-structure'', which is nothing but a branched structure. In the following sections we'll describe a number of different kinds of polymers. Each time, they will be described by initially detailing the structure of their constitutive monomers; next the formation of the polymer is described. At each step we shall try to set forth each polymer characteristics in such a manner as to introduce the way \pxm' ``thinks polymers'' and to introduce specialized terminologies. Once the basic chemistries (of the different polymers) have all been described, we will enter a more complex subject that is of enormous importance to the mass spectrometry specialist: polymer chain disrupting chemistry. We shall see that this terminology actually involves two kinds of chemistries: cleavage on the one hand and fragmentation on the other hand. While \pxm\ is basically oriented to linear single stranded polymer chemistries, it also can be used to simulate highly complex polymer chemistries. Biological polymers are the main focus of this manual, however all the concepts described here may be applied with no modification (or so slight) to synthetic polymer chemistries. Well, time has come to make a ``biochemical polymers'' tour. The reader who feels at home with biopolymers may skip joyfully the next sections. However, the section pertaining to polymer lysis and fragmentation should be of interest even to the expert because they are the opportunity to introduce a ``funny'' terminology that is not encountered anywhere else (have you ever heard of \emph{``leftrighrules''} or of \emph{``fragrules''}?!). \renewcommand{\sectitle}{Various Biopolymer Structures} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} Biopolymers are amongst the most sophisticated and complex polymers on earth and it certainly is not a mistake to take them as examples of how monomers (be these complex or not) can assemble covalently into life-enabling polymers. In this section we will visit three different polymers encountered in the living world: proteins, nucleic acids and polysaccharides. We shall be concerned with 1) the monomers' structure, 2) the polymerization reaction and 3) the final capping reaction responsible for putting the polymer in its \emph{finished state}. \subsection*{Proteins} These biopolymers are made of amino acids. There are twenty major amino acids in nature, and each protein is made of a number of these amino acids. The combinations are infinite, providing enormous diversity of proteins to the living world. A protein is a polar polymer: it has a left end and a right end. This means that the polymerization process is something ordered, from left to right. The Figure~\ref{fig:peptbond-formation} shows that the chemical reaction at the basis of protein synthesis is a \emph{condensation}. A protein is the result of the condensation of amino acids with each other in an orderly polar fashion. A protein has a left end (called \emph{N terminus; amino terminal end}) and a right end (called \emph{C terminus; carboxyl terminal end}). The left end is an amino group ($\mathrm{_2}$HN--) corresponding to the amino group of the non-reacted amino acid. Upon condensation of a new amino acid onto the first one, the carboxyl group of the first amino acid reacts with the amino group of the second amino acid. A water molecule is released, and the formation of a bond between the two amino acids yields a dipeptide. The right end of the dipeptide (and of a polypeptide --\textit{i.e.} of a protein-- also, of course) is a carboxyl group (--COOH) corresponding to the un-reacted carboxyl group of the last amino acid to have ``polymerized in''. The bond formed by condensation of two amino acids is an amide bond, also called --in protein chemistry-- a \emph{peptidic bond}. The elongation of the protein is a simple repetition of the condensation reaction shown in Figure~\ref{fig:peptbond-formation}, granted that the elongation \emph{always} proceeds in the described direction (a new monomer arrives to the right end of the elongating polymer, and elongation is done from left to right). \begin{figure} \begin{center} \includegraphics[scale=0.25]{figures/raster/peptbond-formation.png} \end{center} \caption[Peptidic bond formation]{\textbf{Peptidic bond formation by condensation.} The left end monomer $\mathrm{R_1}$ is condensed to the right end monomer $\mathrm{R_2}$ to yield a peptidic bond. A water molecule is lost during the process.} \label{fig:peptbond-formation} \end{figure} Now we should point at a protein chemistry-specific terminology issue: we have seen that a protein is a polymer made of a number of monomers, called amino acids. In protein chemistry, there is a subtlety: once a monomer is polymerized into a protein it is no more called a monomer, it is called a \emph{residue}. We could say that a residue is an amino acid less a water molecule. From what we have seen until now, we could define a protein this way: ---\textsl{``A protein is a chain of residues linked together in an orderly polar fashion, with the residues being numbered starting from 1 and ending at n, from the first residue on the left end to the last one on the right end''}. This definition is still partly inexact, however. Indeed, from what is shown in Figure~\ref{fig:prot-polymer}, there is still a problem with the extremities of the polymer chain: what about the amino group on the left end of a protein (the amino group sits right onto the first amino acid of the protein), and what about the carboxyl group of the right end of a protein (the carboxyl group sits right onto the last amino acid of the protein)? These two groups are un-reacted, in fact. If we followed the new ``residue-based'' definition of a protein polymer, we would say that there is a proton in \emph{excess} on the left end and a hydroxyl in \emph{excess} on the right end. However, these two chemical groups are not actually in \emph{excess}, they are called (in \pxm) the \emph{cappings} or \emph{caps} of the polymer (this terminology is also used in polymer science). They ensure that the polymer is in a \emph{finished state}, which means that it cannot be elongated anymore, on whichever end. The proton is the \emph{left cap} of the protein polymer and the hydroxyl is the \emph{right cap} of the protein polymer. \begin{figure} \begin{center} \includegraphics[scale=0.25]{figures/raster/prot-polymer.png} \end{center} \caption[A protein is a capped residue chain]{\textbf{End capping chemistry of the protein polymer.} A protein is made of a chain of residues and of two caps. The left cap is the N-terminal proton and the right cap is the C-terminal hydroxyl. Altogether, the residual chain (enclosed here in the blue polygon) and both red-colored caps (H and OH) do form a complete protein polymer.} \label{fig:prot-polymer} \end{figure} Now comes the question of unambiguously defining the structure of a protein. It is commonly accepted that the simple ordered sequence of each residue code in the protein, from left to right, constitutes an unambiguous description of the protein's \emph{primary structure}. Of course, proteins have three-dimensional structures, but this is of no interest to a program like \pxm, which is aimed at calculating masses of polymers. To enunciate unambiguously the \emph{sequence} of a protein, you would use a symbology like this: \begin{mynoindent} {\footnotesize using the 3-letter code of the amino acids:}\\ Ala Gly Trp Tyr Glu Gly Lys\\ {\footnotesize or, using the 1-letter code of the amino acids:}\\ A G W Y E G K\\ Alanine is thus the residue 1 and Lysine is the last residue ($\mathrm{n = 7}$). \end{mynoindent} This primer in protein chemistry should be sufficient for the moment. Let us now go to see how nucleic acids differ from the proteins (and they do no little). \subsection*{Nucleic Acids} These biopolymers are more complex than the proteins are. This is mainly due to the fact that nucleic acids are composed of monomers that have three different parts, and because those parts differ in DNA and RNA. Nucleic acids are made of \emph{nucleotide}s. A nucleotide is the nucleic acid's brick: \emph{a nucleotide consists of a nitrogenous base combined with a ribose/deoxyribose sugar and with a phosphate group}. There are two different kinds of nucleic acids: deoxyribonucleic acid, also known as DNA (the sugar is a deoxyribose) and ribonucleic acid, also known as RNA (the sugar is a ribose). DNA is most often found in its double stranded form, while RNA is most often found in single strand form. There are four nitrogenous bases for each: Adenine, Thymine, Guanine, Cytosine for DNA; in RNA only one of these bases changes: Thymine is replaced by Uracile. A nucleic acid is a polar polymer: it has a left end and a right end (same as for proteins, remember?). This means that the polymerization process is something ordered, from left to right (sometimes left is up and right is down in certain vertical representations found mainly in textbooks). This manual is not to teach biochemistry, which is why I am not going to describe the structure of the monomers in atomic detail. However, since it is important to understand how the polymerization occurs, I drew the Figure~\ref{fig:nucacbond-formation} which shows the polymerization reaction mechanism between a nucleotide and another one, to yield a dinucleotide. The Figure~\ref{fig:nucacbond-formation} shows that the chemical reaction that is at the basis of nucleic acid synthesis is an \emph{esterification}. A nucleic acid has a left end (called \emph{5' end; often this end is phosphorylated}) and a right end (called \emph{3' end; hydroxyl end}\/). The reaction is the attack of the phosphorus of the new (deoxy)nucleotide triphosphate by the 3'OH of the right end of the elongating nucleotidic chain. Upon esterification, an \emph{inorganic pyrophosphate} (PP$\mathrm{_i}$) is released, and the formation of a phosphodiester bond between the two nucleotides yields a dinucleotide. The elongation of the nucleic acid polymer is a simple repetition of this esterification reaction so that the chain growth is always in the 5'$\Longrightarrow$3' direction. This is achieved in the living cells by what is called the \emph{5'$\Longrightarrow$3' polymerase enzymatic activity}. The conventional representation of a nucleic acid involves showing the 5' end on the left, and the 3' end on the right, horizontally. Sometimes, to clearly indicate that the left end is phosphorylated, while the right end is not, the ends are indicated as ``5'P'' and ``3'OH''. \begin{figure} \begin{center} \includegraphics[scale=0.2]{figures/raster/nucacbond-formation.png} \end{center} \caption[Phosphodiester bond formation]{\textbf{Phosphodiester bond formation by esterification.} The arriving monomer (on the right) has its triphosphate on the 5' carbon of the sugar esterified by nucleophilic attack of the first phosphorus by the alcohol function beared by the 3' carbon of the (deoxy)ribose sugar ring of the left monomer. The bond that is formed is a phosphodiester bond, with release of a pyrophosphate group ($\mathrm{P_i}$). Note that the sugar and nitrogenous bases are schematically represented in this figure.} \label{fig:nucacbond-formation} \end{figure} Figure~\ref{fig:nucac-polymer} shows a simple way to formalize what a nucleic acid polymer is. The molecule represented on the left is the representation of the ``monomer'' in the sense that the polymer is made of a number of these monomers (if you put in the presented formula the proper nitrogenous base and the proper sugar --ribose or deoxyribose--, you will get the nucleotide of your choice). We have seen previously that, in the specific case of the protein polymer chemistry, the monomer is called residue once it is polymerized into the polymer chain. In the case of the nucleic acids, there is no such specific term, we just call the monomeric unit a nucleotide. The formula represented on the left of the Figure~\ref{fig:nucac-polymer} shows the repetitive element in a nucleic acid polymer, exactly the same way as we had shown the residue formula in the protein polymer chemistry section. Indeed, as we had explained earlier with proteins, the formula shown on the right of the Figure~\ref{fig:nucac-polymer} illustrates that the nucleic acid polymer needs to be set to a \emph{finished state}. The atoms shown in red (outside the boxed repetitive elements) are the nucleic acid \emph{caps}. Thus, we see clearly that in the case of the nucleic acid polymers, the left cap is a hydroxyl and the right cap is a proton. This anecdotically happens to be the exact converse of what we described earlier for proteins. \begin{figure} \begin{center} \includegraphics[scale=0.25]{figures/raster/nucac-polymer.png} \end{center} \caption[A nucleic acid is a capped nucleotide chain]{\textbf{End capping chemistry of the nucleic acid polymer.} A nucleic acid is made of a chain of nucleotides (left formula) and of two caps. The left cap is the hydroxyl group that belongs to the terminal phosphate of the 5' carbon of the sugar. The right cap is the proton that belongs to the hydroxyl group of the 3' carbon of the sugar ring (right formula). Altogether, a finished nucleic acid polymer is made of the nucleotidic chain (enclosed here in the blue polygon), made of the repetitive elements (one of which is shown on the left), and of the two caps (red-colored OH and H, out of the box on the right).} \label{fig:nucac-polymer} \end{figure} Now comes the question of unambiguously defining the structure of a nucleic acid. It is commonly accepted that the simple ordered sequence of the named nitrogenous bases in the nucleic acid, from left (5' end) to right (3' end), constitutes an unambiguous description of the nucleic acid sequence. To enunciate the sequence of a gene, you would use a symbology like this: \begin{mynoindent} {\footnotesize for a DNA, using the 1-letter code of the nitrogenous bases:} A T G C A G T C\\ {\footnotesize for an RNA, using the 1-letter code of the nitrogenous bases:} A U G C A G U C\\ Adenine is thus the base 1 and Cytosine is the last base ($\mathrm{n = 8}$). \end{mynoindent} \subsection*{Polysaccharides} These biopolymers are almost certainly amongst the more complex in the living world. This is mainly due to the fact that saccharides are usually heavily modified in living cells. There are a huge variety of chemical modifications occurring on these biopolymers. Furthermore, the ramifications in the polymer structure are more often the normal situation than not. Interestingly these molecules are first thought of as the ``fuel'' for the cell, which is certainly far from being total non-sense, but it is clear that their structural role is extremely important. Their ability to form complex structures has been exploited in living systems for identification processes. There are a number of complex sugars on the cell walls\dots Nonetheless, the general picture is not that complex, if we only think of the way monomers are polymerized together. As far as we are concerned, in fact, the polymerization mechanism is a simple condensation. In this respect, everything looks much like with proteins; some people do use the same terminology: a monomer sugar becomes a residue once polymerized in the saccharidic chain. There are two main different kinds of sugars: \emph{pentoses} (in $\mathrm{C_5}$) and \emph{hexoses} (in $\mathrm{C_6}$); it should be noted, however, that there is a variety of other common molecules, like \emph{sialic acids}, \emph{heptose}\dots A saccharidic polymer is polar: it has a left end and a right end (same as for proteins and nucleic acid, should you remember!). This means that the polymerization process is something ordered, from left to right. The terminology regarding the ends of a saccharidic polymer is rather unexpected at first sight: the left end is said to be the \emph{non-reducing end} while the right end is said to be the \emph{reducing end}. Historically this was observed with monosaccharides (also called \emph{monoses}), which reduced cupric ($\mathrm{Cu^{2+}}$) ions, thus getting oxydized themselves on the carbonyl (when in the open ring aldehydic form). Figure~\ref{fig:sacchbond-formation} shows the polymerization reaction between a sugar and another one (2 glucose monomers, actually), to yield a maltose disaccharide. The polymerization mechanism is a simple condensation. The elongation of the polysaccharidic polymer is a simple repetition of this condensation reaction so that the chain growth is always in the same orientation, from non-reducing end to reducing end. The conventional representation of a polysaccharide involves showing the non-reducing end on the left, and the reducing end on the right, horizontally. \begin{figure} \begin{center} \includegraphics[scale=0.2]{figures/raster/sacchbond-formation.png} \end{center} \caption[Osidic bond formation]{\textbf{Osidic bond formation by condensation.} The two monomers are subject to condensation with loss of one molecule of water.} \label{fig:sacchbond-formation} \end{figure} Figure~\ref{fig:sacch-polymer} shows a simple way to formalize what a saccharidic polymer is. The top formula is the representation of the ``monomer'' in the sense that the polymer is made of a number of these monomers. The bottom formula represents a polysaccharide, with the repetitive elements boxed (there are n monomers polymerized). The atoms shown in red (outside the boxed repetitive elements) are the saccharidic polymer \emph{caps}. Thus, we see clearly that in the case of polysaccharides, the left cap is a proton and the right cap is a hydroxyl. This anecdotically happens to be identical to the protein case and the exact converse of what we described previously for nucleic acids. \begin{figure} \begin{center} \includegraphics[scale=0.25]{figures/raster/sacch-polymer.png} \end{center} \caption[A saccharidic polymer is a capped osidic residue chain]{\textbf{End capping chemistry of the polysaccharidic polymer.} A polysaccharide is made of a chain of osidic residues (blue-boxed formula) and of two caps (red-colored atoms). The left cap is the proton group that belongs to the non-reducing end of the polymer. The right cap is the hydroxyl group that belongs to the reducing end of the polymer.} \label{fig:sacch-polymer} \end{figure} Now comes the question of unambiguously defining the structure of a saccharidic polymer. It is commonly accepted that the simple ordered sequence of the named monoses in the saccharidic polymer, from left (non-reducing end) to right (reducing end), constitutes an unambiguous description of the glycan sequence. To enunciate the sequence of a glycan, you would use a symbology like this: \begin{mynoindent} {\footnotesize using a 3-letter code:}\\ Ara Gal Xyl Glc Hep Man Fru\\ Arabinose is thus the monose 1 and Fructose is the last monose ($\mathrm{n = 7}$). \end{mynoindent} Incidentally, this is where the ability of \pxm\ to handle monomer codes of non-limited length comes in handy! \renewcommand{\sectitle}{To Sum Up} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} rapidly made an overview of the three major polymers in the living world. A great many other polymers exist around us. Table~\ref{tab:three-biopolym-exples} on page~\pageref{tab:three-biopolym-exples} tries to sum up all the informations gathered so far. Note that the formulae given for the monomers are the ``residual'' ones. For example, the formula of the glycyl residue corresponds to the formula of the Glycine monomer less one molecule of water. \begin{table} \begin{small} \begin{tabular}{c|ccccc}\hline polymer & name & code & formula & left cap & right cap \\ \hline protein & & & & H & OH \\ & Glycine & G & $\mathrm{C_2H_3O_1N_1}$ & & \\ & Alanine & A & $\mathrm{C_3H_5O_1N_1}$ & & \\ & Tyrosine & T & $\mathrm{C_9H_9O_2N_1}$ & & \\ nucleic acid& & & & OH & H \\ & Adenine & A & $\mathrm{C_{10}H_{12}O_5N_5P_1}$ & & \\ & Cytosine & C & $\mathrm{C_9H_{12}O_6N_3P_1}$ & & \\ saccharide & & & & H & OH \\ & Arabinose & Ara & $\mathrm{C_5H_8O_4}$ & & \\ & Heptose & Hep & $\mathrm{C_7H_{12}O_8}$ & & \\ \hline \multicolumn{6}{c}{Note: LC=left cap; RC= right cap}\\ \hline \end{tabular} \caption[Comparison of three common biopolymers]{\textbf{Quick comparison of three biopolymers with examples of monomers}}\label{tab:three-biopolym-exples} \end{small} \end{table} Many synthetic polymers are much simpler than the ones we have rapidly reviewed, and it should be clear that, if \pxm\ can deal with the complex biopolymers described so far, it certainly will be very proficient with less complex synthetic polymers. Describing the formation of polymers is one thing, but we also have to describe how to disrupt polymers. This is what we shall do in the next section. \renewcommand{\sectitle}{Polymer Chain Disrupting Chemistry} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} \label{sect:pol-chain-disrupt-chem} As we initially spoke of ``polymer chain disrupting chemistry'' earlier, we said that this was a complex subject, and that it was of \emph{enormous} importance to the mass spectrometrist. This is why we will treat this subject in a pretty thorough manner. First of all we should insist on the fact that chemically modifying a polymer does not necessarily mean that the chain structure of the polymer is perturbed. Here, however, we are concerned specifically with the chemical modifications that yield a polymer chain perturbation; \emph{cleavage} and \emph{fragmentation}: \begin{itemize} \item \textsc{A cleavage is a chemical process} by which a molecule will act directly on the polymer making it fall into at least two separated pieces (the \emph{oligomers}). As a result of the cleavage reaction, groups originating in the cleaving molecule remain attached to the polymer at the precise cleavage location; \item \textsc{A fragmentation is a chemical process} by which the polymer structure is disrupted into separated pieces (the \emph{fragments}) mainly because of energy-dependent electron doublet rearrangements leading to bond breakage. \end{itemize} Here are the details pertaining to each one of these two very different processes: \subsection*{Polymer Cleavage} We said above that, upon cleavage of a polymer, the cleaving molecule reacts with it, and by doing so directly or indirectly ``\emph{dissolves}'' an inter-monomer bond. A polymer cleavage always occurs in such a way as to generate a set of \emph{true} polymers (smaller in size than the parent polymer, evidently, which is why they are called \emph{oligomers}). Indeed, let us take the example shown in Figure~\ref{fig:prot-cleavage}, where a tripeptide (a very little protein, containing a methionyl residue at position 2) is submitted either to a water-mediated cleavage (hydrolysis, upper panel) or to a cyanogen bromide-mediated cleavage (lower panel). The two cases presented in this figure are similar in some respects but different in other respects: \begin{itemize} \item in both cases the bond that is cleaved is the inter-monomer bond (in protein chemistry this is a peptidic bond); \item in both cases the Oligomer 2 has the same structure; \item in the first case the molecule that is responsible for the cleavage is water, while in the second case it is cyanogen bromide; \item the structures of the Oligomer 1 species differ when produced using water or cyanogen bromide as the cleaving molecule. \end{itemize} \begin{figure} \begin{center} \includegraphics[scale=0.3]{figures/raster/prot-cleavage.png} \end{center} \caption[Protein cleavage by water and cyanogen bromide]{\textbf{Protein cleavage by water and cyanogen bromide.} A tripeptide (pretty small protein) is cleaved at position 1 either by hydrolysis (top) or by cyanogen bromide (bottom). Cyanogen bromide cleaves specifically on the right of a methionine monomer.} \label{fig:prot-cleavage} \end{figure} The difference between hydrolysis and cyanogen bromide cleavage is the Oligomer 1 species: the cyanogen bromide cleavage has a side effect of generating a homoserine as the right end monomer of Oligomer 1, while hydrolysis generates a genuine methionine monomer. This is because water reverses in a very symmetrical manner what polymerization did (hydrolysis is the converse of condensation), while cyanogen bromide did some chemical modification onto the generated Oligomer 1 species. Nonetheless, the reader might have noted that --interestingly-- all the four oligomers do effectively have their left cap (a proton) and their right cap (the hydroxyl). This means that in both water and cyanogen bromide-mediated cleavage, all the generated oligomers are indeed true polymers in the sense that: 1) they are a chain of monomers (modified or not) and 2) they are correctly capped (\textit{i.e.} they are polymers in their finished state). This is important because it is the basis on which we shall make the difference between a cleavage process and a fragmentation process. Thus, the \pxm\ definition of an oligomer might be: \emph{an oligomer is a polymer (of at least one monomer) in its finished state that was generated upon cleavage of a longer polymer}. When the polymer cleavage reaction precisely reverses the reaction that was performed for the same polymer's synthesis, there is no special difficulty. But when the cleavage reaction modifies the substrate, then this should be carefully modelled. How? To answer this question we might start by comparing the two different Oligomer 1 species that were yielded upon the water-mediated and the cyanogen bromide-mediated cleavage reactions: ``the hydrolysis-generated Oligomer 1 is equal to the cyanogen bromide-generated Oligomer 1 +S1 +C1 +H2 -O1''; this is a big difference! The observations we did so far might be worded this way: \textsl{Whenever a protein undergoes a cyanogen bromide-mediated cleavage, the \[\textrm{``-C1H2S1+O1''}\] chemical reaction should be applied to the resulting oligomers \textit{if and only if} they have a methionine monomer at their right end}. This logical condition is called, in \pxm' jargon, a \emph{leftrightrule}, and will be described later (see page~\pageref{sect:cleavespecif}). Well, this sounds reasonable. But what about the ``normal'' case, when the cleavage is done using water? Nothing special: the mass of the oligomer is calculated by summing the mass of each monomer in the oligomer (since the monomers are not modified this is easily done) and the masses corresponding to both the left and right caps (these are defined in the polymer chemistry definition; in our present case it would be a proton on the left end, and a hydroxyl on the right end). In this way, the oligomer complies with its definition, which states that it is a faithful polymer made of monomers and that it is in its finished state. Yes, but then how will \pxm\ manage to calculate the mass of the modified oligomer, like our Oligomer 1 in the case of the cyanogen bromide-mediated cleavage? Simple enough, in a first step it does exactly the same way as for the unmodified oligomer. Next, each oligomer is checked for presence or absence of a methionine residue on its right end. If a methionine is found, the mass corresponding to the ``-C1H2S1+O1'' chemical reaction is applied. And that's it! In the previous cyanogen bromide example, the logical condition was involving the identity of the oligomers' right end monomer, but other examples can involve not the right end monomer, but the left end monomer, if some chemical modification was to occur to the monomer sitting right of the cleavage location. In this case the user would have to analyse the situation and provide \pxm\ with the proper chemical reaction by stating something analog to: \textsl{\textit{if and only if} they have a Xyz monomer at their left end} (note the partial analogy with the case described above). For the moment this is enough polymer cleavage abstraction, as the rest of the description pertaining to the cleavage specification definition is thoroughly detailed at page~\pageref{sect:cleavespecif}. \subsection*{Polymer Fragmentation} \label{sect:polymer-fragmentation} In a fragmentation process, the bond that is broken is not necessarily the inter-monomer bond. Indeed, fragmentations are oft-times high energy chemical processes that can affect bonds that belong to the monomers' internal structure. This is one of the reasons why fragmentations do differ from cleavages: they are specific of the polymer type in which they occur. Hydrolyzing a protein and an oligosaccharide is just the same process, from a chemical point of view. But fragmenting a protein or an oligosaccharide are truly different processes because the way that the fragmentation happens in the polymer sequence is so much dependent on the nature of each monomer that makes it. Another peculiarity of the fragmentations, compared with the cleavages that were described above, is the fact that there is no cleaving molecule starting the process. Instead, a fragmentation process is often initiated by an intra molecular electron doublet rearragement that propagates more or less in the polymer structure to eventually break it. Fragmentations are mainly a gas phase process, not some reaction that happens in solution as a result of putting in contact the polymer and some reagent. It is precisely because no cleaving molecule is involved in the fragmentation process that the fragments are not necessarily capped like a normal polymer should be; and this is another really important difference between cleavage and fragmentation. Let us illustrate these concepts through two examples: proteins and nucleic acids. \subsubsection*{Protein Fragmentation} There is a pretty important number of different kinds of fragments that can be generated upon fragmentation of peptides. We are going to detail the most common ones; the user is invited to use the \pxm' fragmentation-specification grammar to add less frequent (or newly discovered) fragmentation types. \begin{figure} \begin{center} \includegraphics[scale=0.2]{figures/raster/prot-fragmentation.png} \end{center} \caption[Protein fragmentation]{\textbf{Protein fragmentation patterns most widely encountered.} An hexapeptide is fragmented in the seven most widely encountered manners, such as to generate a, b, c, x, y, z and immonium fragment ions. The figure illustrates the position of the cleavage for each kind of fragment (exemplified using the case of the smallest fragment possible) and the mass calculation method is described for each fragment kind; consider that each fragment bears only \emph{one positive} charge.} \label{fig:prot-fragmentation} \end{figure} As can be seen from Figure~\ref{fig:prot-fragmentation}, the fragmentations do generate fragments of three categories: the ones that include the left end of the precursor polymer (a, b, c), the ones that include the right end of the precursor polymer (x, y, z), and finally the special case in which the fragment is an \emph{internal fragment}, like the immonium ions. When looking at the fragmentations described in the figure it becomes immediately clear why a fragmentation cannot be mistaken for a cleavage: the ionization of the fragment is not necessarily due to the captation of a proton by the fragment. Furthermore, we can also see that a fragmentation is not a cleavage because the fragment that is generated is \emph{absolutely} not necessarily what we call a polymer, in the sense that the fragment might not be capped the same way as the precursor polymer is (in its finished state). The two observations above should make clear to the reader that calculating masses for fragments is a more difficult process than what was described above for the oligomers. Indeed, while it was simple to calculate the mass of an oligomer (by simply adding the masses of its constitutive monomer units, plus the left and right caps, plus ionization), here there is no chemical formalism generally applicable to all the fragment types. This is why the specification of the fragmentation is left to the user's responsibility. By looking at Figure~\ref{fig:prot-fragmentation}, the reader should have noticed that the fragment naming scheme takes into consideration the fact that the fragment bears the left or the right end of the precursor polymer (or none, also). Indeed, the numbering of fragments holding the left end of the precursor polymer sequence begins at the left end, and for fragments that hold the right end at the right end. Thus the third fragment of series \emph{a} --\emph{a3}-- would involve monomers [1$\rightarrow$3]; and the third fragment of series \emph{y} --\emph{y3}-- would involve monomers [6$\rightarrow$4] (in the figure these left-to-right and right-to-left directions are symbolized using arrows). Therefore, it should appear to the reader how important --when specifying a fragmentation-- it is to clearly indicate from which end of the precursor polymer the fragment is generated (in \pxm\ jargon this is ``LE'' for left end, ``RE'' for right end and ``NE'' for no end). \pxm\ knows what action it should take when it encounters one of these three specifications; for example, if a ``LE'' specification is found for a given fragmentation specification, \pxm\ adds to the fragment's mass the mass corresponding to the left cap of the precursor polymer. Now that the stage is set we can start rationalizing fragment specifications, and thus mass calculations. \paragraph{\emph{a} fragment series} If we take the \emph{a} fragment series, the Figure~\ref{fig:prot-fragmentation} indicates that the fragments include the left end and that their last monomer lacks its carbonyl group (see, on top of Figure~\ref{fig:prot-fragmentation}, that the \emph{a1} arrow goes between the C$\alpha$H and the CO of monomer 1?). So we would say that each fragment of the \emph{a} series should be challenged with the following chemical treatments: 1) addition of the mass corresponding to the left cap (proton), 2) removal of the mass corresponding to the lacking CO group. This way we have the mass of fragment \emph{a1}. If we were interested in the fragment \emph{a4} we would have summed the masses of monomers 1 to 4, added the mass of the left cap, and finally removed the mass of a CO; that's it. The mass calculation is thus mathematically expressed \[a_i = LC + \sum_{1}^{i} M_i - CO\] \paragraph{\emph{b} fragment series} Similarly, the mass calculation is mathematically expressed \[b_i = LC + \sum_{1}^{i} M_i\] \paragraph{\emph{c} fragment series} The mass calculation is mathematically expressed \[c_i = LC + \sum_{1}^{i} M_i + NH_3\] \paragraph{\emph{x} fragment series} For this series of fragments we do not add the left cap anymore, but replace it with the right cap, since the fragments hold the right end of the precursor polymer. Note also that the numbering of the monomers using the variable \emph{i} in the following mathematical expressions goes from right to left (contrary to what happened for the \emph{a, b, c} fragment series. All the fragments that hold the precursor polymer right end are numbered this way, so this applies to fragments \emph{x, y, z}. The mass calculation is mathematically expressed \[x_i = RC + \sum_{1}^{i} M_i + CO\] \paragraph{\emph{y} fragment series} The calculation is mathematically expressed \[y_i = RC + \sum_{1}^{i} M_i + H_2\] \paragraph{\emph{z} fragment series} In low energy CID, the \emph{z} fragments are expressed this way: \[z_i = RC + \sum_{1}^{i} M_i - NH\] which is equivalent to \emph{y-$NH_3$}; in high energy CID an additional proton is often measured: \[z_i = RC + \sum_{1}^{i} M_i - NH + H\] \paragraph{\emph{immonium} fragment series} These fragments are internal fragments in the sense that they do not hold neither of the two precursor polymer's ends. \pxm\ understands that the user is speaking of this kind of fragment when the ``from which end'' piece of data --in the fragmentation specification-- states ``NE'' instead of ``LE'' or ``RE'' (see page~\pageref{sect:fragspecif}). The mass calculation for these fragments does not take into account the monomers surrounding the one for which the calculation is done. The mass for an immonium ion --at position \emph{i} in the precursor polymer-- will be the mass of the monomer at position \emph{i}, less the mass of a CO, plus the mass of a proton. The mass calculation for these special internal fragments is expressed \[imm_i = M_i + H - CO\] \subsubsection*{Nucleic Acid Fragmentation} The fragmentations that can be obtained with nucleic acid are numerous and it is more complicated than with proteins to describe them fully. The main reason for this is that there are a big number of fragmentation combinations because of the loss of nitrogenous bases from the skeleton. The mechanisms by which this loss happens are fairly complex, and I am not going to detail any of them. Figure ~\ref{fig:dna-fragmentation} shows the most common fragmentations (without taking into consideration the potential loss of bases). An example of fragment is given for each fragment series (pretty the same way as we did before for proteins). Note that the fragment representations are aimed at helping the reader to figure out what the product ion is, not taking into account where the negative charge lies on the fragment, since this charge can float around at every de-protonatable group. All the fragments shown bear one and one only negative charge. The reader might have noticed --at the bottom of the figure-- that a provision is made in the case the fragmented molecular species are not 5' end-phosphorylated but 5' end-hydroxylated. Indeed, the canonical monomer is such that, upon polymerization and left capping, the 5' end is phosphorylated. However, oft-times the oligonucleotides are synthesized chemically without the 5' end phosphate group, thus ending in hydroxyl. This special case should be accounted for by applying to all the fragments that bear the left end of the precursor polymer the following chemical reaction: $\mathrm -HPO_3$. This chemical reaction should be applied \emph{in addition} to the chemical reaction that yields the fragment \emph{per se}. \begin{figure} \begin{center} \includegraphics[scale=0.2]{figures/raster/dna-fragmentation.png} \end{center} \caption[DNA fragmentation]{\textbf{DNA fragmentation patterns most widely encountered.} A short DNA sequence is fragmented in the eight most widely encountered manners, such as to generate a, b, c, d, w, x, y, z fragment ions. The figure illustrates the position of the cleavage for each kind of fragment (exemplified using the case of the smallest fragment possible). and the mass calculation method is described for each fragment kind; considering that each fragment is protonated only once (+1).} \label{fig:dna-fragmentation} \end{figure} Exactly as we did for the protein fragments, we are giving below the mathematical expressions used to calculate the mass of different series of nucleic acid fragments; in these calculations we assume that the left end of the precursor polymer is phosphorylated (5' P) and the reader should bear in mind that this precise phosphate might itself be expelled by the fragmentation. The fragment naming scheme consideration that we emitted for protein fragments above (left-to-right or, conversely, right-to-left) applies here also in an identical manner. \paragraph{\emph{a} fragment series} These fragments most often appear with base loss. \[a_i = LC + \sum_{1}^{i} M_i - O\] \paragraph{\emph{b} fragment series} \[b_i = LC + \sum_{1}^{i} M_i\] \paragraph{\emph{c} fragment series} \[c_i = LC + \sum_{1}^{i} M_i - HPO_2\] \paragraph{\emph{d} fragment series} \[d_i = LC + \sum_{1}^{i} M_i - HPO_3\] \paragraph{\emph{w} fragment series} \[w_i = RC + \sum_{1}^{i} M_i + O\] \paragraph{\emph{x} fragment series} \[x_i = RC + \sum_{1}^{i} M_i\] \paragraph{\emph{y} fragment series} \[y_i = RC + \sum_{1}^{i} M_i - HPO_2\] \paragraph{\emph{z} fragment series} \[z_i = RC + \sum_{1}^{i} M_i - HPO_3\] There are also a variety of fragments for which a base is lost. But we cannot describe them all! \subsubsection*{More Complex Patterns Of Fragmentation} Before finishing with fragmentations, it is necessary to describe a powerful feature of the fragmentation specification grammar available in \pxm. This feature was required for the fragmentation of oligosaccharides and also sometimes for proteins. When the fragmentation (the bond breakage reaction itself) occurs at the level of certain monomers, it might be necessary to be able to specify some particular chemistry that would arise on the monomer in question. We have seen in the cleavage documentation that, upon cleavage of a protein sequence with cyanogen bromide, for example, a particular chemical reaction had to be applied to the oligomers that were generated with a methionine monomer as their right end monomer. Well, in a fragmentation specification it is possible to apply comparable chemical reactions but in a more thorough manner. Indeed, while in the cleavage it was possible to say something like ``\textsl{apply a given chemical reaction to the oligomer if the right end monomer is Xyz''}, in the fragmentation the logical condition can be bound not only to the identity of the currently fragmented monomer, but also (optionally) to the identity of the previous and/or next monomer in the precursor polymer sequence. For example: ---\textsl{``Apply a given chemical reaction if fragmentation occurs at the level of ``Xyz'' monomer only if it is preceded by a ``Yxz'' monomer and followed by a ``Zyx'' monomer''}. These logical conditions are called \emph{fragrules}. A \emph{fragspecif} can hold as many \emph{fragrules} as necessary. Thus we see that a fragmentation specification is a multi-part specification, with a \emph{fragspecif} optionally integrating \emph{fragrule} objects\dots All of this is described in great detail at page~\pageref{sect:fragspecif}. \subsubsection*{To Sum Up} To sum up all what we have seen so far with polymer chain disrupting chemistries: \begin{itemize} \item A polymer sequence gets cleaved into oligomers when a chemical reaction occurs in it at the level of one or more inter-monomer bond(s); monomer-specific chemical reactions can be modelled into the cleavage specification using at most one leftrighrule; \item A polymer sequence gets fragmented into fragments when a bond breakage occurs, without the help of any exterior molecule, at any level of the polymer structure, with no limitation to the inter-monomer bond; monomer-specific chemical reactions can be modelled into the fragmentation specification using any number of fragrules; \item Oligomers are automatically capped --\emph{on both ends}-- using the rules described in the precursor polymer's definition; \item Fragments are capped automatically only --\emph{on the end they hold, if any}-- using the rules described in the precursor polymer's definition; \item Oligomers are automatically ionized (if required by the user) using the rules described in the precursor polymer's definition; \item Fragments are never ionized automatically; ionization (gain/loss of a charged group) is necessarily integrated in the fragmentation specification. \end{itemize} \cleardoublepage %%% Local Variables: %%% mode: latex %%% TeX-master: "polyxmass" %%% End: