strbio2022_23/pdb.qmd at main · StrBio/strbio2022_23 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
---
title: "Protein Structures"
author: "Modesto Redrejo Rodríguez"
date: "`r Sys.Date()`"
toc: true
toc_float: true
format:
  html:
    theme: simplex
    toc: true
    toc-location: right
    toc-depth: 4
    number-sections: true
    code-overflow: wrap
    link-external-icon: true
    link-external-newwindow: true
bibliography: references.bib
editor_options:
  markdown:
    wrap: 80
---

```{r wrap-hook, include=FALSE}
#Markdown options
library(knitr)
library(formatR)
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE,fig.cap = "", fig.path = "Plot")


#Determine the output format of the document
outputFormat   = opts_knit$get("rmarkdown.pandoc.to")


```

# Obtaining and working with protein structures

::: callout-warning
## Warning for current and future structural biologists
:::

![Ceci n'est pas une proteine. Source: [SwissModel
site](https://swissmodel.expasy.org/static/course/files/PartIII_quality_assesment.pdf).](pics/magritte.png "Ceci n'est pas une proteine"){#fig-magritte
.figure}

The surrealist Belgian painter René Magritte created a collection of
surrealistic paintings entitled [***La trahison des images***
(1928--1929)](https://en.wikipedia.org/wiki/The_Treachery_of_Images "Magritte").
The most renowned of those paintings show a smoking pipe and the following
caption underneath: *"Ceci n'est pas une pipe"* (This is not a pipe). Yes,
indeed! It is actually the painting of a pipe.

Similarly, a picture of a protein, or a PDB file with the coordinates of a
protein structure, is not a protein. It is a representation of ONE possible structure of that protein.
Even experimentally determined structures have two main limitations that we
should always keep in mind: (1) they are a fixed structure (except RMN-based) whereas proteins *in
vivo* are flexible and dynamic and (2) they are subjected to experimental error
and they often contain regions of low reliability (see below @sec-assess).
Moreover, even experimentally obtained macromolecular structures are to some
degree models, with a variable ratio between experimental data and computational
prediction to match the experimental data (X-ray diffraction, cryo-EM density
maps, NMR, SAXS, FRET...) with previously known structures or prediction models.
Obviously, that does not mean that protein structures are useless, they can be
very useful, but we must be aware of the limitations as well as the
applications.

# Experimental determination of protein structures

The structure helps to understand the molecular mechanism of protein function at
a higher level of detail. The 3D representation can help to orientate different
domains/motifs/residues of interest. This can be essential to understand the
analysis of population or pathogenic variants, drug design, and protein
engineering. Moreover, the structures can also help to predict protein function
and evolution, as they are more conserved than sequence, i.e., the protein
structure space is smaller than the sequence space. However, obtaining detailed
and reliable structural data can be technically difficult and time-consuming
and, as we will see, modeling protein structures can be often a good complement
or alternative. Experimentally-obtained structures usually rely on three
alternative techniques, X-ray crystallography, nuclear magnetic resonance (RMN),
or electron cryomicroscopy (CryoEM).

## X-Ray crystallography or single crystal X-ray diffraction

The X-Ray crystallography or single crystal X-ray diffraction is a method to
determine the atomic structure of molecules in regular, crystalline structures.
It requires the generation of a crystal of the molecule of interest that is then
mounted on a goniometer and illuminated or irradiated with a focused beam of
X-rays. The diffraction pattern of the X-rays at the other side of the crystal
allow the determination of the atoms positions, as well as their chemical bonds,
their crystallographic disorder, and various other information. However, the
link between the diffraction pattern and the electron density is not trivial and
requires some complex maths, called [Fourier
transforms](http://pruffle.mit.edu/atomiccontrol/education/xray/fourier.php).

![Schematic workflow of X-ray crystallography. From Creative Structure
[website](https://www.creative-biostructure.com/comparison-of-crystallography-nmr-and-em_6.htm).](images/paste-F430F5B1.png){#fig-xray
.figure}

X-ray is a powerful method that allows obtaining high-resolution structures at
the atomic level of soluble or membrane proteins, either as an apoenzyme or as a
holoenzyme bound to substrate, cofactor or drugs. However, the sample must be
crystallizable (i.e. homogeneous), which requires quite an amount of very pure
protein. Another con of X-ray structures is that, as mentioned, you only obtain
one (or very few) static forms of the protein and the location of hydrogen atoms
cannot be determined by conventional diffraction methods—the fact that they have only one electron—makes them very hard to detect with x-rays accurately because x-rays scatter from the electron density. They can be predicted, but that still hinders some chemical
analysis.

## Nuclear Magnetic Resonance

All atomic nuclei are charged, fast spinning particles, which gave rise to
resonance frequencies that are different for each atom. Therefore, if we apply a
magnetic field we can obtain an electromagnetic signal with a frequency
characteristic of the magnetic field at the nucleus. That is the basis of
nuclear magnetic resonance (NMR).

Also, we should consider that the movement of the nucleus is not isolated--it
interacts with the surrounding atoms both intra- and inter-molecularly. Thus,
through nuclear magnetic resonance spectroscopy, structural information of a
given molecule can be obtained. Taking protein as an example, its secondary
structure, such as α-helix, β-sheet, turn, circular, and curl, reflect the
different arrangement of the main chain atoms of protein molecules
three-dimensionally. The spacing of the atomic nuclei in different secondary
domains, the interaction between nuclei, and the dynamic characteristics of
polypeptide segments all directly reflect the three-dimensional structure of
proteins. These nuclear features all contribute to spectroscopic behaviors of
the analyzed sample, thus providing characteristic NMR signals. Interpretation
of these signals by computer-aided methods leads to deciphering of the
three-dimensional structure.

![Basis of nuclear magnetic resonance. From Creative Structure
[website](https://www.creative-biostructure.com/comparison-of-crystallography-nmr-and-em_6.htm).](images/paste-2013F0AC.png){#fig-rmn
.figure}

The most important feature of the NMR method is that the three-dimensional
structure of macromolecules in the natural state can be measured directly in
solution, and NMR may provide unique information about dynamics and
intermolecular interactions. The resolution of the macromolecular
three-dimensional structure can be as low as sub nanometer. However, the NMR
spectrum of biomolecules with large molecular weight is very complicated and
difficult to interpret, thereby limiting the application of NMR in analyzing
large biomolecules, often below 20-30 kDa (see @fig-experimental). Additionally,
this technique requires relatively large amounts of pure samples (on the order
of several mg) to achieve a reasonable signal to noise level.

[![Coverage of molecular weight by structural technique. From @hancock2022
.](images/paste-FA73AF1E.png){#fig-experimental
.figure}](https://doi.org/10.1016/j.jsb.2022.107841)

## Electron cryomicroscopy

The essential mechanism of Cryo-EM is the same of any electron microscopy
method, i. e., electron scattering. Samples are prepared through
cryopreservation prior to analysis. The coherent electrons are used as a light
source to measure the sample. After the electron beam passes through the sample,
a complex lens system converts the scattered signal into a magnified image
recorded on the detector. A key subsequent step is signal processing, that
transform thousands of images of the particles in any orientation into a
three-dimensional structure of the sample.

![The process of Cryo-EM single particle analysis technique. From Creative
Structure
[website](https://www.creative-biostructure.com/comparison-of-crystallography-nmr-and-em_6.htm).](images/paste-5E29F580.png){#fig-cryoEM
.figure}

The use of electron microscopy methods for structural biology was traditionally
limited to very large macromolecular complexes, like viral capsids, and only
recently it could be used for smaller particles (see @fig-experimental). The
number of protein structures being determined by cryo-electron microscopy is
growing at an explosive rate in the last 5-10 years. This is thanks to several
technical improvements in the technique, spanning sample preparation, analysis
and processing that allow obtaining pictures at the atomic level
[@callaway2020]. This advances were acknowledged by the [2017 Nobel Prize in
Chemistry](https://www.nobelprize.org/prizes/chemistry/2017/press-release/) to
Jacques Dubochet, Joachim Frank and Richard Henderson.

![Cryo-electron microscopy revolution. From Creative Structure
[website](https://www.creative-biostructure.com/comparison-of-crystallography-nmr-and-em_6.htm).](images/paste-E5E5AE75.png){#fig-cryoEM2
.figure}

::: {.callout-tip collapse="true"}
### Tip

Check the already classic article by @egelman2016 for more a detailed info. And
[here](https://www.chemistryworld.com/news/explainer-what-is-cryo-electron-microscopy/3008091.article)
for a great outreaching article after the Nobel Prize.
:::

CryoEM is widely use nowadays because, particularly for large molecular
complexes or viral particles. Structures can be generated quickly, as it does
not require a high amount of protein and it can generate good data even in the
presence of impurities. However, new generation microscopes are only affordable
by large institutions and small particles can have a high level of noise.
Moreover, processing a large amount of images can be limiting to obtain
high-quality structures.

# Structural quality assurance {#sec-assess}

As outlined at the beginning of this section, regardless of the origin or
determination method, any structure is subjected to error. Experimentally-based
structures are actually models that match with the data. The quality of the
original data and the care with which the experiments have been performed will
determine the reliability of the structural results. As in any other scientific
discipline, independently performed experiments can arrive to related models of
the same molecule but almost always there are differences; however both can be
good models.

::: callout-note
### Extra info

Check the detailed documentation about PDB validation report
[here](https://www.wwpdb.org/validation/XrayValidationReportHelp).
:::

## Global parameters in experimentally-based structures

There are a number of diverse parameters that help us to understand the quality
and reliability of an structure. First, the **resolution** is a good indicator
of the level of detail of the structure, as it can strongly affect affect how
the experimental data is modeled.

![The effect of resolution on the quality of the electron density. The Tyr100
residue from concanavalin A as found in the indicated PDB structures at 3 Å, 2 Å
and 1.2 Å. Picture inspired in Figure 14.5 from @structur and rendered with
Pymol (see [concanavalin.pse](../concanavalin.pse) and
[concanavalin.txt](../concanavalin.txt) in the Repo for details about the
picture display).](images/paste-C3031EBE.png){#fig-reso .figure}

Other important parameter is the ***R*****-factor**, which assess the difference between the structure factors calculated from the model and those obtained from
the experimental data. This is, the *R*-factor is the deviation of the
calculated diffraction pattern of the model and the original experimental
diffraction pattern. Typically, good structures with a resolution around 1-3 Å,
will have an *R-*factor of 0.2 (i.e., 20% of deviation). However, it should be
noted that this factor is usually reduced after iterative refinement, which
downplays its use as an indicative of reliability. A more reliable factor is the
***R~free~***. This is less susceptible to manipulation during refinement, as it
is based only in a small fraction of the experimental data (5-10%) that is not
used throughout the refinement stage.

A more intuitive, but only qualitative, way of understand the presicion of a
given atom's coordinates is the *B-*factor. The temperature value or *B-*factor
correlates with the positional errors, although its mathematical definition is
more complex. Normal values for a B-factor are within the range 14-30, whereas
values above 30 usually indicate that the atom is within a flexible or
disordered region, and atoms with a *B-*factor over 40 are often excluded as
being too unreliable.

The root-mean-squared deviation (**RMSD,** [see Structure alignment
section](ddbb.html#sec-alignment)) is a traditional estimator of the quality of
NMR-solved structures. Regions with high RMSD values are those that are less
defined by data. However, it should be also noted that this parameter can be
also misleading as it is highly dependent on the procedure used to generate the
and selection the data that is submitted to the PDB. An experimentalist could
reduce the RMSD by choosing the "best" few structures for deposition from a much
larger draft. Note that the RMSD has many other applications, like comparing
different structures or models from the same or related sequences.

In the last years, along with the boost in quantity and quality of EM
structures, new parameters have been proposed. One of them, the
***Q-*****factor** has been recently implemented for [validation of 3DEM/PDB
structures](https://www.rcsb.org/news/feature/62de9e5235ec5bb4ddb19a43).
Briefly, the Q-score calculates the resolvability of atoms by measuring
similarity of the map values around each atom relative to a Gaussian-like
function for a well resolved atom. Q-score of 1 indicates that the similarity is
perfect whilst closer to 0 indicates the similarity is low. If the atom is not
well placed in the map then a negative Q-score value may be reported. Therefore,
Q-score values in the reports will be in a range of -1 to +1.

## Stereochemical parameters

Given that all structural models contain some degree of error and some of the
modeling global parameters can be controversial, we can analyze the model's
geometry, stereochemistry, and other structural properties to assess structural
models (experimentally or computationally obtained). These parameters compare a
given structure (protein or nucleic acid) against what is already known about
this kind of molecule, based on our knowledge from high-res structures. That
means that the (best) structures in the current structural space define what is
"normal" in a protein structure. The advantage of these analyses and derived
parameters is that they do not consider the process that leads to the model,
they only consider the final product and its reliability. The main disadvantage
is that the current structural space is biased to proteins of known function and
with biomedical or biotechnological interest.

One of the most common and powerful methods to asses the stereochemistry of a
protein is the [Ramachandran plot](intro.html#sec-rama), defined in 1963 and
still in use.

Another widely used analysis (and available for all PDB structures) is the side
[chain torsion
angles](https://www.wwpdb.org/validation/XrayValidationReportHelp#torsion_angles),
usually measured as ***Side chain outliers**.* As outlined in the
[Introduction](intro.html#sec-str), amino acid side chains also have some
preferred conformations. Like the Ramachandran plot, a plot of the χ1-χ2 torsion
angles can indicate problems with a protein model when angle values are outside
of the high density values.

Bad contact or
[clashes](https://www.wwpdb.org/validation/XrayValidationReportHelp#close_contacts)
indicate a bad model. It is obvious that two atoms cannot be in the same (or
very close) spot. We can define this as a situation where two nonbonded atoms
have a center-to-center distance that is smaller than the sum of their van der
Walls radii.

# Protein structure display {#sec-apps}

## Protein structure file formats

Experimental structural data from different methods is stored in different file
formats. For instance, crystallographic raw data is usually saved as `*.ccp4`
files, but Cryo-EM or X-ray density maps can be stored in `*.mrc` or `*.mtz`
files. Other complex file formats, such as the Extensible Markup Language
`*.xml`, provide a framework for structure complex information and documents
like protein structures.

Along with the establishment of the PDB, a simple format was developed. The PDB
format consist of a collection of fixed format records that describe the atomic
coordinates, chemical and biochemical features, experimental details of the
structure determination, and some structural features such as secondary
structure assignments, hydrogen bonding, or active sites. The current version is
named PDBx/mmCIF)also incorporates the expanded crystallographic information
file format (mmCIF), allowing the representation of large structures, complex
chemistry, and new and hybrid experimental methods. Thus a `*.pdb` and `*.cif`
files can be considered as identical files.

::: {.callout-tip collapse="true"}
### Tip

Check PDB-101 course about PDBx/mmCIF format
[here](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/beginner%E2%80%99s-guide-to-pdb-structures-and-the-pdbx-mmcif-format).
:::

![Coordinates in the PDB file (6KI3)](images/paste-96FF761D.png){#fig-PDBfile
.figure}

### Occupancy and B-factor

Disregarding the repetition of atom type in the rightmost column, the last
columns in the PDB file are the **Occupancy** and the **temperature factor** or
the **B-factor**.

Macromolecular crystals are composed of many individual molecules packed into a
symmetrical arrangement. In some crystals, there are slight differences between
each of these molecules. For instance, a sidechain on the surface may wag back
and forth between several conformations, or a substrate may bind in two
orientations in an active site, or a metal ion may be bound to only a few of the
molecules. When researchers build the atomic model of these portions, they can
use the occupancy to estimate the amount of each conformation that is observed
in the crystal.

Thus, by definition, the sum of **occupancy** values for each atom must be 1.
Usually, we will see a single record for an atom, with an occupancy value of 1,
indicating that the atom is found in all of the molecules in the same place in
the crystal. However, if a metal ion binds to only half of the molecules in the
crystal, the researcher will see a weak image of the ion in the electron density
map, and can assign an occupancy of 0.5 in the PDB structure file for this atom.
Two (or more) atom records are included for each atom, with occupancies like 0.5
and 0.5, or 0.4 and 0.6, or other fractional occupancies that sum to a total of
1.

On the other hand, **temperature value or *B*-factor is** a measure of our
confidence in the location of each atom, as discussed above (@sec-assess). If
you find an atom on the surface of a protein with a high temperature factor,
keep in mind that this atom is probably moving a lot, and that the coordinates
specified in the PDB file are only one possible snapshot of its location. Thus,
an atom record with an occupancy \<1 can have a low B-factor if that position is
certain.

As you can imagine, this column is also used by computationally-derived models
to include a confidence value.

## Biological macromolecules display applications

### PyMOL

[PyMOL](https://en.wikipedia.org/wiki/PyMOL) is a very powerful molecular
visualization system written originally by [Warren
DeLano](https://en.wikipedia.org/wiki/Warren_Lyford_DeLano). It was released in
2000 and soon became very popular. It's currently commercialized under License
by [Schrödinger](https://pymol.org/) but a free license for teaching can be
requested. Also, open source code is available on
[GitHub](https://github.com/schrodinger/pymol-open-source) that can be installed
on Linux or MAC. More info on [Wikipedia](https://en.wikipedia.org/wiki/PyMOL).
You can also check this quick [Reference
guide](https://www.uml.edu/docs/PyMOL%20Quick%20Reference%20Guide_tcm18-230352.pdf)

PyMOL allows working with different structures representation, but also with raw
experimental data in different
[formats](https://pymol.org/dokuwiki/doku.php?id=format).

PyMOL is written in Python and can be used with interactive menus and also with
command line. There are a lot of resources that can help you with PyMOL, like a
[Documentation Reference Wiki](https://pymol.org/dokuwiki/) or a
community-supported [PyMOLWiki](https://pymolwiki.org/index.php/Main_Page).
Moreover, it allows the implementation of new functionalities as plugins, like
[PyMod](http://schubert.bio.uniroma1.it/pymod/index.html) or
[DockingPie](http://schubert.bio.uniroma1.it/dockingpie/index.html), among
others. [PyMod](https://pymolwiki.org/index.php/PyMod) [@janson2021] is designed
to act as simple and intuitive interface between PyMOL and several
bioinformatics tools (i.e., PSI-BLAST, Clustal Omega, HMMER, MUSCLE, CAMPO,
PSIPRED, and MODELLER). Starting from the amino acid sequence of the target
protein, PyMod is designed to carry out the main steps of the homology modeling
process (that is, template searching, target-template sequence alignment and
model building) in order to build a 3D atomic model of a target protein (or
protein complex). The integration with PyMOL facilitates a detailed analysis of
the modeling process.

Finally, as any Python-based program, it can be used within Jupyter notebooks
(see <https://www.computer.org/csdl/magazine/cs/2021/02/09354947/1rgCkrAJCko>).

### UCSF ChimeraX

[ChimeraX](https://www.rbvi.ucsf.edu/chimerax/) [@pettersen2021] is a fully open
source software, developed by the UCSF as a renovated version of the former
[Chimera](https://www.cgl.ucsf.edu/chimera/) software, with versions for Linux,
MacOS, and Windows. It aims to be a comprehensive structural biology tool, but
it is more widely known for its capacities for EM maps. As any other open source
software, it has gained new and exciting capacities in the last years, like
[Virtual Reality
capabilities](https://www.rbvi.ucsf.edu/chimerax/docs/user/vr.html) or
[Alphafold2](https://www.rbvi.ucsf.edu/chimerax/data/alphafold-nov2021/af_sbgrid.html)
modeling.

### LiteMol and others

[LiteMol Viewer](https://www.litemol.org/Viewer) is a powerful HTML5 web
application for 3D visualization of molecules and other related data. It is used
in a web browser, eliminating the need for external software and also allowing
the integration with third-party sites as an embedded plugin, as it is in the
[PDBe site](https://www.ebi.ac.uk/pdbe/litemol). More information about LiteMol
can be found on @sehnal2017, the
[wiki](https://webchem.ncbr.muni.cz/Wiki/LiteMol:UserManual), or [YouTube
tutorials](https://www.youtube.com/channel/UCRoyYUeP1hdH2r8XUW-WMoA).

The same phylosopy applies to other open-source viewers like [NGL
Viewer](https://nglviewer.org/) and [Mol\*](https://molstar.org/). However,
since LiteMol was developed by the EMBL-EBI and it is used in ePDB site, it has
become more popular.

Other applications that you may know or came into are:

-   SwissPDBViewer (aka DeepView), developed to work with SWISS-MODEL homology
    modeling app, is an application that provides a user-friendly interface
    allowing to analyze several proteins at the same time. It has currently
    fallen in disuse as the last version (4.1) is only a 32 bits application.

-   RasMol and OpenRasMol were developed initially in 1992 and its last release
    was in 2009. It was a pioneer as a simple molecular display open-source
    application, but it is outdated nowadays.

# [**PyMOL Practice**]{style="color:green"}

Our PyMOL Practice has two parts.

## [PART A: A 10-steps self-guided practice](https://www.evernote.com/shard/s62/sh/666c39a1-ac03-64c9-b8a0-cab9ff3959c7/9d20b6eadae8d130fdbe6652d60237f2)

This is a [Evernote](https://evernote.com/intl/es) note that you can consult
online and also copy into your Evenote account if you wish.

## PART B: PyMOL Challenge

Make a ready-to-publish picture of your favorite protein. As a suggestion, you
can reproduce the top panels in the Figure 1B of @gao2020, but any structure
involving more than one domain and/or with a substrate/cofactor molecule can be
a good challenge.