Seven quick tips for beginners in protein crystallography

The aim of this brief review is to provide a roadmap for beginning crystallographers who have little or no experience in structural biology and yet are keen to produce protein crystals and analyze their 3D structures to understand their biological roles. To achieve this goal it is crucial to perform crystallization, structure determination, visualization and analysis of the protein’s structural features related to its biological function. Keeping that objective in mind, tips presented herein cover the most important steps in a crystallographic endeavor and present a selection of databases and software which can aid and accelerate the whole process. We hope that this short overview will help novices coming from different disciplines to navigate a protein crystallography project and, hopefully, allow avoiding some costly mistakes, even though being a crystallographer means learning by trial and error.


INTRODUCTION
X-ray macromolecular crystallography serves a variety of scientific disciplines and significantly accelerates discoveries in many research areas, including studies on protein biological function, drug screening and design, and human health and disease. Through decades, biocrystallography has evolved together with developments in computer science allowing faster structure determination. As a result, a spectacular growth in the number of new software, advanced databases and bioinformatic servers can be observed. The scale of constantly growing interest in structural biology, and hence also biocrystallography, is proven by the traffic on the central database Protein Data Bank (Berman et al., 2000). Worldwide, more than 1 million users visit the Protein Data Bank every year, as judged by counting unique IP addresses. They perform more than 1.5 million downloads of structures every day, or more than 500 million per year (Bruno et al., 2017). Beyond the final outcome from the crystallographic experiment, one can feel lost in the diverse and thriving ecosystem of software applied during the process of structure determination and analysis. Here, we attempt to create a roadmap ( Fig. 1) for beginners in protein crystallography in a form of a set of tips comprising a modest selection of macromolecular crystallography software and bioinformatic tools essential for a crystallographer's journey. However, it is not a comprehensive review on freely available software packages, services, or commercial products. This subjective overview made by the authors comprises resources that are well-known among the community and that are currently available. It is hoped that the "seven tips" can serve as a starting point especially for young researchers and will act as a catalyst for the readers to deepen their crystallographic knowledge.

TIP 1: LEARN WHAT IS KNOWN ABOUT YOUR PROTEIN
A good starting point for gathering information about the chosen protein is the UniProt server (https://www. uniprot.org/) (Bateman, 2019). It offers an advanced search engine which accepts, inter alia, name of the protein, EC number or, via a BLAST (Zhang & Madden, 1997) search option, its amino acid sequence. UniProt entry provides key information about the protein function, names and taxonomy, subcellular location, posttranslational modifications, its interactions with itself or other proteins, similarity to other proteins and domains present in the protein and its amino acid sequence. Literature references to information sources are provided, as well as a rich selection of cross references to other databases. One of very useful features of the server is the "Add to basket" option available in the Sequence section. With its use one can gather a set of protein sequences, which one can later align using the Clustal Omega program (Sievers et al., 2011) available at the server.
Information on protein domains and their organization within a chosen protein, as well as on the whole protein family to which the protein belongs can be retrieved from the Pfam database (http://pfam.xfam.org/) (El-Gebali et al., 2019). That database has a user-friendly search engine that accepts, inter alia, UniProt ID and PDB IDs. Entry for each protein family provides very useful information on protein architectures, available structures deposited in PDB, species and phylogenetic trees and, importantly, it allows one to view or download stored sequence alignments for the family. An available profile logo for the family aids in identifying conserved residues and variable positions.
Protein solubility in E. coli can be predicted based on the protein amino acid sequence with the use of the SoluProt server (https://loschmidt.chemi.muni.cz/soluprot/) (Hon et al., 2020), which employs machine learning techniques trained on curated databases of experimental data.
Before ordering or cloning a gene for the protein to be crystallized, it is advisable to inspect the results of the XtalPred server (https://xtalpred.godziklab.org/Xtal-Pred-cgi/xtal.pl) (Slabinski et al., 2007). Using the amino acid sequence as input, the server predicts a range of physico-chemical properties and based on them, by using machine learning (random forest) method, predicts protein crystallizability. Detailed reports on values of computed target features vs. distributions of crystallization successes and failures allow one to judge which feature can potentially be a major obstacle to crystallization. The predicted sequence features, along with the amino acid sequence, can aid in construct design, e.g. suggest removing a signal peptide or a long disordered fragment at the protein's termini. The XtalPred results, combined with information retrieved from the Pfam database, can also aid in deciding if the whole multi-domain protein should be crystallized as a whole or as separate fragments. The "homologs" section of the results provides, among others, a list of homologs with known structure deposited in PDB, which is a valuable information with respect to the feasibility of solving the structure by a molecular replacement method, which is currently the most common means of solving the crystal structure. To be useful for this purpose, the homolog should have an amino acid sequence that is at least 20-25% identical with the target protein. Homologs can be also directly searched at the PDB server (https://www.rcsb.org/) using an advanced search option and amino acid sequence as input. If close-enough homologs with a known structure are not available, then experimental phasing needs to be considered, among which the single wavelength anomalous diffraction (SAD) method is the most popular. SAD uses anomalous signal from either natural components of the investigated protein (e.g. Zn, Cu, Fe, Mn, Ni, or in favorite cases S) or from selenomethionine (Se-Met), introduced into the protein (through bacterial or yeast growth medium) during protein production.
Since the whole process of structure determination starts from a protein sample (Fig. 2), after gathering all available information about the macromolecule of interest the researcher should answer a few more questions regarding the protein source or its production. A good planning procedure at this step that includes decisions about working with multidomain versus single domains of the studied molecule and possible truncation of the flexible parts, as well as the awareness of a wide range of methods used for solving the structures based on intrinsic features of the macromolecule (i.e. metal content, SeMet derivatives etc.) can save a great amount of time at the next stages. It is good to remember that storage of the protein sample can be critical. Not all proteins tolerate freezing at -20°C, thus most samples are kept at 4°C or -80°C, but the activity and stability must be regularly checked. In addition, as a general rule, it is better to store proteins that are concentrated than diluted.

TIP 2: CRYSTALLIZATION IS AN ART, YET PLANNING IS ADVISABLE
Before planning the crystallization experiments, it is important to realize which factors influence crystal growth ( Table 1). All of these variables, categorized as physical, chemical or biochemical, can heavily impact the crystal formation process. The details of these various parameters have been largely described in the literature (Abdalla, 2016;Bhat et al., 2018). In practice, the purity, homogeneity and stability of the protein sample are the very first factors that should be considered. Protein concentration is always case dependent, but for the ini-tial experiments concentration of 1-25 mg/ml is recommended (10-15 mg/ml is typical). Eukaryotic proteins tend to be less soluble than bacterial proteins. All further approaches strongly depend on the amount of the protein sample, the equipment available and resources. As already mentioned, searching for optimal crystallization conditions is still a try and error process, enhanced by usage of commercially available screens. However, there are a couple of evidence-based rational approaches that are very likely to improve a chance of obtaining protein crystals. Apart from purity of the protein sample, the second very important aspect is based on observations that pH of the crystallization solution has a significant impact on crystal growth. It has been suggested that pH should deviate from pI of the protein by up to 3 pH units and that pH of the protein solution should be "as low, as high or as divergent from the pI as possible for basic, acidic or neutral proteins, respectively, within their stable pH range" (Zhang et al., 2013). Thus, initial screening for crystallization conditions should explore the widest set of pH/precipitants/buffers/additives, which can be easily conducted with the use of crystallization kits provided by many suppliers. The best way to increase the success in macromolecular crystallization is to initiate a collaboration with a structural biology group or with dedicated core facilities equipped with crystallization robots, cold rooms and/or crystal hotels. A number of such initiatives supporting their users through the entire crystallization process has rapidly multiplied in recent years all over the world. What is encouraging in robotic handling of crystallization plates is the substantially smaller amount of the sample used by crystallization robots in comparison to the traditional path with manually setup crystallization drops where a sample volume between 1 μL and 5 μL is used. For example, to set up 10 screens (96 conditions each) 150-200 microliters of protein at the proper concentration should be prepared. Discussion of the theoretical principles behind crystallization (McPherson & Gavira, 2014;Russo Krauss et al., 2013), description of the strategies regarding the experiments (Cheraghian Radi et al., 2021) and how to proceed with optimization (He et al., 2020) is beyond the scope of this review. However, as an extension of this tip we would like to point to one more rational approach enhancing crystallization chances -protein surface entropy reduction, which can be planned with the use of the SERp server (http://services.mbi.ucla.edu/SER/) (Goldschmidt et al., 2007). The server identifies regions on the protein surface characterized by a high side chain conformational freedom (and hence, entropy) and based on the secondary structure prediction results (coil regions are preferred) and sequence alignment to homologous proteins (amino acid conservation analysis) suggests the best candidates (up to three consecutive residues) for mutation. The resulting mutant is expected to have low- Assuming that the protein crystals can be seen in the drop, now the question "what's next?" should be answered. How to handle the crystals? How to prepare samples for their journey to the synchrotron and for data collection? Crystals that look good under the microscope are only a promising start. Working with protein crystals is not the easiest one. Before the measurements, they need to be harvested and protected from destruction. Since crystals are formed from solutions based on water, large part of their crystal lattice is composed of water (Chayen & Saridakis, 2008). Large amount of the mother liquor in the crystals ensures that the protein molecules adopt a native conformation that is similar to that observed under physiological conditions. Furthermore, the presence of water channels makes it possible to easily introduce low molecular weight components into the protein crystals, e.g. heavy element ions, inhibitors or activators (Gnesi & Carugo, 2017). On the other hand, the presence of water in the crystals also has a negative side. During diffraction experiments with intense X-ray, free radicals are produced by ionizing radiation. Unfortunately, the presence of water channels allows these very dangerous molecules to spread quickly, and when reaching protein molecules they cause destruction and degradation of the crystal. To mitigate this process, a cryoprotection method is applied (Pellegrini et al., 2011). This step is deeply connected with the next one -crystal handling. Finding the right cryoprotecting agent and its concentration is a crucial step for preserving good crystal condition. Cryoprotectant selection remains a trial and error exercise, where the first combination that "works" is accepted. During this step, one should remember that an efficient cryoprotectant solution should firstly stabilize the crystal, but the addition of a cryoprotectant should also prevent ice formation on the surface of the sample during flash-cooling. At this point we should mention that soaking in a cryoprotection solution is not the only method for protein crystals' protection from damage. Dehydration, high-pressure cryocooling or crystal annealing can be also applied (Huang & Szebenyi, 2016). Crystals should be handled one by one and as fast as possible, otherwise the crystal and the drop can dry (in result, other crystals in the same drop will be lost). Many tools for crystal handling and mounting can be found on the market: a wide variety of loops (different shapes and sizes), microtools, sets for the room temperature meas-urements and capillaries. With the use of the loop that is a bit larger than the crystal, after fishing it out, the crystal can be stepwise transferred to cryoprotection solutions with gradually increased concentration of the cryoprotectant or can be immediately soaked in the already established final cryoprotecting solution (Vera & Stura, 2014). In both cases, the next step requires transfer of the crystal into liquid nitrogen. After flash-cooling, crystals should be directly mounted on the X-ray diffractometer or placed inside a dewar, where samples can be kept for as long as it is necessary. Once frozen, crystals are transported under cryogenic conditions, usually with the use of dry-shipper dewars.

TIP 3: HAVE A PLAN FOR DATA COLLECTION
The most important part of this tip could be enclosed in one sentence: data collection is the last experiment in the course of a structure determination and it requires compromises (Fig. 3). Collecting bad data can unfortunately ruin all previous efforts and substantially influence the expected outcomes. To avoid this situation, the diffraction experiment should be prepared and conducted after careful planning. The vast majority of X-ray crystal structures in the Protein Data Bank is based on synchrotron data. State-of-the-art synchrotron sites dedicated to structural studies of biological samples offer small and focused beams, which allow routine diffraction measurements for microcrystal samples. Furthermore, the X-ray diffraction data collections, including optimized anomalous dispersion element identification or phasing, experiments with crystals featuring large unit cells, as well as high resolution measurements are now possible at shorter measuring times. Intense in-house laboratory sources also serve as tools for collection of single-wavelength diffraction data, which even enable obtaining data suitable for the effective S-SAD phasing, however they are limited to the characteristic radiation of the X-ray anode material. The process of recording diffractograms relies on several principles that should be considered before data collection: -The first important parameter is the wavelength of the X-ray that will hit the crystal. X-rays are of the same nature as visible light or radio waves, the only difference is their wavelength, which is very short (about 1Å). A phenomenon caused by the interaction of electromagnetic waves with the matter inside the crystal (particularly with the electrons) depends on radiation wavelength. The choice of X-ray wavelength used during data collection depends on the strategy that will be used during structure determination, and here the most common approaches are molecular replacement and anomalous signal methods. In the case of molecular replacement, which is viable when the structure of the model protein is known, single wavelength without special consideration of anomalous data suffices. Anomalous signal phasing methods require collection of single wavelength anomalous data (SAD) for selected marker elements (also possible for native sulfur (S-SAD) and native phosphorus (P-SAD)) or multiple anomalous diffraction data (MAD). In those cases, after spectrum determination of the absorption edge for anomalous marker(s), data sets are collected at single or various selected wavelengths in order to obtain the maximal anomalous signal. Also, when metal atoms are already present in the structure, it is advisable to collect the X-ray fluorescence spectrum which can be collected at most synchrotron beamlines. Recording X-ray fluorescence spectra, and collecting diffraction datasets above and below the corresponding metal absorption edges, in most cases allow to gather sufficient evidence to unambiguously determine the identity and location of the metal of interest, as well as to accurately characterize the coordinating ligands in the metal binding environment within the protein.
-Keeping in mind that the crystal structure is encoded in the diffracted X-rays, where crystal orientation, shape and symmetry of the unit cell define the directions of the diffracted beams, whereas the positions of all atoms in the unit cell define their intensities, a few more important aspects should be considered for successful data collection, including inspection of the first diffraction image and strategy determination. By visual examination of the first diffractogram, salt and protein samples can be easily distinguished. Furthermore, for cryocooled samples, the inspection of the collected image for presence and strength of diffraction rings caused by ice, will reveal whether the choice of the cryoprotective agent was appropriate. Modern software can deal with regions that should be excluded from data processing in cases where such "ice rings" are present on the images. Observation of strong, well-shaped and resolved spots up to a high resolution region suggests that collection of good quality data is possible. Nevertheless, it is not uncommon that the diffraction is anisotropic. To check whether the diffraction intensity does not vary too much with the orientation of crystal lattice, a second image for the crystal rotated by 45 or 90 degrees should be also recorded and inspected. The preliminary experiment tests the crystal in terms of its quality and allows us to decide on the strategy of XRD data collection hinging on the fact that the crystal symmetry influences the symmetry of spots' distribution on the images. Thus, it is crucial at this point to determine the space group and unit cell dimensions, this will help to get the information on how many diffraction images should be recorded. Moreover, evaluation of the maximum resolution will support the decision regarding the detector distance from the sample. Major advances in the field of automated data processing in terms of indexing, integration and scaling have been made in the last decades, but understanding the foundations of the applied protocols implemented in the chosen software is highly recommended (Powell, 2017).
-Strategy determination will also bring the information about the oscillation range and the time of exposure. To collect data of high quality, one should also consider the expected lifetime of the crystal, since radiation damage limits achievable resolution and data quality. This can be done for example with the BEST (Bourenkov & Popov, 2010) or RADDOSE (http://www.raddo.se/) (Garman, 2014) software packages. The final strategy applied in data collection also depends on the available geometry of the goniostat. The higher degree of freedom of crystal orientation the better. The most common synchrotron setups allow to rotate crystals around a single axis (phi), while 3-or 4-axes goniostats can be found as part of the in-house diffractometers, but increasing number of macromolecular crystallography beamlines also allows to rotate the sample around more than one axis. By using large area detectors, rotation around a single axis in most cases allows one to obtain complete data, regardless of the initial orientation of the crystal. The latest software available at synchrotron sites and in-house machines greatly helps to predict and collect data and it supports the most popular phasing methods, nevertheless the decisions need to be made by the crystallographer according to all available and previously gathered information. The data collection experiment should be conducted properly in order to obtain complete data. If the strategy was planned in a wrong way or a rapid decay of diffraction power occurred, some reflections may not be measured at all, and the data may not be complete. A number of synchrotron sites for macromolecular crystallography in Europe operates with MXCuBE (Gabadinho et al., 2010) and the latest version MXCuBE3 (Mueller et al., 2017) (https://mxcube.github.io/mxcube/), which supports the users in making reasonable decisions during data collection. Another important aspect of making the most of the beam time is the opportunity to process your data during or just after data collection. Quick examination of the final statistics will be beneficial in situations when for some reason the measurements went wrong and data collection needs to be repeated.
-As a final remark to this tip, remember that making a good plan for data collection is an effort that will pay off at the structure determination step. Losing a chance of obtaining good data for crystals that were not easy to obtain or cannot be easily reproduced can be fatal for the scientific project.
Additionally, a good practice is to save the raw images and to keep the copy at least till the work with structural results has been accepted for publication. The processes of structure validation and reviewing the manuscript can require repetition of data inspection or even data reprocessing. Moreover, it is highly advisable to deposit raw images at some open data repository once the publication has been accepted (vide infra).
Finally, most European synchrotron beamlines dedicated to macromolecular crystallography offer some useful tips, access to management system (i.e. ISPyB) (Delagenière et al., 2011) and guidelines for data collection that can be found on the respective web sites: Before we discuss the most important issues of data processing, a minimal portion of theory regarding the diffraction experiment should be recalled. Each ray behind reflections that can be seen on the collected images is characterized by its amplitude and phase. However, only reflection amplitudes, which are proportional to modulus of structure factor F, which in turn is a sum of contributions of all atoms from the unit cell: can be obtained from the measured intensities: but no direct information about reflection phases is provided by the diffraction experiment. The function of electron density defined at every point in the unit cell, which is reconstructed from the measured structure factors' amplitudes and their phases has to be calculated: Therefore, data processing that is aimed at extracting the relative intensities of the diffracted X-ray beams is a very important step in protein crystallography projects after diffraction data collection. First, recorded diffraction spots have to be indexed, next respective raw pixel intensities must be properly integrated and scaled after noise and background subtraction. Several different computer programs exist and can be used for this purpose. Among these are: • XDS (http://xds.mpimf-heidelberg.mpg.de/) (Kabsch, 2010) • HKL (https://www.hkl-xray.com/) (Otwinowski & Minor, 1997) • DIALS (https://dials.github.io/) (Winter et al., 2018) • XIA2 (https://xia2.github.io/) (Winter, 2010) • Mosflm (https://www.mrc-lmb.cam.ac.uk/mosflm/ mosflm/) (Battye et al., 2011). Special attention should be paid at the step of space group assignment. Wrong choice of the symmetry can lead to problems in finding the correct position of the model during molecular replacement, as well as can result in difficulties in phasing performed with the use of other methods. When refinement seems to be problematic, it is not an unusual procedure to search the solution after data reprocessing and select a different space group. If needed, this procedure can be performed with tools implemented in crystallographic software packages mentioned above.
As mentioned earlier, the collected data can be anisotropic. In case of anisotropic data it is now possible to address the statistical significance of the intensity data after merging with StarAniso (http://staraniso.globalphasing.org/cgi-bin/staraniso.cgi) (Tickle et al., 2018).
At this point we would like to encourage scientists who are new to protein crystallography to extend their knowledge about data collection statistics by reading dedicated literature. It is the author's responsibility to collect and provide accurate information about the data quality that fulfill the standards established by the crystallographic community. Here, the most valuable metrics pertinent to results of data processing are mentioned. The first parameter is resolution that limits overall achievable information about the structure. Second is the signal-tonoise ratio, which addresses data quality. The ratio I/σ(I) is the most recognizable parameter that proves the signal strength, but a particularly informative indicator of the internal data consistency, apart from popular R merge , R meas and R p.i.m. (Evans & Murshudov, 2013) used nowadays is the correlation coefficient between randomly chosen half data sets, CC 1/2 (Karplus & Diederichs, 2012). Also, Isa, an asymptotic I/σ(I), the parameter used for identification of random and systematic errors associated with each dataset should be evaluated (Diederichs, 2010). In order to estimate the useful "resolution" of the data, CC 1/2 is a better measure than R merge or R meas (Evans and Murshudov, 2013). Another important issue is data completeness, defined as the coverage of all theoretically possible unique reflections within the measured data set. Data completeness remarkably influences the process of structure determination and shouldn't be lower than 95% (Dauter, 2017). Keep in mind that the completeness can and often depends on the resolution range and can be lower in the highest resolution shell. If lower values are observed in the middle resolution ranges, the data should be carefully inspected. The last parameter to be mentioned in our roadmap is redundancy (multiplicity), which refers to the fact that every reflection is measured with a certain degree of random error (Bourenkov and Popov, 2006), therefore the higher the redundancy, the more precise the final estimation of the averaged reflection intensity.

TIP 5: PHASING MEANS THINKING, REFINEMENT NEEDS TIME, VALIDATION IS A MUST, DEPOSITION IS A GOLD STANDARD
Several programs have evolved from the original concept of molecular replacement to allow faster and more sophisticated searches. The most popular, MOLREP (Vagin & Teplyakov, 1997) and Phaser (McCoy et al., 2007), are included in MrBUMP (Keegan and Winn, 2007) and BALBES (Long et al., 2007), two automated molecular-replacement pipelines. MoRDa is also an interesting choice regarding the available pipelines for automated molecular replacement protein structure solution based on its own domain database derived from the PDB (Vagin & Lebedev, 2015). The very distant models or even secondary structure elements can also lead to successful ab initio solution of macromolecular structures with Arcimboldo (Rodríguez et al., 2012). Several phasing methods are available (MIR, MAD, SAD and MR) and they all rely on the premise that phase information can be obtained if the positions of marker atoms in the unknown crystal structure are known. The SHELXD (Sheldrick, 2010) module of SHELX 'Suite' (http://shelx.uni-ac.gwdg.de/SHELX), and SOLVE (Terwilliger & Berendzen, 1999) are widely used for locating the heavy-atom sites. Direct methods is a class of solution techniques that generates good starting phases using only experimental intensities as a source of phase information and here SnB (Miller et al., 1994), SHELXD and phenix.hyss implemented in PHENIX (Adams et al., 2002;Adams et al., 2010) can be applied. Often, starting phases can be improved by changing the phases by consideration of all available phase information that arise from a combination of the known structure factor magnitudes, the current phase estimates, and stereochemical information. For this purpose a wide range of software can be used: DM (Cowtan, 2010), SOLOMON (Abrahams and Leslie, 1996), RESOLVE (Terwilliger, 2004) and PIRATE (Cowtan, 2010). Seven quick tips for beginners in protein crystallography

Important crystallographic terms and parameters
Unit cell* The unit cell is the parallelepiped built on the vectors, a, b, c, of a crystallographic basis of the direct lattice. Its volume is given by the scalar triple product, V = (a, b, c) and corresponds to the square root of the determinant of the metric tensor.

Space group*
The symmetry group of a three-dimensional crystal pattern is called its space group. For (chiral) macromolecules there are 65 possible space group symmetries.
Phase problem* Waves diffracted by a periodic distribution of simple scatterers obey Bragg's law, which allows ready determination of interplanar distances and thus the easy recovery of a description of the crystal structure. Where the scattering objects are complex (e.g. in molecular crystals) the diffracted radiation suffers a phase shift arising from the spatial distribution of individual scatterers. The amplitudes of the resulting structure factors are directly derivable from the experimental measured intensities of the diffracted beams, but the phases are not. Without a knowledge of the phases, it is not possible to reconstruct the individual atomic positions. Estimating the phases is an essential step in successful structure determination.
Structure factor* The structure factor F hkl is a mathematical function describing the amplitude and phase of a wave diffracted from crystal lattice planes characterised by Miller indices h,k,l.
MAD* An approach to solving the phase problem in protein structure determination by comparing structure factors collected at different wavelengths, including the absorption edge of a heavy-atom scatterer.

MR*
An approach to solving the phase problem by concentrating on phase relationships that arise through X-ray diffraction from similar molecular components. The components can be molecular fragments related through noncrystallographic symmetry (e.g. icosahedral subunits of a virus) or a similar molecule such as a homologous protein with high sequence identity.

SAD
The method of single-wavelength anomalous dispersion used for solving the phase problem, makes use of data collected at just one wavelength, typically at the absorption peak or high-energy remote. It minimizes problems of radiation damage and nonisomorphism, but requires very accurate measurements.

MIR
In the method of multiple isomorphous replacement the interference effects on the intensities of the diffracted beams caused by the addition of heavy atoms to the protein provide the estimates of the phase angles.

Resolution*
In crystal structure determination, the term resolution is used to describe the ability to distinguish between neighboring features in an electron density map. By convention, it is defined as the minimum plane spacing given by Bragg's law for a particular set of X-ray diffraction intensities. The resolution improves with an increase in the maximum value of (sinθ)/λ at which reflections are measured.
R merge R merg is a measure of the uncertainty for unmerged reflections: Where: I i (hkl) = intensity of an individual reflection with indices (hkl) 〈I(hkl)〉 = mean value of the intensity for all reflections with indices (hkl), including those that are equivalent by symmetry.
R meas R meas is a measure of the uncertainty for unmerged reflections: Where: I i (hkl) = intensity of an individual reflection with indices (hkl) 〈I(hkl)〉 = mean value of the intensity for all reflections with indices (hkl), including those that are equivalent by symmetry.
R p.i.m. R p.i.m. provides an estimate of data quality after merging multiple observations: The CC 1/2 is a special case of Pearson's correlation coefficient (CC): a single dataset is divided randomly into two subsets (half the unmerged reflections with indices (hkl) are put into subset x, and half into subset y in the above formulation) and CC is calculated between these.

R (R work )*
The term R factor in crystallography commonly taken to refer to the 'conventional' R factor is a measure of agreement between the amplitudes of the structure factors calculated from a crystallographic model and those from the original X-ray diffraction data (F obs ). The R factor is calculated (F calc ) during each cycle of least-squares structure refinement to assess progress. The final R factor is one measure of model quality.
R free * A residual function calculated during structure refinement in the same way as the conventional R factor (see above), but applied to a small subset of reflections that are not used in the refinement of the structural model. The purpose is to monitor the progress of refinement and to check that the R factor is not being artificially reduced by the introduction of too many parameters.
*From Online Dictionary of Crystallography (International Union of Crystallography) K. Kurpiewska and others Irrespective of the phasing method, the aim of crystallographic model building is to construct a model that explains the experimental data with the conditions that it should make a physical and chemical sense. The latest trend in computational tools in protein crystallography is the development of all-integrated pipelines. Examples of the latter are ARP/wARP (Macromolecular Model Building for Crystallography and Cryo-EM; http://www. embl-hamburg.de/ARP/) (Chojnowski et al., 2019), RE-SOLVE (Terwilliger, 2001) and BUCCANEER (Potterton et al., 2004) (Cowtan, 2006).
The model building is usually performed simultaneously with the process of refinement. In other words, after solving the crystallographic phase problem, the initial model is refined and accordingly the parameters of the model (geometry and B-factor values) are optimized to fit the observations using a refinement function. Different programs, provided by such crystallographic packages as CCP4 (Winn et al., 2011), SHELX (Sheldrick, 2008) or PHE-NIX (Adams et al., 2010) can be utilized for this purpose. Model refinement programs are coupled with the graphics display programs, for example with the most popular COOT , that allow model rebuilding and interpreting regions of the difference Fourier map (unexplained by the model). The model is refined to the point when it is complete and further improvements to the structure are not possible. This is done in an iterative way until convergence is reached, monitored by the values of the R and R free factors ( Table 2). The Rfactors measure how well the simulated diffraction pattern matches the experimentally-observed diffraction pattern. R free is based on a test set consisting of a small percentage (usually ~5-10%) of reflections excluded from a structure refinement. Another important aspect that should be kept in mind is the fact that the appearance of Fourier maps depends more on the phases than on amplitudes. Consequently, even if the correct amplitudes are known from a well-conducted diffraction experiment, inaccurate phases may introduce map bias, which may be difficult to eliminate during refinement and modeling process.
To perform automated crystal structure determination, sophisticated platforms can be used. By cascading execution of a number of macromolecular crystallographic programs, efficient pipelines are produced. A new version of HKL, HKL3000 (Minor et al., 2006) includes all the steps from data collection, processing and structure determination within a single interface with the traditional graphical features of HKL. Similar functionality is offered by Auto-Rickshaw (Panjikar et al., 2005). Last years have brought more systems that facilitate the process of structure determination, for example XChemExplorer (XCE) provides an intuitive graphical user interface which guides the user from data processing, initial map calculation, ligand identification and refinement up to data dissemination . Furthermore, the demand from a growing number of fragment screening experiments led to the development of Pan-DDA (https://pandda.bitbucket.io/)  that allows analysis of such data. Small molecules and ligands are abundantly represented in the PDB, nearly 80% of deposits contain chemicals that do not belong to proteins or nucleic acids. The quality of small molecule models can be improved by the use of geometrical restraints. This common technique for the refinement and validation of small molecule binding sites in protein-small molecule complexes benefits from geometrical parameters derived from the very high-resolution structures in the Cambridge Structure Database (CSD) (https://www.ccdc.cam.ac.uk/) (Groom et al., 2016) that can be used as restraints in small molecule refinement. The ligand binding-site identification, ligand description and conformer generation, ligand fitting, refinement and subsequent validation can be successfully performed with a set of dedicated software: eLBOW (part of the PHENIX suite) (Moriarty et al., 2009), JlLgand (implemented in the CCP4 project) (Lebedev et al., 2012), and Grade (part of BUSTER) (http://grade.globalphasing.org).
It is the primary goal of structural databases to provide highly reliable data, where "reliability" is defined by rigorous validation strategies and quality indicators. Thus, for instance PDB actively works with journals and depositors to provide feedback at an early stage, often actually improving the quality of the data that is to be deposited. The latter was a motivation for an independent initiative, now running for many years, which is the PDB REDO project (https://pdb-redo.eu/) (Joosten et al., 2009). This server provides a re-refined structure with suggested improvements i.e. new coordinates' set for each and every PDB deposit. It also offers a useful server to assist the depositors, before they deposit, to look at the PDB REDO version of their current cycle of model refinement.
Model validation on the protein polypeptide chain can be performed with several programs that provide a statistical evaluation of the geometrical parameters of the structure. For the purpose of validation, scientists can refer to MolProbity , PROCHECK (Laskowski et al., 1993), WHAT_IF (Vriend, 1990) and SFCHECK (Vaguine et al., 1999). After careful inspection of the validation results, that can be also performed with wwPDB OneDep System (https://validate-rcsb-2.wwpdb.org/) and solving the pinpointed issues, the authors can deposit their structures to PDB. This last step, leading to the release of data via the public repository is a prerequisite for publishing structural reports and, by revealing experimental details, it also supports the idea of reproducible science.
Even though the validation system is nowadays an efficient procedure, one should remember that true and critical evaluation of macromolecule structures, in terms of quality and reliability, before referring to existing deposits (MR models, homologues, orthologs) and during submission is crucial (Dauter et al., 2014).
Furthermore, deposition and annotation tools implemented in PDB require from the depositors that atomic coordinates and primary experimental data plus associated metadata are submitted. The ease of archiving raw diffraction data sets is a remarkable development of recent years. In addition, the desire to maximize the availability of research data in accordance with the so-called FAIR principles -Findable, Accessible, Interoperable, and Re-usable (https://www.force11.org/group/fairgroup/fairprinciples) (Wilkinson et al., 2016), encourages crystallographers to deposit and share the raw data. The Integrated Resource for Reproducibility in Macromolecular Crystallography (https://proteindiffraction.org/) (Grabowski et al., 2019) and Macromolecular Xtallography Raw Data Repository (https://mxrdr.icm.edu.pl/) are good examples of such initiatives that include a repository system.

TIP 6: ANALYZE AND VISUALIZE WITH THE USE OF GRAPHICAL TOOLS
A 3D protein structure model is a very rich information source which is best analyzed with the help of some advanced visualization software. There are currently many graphics programs that are suitable for displaying and analyzing protein structures, most of them with capability to: display various representations at once (cartoon, ribbon, ball-and-sticks, sticks, etc.), apply different coloring schemes (by: atom type, B-factor value, secondary structure, etc.), measure geometrical parameters of the model, identify steric clashes, display electron density maps, and save high quality graphics. Majority of the programs also have some scripting interface, which is very useful to automate routine procedures and also save and restore the work. A comprehensive review of the available graphical software packages is far beyond the scope of this brief review, hence here we just list some popular, freely available packages with links to their websites.

TIP 7: ANALYSIS OF STRUCTURAL FEATURES WILL PUT YOUR STRUCTURE IN A BROADER CONTEXT
Analysis of protein structures and their interactions with other molecules is often very helpful in elucidating their cellular functions and mechanisms of action. Thus, XRD structural methods belong to the leading scientific strategies for identification of protein's biological and biochemical relevance.
Analysis of macromolecular interfaces, including prediction of likely oligomeric state and generating its coordinates, calculations of interface area and estimation of free energy of assembly dissociation are only selected capabilities offered by the PDBePISA server (https:// www.ebi.ac.uk/msd-srv/prot_int/pistart.html) (Krissinel & Henrick, 2007). The server also lists amino acids making up the interfaces, evaluates significance of individual residues for macromolecular contacts and offers an advanced search engine for biological interfaces from among structures deposited at PDB.
The DALI server (http://ekhidna2.biocenter.helsinki. fi/dali/) (Holm, 2020) allows one to perform protein comparison based on the 3D structure. The server offers several options, including searching PDB for similar 3D structures, pairwise comparison between selected structures (or individual chains) and "all against all" structural comparison for up to 64 structures. Several modes of results visualization, including structural trees, structurally aligned sequence logos and 3D models with mapped structural or sequence variation aid in results' analysis.
Structural studies on multi domain or multi chain proteins may yield structures corresponding to different conformational states of the macromolecule, e.g. closed vs open conformation. In such a case, the DynDom program or server (http://dyndom.cmp.uea.ac.uk/dyndom/) may turn out to be very useful to identify hinge residues and moving domains, as well as the axes by which the (components of) movement take place (Poornam et al., 2009). The DynDom website also hosts several browsable databases with results of protein domain movement analysis.
Structures of protein-ligand complexes provide valuable insights into interactions between a small molecule, which can be e.g. an inhibitor, a drug or a reactant, and the host macromolecule. Classification of these interactions is greatly enhanced by the Arpeggio program or server (http://biosig.unimelb.edu.au/arpeggioweb/) (Jubb et al., 2017), which identifies the type of interaction between the ligand-protein atom pairs (above a dozen of different types) and generates a PyMOL session file, which can be used to visualize the results in 3D.
For an example of such analysis see Fig. 4, where the overall structure of hyoscyamine 6β-hydroxylase (H6H, PDB: 6ttm) (Kluza et al., 2020) is depicted (panel A), the secondary structure is highlighted (panel B) and key interatomic interactions engaged in the enzyme-substrate recognition are presented (panel C).
PDBSum server is also worth noticing here, as it provides succinct yet richly illustrated summary of protein structure (3D, secondary and primary), its interactions with ligands and metal ions analyzed and illustrated by LIGPLOT (Wallace et al., 1995), as well as 3D visualization of clefts and cavities within the protein molecule. Quality assessment report generated by PROCHECK is Figure 4. Visualization of the structure and key interatomic interactions for hyoscyamine 6beta-hydroxylase (H6H) in complex with its substrate -hysoscyamine (PDB: 6ttm). (A) Overall structure of H6H complexed with Ni 2+ (cyan sphere), hyoscyamine (sticks, C atoms in salmon), co-substrate mimic -N-oxalylglycine (sticks, C atoms in red). Two histidine and one aspartate that coordinate the metal are also shown (sticks, C atoms in green). (B) An overview of H6H with secondary structure highlighted -helices in red, β-strands in yellow, loops and coil regions in green. (C) Close-up view of hyoscyamine (sticks with C atoms in green) binding pocket with depicted key interactions between the substrate and protein that were identified by the Arpeggio server. Hydrogen bonds as red discs, C-H…π interactions as white discs, donor… π interactions as blue discs, weak polar interactions as orange discs. Graphics were generated with PyMOL. K. Kurpiewska and others also available for each PDB entry. Such reports, which are compiled and stored on the server for PDB entries, can be also generated for PDB files uploaded by the user.

CONCLUDING REMARKS
Protein crystallography together with Cryo-EM and NMR are the most powerful techniques for the structure determination of macromolecules, as well as for the analysis of mechanisms of protein actions and interactions at the atomic level. The algorithms and methods for structure determination initially formulated decades ago are now becoming more and more elaborate, but thankfully the computational tools wrapping around these advanced methods have evolved toward simpler and more user-friendly packages and web interfaces. This, combined with amply available tutorials, YouTube channels, manuals, data deposited at open repositories and other educational materials freely available in the internet lowers the "activation barrier" for a novice in the field eager to learn protein crystallography methods. We hope this short review will be a useful aid in this fascinating journey.