Get started at National Supercomputer Centre (NSC)

A pilot MX 3D structure determination project

In 2013 we started a pilot project started aiming at installing protein crystallography software in a HPC environment to investigate performance of such setup with respect to remote graphics, speed of calculation and potential to run supercomputer adapted software in a suitable environment. For 2016 we share 15 000 core hours/month and each user have 20 GB of diskspace in /home/x_user and the entire project have 2000 GB under /proj/xray/users/x_user. Below you find how to join the pilot and brief instructions on how to run MX software in a HPC environment. The pilot was recently presented in a poster at the joint SBnet/SFBBM meeting in Tällberg 2016. See also NSC Triolith getting started guide

Join and Access project

Swedish structural biologists are welcome to join the project and we limit ourselves to MX calculations since 15 000 core hours / month is not enough to support molecular dynamics (MD) for all research groups.  For MD simulations please apply for a SNAC medium project i.e. 20 000 core hours per month to support your research. https://supr.snic.se/round/2016medium/

  1. Register yourself in SUPR
  2. Perform "Request membership in project"
  3. Project info
  4. Please edit the Create New NSC User form

Access NSC Triolith

(A) First time users should perform ssh login to NSC to reset their initial passwords: ssh -X NSCusername@triolith-thinlinc.nsc.liu.se

(B) Download thinlinc-client for any windows, linux or mac computer from Cendio and use it to access NSC Triolith

(C) Working at PSF Linux network the thinlinc client is started by command "thinlinc"

 

sbatch vs interactive

A typical NSC Triolith session looks like this:
1. Open a terminal
2. If possible generate a parameter file (PHENIX software)
3. Edit a sbatch script (i.e. phaser.script/buster.script/sharp.script/mr_rosetta.script/rosetta_refine.script)
4. sbatch phaser.script (or buster.script etc)

Many key MX softwares can be started using a sbatch script such as:

  • BUSTER - use 8 processors via BUSTER parameter "nthreads 8" and (SBATCH -n 8). see BUSTER docs
  • autoSHARP - where sharp/shelxd use 16 processors i.e. (SBATCH -N 1)
  • PHENIX MRage or rosetta_refine use 16 processors i.e. (SBATCH -N 1) and a saved parameter file
  • PHENIX MR rosetta or Autobuild use 16 processors i.e. either (SBATCH -N 1) and a saved parameter file, or GUI "submit queue job"
  • CCP4 PHASER is parallel use 16 processors i.e. (SBATCH -N 1)
  • Arcimboldo_lite is parallel use 16 processors i.e. (SBATCH -N 1)

Graphical software coot, pymol, adxv, albula and the phenix-GUI can be run directly at the login node and If your software cannot be started using an sbatch command you have to require an interactive (or development) node - see Basic HPC Commands below.

 

MX software dependencies

Example of MX software dependences to consider when writing sbatch scripts.

Software Depend on
XDSAPP xds/xdsviewer/ccp4/phenix
autoSHARP shelx/ccp4
hkl2map shelx/ccp4
phenix.mr_rosetta rosetta1
phenix.rosetta_refine rosetta1

1The rosetta module are loaded using "module load phenix/1.11.1-2575"

MX module commands

Modules can be loaded by .bashrc or by .modules or by "module load xxx" in the terminal window. At NSC usage of .modules for frequently used modules are preferred to .bashrc and examples of a .modules file and a .bashrc file is indicated below where modules relavant to MX are commented out using (#) in .bashrc. When opening a terminal window many modules are global and available in the directory specified by $MODULEPATH, while project specific modules are available after "module load proj/xray" that shifts the value of $MODULEPATH. Project specific modules are selected for frequently changing software.

Example of .modules file for frequently used MX software
$ more .modules
fasta/v36
usf/March2011

$
Please note
A) the empty line at the end of the .modules file created by "Carriage return" at the end of the file.
B) Project specific modules "module load proj/xray" has to be loaded manually
Global MX modules at NSC updated 2016-09-07
# For frequently used global modules please edit your $HOME/.modules file instead of .bashrc

# Modules read (no comments character "#" in front)
# module add fasta/v36
# module load usf/March2011

# Data Processing OBS XDSAPP requires ccp4/phenix/xds/xds-viewer modules
# module load proj/xray
# module load xds/2016-05-01
# module load xds-viewer/0.6
# module load xdsgui/2014-12-10  
# module load xdsapp/2.0
# module load adxv/1.9.9 
# module load albula/3.0

# Working with CCP4 and coot,
# module load ccp4/6.5.0-module (ccp4 v7.0 available - see Project specific modules)

# DIALS contains its own xia2 and need ccp4 to run
# module load dials/dev-291 (ccp4 v7.0 include dials via xia2)

# Phasing modules
# module load shelx/expires-2017-01-01
# module load hkl2map/0.4.c-beta
# module load snb/2.3

# Work with graphics
# module add pymol/1.6.0.0
# module load vasco/1.0.2

# Storage module
# module load irods/3.3
# Project specific modules

# module load proj/xray (influence your $MODULEPATH to find the project specific modules)
# module load phenix/1.11.1-2575 - includes rosetta 3.7 (2016.32.58837 bundle)
# module load ccp4/7.0-module
# module load arcimboldo_lite/nov2016
# module load sharp
# module load dials/dev20160202
# module load shelx/2016-02-01
# module load hkl2map/0.4 
# module load buster/20160324
# module load xdsme/20160406
# module load cns/1.3
# module load morda/20160531
# module load xds/2016-05-01
# module load xdsgui/2016-05-20
# module load adxv/1.9.11
# module load chimera/11.1 (start with "vglrun chimera")
# module load O/14.1 (start with "ono")
 

 

Basic HPC commands

Sharing of HPC resources such as computing time and diskspace demand some special commands listed exemplified below.

HPC commands Consequence
interactive -N1 -t 1:00:00 Reserve an interactive node for computing
exit Exit interactive node terminals saves quota
interactive -N1 -t 0:59:00 --reservation=devel Reserve 59 min of a development node
set ccp4i TEMPORARY to $SNIC_TMP MOLREP might use all tmp of compute nodes
snicquota Diskspace available at /home and /proj/xray
squeue -u x_user Check my JOBIDs running
scancel JOBID Cancel a running job using JOBID
jobsh -j JOBID n148 Access compute node n148 using JOBID
#SBATCH -A snic2016-1-XXX Sbatch script compute time on project XXX

 

BUSTER and SHARP

Buster at compute nodes

At NSC Triolith buster by default use 4 processors i.e. use (#SBATCH -n 4) in buster.script or if using "-nthreads 8" use (#SBATCH -n 8) in the buster.script file.  "sbatch buster.script" is efficient and save compute time, compared to requesting an interactive node (16 processors) and then run a standard buster job on the command line that use only 4 out of 16 processors, i.e. wasting compute time on 12 processors.

From the login node simply perform:
sbatch buster.script

where buster.script is:

#!/bin/bash
#SBATCH -t 1:00:00
#SBATCH -n 8
#SBATCH --mail-type=ALL
module load proj/xray
module load buster/20160324
refine -p model.pdb \
-m data.mtz \
-l chiral.dat \
-l grade-AME.cif \
-Gelly tweak4.gelly \
-nthreads 8 \
-autoncs \
-TLS \
AutomaticFormfactorCorrection=yes \
StopOnGellySanityCheckError=no \
-d run1 > run1.log

This will submit your job to the NCS compute nodes and to time or resources are wasted.

For help with the refine command please visit Buster wiki or perform:

module load proj/xray

module load buster/20160324

refine -hhh

to read help files for the refine command

SHARP at compute nodes

The command-line variant of SHARP is suitable for usage with the sbatch command at NSC Triolith.  Please find an example on how to submit a three wavelength Zinc phasing MAD job to NCS Triolith compute nodes below:

sbatch sharp.script

where sharp.script is:

#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load ccp4/7.0-module
module load shelx/2016-02-01
module load sharp
run_autoSHARP.sh \
-seq sequence.pir -ha "Zn" \
-nsit 20 \
-wvl 1.28334 peak -mtz pk.mtz \
-wvl 1.27838 infl -mtz ip.mtz \
-wvl 0.91841 hrem -mtz rm.mtz \
-id MADjob1 | tee MADjob1.lis

The purist should note this example lack measured Zinc scattering factors (f' and f") values at the three wavelengths, however additional examples for SAD/MAD/MIRAS etc are found at the GlobalPhasing homepage.  The script above will allocate an entire compute node for 8 hours and since shelxd and sharp are parallel softwares and arp/warp and buccaneer are partially parallel software, the fast execution time is suitable for synchrotron usage

 

Parallel PHASER MR

When trying many different search models and ensembles for molecular replacement, running the PHASER software in parallel mode is a time saver. PHASER is available both in CCP4 and in PHENIX and here is two ways to generate a sbatch script for parallel execution at the compute nodes

Example 1, sbatch script for CCP4 Phaser

1. Edit phaser.script below: an ensemble of two aligned search models in MR_AUTO mode and 16 processors

#!/bin/bash
#SBATCH -t 00:30:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load ccp4/7.0-module
phaser << eof
MODE MR_AUTO
HKLIN data.mtz
JOBS 16
LABIN F = F_XDSdataset SIGF = SIGF_XDSdataset
ENSEMBLE ens1_1 PDBFILE search_model1.pdb IDENTITY 0.99 &
                                   PDBFILE search_model2.pdb IDENTITY 0.99
COMPOSITION PROTEIN MW 48334 NUM 2
SEARCH ENSEMBLE ens1_1
ROOT ./ens1_1
eof

2. Execute phaser.script as:

sbatch phaser.script

 

Example 2, PHENIX Phaser MRage

1. Start and edit "MRage-automated pipeline" wizard and since "submit queue job" is not available please save a parameter file called mr_pipeline_1.eff

2. Edit MRage.script as:

#!/bin/bash
#SBATCH -t 01:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load phenix/1.11.1-2575
phenix.mrage mr_pipeline_1.eff

3. Then finally execute

sbatch MRage.script

that will submit your job to a compute node. In this particular example an entire node is allocated for 1 hour. More info regarding sbatch scripting at NSC Triolith:

 

Example 3, Phaser using ccp4i2 GUI at interactive node

  1. Get an X hours of compute time at an interactive node by "interactive -N1 -t X:00:00"
  2. module load proj/xray
  3. module load ccp4/7.0-module
  4. ccp4i2
  5. Read in appropriate data.mtz and sequence.seq and model.pdb file and start one of the Phaser modules of ccp4i2
  6. In the Phaser Input module enter "Keywords" and change "Number of parallel threads" from 1 to 16
  7. Run Phaser

Example 4. Submit Parallel phaser job to compute node from Phenix GUI

  1. Start Phenix-GUI
  2. module load proj/xray
  3. module load phenix/1.11.1-2575
  4. Edit phenix preferences - 1 hour with 16 processors is enough for most MR jobs
  5. Edit "Phaser MR (simple one component interface)" and use "Run" - "Submit queue job"

arpWarp

arpWarp is part of many autobuild pipelines such as autoSHARP and some steps are made parallel (1) and suitable for command-line sbatch scripting

  1. Automated macromolecular model building for X-ray crystallography using ARP/wARP version 7.
    Langer G, Cohen S, Lamzin V, Perrakis A
    Nat Protoc 2008 ;3(7):1171-9

Example: arpWarp.script for rebuild of MR solution

#!/bin/bash
module load proj/xray
module load ccp4/7.0-module
auto_tracing.sh \
datafile ./data.mtz \
fp F_XDSdataset \
sigfp SIGF_XDSdataset \
modelin ./MR_solution.pdb \
seqin ./model_sequence.pir

Then submit arpWarp.script to compute nodes by:

sbatch arpWarp.script

More parametrar for auto_tracing.sh can be added to arpWarp.script according to the author instruction here

Phenix and Rosetta

Phenix and Rosetta are two large software packages in rapid development. Currently we have phenix version 1.11.1-2575 (26 Oct 2016) and rosetta 3.7 (2016.32.58837) available to NSC users.  The PHENIX software package comes with a SLURM scheduler in the graphical user interface (GUI) that enable submission of jobs to the compute nodes from the phenix GUI. To enable SLURM scheduling from the Phenix GUI

1. Edit processes in phenix preferences

2. Press the Run symbol (gear wheel) and select "submit queue job"

In the Autobuild wizard under "Other options" one can select "Use queueing system to distribute tasks" is available, however this is NOT required since 15 processors is enough for Autobuild. Many phenix GUI wizards do not enable “submit queue job” for instance “MRage-automated pipeline” and “Rosetta refinement (alpha)”.  In these instances you can save a parameter file and execute a simple sbatch to submit jobs to the Triolith queue.

 

MR rosetta rebuild of MR solution

Molecular replacement is best performed using parallel phaser software called MRage at NSC, however MR rosetta rebuild of MR solutions can save a lot of time and effort.

Below we use the sbatch option although "submit queue job" is another possibility for starting MR rosetta in the phenix GUI. 

Example: MR rosetta rebuild of MR soluition using sbatch command
1. open a terminal
2. Start phenix GUI to edit parameter file
module load proj/xray
module load phenix/1.11.1-2575
phenix
3. save MR rosetta parameter file from phenix GUI (mr_rosetta_1.eff)
4. "diff mr_rosetta_1.eff mr_rosetta_2.eff" is useful when comparing parameter files
5. Edit script for MR rosetta called mr_rosetta.script
#!/bin/bash
#SBATCH -t 96:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load phenix/1.11.1-2575
phenix.mr_rosetta mr_rosetta_1.eff
6. sbatch mr_rosetta.script
and the MR rosetta job is now run at the compute nodes

Skip MR step (use MRage instead) by thicking "Model is already placed" and save parameter file for sbatch command. Fragment files aat000_03_05.200_v1_3/aat000_09_05.200_v1_3 are generated by submitting your protein sequence to the robetta server as described on the phenix homepage

Example script for phenix rosetta refine

1. Enable alpha-test programs and features

2. Open "Rosetta refinement (alpha)" wizard and note that "submit queue job" is not available. Instead save a parameter file called rosetta_refine_1.eff.

3.Edit rosetta_refine.script as:

#!/bin/bash
#SBATCH -t 96:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL

module load proj/xray
module load phenix/1.11.1-2575
phenix.rosetta_refine rosetta_refine_1.eff

4. Execute "sbatch rosetta_refine.script" to send your job to the compute node

 

Arcimboldo_lite

Arcimboldo_lite (1) is "ab initio" high resolution native datasets phasing software where 4eto was used for  benchmarking (2)

  1. ARCIMBOLDO_LITE: single-workstation implementation and use.
    Sammito M, Millán C, Frieske D, Rodríguez-Freire E, Borges R, Usón I
    Acta Crystallogr. D Biol. Crystallogr. 2015 Sep;71(Pt 9):1921-30
  2. Macromolecular ab initio phasing enforcing secondary and tertiary structure.
    Millán C, Sammito M, Usón I
    IUCrJ 2015 Jan;2(Pt 1):95-105

4eto data-files can be downloaded and executed via "sbatch arcimboldo.script"

#!/bin/bash
#SBATCH -t 6:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load ccp4/7.0-module
module load arcimboldo_lite/nov2016
job=4eto
ARCIMBOLDO_LITE ${job}.bor > ${job}.log

To run your own job please modify the arcimboldo.script above and 4eto.bor file according to arcimboldo manual

################# 4eto.bor - start ##################

[CONNECTION]:
distribute_computing: multiprocessing

[GENERAL]
working_directory = /proj/xray/users/x_user/test_directory
mtz_path = ./4eto_2.mtz
hkl_path = ./4eto_4.hkl

[ARCIMBOLDO]
shelxe_line = -m30 -v0 -a3 -t20 -q -s0.55
helix_length = 14
sigf_label = SIGF
molecular_weight = 22000
name_job = 4eto_def
f_label = F
fragment_to_search = 2
number_of_component = 1
identity = 0.2

[LOCAL]
# Third party software paths
path_local_phaser: /proj/xray/software/ccp4_v7.0/destination/ccp4-7.0/bin/phaser
path_local_shelxe: /software/apps/shelx/expires-2017-01-01/bdist/shelxe

################# 4eto.bor - end ##################

DIALS and XDS at NSC Triolith

xia2 for autoprocessing of X-ray diffraction data by XDS and DIALS is developed by Diamond Light Source and CCP4 and runs in the background when visiting a Diamond beamline. 

xia2 is suitable for NSC Triolith using "sbatch xia2.script" here exemplified by XDS -3dii option (upper) and -dials option (below)

#!/bin/bash
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load xds/2016-05-01
module load ccp4/7.0-module
xia2 -3dii -failover \
image=/proj/xray/users/x_marmo/TEST/data/x001/TEST-x001_1_0001.cbf:1:900 \
mode=parallel njob=1 nproc=16 \
trust_beam_centre=True read_all_image_headers=False \

#!/bin/bash
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load xds/2016-05-01
module load ccp4/7.0-module
xia2 -dials -failover \
image=/proj/xray/users/x_marmo/TEST/data/x001/TEST-x001_1_0001.cbf:1:900 \
mode=parallel njob=1 nproc=16 \
trust_beam_centre=True read_all_image_headers=False \

For more xia2 guidance visit the xia2 homepage and xia2 multi crystal tutorial

XDS autoprocessing is also possible with xdsme (XDS Made Easy) and autoPROC from GlobalPhasing and fully automated scripts are presented below:

#!/bin/bash
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load xdsme/20160406
xdsme --brute datafiles_*.cbf

#!/bin/bash
#SBATCH -t 1:00:00
#SBATCH -N 1
#SBATCH --mail-type=ALL
module load proj/xray
module load autoPROC/20160501
process \
-Id id1,/proj/xray/users/x_marmo/targets/PROTA/data/x001,PROTA-x001_1_####.cbf,1,1800 \
-noANO -B \
-nthreads 8 \
-d proc_1 > proc_1.log

Finally XDS can be run from a graphical user interface at an interactive node (or development node) in the shape of XDSGUI and XDSAPP

Single processor software, CNS and MoRDa

Running single processor software at NSC Triolith does not save any wall time per job, however many refinement protocols in CNS and molecular replacement attempts in MoRDa can be fired off quickly

CNS - Crystallography and NMR System

Crystallography and NMR system comes with an online GUI to edit input files.  At present the konqueror browser at NSC Triolith does not allow re-opening of saved CNS input files such as generate.inp or refine.inp hopefully we can solve this issue soon. CNS sbatch scripts should allocate a single processor i.e. "#SBATCH -n 1" as exemplified below

#!/bin/bash
#SBATCH -t 1:00:00
#SBATCH -n 1
#SBATCH --mail-type=ALL
module load proj/xray
module load cns/1.3
cns_solve < refine.inp > refine.log

MoRDa - automatic molecular replacement

MoRDa is new software for automatic molecular replacement and "morda -h" will give some command-line options to the software. MoRDa sbatch script should allocate a single processor i.e. "#SBATCH -n 1" as exemplified below

#!/bin/bash
#SBATCH -t 10:00:00
#SBATCH -n 1
#SBATCH --mail-type=ALL
module load proj/xray
module load ccp4/7.0-module
module load morda/20160531
morda -s target.seq  -f data.mtz -alt

Software conflicts

 

In the .bashrc example above the majority of the settings commands are commented out (#) to avoid conflict between software packages. For instance if you want to run the CCP4 autoprocessing software xia2, but previously read the phenix module, xia2 will simply not run for unknown reason. Sometimes however correct functionality require both ccp4 and phenix modules to be read and XDSAPP is such an example. The VASCo module has its own python virtual environment that might conflict to other Python dependent modules such as XDSAPP/PHENIX etc.

 

NSC diskspace

Every user have 20 GB of diskspace in /home/x_user and the entire project have 2000 GB under /proj/xray/users/x_user. Data transfer to windows machines are most easy done by WinSCP, however to Linux machines use scp and rsync exemplified below

rsync -rvplt ./data x_user@triolith-thinlinc.nsc.liu.se:/home/x_user

Directory named "data" transferred to /home/x_user directory at Triolith
scp x_user@triolith-thinlinc.nsc.liu.se:/home/x_user/data/xdsapp/XDS_ASCII.HKL ./ Single file XDS_ASCII.HKL transferred from Triolith to current directory

 

 

 

iRODS is unstable

The enterprise edition of iRODS does not work as intended and data transfer is frequently interrupted.  At the intial stage of this project community edition of iRODS were used with stable performance.  The current enterprise edition of iRODS is not recommended

Members of the pilot project can join the SweStore iRODS-project to share another 10 TB of long-term data storage.  Data placed in iRODS are safe however cannot be used for running calculations at NSC or elsewhere. Data files present in iRODS gets organisied as a database useful for very fast searches. Data is manipulated in iRODS using icommands. Using icommands are similar to using sftp i.e. standard Linux commands expect for the database specific ones - some exemplified below:

icommand examples Consequence
iinit --ttl 72 Perform Yubikey login to iRODS for 72 hours (maxtime)
imkdir /snicZone/proj/psf/Diamond_150125 Create a new "common" iRODS directory.
ipwd or ils Like the standard pwd and ls command
icd Diamond_140125 change irods directory
irsync -rv --link mx8492-36 i:mx8492-36 Transfer mx8492-36 directory to iRODS - alt1
iput -rv mx8492-36 mx8492-36 Transfer mx8492-36 directory to iRODS - alt2
iquest "select sum(DATA_SIZE),count(DATA_NAME),RESC_NAME where COLL_NAME like '/snicZone/proj/psf/Diamond_140125/mx8492-36%'" Check amount of data transferred (like du -sh)
ilocate XDS_ASCII.HKL Find all XDS_ASCII.HKL files in your iRODS zone
imkdir /snicZone/proj/psf/Anna Create personal directory Anna for user s_anna
ils -A /snicZone/proj/psf/Anna Check Anna directory ownership and rights
ichmod -r own s_anna /snicZone/proj/psf/Anna Make s_anna an owner of Anna
ichmod -r null s_admin /snicZone/proj/psf/Anna Remove user s_admin access to Anna
ichmod -r null psf /snicZone/proj/psf/Anna Remove psf group access to Anna

 

VASCo at NSC

1. module load vasco/1.0.2
2. Place a file.pdb in a fresh directory and execute:
vasco -in_dir ./ -filename file

The vasco installation has it's own python virtualenv, so do not run other python programs (phenix/xdsapp) in the same shell.

Links

Bioinformatics (Computational Biology)Protein CrystallographyStructural Biology