Technical Information
Metagenomics Workflow
* Data sharing processes are removed for all NHS-external distributions.
Feature descriptions:
| Number | Field | Description |
|---|---|---|
| 1 | Load existing sample sheet | Load a pre-existing sample sheet. This will populate the fields below with the data from the TSV file. |
| 2 | Number of samples | The number of samples to be analysed. This will create the number of rows in the table below. |
| 3 | Experiment ID | Not to be confused with the ONT Experiment ID |
| 4 | ONT experiment ID | The exact name matching the experiment name on MinKNOW entered by the user when initiating a sequencing run. This is populated automatically from the /data directory. |
| 5 | ONT sample ID | The exact name matching the Sample name on MinKNOW entered by the user when initiating a sequencing run. This is populated automatically from the /data/{experiment_id}/ |
| 6 | ONT barcode | The ONT library index/barcode used. Green colour indicates the barcode directory has been validated. |
| 7 | Lab/Sample ID | The unique lab accession number for the sample. This data is encrypted before transmission. If repeating a sample, append with _n |
| 8 | Sample accession | The lab's sample ID - identifying a specific patient specimen (Anonymised). |
| 9 | Hospital number | A value identifying the individual providing the sample (Anonymised). |
| 10 | Collection date | The date the specimen was collected. For positive and negative controls, this would be the day of library preparation. |
| 11 | Sample Class | The category of the sample loaded. |
| 12 | Sample type | The type of specimen. |
| 13 | Operator | Identifier for user operating the sequencer. |
| 14 | Notes | An open field for notes that will appear on all reports. |
| 15 | Anonymise | Anonymises the 'Sample accession' and 'Hospital number' values using an encryption cypher. |
| 16 | Deanonymise | Deanonymises the 'Sample accession' and 'Hospital number' values present in the launcher fields to their original values. The deanonymisation tool can be used to access previous runs. |
| 17 | Generate Gourami sample sheet | Only for Q-line >=v1.1 Generates a Gourami compatible sample sheet for starting a sequencing experiment. The output can be found in the ./NHS_RMg_platform/sample_Sheet/gourami directory. |
| 18 | Force overwrite | Checking this box will move results and reports for all timepoints matching the 'Lab/sample ID' filed in the launcher to the ./NHS_RMg_platform/recycle_bin directory and 'unlock' all directories. If you have aborted a run, or the terminal is reporting failures, try using this feature. Anyting overwritten will be moved to the NHS_RMG_platform/recycle_bin directory. |
| 19 | mSCAPE prompt | After the sequencing and analysis run has completed, open the mSCAPE uploader for user input. No data is uploaded without par-sample expressed authorisation. |
| 20 | Select timepoints | Select the timepoints you'd like to be generated. If you encounter errors generating a timepoint visit the FAQ section |
| 21 | Refresh directories | This button refreshes the contents of the MinKNOW experiment ID and MinKNOW sample ID columns. Useful if you have started the launcher before commencing the sequencing experiment. |
| 22 | Launch pipeline | Launches metagenomics analysis, saving the sample sheet to the ./NHS_RMg_platform/sample_sheets. |
| 23 | Force legacy data ingest | Relaunches the interface using the legacy data ingest method. ONT Experiment and Sample dropdowns will be populated directly from /datarather than the Gourami SQL database. |
: Feature description table for metagenomics workflow
Logging
The outputs seen in the terminal while running the workflow are saved in the NHS_RMg_platform/logs directory with the date and time of the run. Opening this in the terminal with a cat command will preserve the colouration of the original outputs. Reading the file conventionally in a text editor, special charaters delimiting colouration will be visible.
Additionally, the debugging data from Auto Query subprocesses are saved. The 'File Outputs' section for Auto Query tabulated data or further debugging.
A log containing a list of the FASTQ files ingested at each timepoint is available in the NHS_RMG_platform/results/{sample_id}/{timepoint}/logs directory.
File outputs
Analysis files can be found in the NHS_RMg_platform/results directory. Briefly, the folders contain intermediate outputs of the constituent workflow tools for each sample and timepoint.
amr - TSV files for abracate and scagire AMR detection tools
centrifuge - Intermediate files for centrifuge classification read_assignments.tsv contains per-read classification after threshold filtering.
files - Contains the sample sheet matching the data input on launch. Can be used to repeat analyse a dataset. Contains summarised data from Auto Query analysis. Used by Summary report for obtaining Auto Query results.
host - Lists of reads removed during human depletion.
log - A list of files moved in the data acqusition step of the workflow.
microbial - Human-scrubbed FASTQ file containing classified and unclassified read data. Good for sharing with 3rd parties for analysis.
qc - Nanostat outputs parsed for the repoirt QC section
Analysis thresholds
Three thresholds are applied throughout Metagenomics analysis. They appear here in the order of application:
-
Centrifuge threshold: A scoring metric used to assess the quality of a match in the k-mer classification (Centrifuge) phase. Any reads falling below are not considered in further analyses and are not visible to the user. The count of reads failing this threshold is shown on the Metagenomics Report in the 'Quality Control' section under 'Below Threshold', not to be confused with the 'Unclassified' metric. This is currently set at 250 for virus taxa and 5000 for everything else. See 'Variable thresholds' section below for configuration.
-
Relative abundance: A taxon's relative abundance (read-count relative to other non-viral taxa in the metagenome) must exceed this value to be included in the 'Organisms above threshold' section of the report. Taxa falling below this threshold are shown in the 'Centrifuge Full' section at the bottom of the report. This threshold is set at 1% for non-virus taxa. Viruses are not subject to an abundance threshold. This can be configured to operate as a read-count threshold. See 'Technical information -> Metagenomics Workflow -> Metagenomics config parameters' for info on how to set.
-
Auto Query support: See the 'Secondary (orthogonal) classification analysis -> Auto Query' section for more info on Auto Query operation and interpretation. The traffic light system in the Auto Query indicates green if > 50 % of the top alignments in the query read subset matches the primary k-mer classification step.
Threshold exceptions
A subset of taxa will always be shown 'above threshold' regardless of the readcount and reletive abundance. These are user configurable (see 'Technical information -> Metagenomics Workflow -> Metagenomics config parameters').
Current exceptions are:
- Aspergillus spp.
- Candida spp.
- Chlamydia spp.
- Pneumocystis spp.
- Mycoplasmoides spp.
- Neosartorya spp.
- Nakaseomyces spp.
- Mycobacterium spp.
- Mycobacteroides spp.
- Any eukaryotic rRNA/mtDNA filter detection
Variable thresholds
The workflow loads a dynamic score threshold configuration file when applying the centrifuge threshold filter. This is located in the NHS_RMg_platform/db/ref/thresholds directory.
To add a new threshold, create a JSON file in the same format of existing threshold configs. Provide a name, score and list of taxids.
ranking.txt specifies the order in which lists with redundant taxids/threshold combinations should be read and applied. For example, the viruses_10239.json file contains all virus tax IDs, which sets their threshold to 250. To set a new threshold for only Enterovirus spp. that will override the 250 threshold set in viruses_10239.json, create a new JSON file containing all Enterovirus taxids and add the name of the json file above viruses_10239.json in ranking.txt and above any existing lists where Enteroviruses might appear.
Force overwrite and recycle bin
The 'Force Overwrite' function on the metagenomics launcher moves any existing analysis data matching the Lab/sample ID on the Metagenomics Launcher window (to be overwritten) in the reports or results directories in to NHS_RMg_platform/recycle_bin. This directory should be emptied periodically.
Query data outputs
Both Auto Query and Organism Query store intermediate analysis data for downstream usage. They are stored in the reports/{sample_id} directory. For Auto Query, this is in the automated_queries directory, with a subdirectory for each queried taxon. For manual Organism Queries, this will be in a directory named with the query organism with the date and time.
The Query subdirectories contain: * A compressed FASTA file containing all query reads. * A list of all read IDs matching the queried taxon. * The portable Organism Report HTML file. * A HTML file containing the BLAST plot only. * A TSV file containing the raw BLAST alignments. The columns headers are 'qseqid sseqid pident staxids sscinames length mismatch gapopen qstart qend sstart send evalue bitscore qseq qlen'. See this website for a full explainer.
Q-line raw data storage and 'Force legacy ingest'
Newer 'Gourami' Q-line (v1.1) devices do not store data in the conventional /data/{experiment}/{sample}/{metadata}/fastq_pass/{barcodeXX} structure. Instead, metadata is stored in an SQL database at /data/gourami/data.sqlite, and the raw sequencing data is stored in a convoluted structure at /data/output/sequences/no_group/no_sample. It is not easy to identify which datasets belong to which experiments. This is solved by workflow by storeing linkage in the NHS_RMg_platform/results/{sample_id}/{timepoint}/files/sample_sheet.tsv file. The 'FullPath' contains the path to the corresponding FASTQ dataset.
The workflow automatically detects 'Gourami' devices and will load metadata from the SQL database by default. To force the workflow to ingest data from the /data directory in the typical structure, use the 'Force legacy ingest' function on the main launcher.
Eukaryote binning
To reduce occurrences of misclassified fungi, eukaryotic rRNA and mtDNA sequence data are processed separately and shown as Eukaryotic rRNA/mitochondrial material on the Metagenomics Report. These detections are elevated above threshold and subject to Auto Query analysis. Users should always inspect every read in the the Organism Report generated by Auto Query when Eukaryotic rRNA/mitochondrial material is detected.
Warning
Changing values in the configuration file can lead to system instability and the publication of misleading results. NHS sites should not edit the configuration without consultation.
See the table below for explanations on the function of configuration fields:
Sample sheets - loading and structure
A sample sheet is generated every time the Metagenomics Launcher is run. They are stored in NHS_RMg_platform/sample_sheets with the time and date of initialisation. Users can append the filename with any string by using the Experiment ID (feature #3) on the launcher. This will make it easier to identify runs for audit or downstream application use.
Sample sheets are copied also to NHS_RMg_platform/results/{sample_id}/{timepoint}/files/sample_sheet.tsv for all samples and timepoints.
For repeat/failed runs, sample sheets can be loaded in to the launcher, which will automatically populate the launcher with the preconfigured fields.
Metagenomics config parameters
| Parameter | Description |
|---|---|
device |
ONT sequencing device name (e.g., "GridION") - appears on reports and summaries |
site |
Site identifier for the installation (e.g., "GSTT") |
timezone |
Timezone setting for the system - must match between host and container (e.g., "UTC") |
parallel_instances |
Number of parallel workflow instances (fed to snakemake 'cores' variable) |
mem_slots |
Number of parallel samples for Centrifuge analysis - increase if more RAM available |
blast_slots |
Number of parallel samples for HTML report generation (per sample) |
abundance_threshold |
Minimum abundance percentage for reporting taxa (default: 1.0%) |
count_threshold |
Minimum read count for reporting taxa (this is superseded by abundance) |
cfg_score |
Centrifuge score threshold - can be superseded by variable threshold function (default: 5000) |
metagenomics_version |
Workflow version number - appears on reports and summaries |
data_dir |
MinKNOW output data directory to search in (default: "/data") |
targets |
Deprecated: Not implemented |
parameters.hg38.index |
Deprecated: Human genome (hg38) minimap2 index path |
parameters.centrifuge.index.cmg |
Centrifuge database index path |
taxonomy.exceptions |
List of organism (partial match strings) exempt from standard thresholds |
taxonomy.viral_taxid |
Deprecated: Path to viral taxonomy ID list |
taxonomy.threshold_config |
Directory containing organism-specific threshold configurations |
taxonomy.skip_taxid |
Deprecated: Path to taxonomy IDs to skip in analysis |
taxonomy.replacement_list |
CSV file for organism name replacements in reports |
taxonomy.dictFile |
ARGOS taxonomy dictionary file |
taxonomy.taxdump |
NCBI taxonomy dump directory |
taxonomy.names |
NCBI taxonomy names.dmp file |
taxonomy.nodes |
NCBI taxonomy nodes.dmp file |
taxonomy.refseqDir |
RefSeq genomes directory |
taxonomy.speciesFileNames |
File containing paths to species reference files |
taxonomy.speciesTaxMeta |
Combined assembly summary metadata file |
amr.scagaire |
Path to Scagaire AMR genes reference file |
viral.targets |
TSV file containing viral targets information |
mlst.directory |
Directory containing PubMLST database files |
mlst.list |
Not implemented: TSV file listing organisms with MLST schemes |
pdf.html |
HTML template file for PDF reports |
pdf.css |
CSS stylesheet for report styling |
pdf.bootstrap |
Bootstrap CSS file for report styling |
pdf.version |
PDF template version (use "null" if not applicable) |
blast_read_count |
Number of reads to use for BLAST Auto Query analysis (default: 25) |
blast.db |
Path to BLAST database for Auto Query |
blast.threshold |
Auto Query: Score threshold for BLAST hit inclusion (default: 500) |
blast.blast_instances |
Auto Query:Number of parallel Organism Query processes per report instance |
blast.blast_threads |
Auto Query:Number of threads per BLAST instance |
blast.html |
Auto Query:HTML template for Auto Query reports |
blast.inclusion_threshold |
Auto Query: Disabling this will result in all non-viral taxa being analysed by Auto Query, regardless of threshold. |
blast.quiet |
Auto Query: Suppress verbose BLAST output (default: True) |
blast.max_target_seqs |
Auto Query: Maximum number of aligned sequences to keep |
blast.word_size |
Auto Query: Word size for BLAST alignment (larger = faster but less sensitive) |
blast.perc_identity |
Auto Query: Minimum percent identity for BLAST hits (default: 70.0%) |
blast.dust |
Auto Query: Enable low-complexity region filtering ("yes"/"no") |
blast.culling_limit |
Auto Query: Delete hits that are enveloped by at least this many higher-scoring hits |
blast.max_hsps |
Auto Query: Maximum number of HSPs (alignments) per subject sequence |
blast.evalue |
Auto Query: Expectation value threshold for reporting alignments (default: 1e-5) |
sylph.min_ani |
Not implemented: Minimum Average Nucleotide Identity for Sylph matches (default: 90%) |
sylph.index1 |
Not implemented: Path to primary Sylph database index (GTDB) |
sylph.index2 |
Not implemented: Path to secondary Sylph database index (IMG/VR) |
sylph.tax1 |
Not implemented: Taxonomy metadata for primary Sylph database |
sylph.tax2 |
Not implemented: Taxonomy metadata for secondary Sylph database |
force_anonymisation |
Require anonymisation before launching analysis (default: false) |
SQL_DB_PATH |
Path to Gourami SQLite database |
SYS_CONF_PATH |
Path to Gourami system configuration file |
OLD_MINKNOW_DATA_DIR |
Legacy MinKNOW data directory path |
version_strings |
Term to search for inside of the target indicator file |
app_icon_path |
Path to application icon image |
header_image_path |
Path to header logo image for launcher |
instructions_link_text |
Text for instructions hyperlink |
instructions_url |
URL for workflow instructions/documentation |
columns[].name |
Display name of the column in the launcher UI |
columns[].output_column_name |
Column header name in the output sample sheet |
columns[].autofill |
Enable automatic filling with default/previous values |
columns[].check |
Enable validation checking for this field |
columns[].width |
Column width in the UI |
columns[].check_func |
Name of validation function to apply (e.g., "barcode_path_check") |
columns[].dropdown |
Enable dropdown selection for this field |
columns[].values |
List of allowed values for dropdown (or empty array/null if not applicable) |
columns[].tooltip |
Help text displayed on hover |
columns[].anonymise |
Mark field for anonymisation |
columns[].dir_exists_check.enabled |
Enable directory existence checking |
columns[].dir_exists_check.prefix |
Directory path prefix to check |
columns[].dir_exists_check.should_exist |
Whether directory should exist (true) or not exist (false) |
columns[].dir_exists_check.fail_tooltip |
Message shown if directory check fails |
move |
Deprecated: Enable moving of files (legacy function, default: true) |
samples |
Deprecated: Default sample table filename (default: "sample_table.tsv") |
viral_score |
Deprecated: Legacy viral score threshold (default: 100) |
: Config YAML fields
Taxon name replacement function
We provide a pre-configured list for adding a 'common name' or replacing how taxa are represented alltogether on the report. This is especially useful where the globally accepted taxonomy differs from the names used in common practice. This is especially useful with the recent latinisation of the virus taxonomy.
Modified taxa will appear in the format {old scientific name} - {acronyms} - {common name} {(Accepted scientific name)}
| Internationally accepted taxonomy | Replacement |
|---|---|
| Lymphocryptovirus humangamma4 | Human herpesvirus 4 - HHV-4 - Epstein–Barr virus (Lymphocryptovirus humangamma4) |
| Nakaseomyces glabratus | Candida glabrata (Nakaseomyces glabratus) |
| Alphainfluenzavirus influenzae | Influenza A virus (Alphainfluenzavirus influenzae) |
: Taxon name replacement examples
The configuration file for this function can be found at:
NHS_RMg_platform/db/ref/reporting_name_replacement_list.csv
Summary report
Description of features
See the diagram below and the associated table for explanations on Summary Report features.
| Number | Feature | Description |
|---|---|---|
| 1 | Results directory | Path selection for the Metagenomics Workflow's results directory. Useful if users move/organise outputs for archiving.* |
| 2 | Output file | The output file path for Summary Report. |
| 3 | Sample search bar | Enter key phrases to subset available samples. |
| 4 | Available Samples panel | Populated automatically from the provided results directory. |
| 5 | Sample selection controls | Move selected or all samples to and from the Selected Samples panel. Clear and refresh to start again. |
| 6 | Selected Samples panel | A list of the samples included for summary |
| 7 | Sample paste bin | Paste a list (newline delimited) of sample names from an external source to quickly load on to Selected Samples.** |
| 8 | Relative abundance | Set a threshold of relative abundance for the non-viral taxa. Taxa falling below will be excluded entirely from the report. |
| 9 | Format output | Change default XLSX output to CSV. Change janky newline delimitation of nested lists to ';' for easier parsing. |
| 10 | Generate summary | Run the script |
| 11 | Status indicator | Data on available samples and selections |
: Summary report diagram legend
Will not be able to see outside of mounted directories of the container (Default: the NHS RMg platform SSD). Modify the launch script to mount additional host directories. *No path checking is performed on pasted samples. They will be excluded from the Summary Report if missing. See terminal output for list of missing samples.
Description of outputs
| Column Name | Explanation |
|---|---|
| Sample | LabID provided on launching the Metagenomics Workflow |
| Experiment | The exact name matching the experiment name on MinKNOW entered by the user when initiating a sequencing run. This is populated automatically from the /data directory. |
| SampleID | The exact name matching the Sample name on MinKNOW entered by the user when initiating a sequencing run. This is populated automatically from the /data/{experiment_id}/ |
| Barcode | The ONT library index/barcode used. Green colour indicates the barcode directory has been validated. |
| LabID | LabID provided on launching the Metagenomics Workflow |
| biosample_id | The Sample Accession (if provided). May also be an anonymised study number derived from the Sample Accession. |
| biosample_source_id | The Hospital number provided to the Metagenomics Launcher. May also be an anonymised study number derived from the Hospital Number. |
| Collection Date | Provided to the Metagenomics Launcher |
| SampleClass | Specimen, postive contol, standard etc. Provided to the launcher. |
| SampleType | Sampling site: BAL, SPT, NDL, ETT, NPA, PFL etc. |
| Operator | Operator initials. Sourced from Launcher. |
| Notes | Additional notes. Sourced from Launcher, shown or reports. |
| RunID | Identifier assigned to the experiment by the sequencing device. Derived from FASTQ. |
| Flow_Cell_ID | Flow cell ID - derived from FASTQ |
| Total reads X hrs | Total reads - pre-human scrubbing |
| Human reads X hrs | Human reads removed |
| Human reads (%) X hrs | Proportion of total reads identified as human and removed |
| Total classified reads X hrs | Total reads post-human scrubbing |
| Sequencing N50 (bp) X hrs | Post human-scrubbing (microbial reads) read length metric |
| Proportion >Q15 quality (%) X hrs | Proportion of microbial reads with a PHRED score >15 |
| Median read quality (PHRED score) N | Median PHRED core for microbial reads |
| Total bases (bp) X hrs | Total bases sequenced including human |
| Organisms (excluding viruses) X hrs | A list of organism identified except for viruses |
| Organisms (excluding viruses) read counts X hrs | Counts of reads for each non-viral taxa identified |
| Organism (excluding viruses) percentage abundance X hrs | |
| Viral organisms X hrs | Virus taxa list |
| Viral read counts X hrs | Counts of viruses identified |
| Auto Query top taxon X hrs | Auto Query's most supported taxon - note by default taxa below threshold will not be subject to Auto Query an therefore will be shown as 'missing' here. See 'configuration' section for more info. |
| Auto Query top percent X hrs | The percentage of top alignments supporting the the top taxon |
| Auto Query 2nd taxon X hrs | |
| Auto Query 2nd percent X hrs | Auto Query's second most supported taxon |
| AvgLength 0.5 hrs | Average length of Auto Query alignments for the top hit |
| AvgPID 0.5 hrs | Average percent identity of Auto Query alignments for the top hit |
| IsMatched50 0.5 hrs | Would a 'green light' be shown on the report |
: Summary report output fields
Network hub app
The platform has a cached version of the Network Hub website for users without an internet connection requiring access. Double click on the desktop icon and connectivity should be autonatically determined.
Customising workflow outputs
The logo on output reports is embedded in base64 on to the HTML template, which ensures portability. For custom logos, resize an image appropriately, then encode using the tool linked below. Insert the image on to the HTML templates at the paths given below:
NHS_RMg_platform/db/ref/Template/report_template.html- PDF reportsNHS_RMg_platform/db/ref/Template/report_template_html.html- HTML Metagenomics reportsNHS_RMg_platform/db/ref/Template/organism_query/report_template.html- Organism Query reports
https://codebeautify.org/image-to-base64-converter


