Summary report generator
Purpose
We have found it useful to be able to generate summaries of multiple runs for downstream analysis. This takes the form of a spreadsheet or CSV, with rows corresponding to each sample and columns containing information derived from the sample sheet and the metagenomic analyses across timepoints. The tool features an end-to-end GUI to select samples and build the sheet.
Important note
We ask that for the NHS RMg service evaluation, you set the Summary Report threshold cut-off to the lowest value (currently 0.1) so that all classifications are included.
A description of the fields featured in the spreadsheet is available here.
Instructions for use
Check out the video at the bottom of the page for an end-to-end demonstration
- Double click the Launch Summary Report icon on the desktop. The window should appear with a loading bar. Please wait until all of the samples are sourced and loaded.
Note
In the above example, we have a list of samples, some of which have red indicators at specific time point. In both instances, it is likely that the reports were not generated because there were no reads present in the dataset. This is an especially frequent occurrence in NTC samples given that few reads should be detected in these samples.
-
The program reads all Metagenomics Workflow runs from the 'results' folder and populates a list. The list has a checkbox to include or exclude samples from the report, and a coloured box indicating the time point is present in the dataset. Select the samples for reporting using one of the two below methods.
a. Select samples using the check boxes on the interface.
b. Produce a simple list (newline delimited) of sample names, matching exactly (case sensitive) the Sample ID/Accession number used in the Metagenomics run. Save the list in the 'Sample Sheet' directory on the Metagenomics SSD. Select the 'Load list of sample names' button.
-
Choose whether you'd like to export as a spreadsheet (xlsx) or a CSV using the checkboxes at the bottom of the interface.
Note
Both report formats (xlsx/CSV) contain lists of taxa in single cells. In the xlsx, this is delimited by a newline (\n). In the CSV this is swapped for a semicolon ';' to avoid parsing errors.
-
Specify the output location by clicking the 'Browse' button in the 'Output File' section at the top of the interface. Fill out the 'Save As' prompt. Please be sure add the '.xlsx' or '.csv' file extension to the filename if it not done automatically.
-
Specify the 'Relative Abundance Threshold'. The default is 1.0%. this default parameter means that no organism < 1% relative abundance will will feature in the output report.
Important note
We ask that for the NHS RMg service evaluation, you set the Summary Report threshold cut-off to the lowest value (currently 0.1) so that all classifications are included.
- Click 'Generate Summary'. The output will appear in the 'summary_report' directory.
Video tutorial - Manually selecting samples
Video tutorial - Providing a list of sample names
Output fields
Column | Description |
---|---|
Sample | The Lab/sample ID. |
Experiment | The specific MinKNOW experiment. |
SampleID | The specific MinKNOW sample. |
Barcode | The barcode used in the library. |
AnonymisedIdentifier | A de-identified hospital number. |
CollectionDate | The date on which the sample was collected. |
SampleClass | The classification of the sample, PC, NC, NTC, specimen. |
SampleType | The sample site, eg. BAL, SPT, NPS. |
Operator | The name or identifier of the individual who processed or sequenced the sample. |
Notes | Any additional notes or remarks related to the sample or its processing. |
Total reads XX hrs | The total number of sequencing reads generated after XX hours of sequencing. |
Human reads XX hrs | The number of reads identified as being from human DNA after XX hours of sequencing. |
Human reads (%) XX hrs | The percentage of total reads that are identified as human DNA after XX hours. |
Total classified reads XX hrs | The number of reads that have been classified (assigned to an organism or category) after XX hours. |
Sequencing N50 (bp) XX hrs | The N50 statistic for the reads generated after XX hours, indicating read length distribution. |
Proportion >Q15 quality (%) XX hrs | The percentage of reads with a quality score greater than Q15 after XX hours. |
Median read quality (PHRED score) XX hrs | The median PHRED quality score of the reads after XX hours, indicating overall data quality. |
Total bases (bp) XX hrs | The total number of base pairs generated by the sequencing run after XX hours. |
Organisms (excluding viruses) XX hrs | The list of organisms (excluding viruses) identified from the reads after XX hours. |
Organisms (excluding viruses) read counts XX hrs | The read counts associated with organisms (excluding viruses) after XX hours. |
Organism (excluding viruses) percentage abundance XX hrs | The percentage abundance of each organism (excluding viruses) in the sample after XX hours. |
Viral organisms XX hrs | The list of viral organisms identified from the reads after XX hours. |
Viral read counts XX hrs | The read counts associated with viral organisms after XX hours. |