[UGENE-6099] Add "Assemble Transcripts with StringTie" workflow element - Jira

Details

Type: New Feature
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: virogenesis
Fix Version/s: 1.31
Component/s: NGS, Workflow
Labels:
- transcriptomics

Story Points:
1
Epic Link:
VEME-2018
Sprint:
DEV-31-1, DEV-31-2
Affect Type:
Userdefined

Description

Element name and description

Name of the element: "Assemble Transcripts with StringTie".
Description of the element on the Scene: "Uses a BAM file with RNA-Seq read mappings to assemble transcripts.".
Description of the element in the Property Editor:
"StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus."

Input data

There is one input port.

Item	Value
Port name in GUI	Input BAM file(s)
Port description	URL(s) to sorted BAM file(s) with RNA-Seq read mappings. Note that every spliced read alignment (i.e. an alignment across at least one junction) in the input file must contain the tag XS to indicate the genomic strand that produced the RNA from which the read was sequenced. Alignments produced by TopHat and HISAT2 (when run with --dta option) already include this tag, but if you use a different read mapper you should check that this XS tag is included for spliced alignments.
Port ID in UWL	in
Number of slots	1
Slot #1 name in GUI	Input URL
Slot #1 ID in UWL	url
Slot #1 data type	String

Output data

There is one output port.

Item	Value
Port name in GUI	StringTie output data
Port description	For each input BAM file the port outputs an URL to a GTF file with assembled transcripts, produced by StringTie. If "Report gene abundance" is "True", the port also output an URL to a text file with gene abundances (in a tab-delimited format).
Port ID in UWL	out
Number of slots	1 or 2 depending on the value of "Report gene abundance"
Slot #1 name in GUI	Output URL Transcripts
Slot #1 ID in UWL	url-transcripts
Slot #1 data type	String
Slot #2 name in GUI	Output URL Gene Abundance
Slot #2 ID in UWL	url-gene-abund
Slot #2 data type	String

Parameters

#	Parameter	Description	Value in GUI	Default value
1	Reference annotations	Use the reference annotation file (in GTF or GFF3 format) to guide the assembly process (-G). The output will include expressed reference transcripts as well as any novel transcripts that are assembled.	A line edit with the browse button nearby that opens a file browse dialog.	There is no default value.
2	Reads orientation	Select the NGS libraries type: unstranded, stranded fr-secondstrand (--fr), or stranded fr-firststand (--rf).	A combo box with values: "Unstranded", "Forward (FR)", "Reverse (RF)".	"Unstranded"
3	Label	Use the specified string as the prefix for the name of the output transcripts (-l).	A line edit.	"STRG"
4	Min isoform fraction	Specify the minimum isoform abundance of the predicted transcripts as a fraction of the most abundant transcript assembled at a given locus (-f). Lower abundance transcripts are often artifacts of incompletely spliced precursors of processed transcripts.	A spin box with values from 0.0 to 1.0.	0.1
5	Min assembled transcript length	Specify the minimum length for the predicted transcripts (-m).	A spin box with INT values >=30.	200
6	Min anchor length for junctions	Junctions that don't have spliced reads that align across them with at least this amount of bases on both sides are filtered out (-a).	For now make it a spin box with INT values >= 0.	10
7	Min junction coverage	There should be at least this many spliced reads that align across a junction (-j). This number can be fractional, since some reads align in more than one place. A read that aligns in n places will contribute 1/n to the junction coverage.	For now make it a spin box with FLOAT values >= 0.	1.0
8	Trim transcripts based on coverage	By default StringTie adjusts the predicted transcript's start and/or stop coordinates based on sudden drops in coverage of the assembled transcript. Set this parameter to "False" to disable the trimming at the ends of the assembled transcripts (-t).	A combo box with values "True" and "False".	"True"
9	Min coverage for assembled transcripts	Specify the minimum read coverage allowed for the predicted transcripts (-c). A transcript with a lower coverage than this value is not shown in the output. This number can be fractional, since some reads align in more than one place. A read that aligns in n places will contribute 1/n to the coverage.	Make it a spin box with FLOAT values >= 0.001.	2.5
10	Min locus gap separation	Reads that are mapped closer than this distance are merged together in the same processing bundle (-g).	A spin box, INT>=0. "bp" is written nearby.	"50 bp"
11	Fraction covered by multi-hit reads	Specify the maximum fraction of muliple-location-mapped reads that are allowed to be present at a given locus (-M). A read that aligns in n places will contribute 1/n to the coverage.	For now make it a spin box with FLOAT values >= 0.	0.95
12	Skip assembling for sequences	Ignore all read alignments (and thus do not attempt to perform transcript assembly) on the specified reference sequences (-x). The value can be a single reference sequence name (e.g. "chrM") or a comma-delimited list of sequence names (e.g. "chrM,chrX,chrY"). This can speed up StringTie especially in the case of excluding the mitochondrial genome, whose genes may have very high coverage in some cases, even though they may be of no interest for a particular RNA-Seq analysis. The reference sequence names are case sensitive, they must match identically the names of chromosomes/contigs of the target genome against which the RNA-Seq reads were aligned in the first place.	A line edit. When the value is empty, the parameter is not specified in the command.	By default, the value is not specified.
13	Abundance for reference transcripts only	Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts (-e). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example. The parameter is only available if the "Reference annotations" file is specified. It is recommended to use it when Ballgown table files are produced.	Hide the parameter if "Reference annotations" is not set. The value should be a combo box with values "True" and "False".	"False"
14	Multi-mapping correction	Enables or disables (-u) multi-mapping correction.	A combo box with values "Enabled" and "Disabled". Note: the parameter is not described in the StringTie documentation, but it is present in the command line.	"Enabled"
15	Verbose log	Enable detailed logging, if required (-v). The messages will be written to the UGENE log (enabling of "DETAILS" and "TRACE" logging may be required) and to the dashboard.	A combo box with values "True" and "False".	"False"
16	Number of threads	Specify the number of processing threads (CPUs) to use for transcript assembly (-p).	A spin box with values from 1 to the number of available cores.	Use the value from the Application Settings (the “Optimize for CPU count”.
17	Output transcripts file	StringTie's primary output GTF file with assembled transcripts.	A line edit with the browse button.	Auto (this equals to "input file name_transcripts.gtf", for example, for "sample.bam" it will be "sample_transcripts.gtf").
18	Enable gene abundance output	Select "True" to generate gene abundances output (-A). The output is written to a tab-delimited text file. Also, the file URL is passed to an output slot of the workflow element.	A combo box with values "True" and "False".	"False"
19	Output gene abundances file	Specify the name of the output file with gene abundances (-A).	The parameter is only available if "Enable gene abundance output" is set to "True". Otherwise, it should be a line edit with the browse button.	Auto (this equals to "input file name_gene_abund.tab", for example. "sample_gene_abund.tab").
20	Enable covered reference transcripts output	Select "True" to generate a file with reference transcripts that are fully covered by reads (-C). Thus, the parameter is only available if the "Reference annotations" file is specified.	Hide the parameter if "Reference annotation" is not set. Otherwise, it should be a combo box with values "True" and "False".	"False"
21	Output covered reference transcripts file	Specify the name of the output file with reference transcripts that are fully covered by reads (-C).	Hide the parameter if "Enable covered reference transcripts output" is "True". Otherwise, it should be a line edit with the browse button.	Auto (this equals to "input file name_cov_refs.gtf", e.g. "sample_cov_refs.gtf").
22	Enable output for Ballgown	Select "True" to generate table files (*.ctab) that can be used as input to Ballgown (-b). The files contain coverage data for the reference transcripts. The parameter is only available if the "Reference annotations" file is specified. It is also recommended to set "Abundance for reference transcripts only" to "True".	Hide the parameter if "Reference annotation" is not set. Otherwise, it should be a combo box with values "True" and "False".	"False"
23	Output folder for Ballgown	Specify a folder for table files (*.ctab) that can be used as input to Ballgown.	A line edit with the browse button. Using the browse dialog one should be able to select a folder, not a file.	Auto (this equals to "ballgown_input" folder in the workflow output folder).

Attachments

Issue Links

relates to

UGENE-6098 Integrate StringTie as an external tool

Closed

Add "Assemble Transcripts with StringTie" workflow element

Details

Description

Element name and description

Input data

Output data

Parameters

Attachments

Issue Links

Activity

People

Dates