Uploaded image for project: 'UGENE'
  1. UGENE
  2. UGENE-6099

Add "Assemble Transcripts with StringTie" workflow element

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: virogenesis
    • Fix Version/s: 1.31
    • Component/s: NGS, Workflow
    • Labels:
    • Story Points:
      1
    • Epic Link:
    • Sprint:
      DEV-31-1, DEV-31-2
    • Affect Type:
      Userdefined

      Description

      Element name and description

      • Name of the element: "Assemble Transcripts with StringTie".
      • Description of the element on the Scene: "Uses a BAM file with RNA-Seq read mappings to assemble transcripts.".
      • Description of the element in the Property Editor:
        "StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus."

      Input data

      There is one input port.

      Item Value
      Port name in GUI Input BAM file(s)
      Port description URL(s) to sorted BAM file(s) with RNA-Seq read mappings.
      Note that every spliced read alignment (i.e. an alignment across at least one junction) in the input file must contain the tag XS to indicate the genomic strand that produced the RNA from which the read was sequenced. Alignments produced by TopHat and HISAT2 (when run with --dta option) already include this tag, but if you use a different read mapper you should check that this XS tag is included for spliced alignments.
      Port ID in UWL in
      Number of slots 1
      Slot #1 name in GUI Input URL
      Slot #1 ID in UWL url
      Slot #1 data type String

      Output data

      There is one output port.

      Item Value
      Port name in GUI StringTie output data
      Port description For each input BAM file the port outputs an URL to a GTF file with assembled transcripts, produced by StringTie.
      If "Report gene abundance" is "True", the port also output an URL to a text file with gene abundances (in a tab-delimited format).
      Port ID in UWL out
      Number of slots 1 or 2 depending on the value of "Report gene abundance"
      Slot #1 name in GUI Output URL Transcripts
      Slot #1 ID in UWL url-transcripts
      Slot #1 data type String
      Slot #2 name in GUI Output URL Gene Abundance
      Slot #2 ID in UWL url-gene-abund
      Slot #2 data type String

      Parameters

      # Parameter Description Value in GUI Default value
      1 Reference annotations Use the reference annotation file (in GTF or GFF3 format) to guide the assembly process (-G). The output will include expressed reference transcripts as well as any novel transcripts that are assembled. A line edit with the browse button nearby that opens a file browse dialog. There is no default value.
      2 Reads orientation Select the NGS libraries type: unstranded, stranded fr-secondstrand (--fr), or stranded fr-firststand (--rf). A combo box with values: "Unstranded", "Forward (FR)", "Reverse (RF)". "Unstranded"
      3 Label Use the specified string as the prefix for the name of the output transcripts (-l). A line edit. "STRG"
      4 Min isoform fraction Specify the minimum isoform abundance of the predicted transcripts as a fraction of the most abundant transcript assembled at a given locus (-f). Lower abundance transcripts are often artifacts of incompletely spliced precursors of processed transcripts. A spin box with values from 0.0 to 1.0. 0.1
      5 Min assembled transcript length Specify the minimum length for the predicted transcripts (-m). A spin box with INT values >=30. 200
      6 Min anchor length for junctions Junctions that don't have spliced reads that align across them with at least this amount of bases on both sides are filtered out (-a). For now make it a spin box with INT values >= 0. 10
      7 Min junction coverage There should be at least this many spliced reads that align across a junction (-j). This number can be fractional, since some reads align in more than one place. A read that aligns in n places will contribute 1/n to the junction coverage. For now make it a spin box with FLOAT values >= 0. 1.0
      8 Trim transcripts based on coverage By default StringTie adjusts the predicted transcript's start and/or stop coordinates based on sudden drops in coverage of the assembled transcript. Set this parameter to "False" to disable the trimming at the ends of the assembled transcripts (-t). A combo box with values "True" and "False". "True"
      9 Min coverage for assembled transcripts Specify the minimum read coverage allowed for the predicted transcripts (-c). A transcript with a lower coverage than this value is not shown in the output. This number can be fractional, since some reads align in more than one place. A read that aligns in n places will contribute 1/n to the coverage. Make it a spin box with FLOAT values >= 0.001. 2.5
      10 Min locus gap separation Reads that are mapped closer than this distance are merged together in the same processing bundle (-g). A spin box, INT>=0. "bp" is written nearby. "50 bp"
      11 Fraction covered by multi-hit reads Specify the maximum fraction of muliple-location-mapped reads that are allowed to be present at a given locus (-M). A read that aligns in n places will contribute 1/n to the coverage. For now make it a spin box with FLOAT values >= 0. 0.95
      12 Skip assembling for sequences Ignore all read alignments (and thus do not attempt to perform transcript assembly) on the specified reference sequences (-x). The value can be a single reference sequence name (e.g. "chrM") or a comma-delimited list of sequence names (e.g. "chrM,chrX,chrY").
      This can speed up StringTie especially in the case of excluding the mitochondrial genome, whose genes may have very high coverage in some cases, even though they may be of no interest for a particular RNA-Seq analysis.
      The reference sequence names are case sensitive, they must match identically the names of chromosomes/contigs of the target genome against which the RNA-Seq reads were aligned in the first place.
      A line edit. When the value is empty, the parameter is not specified in the command. By default, the value is not specified.
      13 Abundance for reference transcripts only Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts (-e). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example.
      The parameter is only available if the "Reference annotations" file is specified. It is recommended to use it when Ballgown table files are produced.
      Hide the parameter if "Reference annotations" is not set. The value should be a combo box with values "True" and "False". "False"
      14 Multi-mapping correction Enables or disables (-u) multi-mapping correction. A combo box with values "Enabled" and "Disabled". Note: the parameter is not described in the StringTie documentation, but it is present in the command line. "Enabled"
      15 Verbose log Enable detailed logging, if required (-v). The messages will be written to the UGENE log (enabling of "DETAILS" and "TRACE" logging may be required) and to the dashboard. A combo box with values "True" and "False". "False"
      16 Number of threads Specify the number of processing threads (CPUs) to use for transcript assembly (-p). A spin box with values from 1 to the number of available cores. Use the value from the Application Settings (the “Optimize for CPU count”.
      17 Output transcripts file StringTie's primary output GTF file with assembled transcripts. A line edit with the browse button. Auto (this equals to "input file name_transcripts.gtf", for example, for "sample.bam" it will be "sample_transcripts.gtf").
      18 Enable gene abundance output Select "True" to generate gene abundances output (-A). The output is written to a tab-delimited text file. Also, the file URL is passed to an output slot of the workflow element. A combo box with values "True" and "False". "False"
      19 Output gene abundances file Specify the name of the output file with gene abundances (-A). The parameter is only available if "Enable gene abundance output" is set to "True". Otherwise, it should be a line edit with the browse button. Auto (this equals to "input file name_gene_abund.tab", for example. "sample_gene_abund.tab").
      20 Enable covered reference transcripts output Select "True" to generate a file with reference transcripts that are fully covered by reads (-C).
      Thus, the parameter is only available if the "Reference annotations" file is specified.
      Hide the parameter if "Reference annotation" is not set. Otherwise, it should be a combo box with values "True" and "False". "False"
      21 Output covered reference transcripts file Specify the name of the output file with reference transcripts that are fully covered by reads (-C). Hide the parameter if "Enable covered reference transcripts output" is "True". Otherwise, it should be a line edit with the browse button. Auto (this equals to "input file name_cov_refs.gtf", e.g. "sample_cov_refs.gtf").
      22 Enable output for Ballgown Select "True" to generate table files (*.ctab) that can be used as input to Ballgown (-b). The files contain coverage data for the reference transcripts.
      The parameter is only available if the "Reference annotations" file is specified.
      It is also recommended to set "Abundance for reference transcripts only" to "True".
      Hide the parameter if "Reference annotation" is not set. Otherwise, it should be a combo box with values "True" and "False". "False"
      23 Output folder for Ballgown Specify a folder for table files (*.ctab) that can be used as input to Ballgown. A line edit with the browse button. Using the browse dialog one should be able to select a folder, not a file. Auto (this equals to "ballgown_input" folder in the workflow output folder).

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              atiunov Aleksey Tiunov [X] (Inactive)
              Reporter:
              oigl Olga Golosova
              Assigned Tester:
              Dmitrii Sukhomlinov
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: