Uploaded image for project: 'UGENE'
  1. UGENE
  2. UGENE-6035

Add "Ensemble Classification Data" workflow element

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: virogenesis
    • Fix Version/s: 1.31
    • Component/s: NGS, Workflow
    • Labels:
    • Story Points:
      3
    • Epic Link:
    • Sprint:
      DEV-30-4, DEV-30-5, DEV-30-6
    • Affect Type:
      Userdefined

      Description

      Element name and description

      • Name of the element: "Ensemble Classification Data"
      • Description of the element on the Scene:"Ensemble classification data from other elements into unset."
        The "unset" value corresponds to the name of the output file.
      • Description of the element in the Property Editor:
        "The element ensembles data, produced by classification tools (Kraken, CLARK, DIAMOND), into a single file in CSV format. This file can be used as input for the WEVOTE classifier."

      Input data

      There is one input port:

      Item Value
      Port name in GUI Input taxonomy data
      Port description Three input slots are available for taxonomy classification data. At least first and second slots should be connected to classification data slots.
      Port ID in UWL in
      Number of slots 3
      Slot #1 name in GUI Input tax data 1
      Slot #1 ID in UWL tax_data1
      Slot #1 data type Taxonomy classification
      Slot #2 name in GUI Input tax data 2
      Slot #2 ID in UWL tax_data2
      Slot #2 data type Taxonomy classification
      Slot #3 name in GUI Input tax data 3
      Slot #3 ID in UWL tax_data3
      Slot #3 data type Taxonomy classification

      Output data

      There is one output port:

      Item Value
      Port name in GUI Ensembled classification
      Port description URL to the CSV file with ensembled classification data.
      Port ID in UWL out
      Number of slots 1
      Slot #1 name in GUI Output URL
      Slot #1 ID in UWL url
      Slot #1 data type string

      Parameters

      There is one parameter "Output file". In GUI it is a line edit with the browse button. The value is mandatory ("Required"). The default value is "ensemble.csv". The parameter description is the following:

      Specify the output file. The classification data are stored in CSV format with the following columns:
          1) a sequence name
          2) taxID from the first tool
          3) taxID from the second tool
          4) optionally, taxID from the third tool
      

      Data processing by the element

      • The element takes input taxonomy data (i.e. maps of sequence names with taxIDs) from two or three slots. Datasets are not taken into account. The data are processed per file.
      • It sorts all input sequence names by alphabet.
      • Create a CSV file (using the name, specified in the parameters) with the following columns structure:
        • seq_name
        • taxID_of_seq_from_slot1
        • taxID_of_seq_from_slot2
        • taxID_of_seq_from_slot3 (if specified)
      • Show the CSV file as the output on the WD dashboard. Pass the file URL to the output port.

      Error messages

      In case the first or the second slot is not set:

      • Show an error in the WD Error list:
        It is required to input taxonomy data for at least the first and the second slot.
        

      In case there are sequences present in one of the map, but not present in another one:

      • Generate a "TRACE" message like:
        Taxonomy data for "seq_name" is found in "file1", but not found in "file2" and "file3".
        
      • Generate an "INFO" message in the log and a warning message on the WD dashboard:
        Different taxonomy data do not match. Some sequence names were skipped. 
        

      Sample data

      See, for example, files "HC1.fasta" and "HC1_ensemble.csv" on the file server (in the ".../virogenesis/tools_testing/wevote_without_classifiers" folder). The second file was received from the first one by running "run_WEVOTE_PIPELINE.sh" with:

      • CLARK-l with the "bacteria" database that goes with the tool.
      • Kraken with the "MiniKraken" database.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              atiunov Aleksey Tiunov [X] (Inactive)
              Reporter:
              oigl Olga Golosova
              Assigned Tester:
              Eugenia Pushkova [X] (Inactive)
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: