How to Parallelize Your Workflows Using PlantCV - Multiple Plants

1

Often you will start with a workflow from another tutorial that you will have then edited to work well for your experiment. Here we assume that you start with an `.ipynb` workflow that runs without errors on several representative test images. NOTE: Make sure you have ran your workflow and have produced a JSON output file.

2

Convert your Jupyter Notebook (`.ipynb`) to an executable script (a Python script, `.py`) selecting ***File > Save and Export Notebook As... > Executable Script*** Save your Python script in the same directory as your Jupyter notebook for easy access. You may need to adjust the title of your script so it is concise (in the example, I am using a reference document and do not want [["reference"]] in my final script.

3

Open your new Python script. We will have to make some edits to make it a functioning workflow script. The first thing we will want to do is "comment out" things we do not want to run in parallel. In the example image this would mean adding an "#" ahead of line 7: [[get_python().run_line_magic('matplotlib', 'widget')]] For your workflow you may also want to comment out commands that only print images or that look for google colab environments.

4

We need to make a change to our import statements to say: `from plantcv.parallel import workflow_inputs` in place of `from plantcv.parallel import WorkflowInputs` This will tell the parallel process to use `workflow_inputs()` to set parameters based on the configuration JSON file instead of using the hard coded arguments from the notebook.

5

Next, we will change our argument definitions for this workflow by "commenting out" the our [[WorkflowInputs ]]parameters. Remember that you can comment out an entire section by using [[CTRL or Cmd + /]] in jupyter. See the image below:

6

Above the commented out [[WorkflowInputs]], we are going to add a new line that will store workflow inputs into [[args ]]to support parallel workflow execution: `args = workflow_inputs()`

7

Save the changes you made to your Python script.

In the next step we will have to change the working directory in terminal to the location where our scripts are. There are a couple of ways to obtain the file path for your working directory: 1. In Jupyter if you hover over the folder icon in the file browser it will show the working directory. 2. In Jupyter you can open a terminal and run \``` pwd` `` to print the working directory. NOTE: You may want to copy the file path to a document so you can see how much of the file path is copied (the full file path will not be copied).

8

In this guide, the current working directory is local to the root directory, so only a portion of the path was needed to point our conda environment to the current working directory. Remember that to change directories, you need to use the operator `cd` (See image below, outlined in green). Once you have changed your working directory in conda, type `dir` or `ls` to view the contents of the directory. Make sure that your Python script is in the correct location. (See image below, outlined in orange).

9

Now we need to create our parallel configuration file by typing: `plantcv-run-workflow --template config.json` Since we have changed our working directory to our current folder, you should see **[[config.json]]** appear in the file explorer like shown below.

10

Since we have changed our working directory to our current folder, you should see **[[config.json]]** appear in the file explorer like shown below. We will need to edit [[config.json ]]so that we can configure the parallel analysis. Right-click [[config.json]] and hover over [[Open With]] and select [[Editor ]]to make changes to the file.

11

In Editor, make changes to [[config.json]] , common things that should be changed are listed below but the full set of options is in the [documentation](https://plantcv.readthedocs.io/en/stable/parallel_config/) and is worth reviewing. - "input_dir": "./imgs" *\[Put file path/name of input directory for images you want analyzed\]* - "json": "data_output.json" *\[Put the path/name of the data output file (located in the args container under results within your workflow)\]* - "filename_metadata": \["camera", "id", "timestamp"\] *\[list of metadata terms to collect. Supported metadata terms include: camera, imgtype, zoom, exposure, gain, frame, lifter, timestamp, id, plantbarcode, treatment, cartag, measurementlabel, and other\]* - "workflow": "multi-plant-workflow.py" *\[path/name of user-defined (your) PlantCV workflow Python script\]* - "img_outdir": "./output_images" *\[path/name of output directory where measured images will be stored. Default is "./output_images"\]* - "imgformat": "jpg" *\[image file format/extension. Default is "png"\]* - "timestampformat": "%Y-%m-%d-%H-%M" *\[date format as observed in your naming scheme. For explanation what each of the symbols mean, see the python [time format documentation](https://docs.python.org/3.7/library/datetime.html#strftime-and-strptime-behavior) \]* - "append": false *\[(bool, default = False): if [[True ]]will append results to an existing json file. If [[False]], will delete previous results stored in the specified JSON file.\]* - *"cluster": "LocalCluster" \[There are several cluster types, the default option is "LocalCluster" which will run in parallel on the machine you run the run workflow command from. The complete list of options is:* [["LocalCluster", "HTCondorCluster", "LSFCluster", "MoabCluster", "OARCluster", "PBSCluster", "SGECluster", and "SLURMCluster"]] which can be read about in the [dask docs](https://jobqueue.dask.org/).*\]* - *cluster_config:* - n_workers: In the example below this is still 1, but you will increase this based on how many cores you have available/want to use. This controls the number of workers to run in parallel. The "cores" argument is how many cores each worker needs, which will almost always stay as 1.

12

Save the changes you have made to the [[config.json]]

13

Now that we have made the necessary changes to our parallel configuration file, it is time for us to run our workflow. To execute your parallel analysis, return to your terminal and type `plantcv-run-workflow --config config.json` into the prompt.

If you successfully set up your [[config.json]] then you should see a number of files found and a progress bar on your screen with how long it will take to analyze your dataset. You will also see that your job list will include X workflows. If you did not set up your [[config.json]] then you will receive error messages that detail where PlantCV is having issues finding an image directory, your workflow, incorrect date formats, etc. After the job has completed, PlantCV will automatically convert your JSON file to CSV.

14

You should see two CSV files appear in your directory: - [[plantcv-results-single-value-traits.csv]] - [[plantcv-results-multi-value-traits.csv]] The [[single-value-traits.csv]] file will be in wide format, with a column per trait, whereas the [[multi-value-traits.csv]] file will be in long format, with one row per value/label. The hierarchical organization of these files enable more efficient data processing downstream.