Extract-Transform-Load

Performing ETL activities consume a significant part of the average data science workflow. With Akumen, we aim to streamline it somewhat by using PETL - a Python ETL library and makes transforms relatively simple. This model is simple, because there isn’t much ETL required for this dataset - it’s all basically in the format that we need. Hence, we just perform some minor transformations to show the workflow.

To create an ETL model, you can do the following:

  • Go to the App Manager, and select Create Application -> Python Model, named ETL - Breast Cancer.
  • To use PETL, we need to include it in the build environment. Create a new file called Dockerfile (which will be created from a template) and add the petl install line as below:
FROM quay.io/optika.solutions/akumen-python:3.7.2
RUN pip3 install petl
  • Clear out main.py and add the following, step-by-step:
import petl
from akumen_api import get_results

The first line imports petl, the Python ETL library. The second line imports a helper from Akumen’s API functions, get_results, which retrieves data from another model’s execution. We’ll use this to get the data from our data connector in the previous section.

def akumen(**kwargs):
    """
    Parameters:
        - Input: data [scenario]
        - Input: view [string]
        - Input: scope [string]

        - Output: results [tabular]
    """
    print('Running Akumen model...')

This section is our standard Akumen wrapper function and parameter definitions. In this model, we take three parameters:

  • data - a scenario input that points to our Connector model. We use this with get_results to retrieve the connector’s result data (which is our breast cancer data).
  • view - the name of the view that we want to retrieve results from. For our connector, this is data_vw.
  • scope - the scope level that we want to retrieve data for. We’re only retrieving data from a single scenario, so scenario scope is fine.
    # grab the whole resultset
    input = petl.fromdataframe(
        get_results(kwargs.get('data'), kwargs.get('view'), kwargs.get('scope'))
    )

Using get_results, we retrieve the dataset from the Connector model and pass it into petl. get_results returns a dataframe, which is easily imported into petl for modification.

    # reflow diagnosis into a 0/1 value, since autosklearn doesn't operate on labels
    input = petl.convert(input, 'diagnosis', lambda v: 0 if v == 'B' else 1)

As a simple modification, we reform the diagnosis column into a binary integer format. petl provides a large number of transformation functions that can be used, but other libraries or code can be used here too - fuzzy matching, integration, or others.

    # strip off the akumen columns so we don't duplicate them
    for col in ['studyname', 'scenarioname', 'id']:
        input = petl.cutout(input, col)

We also want to cut off some Akumen-standard columns, so they’re not duplicated by this model.

    # or we can return the df directly to the `return` of the `akumen()` function:
    return {
        'results': petl.todataframe(input)
    }

And we return the resulting dataframe for output.

The entire resulting file is below:

import petl
from akumen_api import get_results


def akumen(**kwargs):
    """
    Parameters:
        - Input: data [scenario]
        - Input: view [string]
        - Input: scope [string]

        - Output: results [tabular]
    """
    print('Running Akumen model...')
    
    # grab the whole resultset
    input = petl.fromdataframe(
        get_results(kwargs.get('data'), kwargs.get('view'), kwargs.get('scope'))
    )
    
    # reflow diagnosis into a 0/1 value, since autosklearn doesn't operate on labels
    input = petl.convert(input, 'diagnosis', lambda v: 0 if v == 'B' else 1)

    # strip off the akumen columns so we don't duplicate them
    for col in ['studyname', 'scenarioname', 'id']:
        input = petl.cutout(input, col)

    # or we can return the df directly to the `return` of the `akumen()` function:
    return {
        'results': petl.todataframe(input)
    }

For inputs, you should use:

  • data: the connector model we previously built.
  • view: data_vw
  • scope: scenario

Research Grid for ETL Research Grid for ETL

Execute the model and go to the data tab, and you should see that the diagnosis column is now 0/1 instead of B/M.

Data View for ETL Data View for ETL