Skip to content

Detect parameter collisions when using the @inherits and @requires decorators #2566

@KenyaDonDraper

Description

@KenyaDonDraper

At a project am working on, we are heavily utilizing the @requires decorator to ease parameter inheritance pain. This is a big project so we have split up our monolithic pipeline into smaller pipelines each being worked on by a different team. Some of these pipelines also derive from others using normal Python inheritance. Because of our use case (many interdependent pipelines and use of @requires) we have found ourselves in the situation where two or several tasks in the pipelines chain define the same parameter name in different contexts. Since the context is different the existing functionality of @requires (and @inherits) of silently masking a similarly named upstream parameter is undesirable. It would be better if Luigi could avoid this silent parameter masking or give a warning at the very least.

This problem that I have described, is what we are unofficially calling the "parameter collision" problem. There are two flavors of this problem that we have encountered;

Case A:

import luigi
from luigi.util import requires


###  Pipeline I
class TaskA(luigi.Task):
    param_a = luigi.Parameter()

    def output(self):
        return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))

@requires(TaskA)
class TaskB(luigi.Task):
    param_b = luigi.Parameter()

    def output(self):
        return luigi.LocalTarget('/tmp/log-{t.param_b}'.format(t=self))


###  Pipeline II
@requires(TaskB)
class TaskC(luigi.Task):
    param_a = luigi.Parameter()

Case B:

import luigi
from luigi.util import requires

###  Pipeline I
class TaskX(luigi.Task):
    param_a = luigi.Parameter()

    def output(self):
        return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))


###  Pipeline II
class TaskY(luigi.Task):
    param_a = luigi.Parameter()

    def output(self):
        return luigi.LocalTarget('/tmp/log-{t.param_a}'.format(t=self))


###  Pipeline III
@requires(TaskA, TaskB)
class TaskZ(luigi.Task):
    param_b= luigi.Parameter()

In the first case (Case A), TaskC will silently mask param_a that had been passed down from the upstream TaskA. In the 2nd case (Case B), TaskZ will possibly also mask param_a from either TaskX or TaskY. Such a masking could have been unintentional and therefore it's dangerous for parameters to be silently overwritten (masked).

In a big project, with different teams working on different parts, it might be impossible to realize you have masked/overwritten a parameter that had been defined in an upstream task. Therefore, we need a mechanism for detecting and warning the developer when such a collision of parameter names occurs. We evaluated a few options but the most promising solution seems to be;

  • Modify the @requires and @inherits decorator to throw an exception when a parameter collision is detected. In addition, add a new argument, ignore_collisions=(), to allow excluding select parameters from the collision detection. The intention is that when a parameter collision is detected at a particular point in the task inheritance chain, the developer has the choice of either renaming one of the parameters or include it in the ignore_collisions list if the collision is desired. I have included a WIP pull request here.

Has anyone else encountered this problem and if so do you think the suggested solution is sufficient? If the answer to both of these questions is in the affirmative, I will rework the pull request to include tests and documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions