Skip to content

Behaviour of OneHotEncoder in a PipelineΒ #83

@matejklemen

Description

@matejklemen

Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).

To observe that everything was working correctly, I put a println inside the fit method of Pipeline. I noticed, that the above code does not work as intended (see Expected behaviour and Actual behaviour).

To Reproduce

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._

// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
	List(0.0, 3.7),
	List(5.0, 2.4),
	List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
	pipe(OneHotEncoder(featureIndex)),
	pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)

Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.

Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.

Versions
Scala: 2.13.0
doddle-model: 0.0.1

Additional context
Prints of partial results (i.e. transformed data):

  • after first transformer is applied:
3.7   1.0  0.0  0.0  0.0  0.0  0.0  
2.4   0.0  0.0  0.0  0.0  0.0  1.0  
-0.3  0.0  0.0  1.0  0.0  0.0  0.0 
  • after second transformer is applied:
3.7   1.1547005383792515   0.0  0.0  0.0  0.0  0.0  
2.4   -0.5773502691896258  0.0  0.0  0.0  0.0  1.0  
-0.3  -0.5773502691896258  0.0  1.0  0.0  0.0  0.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions