-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).
To observe that everything was working correctly, I put a println inside the fit method of Pipeline. I noticed, that the above code does not work as intended (see Expected behaviour and Actual behaviour).
To Reproduce
import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._
// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
List(0.0, 3.7),
List(5.0, 2.4),
List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
pipe(OneHotEncoder(featureIndex)),
pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.
Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.
Versions
Scala: 2.13.0
doddle-model: 0.0.1
Additional context
Prints of partial results (i.e. transformed data):
- after first transformer is applied:
3.7 1.0 0.0 0.0 0.0 0.0 0.0
2.4 0.0 0.0 0.0 0.0 0.0 1.0
-0.3 0.0 0.0 1.0 0.0 0.0 0.0
- after second transformer is applied:
3.7 1.1547005383792515 0.0 0.0 0.0 0.0 0.0
2.4 -0.5773502691896258 0.0 0.0 0.0 0.0 1.0
-0.3 -0.5773502691896258 0.0 1.0 0.0 0.0 0.0