StandardScaler#
- class pyspark.mllib.feature.StandardScaler(withMean=False, withStd=True)[source]#
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
New in version 1.2.0.
- Parameters
- withMeanbool, optional
False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
- withStdbool, optional
True by default. Scales the data to unit standard deviation.
Examples
>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True
Methods
fit
(dataset)Computes the mean and variance and stores as a model to be used for later scaling.
Methods Documentation
- fit(dataset)[source]#
Computes the mean and variance and stores as a model to be used for later scaling.
New in version 1.2.0.
- Parameters
- dataset
pyspark.RDD
The data used to compute the mean and variance to build the transformation model.
- dataset
- Returns