jsonvectorizer.vectorizers.StringVectorizer¶

class jsonvectorizer.vectorizers.StringVectorizer(min_df=1, **kwargs)¶

Vectorizer for strings

Tokenization using scikit-learn’s CountVectorizer.

Parameters:	min_df : int or float, optional (default=1) When using tokenization, ignore terms that have a document frequency strictly lower than this threshold. An integer is taken as an absolute count, and a float indicates the proportion of n_total passed to the `fit()` method. **kwargs Passed to scikit-learn’s `CountVectorizer` class for initialization.
Raises:	ValueError If min_df is not a positive number.
Attributes:	feature_names_ : list of str

Methods

`fit`(self, values[, n_total])	Fit vectorizer to the provided data
`fit_transform`(self, values, \\fit_params)	Fit vectorizer to the provided data, then transform it
`get_params`(self[, deep])	Get parameters for this estimator.
`set_params`(self, \\params)	Set the parameters of this estimator.
`transform`(self, values)	Transform values and return the resulting feature matrix

fit(self, values, n_total=None, **kwargs)¶

Fit vectorizer to the provided data

Parameters:	values : array-like, [n_samples] n_total : int or None, optional (default=None) Total Number of documents that values are extracted from. If None, defaults to `len(values)`. **kwargs: Ignored keyword arguments.
Returns:	self or None Returns None if no features were generated, otherwise returns self.

fit_transform(self, values, **fit_params)¶

Fit vectorizer to the provided data, then transform it

Parameters:	values : array-like, [n_samples] **fit_params Keyword arguments, passed to the `fit()` method.
Returns:	X : ndarray, [n_samples, n_features]

get_params(self, deep=True)¶

Get parameters for this estimator.

Parameters:	deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params : mapping of string to any Parameter names mapped to their values.

set_params(self, **params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:	**params : dict Estimator parameters.
Returns:	self : object Estimator instance.

transform(self, values)¶

Transform values and return the resulting feature matrix

Parameters:	values : array-like, [n_samples]
Returns:	X : sparse matrix, shape [n_samples, n_features]
Raises:	NotFittedError If the vectorizer has not yet been fitted.