jsonvectorizer.vectorizers.StringVectorizer

class jsonvectorizer.vectorizers.StringVectorizer(min_df=1, **kwargs)

Vectorizer for strings

Tokenization using scikit-learn’s CountVectorizer.

Parameters:
min_df : int or float, optional (default=1)

When using tokenization, ignore terms that have a document frequency strictly lower than this threshold. An integer is taken as an absolute count, and a float indicates the proportion of n_total passed to the fit() method.

**kwargs

Passed to scikit-learn’s CountVectorizer class for initialization.

Raises:
ValueError

If min_df is not a positive number.

Attributes:
feature_names_ : list of str

Methods

fit(self, values[, n_total]) Fit vectorizer to the provided data
fit_transform(self, values, \*\*fit_params) Fit vectorizer to the provided data, then transform it
get_params(self[, deep]) Get parameters for this estimator.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, values) Transform values and return the resulting feature matrix
fit(self, values, n_total=None, **kwargs)

Fit vectorizer to the provided data

Parameters:
values : array-like, [n_samples]
n_total : int or None, optional (default=None)

Total Number of documents that values are extracted from. If None, defaults to len(values).

**kwargs:

Ignored keyword arguments.

Returns:
self or None

Returns None if no features were generated, otherwise returns self.

fit_transform(self, values, **fit_params)

Fit vectorizer to the provided data, then transform it

Parameters:
values : array-like, [n_samples]
**fit_params

Keyword arguments, passed to the fit() method.

Returns:
X : ndarray, [n_samples, n_features]
get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**params : dict

Estimator parameters.

Returns:
self : object

Estimator instance.

transform(self, values)

Transform values and return the resulting feature matrix

Parameters:
values : array-like, [n_samples]
Returns:
X : sparse matrix, shape [n_samples, n_features]
Raises:
NotFittedError

If the vectorizer has not yet been fitted.