jsonvectorizer.JsonVectorizer¶
-
class
jsonvectorizer.JsonVectorizer¶ Class for extracting features from JSON documents
Parameters: - schema : dict, optional (default={})
A valid JSON schema for initializing the object.
- path : tuple of str, optional (default=(‘root’,))
Path from the top-most node (including the root) to this node.
- tuple_items : bool, optional (default=False)
If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.
Attributes: - path : tuple of str
Path from the top-most node (including the root) to this node.
- tuple_items : bool
If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.
- type : set
Valid data types for documents conforming to the current schema.
- required : set of str
Set of required properties for JSON objects (dictionaries).
- properties : dict
Mapping between property names, and
JsonVectorizerinstances corresponding to different properties in JSON objects (Python dictionaries).- items : list
JsonVectorizerinstances corresponding to different items in JSON arrays.- feature_names_ : list of str
Array mapping from feature integer indices to feature names.
Methods
extend()Extend the schema to conform to the provided documents find_nodes()Find nodes that match any of the provided regular expressions fit()Fit vectorizer to the provided data prune()Prune the learned schema using the provided rules transform()Transform JSON documents to feature matrix. -
extend()¶ Extend the schema to conform to the provided documents
Parameters: - docs : iterable object
Iterable containing JSON documents.
-
find_nodes()¶ Find nodes that match any of the provided regular expressions
Parameters: - patterns : str or list of str
Returns: - paths : list of tuple
List of paths for matching nodes. Each item is a tuple, containing the path from the top-most node (including the root) to a matching node.
-
fit()¶ Fit vectorizer to the provided data
For each node, the first matching vectorizer is used to extract features from the node. Each item of vectorizers must be a dictionary containing the following fields:
- vectorizer : Class for extracting features. For currently
supported classes, see
jsonvectorizer.vectorizers. - type (str or list of str, optional) : Data type(s) that can be used with vectorizer. If not provided, matches all supported data types: {‘object’, ‘array’, ‘null’, ‘boolean’, ‘integer’, ‘number’, ‘string’}.
- patterns (str or list of str, optional) : When provided, nodes that do not match this (at least one of these) regular expression(s) will be skipped over.
- args (list, optional) : Positional arguments passed to vectorizer for initialization.
- kwargs (dict, optional) : Keyword arguments passed to vectorizer for initialization.
Parameters: - docs : iterable object, optional (default=[])
Iterable containing JSON documents for learning a schema and fitting vectorizers. Alternatively, the
extend()method can be used for this step.- vectorizers : list of dict, optional (default=[])
List of vectorizer definitions (see above for details) for extracting features from individual nodes.
- ignore_patterns : str or list of str, optional (default=[])
Node paths that match this (at least one of these) regular expression(s) will be ignored. Node names in a path are separated by colons, e.g., ‘foo:bar’.
Returns: - self
- vectorizer : Class for extracting features. For currently
supported classes, see
-
prune()¶ Prune the learned schema using the provided rules
Parameters: - patterns : str or list of str (default=[])
Drops node paths that match this (at least one of these) regular expression(s). Node names in a path are separated by colons, e.g., ‘foo:bar’.
- min_f : int or float, optional (default=1)
For all nodes in the learned schema, removes data types with less than this many collected samples. An integer is taken as an absolute count, and a float indicates the proportion of all documents. If all data types in a node are removed, the node itself will be dropped from the schema.
Returns: - paths : list of str
List of node paths that were dropped, e.g., ‘foo:bar’ if a node is dropped, or ‘foo:bar -> string’ if a specific data type is removed.
-
transform()¶ Transform JSON documents to feature matrix.
Parameters: - docs: iterable object
Iterable containing JSON documents.
Returns: - X: sparse LIL matrix, [n_samples, n_features]
Feature matrix.