jsonvectorizer.JsonVectorizer¶
-
class
jsonvectorizer.
JsonVectorizer
¶ Class for extracting features from JSON documents
Parameters: - schema : dict, optional (default={})
A valid JSON schema for initializing the object.
- path : tuple of str, optional (default=(‘root’,))
Path from the top-most node (including the root) to this node.
- tuple_items : bool, optional (default=False)
If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.
Attributes: - path : tuple of str
Path from the top-most node (including the root) to this node.
- tuple_items : bool
If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.
- type : set
Valid data types for documents conforming to the current schema.
- required : set of str
Set of required properties for JSON objects (dictionaries).
- properties : dict
Mapping between property names, and
JsonVectorizer
instances corresponding to different properties in JSON objects (Python dictionaries).- items : list
JsonVectorizer
instances corresponding to different items in JSON arrays.- feature_names_ : list of str
Array mapping from feature integer indices to feature names.
Methods
extend
()Extend the schema to conform to the provided documents find_nodes
()Find nodes that match any of the provided regular expressions fit
()Fit vectorizer to the provided data prune
()Prune the learned schema using the provided rules transform
()Transform JSON documents to feature matrix. -
extend
()¶ Extend the schema to conform to the provided documents
Parameters: - docs : iterable object
Iterable containing JSON documents.
-
find_nodes
()¶ Find nodes that match any of the provided regular expressions
Parameters: - patterns : str or list of str
Returns: - paths : list of tuple
List of paths for matching nodes. Each item is a tuple, containing the path from the top-most node (including the root) to a matching node.
-
fit
()¶ Fit vectorizer to the provided data
For each node, the first matching vectorizer is used to extract features from the node. Each item of vectorizers must be a dictionary containing the following fields:
- vectorizer : Class for extracting features. For currently
supported classes, see
jsonvectorizer.vectorizers
. - type (str or list of str, optional) : Data type(s) that can be used with vectorizer. If not provided, matches all supported data types: {‘object’, ‘array’, ‘null’, ‘boolean’, ‘integer’, ‘number’, ‘string’}.
- patterns (str or list of str, optional) : When provided, nodes that do not match this (at least one of these) regular expression(s) will be skipped over.
- args (list, optional) : Positional arguments passed to vectorizer for initialization.
- kwargs (dict, optional) : Keyword arguments passed to vectorizer for initialization.
Parameters: - docs : iterable object, optional (default=[])
Iterable containing JSON documents for learning a schema and fitting vectorizers. Alternatively, the
extend()
method can be used for this step.- vectorizers : list of dict, optional (default=[])
List of vectorizer definitions (see above for details) for extracting features from individual nodes.
- ignore_patterns : str or list of str, optional (default=[])
Node paths that match this (at least one of these) regular expression(s) will be ignored. Node names in a path are separated by colons, e.g., ‘foo:bar’.
Returns: - self
- vectorizer : Class for extracting features. For currently
supported classes, see
-
prune
()¶ Prune the learned schema using the provided rules
Parameters: - patterns : str or list of str (default=[])
Drops node paths that match this (at least one of these) regular expression(s). Node names in a path are separated by colons, e.g., ‘foo:bar’.
- min_f : int or float, optional (default=1)
For all nodes in the learned schema, removes data types with less than this many collected samples. An integer is taken as an absolute count, and a float indicates the proportion of all documents. If all data types in a node are removed, the node itself will be dropped from the schema.
Returns: - paths : list of str
List of node paths that were dropped, e.g., ‘foo:bar’ if a node is dropped, or ‘foo:bar -> string’ if a specific data type is removed.
-
transform
()¶ Transform JSON documents to feature matrix.
Parameters: - docs: iterable object
Iterable containing JSON documents.
Returns: - X: sparse LIL matrix, [n_samples, n_features]
Feature matrix.