jsonvectorizer.JsonVectorizer¶

class jsonvectorizer.JsonVectorizer¶

Class for extracting features from JSON documents

Parameters:

schema : dict, optional (default={}): A valid JSON schema for initializing the object.
path : tuple of str, optional (default=(‘root’,)): Path from the top-most node (including the root) to this node.
tuple_items : bool, optional (default=False): If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.

Attributes:

path : tuple of str: Path from the top-most node (including the root) to this node.
tuple_items : bool: If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.
type : set: Valid data types for documents conforming to the current schema.
required : set of str: Set of required properties for JSON objects (dictionaries).
properties : dict: Mapping between property names, and JsonVectorizer instances corresponding to different properties in JSON objects (Python dictionaries).
items : list: JsonVectorizer instances corresponding to different items in JSON arrays.
feature_names_ : list of str: Array mapping from feature integer indices to feature names.

Methods

`extend`()	Extend the schema to conform to the provided documents
`find_nodes`()	Find nodes that match any of the provided regular expressions
`fit`()	Fit vectorizer to the provided data
`prune`()	Prune the learned schema using the provided rules
`transform`()	Transform JSON documents to feature matrix.

extend()¶

Extend the schema to conform to the provided documents

Parameters:	docs : iterable object Iterable containing JSON documents.

find_nodes()¶

Find nodes that match any of the provided regular expressions

Parameters:	patterns : str or list of str
Returns:	paths : list of tuple List of paths for matching nodes. Each item is a tuple, containing the path from the top-most node (including the root) to a matching node.

fit()¶

Fit vectorizer to the provided data

For each node, the first matching vectorizer is used to extract features from the node. Each item of vectorizers must be a dictionary containing the following fields:

vectorizer : Class for extracting features. For currently supported classes, see jsonvectorizer.vectorizers.
type (str or list of str, optional) : Data type(s) that can be used with vectorizer. If not provided, matches all supported data types: {‘object’, ‘array’, ‘null’, ‘boolean’, ‘integer’, ‘number’, ‘string’}.
patterns (str or list of str, optional) : When provided, nodes that do not match this (at least one of these) regular expression(s) will be skipped over.
args (list, optional) : Positional arguments passed to vectorizer for initialization.
kwargs (dict, optional) : Keyword arguments passed to vectorizer for initialization.

Parameters:

docs : iterable object, optional (default=[]): Iterable containing JSON documents for learning a schema and fitting vectorizers. Alternatively, the extend() method can be used for this step.
vectorizers : list of dict, optional (default=[]): List of vectorizer definitions (see above for details) for extracting features from individual nodes.
ignore_patterns : str or list of str, optional (default=[]): Node paths that match this (at least one of these) regular expression(s) will be ignored. Node names in a path are separated by colons, e.g., ‘foo:bar’.

Returns:

self

prune()¶

Prune the learned schema using the provided rules

Parameters:

patterns : str or list of str (default=[]): Drops node paths that match this (at least one of these) regular expression(s). Node names in a path are separated by colons, e.g., ‘foo:bar’.
min_f : int or float, optional (default=1): For all nodes in the learned schema, removes data types with less than this many collected samples. An integer is taken as an absolute count, and a float indicates the proportion of all documents. If all data types in a node are removed, the node itself will be dropped from the schema.

Returns:

paths : list of str: List of node paths that were dropped, e.g., ‘foo:bar’ if a node is dropped, or ‘foo:bar -> string’ if a specific data type is removed.

transform()¶

Transform JSON documents to feature matrix.

Parameters:	docs: iterable object Iterable containing JSON documents.
Returns:	X: sparse LIL matrix, [n_samples, n_features] Feature matrix.