jsonvectorizer.JsonVectorizer

class jsonvectorizer.JsonVectorizer

Class for extracting features from JSON documents

Parameters:
schema : dict, optional (default={})

A valid JSON schema for initializing the object.

path : tuple of str, optional (default=(‘root’,))

Path from the top-most node (including the root) to this node.

tuple_items : bool, optional (default=False)

If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.

Attributes:
path : tuple of str

Path from the top-most node (including the root) to this node.

tuple_items : bool

If True, JSON arrays are regarded as tuples with different schemas for each index, otherwise it is assumed that all items conform to the same schema.

type : set

Valid data types for documents conforming to the current schema.

required : set of str

Set of required properties for JSON objects (dictionaries).

properties : dict

Mapping between property names, and JsonVectorizer instances corresponding to different properties in JSON objects (Python dictionaries).

items : list

JsonVectorizer instances corresponding to different items in JSON arrays.

feature_names_ : list of str

Array mapping from feature integer indices to feature names.

Methods

extend() Extend the schema to conform to the provided documents
find_nodes() Find nodes that match any of the provided regular expressions
fit() Fit vectorizer to the provided data
prune() Prune the learned schema using the provided rules
transform() Transform JSON documents to feature matrix.
extend()

Extend the schema to conform to the provided documents

Parameters:
docs : iterable object

Iterable containing JSON documents.

find_nodes()

Find nodes that match any of the provided regular expressions

Parameters:
patterns : str or list of str
Returns:
paths : list of tuple

List of paths for matching nodes. Each item is a tuple, containing the path from the top-most node (including the root) to a matching node.

fit()

Fit vectorizer to the provided data

For each node, the first matching vectorizer is used to extract features from the node. Each item of vectorizers must be a dictionary containing the following fields:

  • vectorizer : Class for extracting features. For currently supported classes, see jsonvectorizer.vectorizers.
  • type (str or list of str, optional) : Data type(s) that can be used with vectorizer. If not provided, matches all supported data types: {‘object’, ‘array’, ‘null’, ‘boolean’, ‘integer’, ‘number’, ‘string’}.
  • patterns (str or list of str, optional) : When provided, nodes that do not match this (at least one of these) regular expression(s) will be skipped over.
  • args (list, optional) : Positional arguments passed to vectorizer for initialization.
  • kwargs (dict, optional) : Keyword arguments passed to vectorizer for initialization.
Parameters:
docs : iterable object, optional (default=[])

Iterable containing JSON documents for learning a schema and fitting vectorizers. Alternatively, the extend() method can be used for this step.

vectorizers : list of dict, optional (default=[])

List of vectorizer definitions (see above for details) for extracting features from individual nodes.

ignore_patterns : str or list of str, optional (default=[])

Node paths that match this (at least one of these) regular expression(s) will be ignored. Node names in a path are separated by colons, e.g., ‘foo:bar’.

Returns:
self
prune()

Prune the learned schema using the provided rules

Parameters:
patterns : str or list of str (default=[])

Drops node paths that match this (at least one of these) regular expression(s). Node names in a path are separated by colons, e.g., ‘foo:bar’.

min_f : int or float, optional (default=1)

For all nodes in the learned schema, removes data types with less than this many collected samples. An integer is taken as an absolute count, and a float indicates the proportion of all documents. If all data types in a node are removed, the node itself will be dropped from the schema.

Returns:
paths : list of str

List of node paths that were dropped, e.g., ‘foo:bar’ if a node is dropped, or ‘foo:bar -> string’ if a specific data type is removed.

transform()

Transform JSON documents to feature matrix.

Parameters:
docs: iterable object

Iterable containing JSON documents.

Returns:
X: sparse LIL matrix, [n_samples, n_features]

Feature matrix.