Utils

class datatoolkit.utils.Group(feature: str, data: pandas.core.frame.DataFrame, secondary_feature: str = None, bins: Union[collections.abc.Sequence, str, int] = 'auto')

summarize()

Calculates summary statistics in each bin

Returns: Statistics summary
Return type: (pd.DataFrame)

class datatoolkit.utils.MostFrequent(data: collections.abc.Iterable, top_pct_obs: float = 0.8, top_pct_cat: float = 0.2)

Truncates data according to the proportion of a categorical column

Parameters

array (Iterable) – Array of categorical values
top_pct_obs (float, optional) – Percentage of observations to use. Defaults to 0.8.
top_pct_cat (float, optional) – Percentage of categories to use. Defaults to 0.2.

Returns

1d array with most frequent categories and: summary statistics

Return type

(Iterable, pd.DataFrame)

References

[1] https://hsteinshiromoto.github.io/posts/2020/06/25/find_row_closest_value_to_input

Example

>>> x = [[i]*j for j,i in zip([50, 25, 12, 6, 3, 2, 2], range(1, 7))]
>>> x = functools.reduce(operator.iconcat, x, []) # Flat the list
>>> mf = MostFrequent(x)
>>> output, stats = mf()
>>> print(stats[["category", "n_observations_proportions", "cum_n_observations_proportions"]])
           category  n_observations_proportions  cum_n_observations_proportions
0                 1                    0.510204                        0.510204
1                 2                    0.255102                        0.765306
2  other categories                    0.489796                        1.000000

fit()

Make statistical summary of the data

Returns: (None)

transform()

Locate the category that is closest to the top_pct_cat or observation that is close to top_pct_obs proportion

Raises: ValueError – Raise if floats are not positive
Returns: Categories and summary data set
Return type: (pd.DataFrame)

class datatoolkit.utils.Quantize(feature: str, data: pandas.core.frame.DataFrame, secondary_feature: str = None, bins: Union[collections.abc.Sequence, str, int] = 'auto')

Quantize data frame

Example

>>> data = pd.DataFrame(np.random.rand(10), columns=["A"])
>>> quantized_data = Quantize(data=data, feature="A")
>>> _ = quantized_data()
>>> _ = quantized_data.summarize()

class datatoolkit.utils.QuantizeDatetime(feature: str, data: pandas.core.frame.DataFrame, secondary_feature: str = None, bins: Union[collections.abc.Sequence, str, int] = 'auto')

Quantize datetime data frame

Example

>>> data = pd.DataFrame(np.arange(datetime(1985,7,1), datetime(2015,7,1), timedelta(days=1)).astype(datetime), columns=["A"])
>>> quantized_data = QuantizeDatetime(data=data, feature="A", bins="M")
>>> _ = quantized_data("count")
>>> _ = quantized_data.summarize()

datatoolkit.utils.make_distribution(distribution_name: str, params: dict)

Returns SciPy statistical distribution object

Parameters

distribution_name (str) – Name of the distribution.
params (dict) – Distribution parameters.

Returns

_description_

Return type

_type_

Example

>>> params = {"loc": 1, "scale": 0.05}
>>> half_norm = make_distribution("halfnorm", params)
>>> half_norm.stats(moments='mvsk')
(array(1.03989423), array(0.00090845), array(0.99527175), array(0.8691773))

datatoolkit.utils.make_graph(nodes: collections.abc.Iterable, M: numpy.ndarray, G: networkx.classes.digraph.DiGraph = <networkx.classes.digraph.DiGraph object>)

Build graph based on list of nodes and a weight matrix :param nodes: Graph nodes :type nodes: list :param M: Weight matrix :type M: np.ndarray :param G: Graph type. Defaults to nx.DiGraph(). :type G: nx.classes.digraph.DiGraph, optional

Returns: Graph object
Return type: [type]

Example

>>> n_nodes = 4
>>> M = np.random.rand(n_nodes, n_nodes)
>>> nodes = range(M.shape[0])
>>> G = make_graph(nodes, M)

datatoolkit.utils.make_pivot(feature: str, index: str, column: str, data: pandas.core.frame.DataFrame, groupby_args: list = None)

Create two types of pivot matrices: count and mean

Parameters

feature (str) – Feature that is used as a value for the pivot tables. Needs to be numeric
index (str) – Name of rows of the pivot table
column (str) – Name of columns of the pivot table
data (pd.DataFrame) – Data frame containing the data
groupby_args (list, optional) – Parse arguments to groupby. Defaults to None.

Returns

Pivot tables

Return type

(pd.DataFrame)