Utils
- class datatoolkit.utils.Group(feature: str, data: pandas.core.frame.DataFrame, secondary_feature: str = None, bins: Union[collections.abc.Sequence, str, int] = 'auto')
- summarize()
Calculates summary statistics in each bin
- Returns
Statistics summary
- Return type
(pd.DataFrame)
- class datatoolkit.utils.MostFrequent(data: collections.abc.Iterable, top_pct_obs: float = 0.8, top_pct_cat: float = 0.2)
Truncates data according to the proportion of a categorical column
- Parameters
array (Iterable) – Array of categorical values
top_pct_obs (float, optional) – Percentage of observations to use. Defaults to 0.8.
top_pct_cat (float, optional) – Percentage of categories to use. Defaults to 0.2.
- Returns
- 1d array with most frequent categories and
summary statistics
- Return type
(Iterable, pd.DataFrame)
References
[1] https://hsteinshiromoto.github.io/posts/2020/06/25/find_row_closest_value_to_input
Example
>>> x = [[i]*j for j,i in zip([50, 25, 12, 6, 3, 2, 2], range(1, 7))] >>> x = functools.reduce(operator.iconcat, x, []) # Flat the list >>> mf = MostFrequent(x) >>> output, stats = mf() >>> print(stats[["category", "n_observations_proportions", "cum_n_observations_proportions"]]) category n_observations_proportions cum_n_observations_proportions 0 1 0.510204 0.510204 1 2 0.255102 0.765306 2 other categories 0.489796 1.000000
- fit()
Make statistical summary of the data
- Returns
(None)
- transform()
Locate the category that is closest to the top_pct_cat or observation that is close to top_pct_obs proportion
- Raises
ValueError – Raise if floats are not positive
- Returns
Categories and summary data set
- Return type
(pd.DataFrame)
- class datatoolkit.utils.Quantize(feature: str, data: pandas.core.frame.DataFrame, secondary_feature: str = None, bins: Union[collections.abc.Sequence, str, int] = 'auto')
Quantize data frame
Example
>>> data = pd.DataFrame(np.random.rand(10), columns=["A"]) >>> quantized_data = Quantize(data=data, feature="A") >>> _ = quantized_data() >>> _ = quantized_data.summarize()
- class datatoolkit.utils.QuantizeDatetime(feature: str, data: pandas.core.frame.DataFrame, secondary_feature: str = None, bins: Union[collections.abc.Sequence, str, int] = 'auto')
Quantize datetime data frame
Example
>>> data = pd.DataFrame(np.arange(datetime(1985,7,1), datetime(2015,7,1), timedelta(days=1)).astype(datetime), columns=["A"]) >>> quantized_data = QuantizeDatetime(data=data, feature="A", bins="M") >>> _ = quantized_data("count") >>> _ = quantized_data.summarize()
- datatoolkit.utils.make_distribution(distribution_name: str, params: dict)
Returns SciPy statistical distribution object
- Parameters
distribution_name (str) – Name of the distribution.
params (dict) – Distribution parameters.
- Returns
_description_
- Return type
_type_
Example
>>> params = {"loc": 1, "scale": 0.05} >>> half_norm = make_distribution("halfnorm", params) >>> half_norm.stats(moments='mvsk') (array(1.03989423), array(0.00090845), array(0.99527175), array(0.8691773))
- datatoolkit.utils.make_graph(nodes: collections.abc.Iterable, M: numpy.ndarray, G: networkx.classes.digraph.DiGraph = <networkx.classes.digraph.DiGraph object>)
Build graph based on list of nodes and a weight matrix :param nodes: Graph nodes :type nodes: list :param M: Weight matrix :type M: np.ndarray :param G: Graph type. Defaults to nx.DiGraph(). :type G: nx.classes.digraph.DiGraph, optional
- Returns
Graph object
- Return type
[type]
Example
>>> n_nodes = 4 >>> M = np.random.rand(n_nodes, n_nodes) >>> nodes = range(M.shape[0]) >>> G = make_graph(nodes, M)
- datatoolkit.utils.make_pivot(feature: str, index: str, column: str, data: pandas.core.frame.DataFrame, groupby_args: list = None)
Create two types of pivot matrices: count and mean
- Parameters
feature (str) – Feature that is used as a value for the pivot tables. Needs to be numeric
index (str) – Name of rows of the pivot table
column (str) – Name of columns of the pivot table
data (pd.DataFrame) – Data frame containing the data
groupby_args (list, optional) – Parse arguments to groupby. Defaults to None.
- Returns
Pivot tables
- Return type
(pd.DataFrame)