API

The core two classes needed to work with pandas-ml-utils are subclass of Model and FeaturesAndLabels. After a model is fitted you get back a pandas_ml_utils.model.fitting.fit.Fit object.

Methods available via pandas DataFrame’s

The general pattern is:

import pands as pd
import pandas_ml_utils as pmu
from sklearn.neural_network import MLPClassifier
from hyperopt import hp, fmin

fit = df.fit(pmu.SkModel(MLPClassifier(activation='tanh', random_state=42), fAndL_v1),
             test_size=0.2,
             hyper_parameter_space={'learning_rate_init': hp.uniform('learning_rate_init', 0.0001, 0.01),
                                    'alpha': hp.uniform('alpha', 0.00001, 0.01),
                                    'hidden_layer_sizes': hp.choice('hidden_layer_sizes', [(60,50), (110), (30,30,30)]),
                                    'early_stopping': True,
                                    'max_iter': 50,
                                    '__max_evals': 100})

Where the available arguments are:

pandas_ml_utils.fit(df, model_provider, test_size=0.4, youngest_size=None, cross_validation=None, test_validate_split_seed=42, hyper_parameter_space=None)
Parameters
  • df (DataFrame) – the DataFrame you apply this function to

  • model_provider (Callable[[int], Model]) – a callable which provides a new Model instance i.e. for each hyper parameter if hyper parameter tuning is enforced. Usually all the Model subclasses implement __call__ thus they are a provider of itself

  • test_size (float) – the fraction [0, 1] of random samples which are used for a test set

  • youngest_size (Optional[float]) – the fraction [0, 1] of the test samples which are not random but are the youngest

  • cross_validation (Optional[Tuple[int, Callable[[ndarray, ndarray], Tuple[ndarray, ndarray]]]]) – tuple of number of epochs for each fold provider and a cross validation provider

  • test_validate_split_seed – seed if train, test split needs to be reproduceable. A magic seed ‘youngest’ is available, which just uses the youngest data as test data

  • hyper_parameter_space (Optional[Dict]) – space of hyper parameters passed as kwargs to your model provider

Return type

Fit

Returns

returns a pandas_ml_utils.model.fitting.fit.Fit object

Model

class pandas_ml_utils.Model(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs)

Represents a statistical or ML model and holds the necessary information how to interpret the columns of a pandas DataFrame ( FeaturesAndLabels ). Currently available implementations are:

  • SkitModel - provide any skit learn classifier or regressor

  • KerasModel - provide a function returning a compiled keras model

  • MultiModel - provide a model which will copied (and fitted) for each provided target

__init__(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs)

All implementations of Model need to pass two arguments to super().__init()__.

Parameters
  • features_and_labels (FeaturesAndLabels) – the FeaturesAndLabels object defining all the features, feature engineerings and labels

  • summary_provider (Callable[[DataFrame], Summary]) – a summary provider in the most simple case just holds a pd.DataFrame containing all the labels and all the predictions and optionally loss and target values. Since constructors as callables as well it is usually enoug tho just pass the type i.e. summary_provider=BinaryClassificationSummary

  • kwargs

fit(x, y, x_val, y_val, sample_weight_train, sample_weight_test)

function called to fit the model :type x: ndarray :param x: x :type y: ndarray :param y: y :type x_val: ndarray :param x_val: x validation :type y_val: ndarray :param y_val: y validation :type sample_weight_train: ndarray :param sample_weight_train: sample weights for loss penalisation (default np.ones) :type sample_weight_test: ndarray :param sample_weight_test: sample weights for loss penalisation (default np.ones) :rtype: float :return: loss of the fit

static load(filename)

Loads a previously saved model from disk. By default dill is used to serialize / deserialize a model.

Parameters

filename (str) – filename of the serialized model inclusive file extension

Returns

returns a deserialized model

predict(x)

prediction of the model for each target

Parameters

x (ndarray) – x

Return type

ndarray

Returns

prediction of the model for each target

save(filename)

save model to disk :type filename: str :param filename: filename inclusive file extension :return: None

Lambda Magic: Note that all places which accept a callable perform some argument magic. Usually the first argument is the data frame and the second argument is the target name. By using smart argument names you have full control which data you need for your lambda. So is it possible to inject all members and kwaegs of the FeaturesAndLabels class as well as from the Model class. So is it possible i.e. to inject the labels by using lambda df, _labels: ...

SkModel

Simply provide the sklearn model i.e. LogisticRegression along with the features and labels

KerasModel

class pandas_ml_utils.Model(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs)

Represents a statistical or ML model and holds the necessary information how to interpret the columns of a pandas DataFrame ( FeaturesAndLabels ). Currently available implementations are:

  • SkitModel - provide any skit learn classifier or regressor

  • KerasModel - provide a function returning a compiled keras model

  • MultiModel - provide a model which will copied (and fitted) for each provided target

__init__(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs)

All implementations of Model need to pass two arguments to super().__init()__.

Parameters
  • features_and_labels (FeaturesAndLabels) – the FeaturesAndLabels object defining all the features, feature engineerings and labels

  • summary_provider (Callable[[DataFrame], Summary]) – a summary provider in the most simple case just holds a pd.DataFrame containing all the labels and all the predictions and optionally loss and target values. Since constructors as callables as well it is usually enoug tho just pass the type i.e. summary_provider=BinaryClassificationSummary

  • kwargs

fit(x, y, x_val, y_val, sample_weight_train, sample_weight_test)

function called to fit the model :type x: ndarray :param x: x :type y: ndarray :param y: y :type x_val: ndarray :param x_val: x validation :type y_val: ndarray :param y_val: y validation :type sample_weight_train: ndarray :param sample_weight_train: sample weights for loss penalisation (default np.ones) :type sample_weight_test: ndarray :param sample_weight_test: sample weights for loss penalisation (default np.ones) :rtype: float :return: loss of the fit

static load(filename)

Loads a previously saved model from disk. By default dill is used to serialize / deserialize a model.

Parameters

filename (str) – filename of the serialized model inclusive file extension

Returns

returns a deserialized model

predict(x)

prediction of the model for each target

Parameters

x (ndarray) – x

Return type

ndarray

Returns

prediction of the model for each target

save(filename)

save model to disk :type filename: str :param filename: filename inclusive file extension :return: None

NOTE in case of tensorflow backend currently only tensorflow 1.* is supported. For tensorflow 2.x we need implement a TfKeras Model.

FeaturesAndLabels

class pandas_ml_utils.FeaturesAndLabels(features, labels, label_type=None, gross_loss=None, targets=None, feature_lags=None, feature_rescaling=None, lag_smoothing=None, pre_processor=<function FeaturesAndLabels.<lambda>>, **kwargs)

FeaturesAndLabels is the main object used to hold the context of your problem. Here you define which columns of your DataFrame is a feature, a label or a target. This class also provides some functionality to generate autoregressive features. By default lagging features results in an RNN shaped 3D array (in the format of keras RNN layers input format).

__init__(features, labels, label_type=None, gross_loss=None, targets=None, feature_lags=None, feature_rescaling=None, lag_smoothing=None, pre_processor=<function FeaturesAndLabels.<lambda>>, **kwargs)
Parameters
  • features (List[str]) – a list of column names which are used as features for your model

  • labels (Union[List[str], TargetLabelEncoder, Dict[str, Union[List[str], TargetLabelEncoder]], Callable[[Any], Union[List[str], TargetLabelEncoder, Dict[str, Union[List[str], TargetLabelEncoder]]]]]) – as list of column names which are uses as labels for your model. you can specify one ore more named targets for a set of labels by providing a dict. This is useful if you want to train a MultiModel or if you want to provide extra information about the label. i.e. you want to classify whether a stock price is below or above average and you want to provide what the average was. It is also possible to provide a Callable[[df, …magic], labels] which returns the expected data structure.

  • label_type (Optional[Type]) – whether to treat a label as int, float, bool

  • gross_loss (Optional[Callable[[str, DataFrame], Union[Series, DataFrame]]]) – expects a callable[[df, target, …magic], df] which receives the source data frame and a target (or None) and should return a series or data frame. Let’s say you want to classify whether a printer is jamming the next page or not. Halting and servicing the printer costs 5‘000 while a jam costs 15‘000. Your target will be 0 or empty but your gross loss will be -5000 for all your type II errors and -15‘000 for all your type I errors in case of miss- classification. Another example would be if you want to classify whether a stock price is above (buy) the current price or not (do nothing). Your target is the today’s price and your loss is tomorrows price minus today’s price.

  • targets (Optional[Callable[[str, DataFrame], Union[Series, DataFrame]]]) – expects a callable[[df, targets, …magic], df] which receives the source data frame and a target (or None) and should return a series or data frame. In case of multiple targets the series names need to be unique!

  • feature_lags (Optional[Iterable[int]]) – an iterable of integers specifying the lags of an AR model i.e. [1] for AR(1) if the un-lagged feature is needed as well provide also lag of 0 like range(1)

  • feature_rescaling (Optional[Dict[Tuple[str, …], Tuple[int, …]]]) – this allows to rescale features. in a dict we can define a tuple of column names and a target range

  • lag_smoothing (Optional[Dict[int, Callable[[Series], Series]]]) – very long lags in an AR model can be a bit fuzzy, it is possible to smooth lags i.e. by using moving averages. the key is the lag length at which a smoothing function starts to be applied

  • pre_processor (Callable[[DataFrame, Dict], DataFrame]) – provide a callable[[df, …magic], df] returning an eventually augmented data frame from a given source data frame and self.kwargs. This is useful if you have i.e. data cleaning tasks. This way you can apply the model directly on the raw data.

  • kwargs – maybe you want to pass some extra parameters to a callable you have provided

len_features()

Returns the length of the defined features, the number of lags used and the total number of all features * lags

Return type

Tuple[int, …]

Returns

tuple of (#features, #lags, #features * #lags)

len_labels()

Returns the number of labels

Return type

int

Returns

number of labels

property shape

Returns the shape of features and labels how they get passed to the Model. If laging is used, then the features shape is in Keras RNN form.

Return type

Tuple[Tuple[int, …], Tuple[int, …]]

Returns

a tuple of (features.shape, labels.shape)

Fit

class pandas_ml_utils.model.fitting.fit.Fit(model, training_summary, test_summary, trails)

After a model is fitted it gets embedded into this class along with some Summary statistics. In the case of Fit is displayed in a notebook the _repr_html of the Fit and Summary objects are called.

save_model(filename)

Save the fitted model.

Parameters

filename (str) – filename

Returns

None

trails()

In case of hyper parameter optimization a trails object as used by Hyperopt is available.

Returns

Trails object

values()
Returns

returns the fitted model, a Summary on the training data, a Summary on the test data

Summary

class pandas_ml_utils.summary.summary.Summary(df, **kwargs)

Summary objects a used to visually present the results of a df.fit fitted model or a df.backtest All implementations of Summary need to:

  • pass a pd.DataFrame to super().__init()__

  • implement _repr_html_()