API¶
The core two classes needed to work with pandas-ml-utils are subclass of
Model
and FeaturesAndLabels
. After a model is fitted you get back a
pandas_ml_utils.model.fitting.fit.Fit
object.
Methods available via pandas DataFrame’s¶
The general pattern is:
import pands as pd
import pandas_ml_utils as pmu
from sklearn.neural_network import MLPClassifier
from hyperopt import hp, fmin
fit = df.fit(pmu.SkModel(MLPClassifier(activation='tanh', random_state=42), fAndL_v1),
test_size=0.2,
hyper_parameter_space={'learning_rate_init': hp.uniform('learning_rate_init', 0.0001, 0.01),
'alpha': hp.uniform('alpha', 0.00001, 0.01),
'hidden_layer_sizes': hp.choice('hidden_layer_sizes', [(60,50), (110), (30,30,30)]),
'early_stopping': True,
'max_iter': 50,
'__max_evals': 100})
Where the available arguments are:
-
pandas_ml_utils.
fit
(df, model_provider, test_size=0.4, youngest_size=None, cross_validation=None, test_validate_split_seed=42, hyper_parameter_space=None)¶ - Parameters
df (
DataFrame
) – the DataFrame you apply this function tomodel_provider (
Callable
[[int
],Model
]) – a callable which provides a newModel
instance i.e. for each hyper parameter if hyper parameter tuning is enforced. Usually all the Model subclasses implement __call__ thus they are a provider of itselftest_size (
float
) – the fraction [0, 1] of random samples which are used for a test setyoungest_size (
Optional
[float
]) – the fraction [0, 1] of the test samples which are not random but are the youngestcross_validation (
Optional
[Tuple
[int
,Callable
[[ndarray
,ndarray
],Tuple
[ndarray
,ndarray
]]]]) – tuple of number of epochs for each fold provider and a cross validation providertest_validate_split_seed – seed if train, test split needs to be reproduceable. A magic seed ‘youngest’ is available, which just uses the youngest data as test data
hyper_parameter_space (
Optional
[Dict
]) – space of hyper parameters passed as kwargs to your model provider
- Return type
- Returns
returns a
pandas_ml_utils.model.fitting.fit.Fit
object
Model¶
-
class
pandas_ml_utils.
Model
(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs)¶ Represents a statistical or ML model and holds the necessary information how to interpret the columns of a pandas DataFrame (
FeaturesAndLabels
). Currently available implementations are:SkitModel - provide any skit learn classifier or regressor
KerasModel - provide a function returning a compiled keras model
MultiModel - provide a model which will copied (and fitted) for each provided target
-
__init__
(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs)¶ All implementations of Model need to pass two arguments to super().__init()__.
- Parameters
features_and_labels (
FeaturesAndLabels
) – theFeaturesAndLabels
object defining all the features, feature engineerings and labelssummary_provider (
Callable
[[DataFrame
],Summary
]) – a summary provider in the most simple case just holds a pd.DataFrame containing all the labels and all the predictions and optionally loss and target values. Since constructors as callables as well it is usually enoug tho just pass the type i.e. summary_provider=BinaryClassificationSummarykwargs –
-
fit
(x, y, x_val, y_val, sample_weight_train, sample_weight_test)¶ function called to fit the model :type x:
ndarray
:param x: x :type y:ndarray
:param y: y :type x_val:ndarray
:param x_val: x validation :type y_val:ndarray
:param y_val: y validation :type sample_weight_train:ndarray
:param sample_weight_train: sample weights for loss penalisation (default np.ones) :type sample_weight_test:ndarray
:param sample_weight_test: sample weights for loss penalisation (default np.ones) :rtype:float
:return: loss of the fit
-
static
load
(filename)¶ Loads a previously saved model from disk. By default dill is used to serialize / deserialize a model.
- Parameters
filename (
str
) – filename of the serialized model inclusive file extension- Returns
returns a deserialized model
-
predict
(x)¶ prediction of the model for each target
- Parameters
x (
ndarray
) – x- Return type
ndarray
- Returns
prediction of the model for each target
-
save
(filename)¶ save model to disk :type filename:
str
:param filename: filename inclusive file extension :return: None
Lambda Magic:
Note that all places which accept a callable perform some argument magic. Usually the first argument
is the data frame and the second argument is the target name. By using smart argument names you have
full control which data you need for your lambda. So is it possible to inject all members and kwaegs
of the FeaturesAndLabels class as well as from the Model class. So is it possible i.e. to inject
the labels by using lambda df, _labels: ...
SkModel¶
Simply provide the sklearn model i.e. LogisticRegression along with the features and labels
KerasModel¶
-
class
pandas_ml_utils.
Model
(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs) Represents a statistical or ML model and holds the necessary information how to interpret the columns of a pandas DataFrame (
FeaturesAndLabels
). Currently available implementations are:SkitModel - provide any skit learn classifier or regressor
KerasModel - provide a function returning a compiled keras model
MultiModel - provide a model which will copied (and fitted) for each provided target
-
__init__
(features_and_labels, summary_provider=<class 'pandas_ml_utils.summary.summary.Summary'>, **kwargs) All implementations of Model need to pass two arguments to super().__init()__.
- Parameters
features_and_labels (
FeaturesAndLabels
) – theFeaturesAndLabels
object defining all the features, feature engineerings and labelssummary_provider (
Callable
[[DataFrame
],Summary
]) – a summary provider in the most simple case just holds a pd.DataFrame containing all the labels and all the predictions and optionally loss and target values. Since constructors as callables as well it is usually enoug tho just pass the type i.e. summary_provider=BinaryClassificationSummarykwargs –
-
fit
(x, y, x_val, y_val, sample_weight_train, sample_weight_test) function called to fit the model :type x:
ndarray
:param x: x :type y:ndarray
:param y: y :type x_val:ndarray
:param x_val: x validation :type y_val:ndarray
:param y_val: y validation :type sample_weight_train:ndarray
:param sample_weight_train: sample weights for loss penalisation (default np.ones) :type sample_weight_test:ndarray
:param sample_weight_test: sample weights for loss penalisation (default np.ones) :rtype:float
:return: loss of the fit
-
static
load
(filename) Loads a previously saved model from disk. By default dill is used to serialize / deserialize a model.
- Parameters
filename (
str
) – filename of the serialized model inclusive file extension- Returns
returns a deserialized model
-
predict
(x) prediction of the model for each target
- Parameters
x (
ndarray
) – x- Return type
ndarray
- Returns
prediction of the model for each target
-
save
(filename) save model to disk :type filename:
str
:param filename: filename inclusive file extension :return: None
NOTE in case of tensorflow backend currently only tensorflow 1.* is supported. For tensorflow 2.x we need implement a TfKeras Model.
FeaturesAndLabels¶
-
class
pandas_ml_utils.
FeaturesAndLabels
(features, labels, label_type=None, gross_loss=None, targets=None, feature_lags=None, feature_rescaling=None, lag_smoothing=None, pre_processor=<function FeaturesAndLabels.<lambda>>, **kwargs)¶ FeaturesAndLabels is the main object used to hold the context of your problem. Here you define which columns of your DataFrame is a feature, a label or a target. This class also provides some functionality to generate autoregressive features. By default lagging features results in an RNN shaped 3D array (in the format of keras RNN layers input format).
-
__init__
(features, labels, label_type=None, gross_loss=None, targets=None, feature_lags=None, feature_rescaling=None, lag_smoothing=None, pre_processor=<function FeaturesAndLabels.<lambda>>, **kwargs)¶ - Parameters
features (
List
[str
]) – a list of column names which are used as features for your modellabels (
Union
[List
[str
],TargetLabelEncoder
,Dict
[str
,Union
[List
[str
],TargetLabelEncoder
]],Callable
[[Any
],Union
[List
[str
],TargetLabelEncoder
,Dict
[str
,Union
[List
[str
],TargetLabelEncoder
]]]]]) – as list of column names which are uses as labels for your model. you can specify one ore more named targets for a set of labels by providing a dict. This is useful if you want to train aMultiModel
or if you want to provide extra information about the label. i.e. you want to classify whether a stock price is below or above average and you want to provide what the average was. It is also possible to provide a Callable[[df, …magic], labels] which returns the expected data structure.label_type (
Optional
[Type
]) – whether to treat a label as int, float, boolgross_loss (
Optional
[Callable
[[str
,DataFrame
],Union
[Series
,DataFrame
]]]) – expects a callable[[df, target, …magic], df] which receives the source data frame and a target (or None) and should return a series or data frame. Let’s say you want to classify whether a printer is jamming the next page or not. Halting and servicing the printer costs 5‘000 while a jam costs 15‘000. Your target will be 0 or empty but your gross loss will be -5000 for all your type II errors and -15‘000 for all your type I errors in case of miss- classification. Another example would be if you want to classify whether a stock price is above (buy) the current price or not (do nothing). Your target is the today’s price and your loss is tomorrows price minus today’s price.targets (
Optional
[Callable
[[str
,DataFrame
],Union
[Series
,DataFrame
]]]) – expects a callable[[df, targets, …magic], df] which receives the source data frame and a target (or None) and should return a series or data frame. In case of multiple targets the series names need to be unique!feature_lags (
Optional
[Iterable
[int
]]) – an iterable of integers specifying the lags of an AR model i.e. [1] for AR(1) if the un-lagged feature is needed as well provide also lag of 0 like range(1)feature_rescaling (
Optional
[Dict
[Tuple
[str
, …],Tuple
[int
, …]]]) – this allows to rescale features. in a dict we can define a tuple of column names and a target rangelag_smoothing (
Optional
[Dict
[int
,Callable
[[Series
],Series
]]]) – very long lags in an AR model can be a bit fuzzy, it is possible to smooth lags i.e. by using moving averages. the key is the lag length at which a smoothing function starts to be appliedpre_processor (
Callable
[[DataFrame
,Dict
],DataFrame
]) – provide a callable[[df, …magic], df] returning an eventually augmented data frame from a given source data frame and self.kwargs. This is useful if you have i.e. data cleaning tasks. This way you can apply the model directly on the raw data.kwargs – maybe you want to pass some extra parameters to a callable you have provided
-
len_features
()¶ Returns the length of the defined features, the number of lags used and the total number of all features * lags
- Return type
Tuple
[int
, …]- Returns
tuple of (#features, #lags, #features * #lags)
-
len_labels
()¶ Returns the number of labels
- Return type
int
- Returns
number of labels
-
Fit¶
-
class
pandas_ml_utils.model.fitting.fit.
Fit
(model, training_summary, test_summary, trails)¶ After a model is fitted it gets embedded into this class along with some
Summary
statistics. In the case of Fit is displayed in a notebook the _repr_html of the Fit and Summary objects are called.-
save_model
(filename)¶ Save the fitted model.
- Parameters
filename (
str
) – filename- Returns
None
-