Analyze the Feature Space

pandas_ml_utils.feature_selection(df, label_column=None, ignore=[], top_features=5, correlation_threshold=0.5, minimum_features=1, lags=range(0, 100), show_plots=True, figsize=(12, 10))

The feature_selection functionality helps you to analyze your features, filter out highly correlated once and focus on the most important features. This function also applies an auto regression and embeds and ACF plot.

Parameters
  • df (DataFrame) – the DataFrame which you apply the function on

  • label_column (Optional[str]) – column name of your dependent variable

  • ignore (Union[List[str], str]) – columns you want to ignore

  • top_features (int) – number of most important features you want to select

  • correlation_threshold (float) – threshold at which correlated features drop out

  • minimum_features (int) – number of features you want to keep even if they are highly correlated

  • lags (Iterable[int]) – iterable of lags you want to analyze as an AR process

  • show_plots (bool) – whether to show plots or not

  • figsize (Tuple[int, int]) – size of the polots

Returns

None

import pandas_ml_utils as pmu
import pandas as pd

df = pd.read_csv('_static/burritos.csv')[["Tortilla", "Temp", "Meat", "Fillings", "Meat:filling", "Uniformity", "Salsa", "Synergy", "Wrap", "overall"]]
df.feature_selection(label_column="overall")
_images/feature_analysis_1_1.png
          Tortilla   overall   Synergy  Fillings      Temp     Salsa  \
Tortilla       1.0  0.403981  0.367575  0.345613  0.290702  0.267212   

              Meat  Uniformity  Meat:filling      Wrap  
Tortilla  0.260194    0.208666      0.207518  0.160831  
label is continuous: True
_images/feature_analysis_1_3.png
Feature ranking:
['Synergy', 'Meat', 'Fillings', 'Meat:filling', 'Wrap', 'Tortilla', 'Uniformity', 'Salsa', 'Temp']

TOP 5 features
         Synergy      Meat  Fillings  Meat:filling     Wrap
Synergy      1.0  0.601545  0.663328      0.428505  0.08685

filter feature Fillings with correlation 1.0 > 0.5

filter feature Meat with correlation 1.0 > 0.5
Features after correlation filer:
           Synergy  Meat:filling      Wrap
Tortilla  0.367575      0.207518  0.160831
_images/feature_analysis_1_5.png _images/feature_analysis_1_6.png
Synergy       1.000000
Synergy_0     1.000000
Synergy_1     0.147495
Synergy_56    0.128449
Synergy_78    0.119272
Synergy_55    0.111832
Synergy_79    0.086466
Synergy_47    0.085117
Synergy_53    0.084786
Synergy_37    0.084312
Name: Synergy, dtype: float64
_images/feature_analysis_1_8.png
Meat:filling       1.000000
Meat:filling_0     1.000000
Meat:filling_15    0.185946
Meat:filling_35    0.175837
Meat:filling_1     0.122546
Meat:filling_87    0.118597
Meat:filling_33    0.112875
Meat:filling_73    0.103090
Meat:filling_72    0.103054
Meat:filling_71    0.089437
Name: Meat:filling, dtype: float64
_images/feature_analysis_1_10.png
Wrap       1.000000
Wrap_0     1.000000
Wrap_63    0.210823
Wrap_88    0.189735
Wrap_1     0.169132
Wrap_87    0.166502
Wrap_66    0.146689
Wrap_89    0.141822
Wrap_74    0.120047
Wrap_11    0.115095
Name: Wrap, dtype: float64
best lags are
[(1, '-1.00'), (2, '-0.15'), (88, '-0.10'), (64, '-0.07'), (19, '-0.07'), (89, '-0.06'), (36, '-0.05'), (43, '-0.05'), (16, '-0.05'), (68, '-0.04'), (90, '-0.04'), (87, '-0.04'), (3, '-0.03'), (20, '-0.03'), (59, '-0.03'), (75, '-0.03'), (91, '-0.03'), (57, '-0.03'), (46, '-0.02'), (48, '-0.02'), (54, '-0.02'), (73, '-0.02'), (25, '-0.02'), (79, '-0.02'), (76, '-0.02'), (37, '-0.02'), (71, '-0.02'), (15, '-0.02'), (49, '-0.02'), (12, '-0.02'), (65, '-0.02'), (40, '-0.02'), (24, '-0.02'), (78, '-0.02'), (53, '-0.02'), (8, '-0.02'), (44, '-0.01'), (45, '0.01'), (56, '0.01'), (26, '0.01'), (82, '0.01'), (77, '0.02'), (22, '0.02'), (83, '0.02'), (11, '0.02'), (66, '0.02'), (31, '0.02'), (80, '0.02'), (92, '0.02'), (39, '0.03'), (27, '0.03'), (70, '0.04'), (41, '0.04'), (51, '0.04'), (4, '0.04'), (7, '0.05'), (13, '0.05'), (97, '0.06'), (60, '0.06'), (42, '0.06'), (96, '0.06'), (95, '0.06'), (30, '0.07'), (81, '0.07'), (52, '0.07'), (9, '0.07'), (61, '0.07'), (84, '0.07'), (29, '0.08'), (94, '0.08'), (28, '0.11')]