• In this notebook we'll explore modules in sklearn's linear_model module and attempt to optimize the top performing models with hyperparameter tuning.


  • The Python tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.


  • Colab link - Open colab


  • 
    import numpy as np
    import pandas as pd
    
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.preprocessing import StandardScaler
    
    from sklearn import linear_model
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    
    import warnings
    warnings.filterwarnings('ignore')
    
    np.random.seed(27)
     
    
  • setting up default plotting parameters


  • 
    %matplotlib inline
    
    plt.rcParams['figure.figsize'] = [20.0, 7.0]
    plt.rcParams.update({'font.size': 22,})
    
    sns.set_palette('viridis')
    sns.set_style('white')
    sns.set_context('talk', font_scale=0.8)
     
    
  • All downloaded content is stored at filename


  • 
    !wget -O train.csv https://www.upscfever.com/datasets/train_ML.csv
    !wget -O test.csv https://www.upscfever.com/datasets/test_ML.csv
    
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    
    print('Train Shape: ', train.shape)
    print('Test Shape: ', test.shape)
    
    train.head()
     
    
  • prepare for modeling


  • Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).


  • 
    X_train = train.drop(['id', 'target'], axis=1)
    y_train = train['target']
    
    X_test = test.drop(['id'], axis=1)
    
     
    
  • scaling data


  • 
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
     
    
  • Baseline Models


  • 
    
    
    ridge = linear_model.Ridge()
    lasso = linear_model.Lasso()
    elastic = linear_model.ElasticNet()
    lasso_lars = linear_model.LassoLars()
    bayesian_ridge = linear_model.BayesianRidge()
    logistic = linear_model.LogisticRegression(solver='liblinear')
    sgd = linear_model.SGDClassifier()
    
    models = [ridge, lasso, elastic, lasso_lars, bayesian_ridge, logistic, sgd]
     
    
  • function to get cross validation scores


  • 
    def get_cv_scores(model):
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
        print('CV Mean: ', np.mean(scores))
        print('STD: ', np.std(scores))
        print('\n')
     
    
  • loop through list of models


  • 
    for model in models:
        print(model)
        get_cv_scores(model)
     
    
  • From this we can see our best performing models out of the box are logistic regression and stochastic gradient descent. Let's see if we can optimize these models with hyperparameter tuning.


  • Logistic Regression and Grid Search


  • Grid search is an exhaustive search over specified parameter values.


  • #class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, #iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)


  • #n_jobs int, default=None #Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.


  • 
    penalty = ['l1', 'l2']
    C = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
    class_weight = [{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}]
    solver = ['liblinear', 'saga']
    
    param_grid = dict(penalty=penalty,
                      C=C,
                      class_weight=class_weight,
                      solver=solver)
    
    grid = GridSearchCV(estimator=logistic, param_grid=param_grid, scoring='roc_auc', verbose=1, n_jobs=-1)
    grid_result = grid.fit(X_train, y_train)
    
    print('Best Score: ', grid_result.best_score_)
    print('Best Params: ', grid_result.best_params_)
    
    logistic = linear_model.LogisticRegression(C=1, class_weight={1:0.6, 0:0.4}, penalty='l1', solver='liblinear')
    get_cv_scores(logistic)
    
    predictions = logistic.fit(X_train, y_train).predict_proba(X_test)
     
    
  • Stochastic Gradient Descent and Random Search - Random search is a random (obviously) search over specified parameter values.


  • 
    loss = ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron']
    penalty = ['l1', 'l2', 'elasticnet']
    alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
    learning_rate = ['constant', 'optimal', 'invscaling', 'adaptive']
    class_weight = [{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}]
    eta0 = [1, 10, 100]
    
    param_distributions = dict(loss=loss,
                               penalty=penalty,
                               alpha=alpha,
                               learning_rate=learning_rate,
                               class_weight=class_weight,
                               eta0=eta0)
    
    random = RandomizedSearchCV(estimator=sgd, param_distributions=param_distributions, scoring='roc_auc', verbose=1, n_jobs=-1, n_iter=1000)
    random_result = random.fit(X_train, y_train)
    
    print('Best Score: ', random_result.best_score_)
    print('Best Params: ', random_result.best_params_)
    
    sgd = linear_model.SGDClassifier(alpha=0.1,
                                     class_weight={1:0.7, 0:0.3},
                                     eta0=100,
                                     learning_rate='optimal',
                                     loss='log',
                                     penalty='elasticnet')
    get_cv_scores(sgd)
    
    predictions = sgd.fit(X_train, y_train).predict_proba(X_test)