I am running a Logistic Regression and would like to plot the Learning Curve of this to get a feel for the data. How can I do this ? Here is my code thus far :

 from sklearn import metrics,preprocessing,cross_validation
  from sklearn.feature_extraction.text import TfidfVectorizer
  import sklearn.linear_model as lm
  import pandas as p
  loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

  print "loading data.."
  traindata = list(np.array(p.read_table('train.tsv'))[:,2])
  testdata = list(np.array(p.read_table('test.tsv'))[:,2])
  y = np.array(p.read_table('train.tsv'))[:,-1]

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)

  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None)

  X_all = traindata + testdata
  lentrain = len(traindata)

  print "fitting pipeline"
  print "transforming data"
  X_all = tfv.transform(X_all)

  X = X_all[:lentrain]
  X_test = X_all[lentrain:]

  print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

  print "training on full data"
  pred = rd.predict_proba(X_test)[:,1]
  testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
  pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
  print "submission file created.."

What I would like to create is something like this, so I can have a better understanding of what is going on :

Image of expected output

Can anyone help me with this please?

Question author Simon-kiely | Source



Not quite as general as it should be, but it'll do the job with a little fiddling on your end.

from matplotlib import pyplot as plt
from sklearn import metrics
import numpy as np

def data_size_response(model,trX,teX,trY,teY,score_func,prob=True,n_subsets=20):

    train_errs,test_errs = [],[]
    subset_sizes = np.exp(np.linspace(3,np.log(trX.shape[0]),n_subsets)).astype(int)

    for m in subset_sizes:
        if prob:
            train_err = score_func(trY[:m],model.predict_proba(trX[:m]))
            test_err = score_func(teY,model.predict_proba(teX))
            train_err = score_func(trY[:m],model.predict(trX[:m]))
            test_err = score_func(teY,model.predict(teX))
        print "training error: %.3f test error: %.3f subset size: %.3f" % (train_err,test_err,m)

    return subset_sizes,train_errs,test_errs

def plot_response(subset_sizes,train_errs,test_errs):

    plt.legend(['Training Error','Test Error'])
    plt.xlabel('Dataset size')
    plt.title('Model response to dataset size')

model = # put your model here
score_func = # put your scoring function here
response = data_size_response(model,trX,teX,trY,teY,score_func,prob=True)

The data_size_response function takes a model (in your case a instantiated LR model), a pre-split dataset (train/test X and Y arrays you can use the train_test_split function in sklearn to generate this), and a scoring function as input and iterates through your dataset training on n exponentially spaced subsets and returns the "learning curve". There is also a plotting function for visualizing this response.

I would have liked to use cross_val_score like your example but it would require modifying sklearn source to get back training scores in addition to the test scores it already provides. The prob argument is whether or not to use a predict_proba vs predict method on the model which is necessary for certain model/scoring function combinations e.g. roc_auc_score.

Example plot on a subset of the MNIST dataset: enter image description here

Let me know if you have any questions!

Answer author Newmu