Generalized Linear Classifiers in NLP
For the better part of a decade machine learning methods like maximum entropy
and support vector machines have been a major part of many NLP applications
such as parsing, semantic role labeling, ontology induction, machine translation, and summarization.
Many of these models fall into the class of Generalized Linear Classifiers,
which are characterized by defining a prediction boundary as a linear combination of
input features and their weights. In this course we will cover many of the important
aspects of generalized linear classifiers including: training methods, min error vs. max likelihood,
distribution free methods, online vs. batch, generative vs. discriminative, structured models, distributed algorithms, and extensions
beyond linear predictors through kernels. The course assumes familiarity with basic concepts from statistics, calculus
and linear algebra.
Date: October 19th, 2009; Lecturer: Ryan McDonald
Outline (subject to change)
1012 
Introduction, feature representations, loss functions, perceptron, margin, SVMs, logistic regression (Max Ent), stochastic gradient descent 

1315
 Parallelization, structured learning including conditional random fields, Kernels  
1517
 Leftovers, practical  
Practical
An implementation of the perceptron algorithm and some extensions handout.
Starter code available here, data sets included.
Project suggestions
 Build a structured perceptron algorithm for entity or partofspeech tagging
 Download and test various linear classifiers for standard NLP problems
(MALLET (log. reg. (max. ent.)),
libsvm or svm light)
 Download and compare some structured learning algorithms (MALLET (CRF),
StructLearn (Perceptron, MIRA),
StructSVM)
 Create a kernalized version of the perceptron algorithm
 Implmenet a stochastic gradient ascent/descent algorithm and compare with perceptron
 Implement a parallel logistic regression (simulation)
 Investigate parallelization by weight averaging, when does it work and when does it not work?
 Parsing: show how one can use perceptron and/or CRFs for structured learning of contextfree parsing through CKY and insideoutside algorithms
 Prove the equivalence of logistic regression and maximum entropy, i.e., write out both objective functions and show that they are maximized with precisely the same parameters
 Email me about possible data sets (see my webpage for an address).
Literature and resources
 Previous course from 2007 html
 Slides from ESSLLI lecture pdf
 A tutorial by John Blitzer pdf
 SVM tutorials html
 Tutorials by Dan Klein html. Bottom of the page. Check out "Introduction to Classification", "Max Margin Methods for NLP", and "Maxent Models ..."
Back to course page