**Class sessions:**** **Thursdays 11:15 – 12:30, Fridays 9:45-11:00, East Hall 2

**Course description. **Machine learning (ML) is all about algorithms which are fed with (large quantities of) real-world data, and which return a compressed ``model'' of the data. An example is the ``world model'' of a robot: the input data are sensor data streams, from which the robot learns a model of its environment -- needed, for instance, for navigation. Another example is a spoken language model: the input data are speech recordings, from which ML methods build a model of spoken English -- useful, for instance, in automated speech recognition systems. There is a large number of formalisms in which such models can be cast, and an equally large diversity of learning algorithms. However, there is a relatively small number of fundamental challenges which are common to all of these formalisms and algorithms: most notably, the ``curse of dimensionality'' and the almost deadly-dangerous problem of under- vs.\ overfitting. This lecture introduces such fundamental concepts and illustrates them with a choice of elementary model formalisms (linear classifiers and regressors, radial basis function networks, clustering, mixtures of Gaussians, Parzen windows). Furthermore, the course also provides a refresher of the requisite concepts from probability theory, statistics, and linear algebra.

**Homeworks.** Students work in teams of two or three, each team submitting a single solution document. There are two kinds of homeworks: problem-based and programming microprojects. Problem-based homework sheets are not graded in the usual sense. Instead, a HW sheet gets a full score if all problems have been visibly and convincingly worked on. Correctness of solutions is not required for full score. It is permissible (though unwise) to copy solutions from friends or other sources. In this case, the fact that the solutions are copied, as well as the source, must be stated on the HW sheet. TA's will annotate HW sheets that are original work (not copied) to provide useful feedback. If a student copies the solutions and does not indicate that fact and the source, it will be treated as a cheating case (nulling the sheet, notification of the Office of Academic Affairs). Problem-based HW solution sheets must be type-set and submitted electronically by email. Programming microprojects are programmed in Matlab or Python (student's choice) and likewise submitted by email in the form of a type-set documentation and the source code.

**Grading and exams:** Grading and exams: The final course grade will be composed from homeworks (15%), presence sheets (10%), and quizzes/exams. There will be four miniquizzes (written in class, 20 minutes), the best three of which will each account to 15% of the final grade (worst will be dropped), and one final exam, counting 30%. All quizzes and the final exam are open book. Each quiz or exam yields a maximum of 100 points.

Miniquiz makeup rules: if a miniquiz is missed without excuse, it will be graded with 0 points. An oral makeup will be offered for medically excused miniquizzes according to the Jacobs rules (especially, the medical excuse must be announced to me on the day of the miniquiz). Non-medical excuses can be accepted and makeups be arranged on a case-by-case basis.

**Fully self-contained lecture notes** are here (Version 0.16, May 8 -- corrected 1 typo page 81).

**TAs for this course** are Xu "Owen" He and Felix Schmoll.

**Schedule **(this will be filled in synchrony with reality as we go along)

Feb 2 |
Introduction |

Feb 3 | Probabilities over very large-dimensional data spaces. Discrete -- continuous data transformations. |

Feb 9 | The geometry of high-dimensional data clouds: manifolds. Reading: Section 2.3 of the lecture notes. Exercise sheet 1 |

Feb 10 | A global view on the field(s) of machine learning. Reading: Section 3 of the lecture notes. |

Feb 16 | Basics of pattern classification. A look in passing at decision trees. Reading: Section 4 of LNs. |

Feb 17 | Optimal decision boundaries. Dimension reduction by feature extraction. Reading: Section 5 up to and including 5.1 |

Feb 23 | K-means clustering. Principal component analysis: principle. Solution to exercises 1 |

Feb 24 | Miniquiz 1 (venue: CNLH; time: normal class time) PCA: basic idea. Exercise sheet 2 (new submission timeline: March 3, 23:59. Ignore delivery time stated on problem sheet) |

Mar 2 | PCA: formal properties, algorithm, and usage for data compression. Reading: Sections 5.2 -- 5.4 in the LN |

Mar 3 | PCA completed; eigendigits Exercise sheet 3 |

Mar 9 | Linear regression |

Mar 10 | no class Exercise sheet 4 |

Mar 16 | no class |

Mar 17 | Miniquiz 2 (Venue: CNLH) The ubiquitous phenomenon of overfitting Reading: Sections 8.1, 8.2 |

Mar 23 | Refresher on expectation, variance and covariance. Reading: Appendix D and Section 8.3 |

Mar 24 | An abstract view on supervised learning. Section 8.3 Exercise sheet 5 |

Mar 30 | Model class inclusion sequences. Optimizing generalization of models by cross validation. Reading: Sections 8.4, 8.5 |

Mar 31 | Miniquiz 3 (Venue: CNLH) Achieving good generalization by regularization. Solutions to sheet 5 Exercise sheet 6-8: miniproject |

Apr 6 | The art of writing Introductions in scientific texts. Ridge regression. Reading: Lecture notes Sections 8.7, 8.8 |

Apr 7 | Why it is called bias-variance tradeoff. RBF networks. Reading: exercise sheet 6-8 |

Apr 20 | Neural Networks: Introduction. Rosenblatt's Perceptron. The von Neumann bottleneck, biological modeling of neurons, challenges for future microchip design and all that. Reading: LN Section 9 up to (including) 9.1 |

Apr 21 | The spirit of the neural networks branch in machine learning. A quick look at energy based models. MLPs: local update rule of a neuron. Reading: LN Section 9 up to (excluding) 9.1 |

Apr 27 | Structure of MLPs. Universal approximation property. Use cases. |

Apr 28 | Training MLPs by gradient descent. The backprop algorithm. Exercise sheet 9 |

May 4 | Miniquiz 4 (CNLH) The intricacies of gradient descent: tension between stability and convergence speed |

May 5 | Quick and steep tutorial on dynamical systems, part 1 Slides of tutorial |

May 11 | (Online course evaluation - please bring your laptops or tablets or smartwatches) Quick and steep tutorial on dynamical systems, part 2 |

May 12 | Quick and steep tutorial on dynamical systems, part 3 |

May 22 | Final Exam: 9:00 - 11:00, Conference Hall IRC |

**References**

*The online lecture notes are self-contained, and no further literature is necessary for this course. However, if you want to study some topics in more depth, the following are recommended references.*

Bishop, Christopher M.: Neural Networks for Pattern Recognition (Oxford Univ. Press, 1995.) IRC: QA76.87 .B574 1995* A recommendable basic reference (beyond the online lecture notes) *

Bishop, Christopher M.: Pattern Recognition and Machine Learning. Springer Verlag, 2006 *Much more up-to-date and comprehensive than the previously mentioned Bishop book, but I dare say too thick and advanced for an undergraduate course (730 pages) -- more like a handbook for practicians. To find your way into ML, the older, slimmer Bishop book will work better. *

Michie, D., Spiegelhalter, D.J., Taylor, C.C.: *Machine Learning, Neural and Statistical Classification* (1994) Free and online at http://www.amsta.leeds.ac.uk/~charles/statlog/ and at the course resource repository. *A transparently written book, concentrating on classification. Good backup reading. Thanks to Mantas for pointing this out!*

Duda, R.O., Hart, P.E., Stork, D.G.: *Pattern Classification*, 2nd edition (John Wiley, 2001) IRC: Q327 .D83 2001 *Covers more than the Bishop book, more detailed and more mathematically oriented. Backup reference for the deep probers*

T. Hastie, R. Tibshirani, J. Friedman, *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. Springer Verlag 2001. IRC: Q325.75 .H37 2001* I have found this book only recently and haven't studied it in detail – looks extremely well written, combining (statistical) maths with applications and principal methods of machine learning, full of illuminating color graphics. May become my favourite. *

Farhang-Boroujeny, B.: *Adaptive Filters, Theory and Applications* (John Wiley, 1999). IRC: TK7872.F5 F37 1998 *Some initial portions of this book describe online linear filtering with the LMS algorithm, which will possibly be covered in the course*

Mitchell, Tom M.: *Machine Learning* (McGraw-Hill, 1997) IRC: Q325.5 .M58 1997. *More general and more comprehensive than the course, covers many branches of ML that are not treated in the course. Gives a good overview of the larger picture of ML*

Nabney, Ian T.: *NETLAB: Algorithms for Pattern Recognition* (Springer Verlag, 2001). IRC: TA1637 .N33 2002. *A companion book to the Bishop book, concentrating on Matlab implementations of the main techniques described in the Bishop book. Matlab code is public and can be downloaded from http://www.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/ *