课程详细信息

自然语言处理中的经验性方法课程详细信息

课程号	04832710	学分	3
英文名称	Empirical Methods in Natural Language Processing
先修课程	线性代数、程序设计、数理逻辑、概率统计
中文简介	《自然语言处理中的经验性方法》是一门面向信息科学相关专业高年级本科生的专业选修课，在已有的数理逻辑、概率统计以及程序设计等课程的基础上，向同学们介绍如何使用以数据为驱动的经验性方法来解决自然语言处理（特别是文本数据处理）中的常见问题，并培养他们分析、处理大规模数据的实际动手能力。同时，希望就一些热点课题，如统计机器翻译、海量信息抽取等进行专题介绍，为同学们介绍更为前沿的研究进展。这门课中所涉及到的经验性方法主要指以数据为驱动，以语料为对象，以模式识别、机器学习为手段的处理思路；希望同学们通过这门课的学习与锻炼，在遇到实际问题时，能够选择适当的算法以及优化方法、独立编程、轻松应对较大规模的文本数据。我们希望无论是继续深造、还是即将步入工作岗位的同学都能受益于这样的锻炼过程。
英文简介	This course is an introduction for undergraduate students who are interested in empirical methods applied to natural language processing. We will emphasize on empirical methods, which mainly refers to data-driven models with ingredient from pattern recognition and machine learning. We will also survey interesting NLP applications, e.g., word segmentation, tagging, parsing, etc., and introduce recent advances in statistical machine translation and information extraction. In this course, students will learn what data-driven methods are, how to utilize those models to build their own systems to analyze massive text data and actually solve a real NLP problem in practice. The pre-requisites for this course include: Passion(sure!), some knowledge about Probability Theory and Statistics, a little bit of Mathematical Logic, and Practice of Programming.
开课院系	信息科学技术学院
通选课领域
是否属于艺术与美育	否
平台课性质
平台课类型
授课语言	英文
教材	Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition,Daniel Jurafsky and James H. Martin,Pearson Prentice Hall,2008,第二版,9787115238924； Foundations of Statistical Natural Language Processing,Christopher Manning and Hinrich Schütze,MIT Press,1999,第一版,7505399217；
参考书
教学大纲	《自然语言处理中的经验性方法》旨在向同学们介绍如何使用以数据为驱动的经验性方法来解决自然语言处理中的常见问题，并培养他们分析、处理大规模数据的实际动手能力。 <Empirical Methods in Natural Language Processing> is an introduction for undergraduate students who are interested in empirical methods applied to natural language processing. Students will learn what data-driven methods are, how to utilize those models to build their own systems to analyze massive text data and actually solve a real NLP problem in practice. Syllabus 1. Overview and Introduction: * corpora and the web * Grammars and the Chomsky Hierarchy * Ambiguity * semantics * Introduction to human language processing * Overview of language technology (1 lecture) 2. Lexicon and lexical processing: * language modeling * smoothing * Hidden Markov Models * Sequence labeling * part of speech tagging to illustrate HMMs * word segmentation to illustrate sequence labeling * chunking and name entity recognizing * Viterbi algorithm (4-5 lectures) 3. Syntax and syntactic processing: * Linguistic intuitions (constituent, bi-lexical dependency); Formal grammars (context-free grammar, lexicalization, dependency grammar) * treebanks: lexicalized grammars and corpus annotation * Statistical models: Probabilistic context-free grammars, Symbol-refined CFG parsing, Lexicalized CFGs * Inference (directional): LR, generalized LR, Earley algorithm * Transition-based parsing (optional: dynamic programming techniques for transition-based parsing) * Inference (non-directional): CYK, inside-outside algorithm, semi-ring * Parameter estimation for PCFGs: MLE and EM (5-6 lectures) 4. Semantics and semantic processing: * compositionality * word sense disambiguation * anaphora resolution * argument structure (2-3 lectures) 5. Applications: * Machine translation * Information Extraction (2-3 lectures) 本课程将以课堂讲授为主要形式，期间安排一到两次习题课；学生需要在课余完成书面作业，以及独立编程完成实践课题。 The course will run by lectures, coursework and projects. Lectures: 18-19 Tutorials: 1-2 Assessment: 2 Project: 2 mid-term/final 本课程不安排期末笔试，学生的成绩由两部分构成，1，平时的书面作业；2，实践课题。其中，实践课题将采取评测的形式：学生需要根据任务独立编写自己的原型系统，对测试数据进行处理；原型系统、任务报告、测试结果三项将作为评分的主要参考内容。 The course will be marked according to assessments and course projects. Assessment: 2 Project: 2 mid-term/final. The projects will run in a formal evaluation form, where students will be given trial corpus when a project is released, and code their systems accordingly. During the evaluation period, normally 3-4 days, students will be first given the real training data, train/tune their systems, and finally test their system using the test data. All stages will be timetabled.
教学评估