Overview

This course introduces the technologies behind web and search engines, including document indexing, searching and ranking. You will also learn different performance metrics for evaluating search quality, methods for understanding user intent and document semantics, and advanced applications including recommendation systems and summarization. Real-life examples and case studies are provided to reinforce the understanding of search algorithms.

Syllabus

Introduction to Search Engines for Web and Enterprise Data

Welcome to the first module of this course! In this module, you will learn: (1) The major tasks involved in web search. (2) The history, evolution, impacts and challenges of web search engine.

Search Engine Business Model

In this module, you will learn: (1) Different business models of web search engine.

TFxIDF

In this module, you will learn: (1) Different information retrieval models, Boolean Models and Statistical models. (2) How to determine important words in a document using TFxIDF.

Vector Space Model

In this module, you will learn: (1) How to represent a document/query as a vector of keywords. 2) How to determine the degree of similarity between a pair of vectors using different similarity measures, including Inner Product, Cosine Similarity, Jaccard Coefficient, Dice Coefficient.

Inverted Files

In this module, you will learn: (1) How to index documents using inverted files. 2) How to perform update and deletion on inverted files.

Extended Boolean Model

In this module, you will learn: (1) How to use Extended Boolean Model to rank documents. 2) How to evaluate conjunctive and disjunctive queries using Extended Boolean Model.

PageRank

In this module, you will learn: (1) The history and evolution of link-based ranking methods. 2) How to determine query/document similarities using HyPursuit, WISE, and PageRank. 3) Possible extensions that can be applied to Pagerank.

HITS Algorithm

In this module, you will learn: (1) How to calculate hub and authority scores of web documents using HITS algorithm. 2) Understand the re-ranking process involved in HITS algorithm.

Performance Evaluation of Information Retrieval System

In this module, you will learn: (1) How to evaluate retrieval effectiveness of an information retrieval using Precision, Recall, F-Measure, Average-Precision, DCG, and NDCG. 2) What are the subjective relevance measures to be used on an information retrieval system.

Benchmarking

In this module, you will learn: (1) How to use the TREC collection for benchmarking. 2) The characteristics of the TREC collection.

Stopword removal and Stemming

In this module, you will learn: (1) What is stemming. 2) Different Content-Sensitive and Context-Free stemming algorithms. 3) How to calculate Successor Variety and Entropy for stemming.

Relevance Feedback

In this module, you will learn: (1) How to perform document space modification using relevance feedback. 2) How to perform query modification using relevance feedback.

Personalized Web Search

In this module, you will learn: (1) Relative preference is more useful than absolute preference in personalization. 2) The importance of eye-tracking user study in personalized web search. 3) How to model preferences as a weighted vector.

Index Term Selection

In this module, you will learn: (1) How to calculate discrimination value for index term selection. 2) The importance of word usage in documents in search engine design.

Discovering Phrases and Correlated Terms

In this module, you will learn: (1) How to use collocated terms in lieu of strict phrases in search. 2) How to identify collocated terms using Pointwise Mutual Information (PMI). 3) How to utilize N-grams for search.