Completed
SAS-MDPs: Constructing an MDP
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Reinforcement Learning in Recommender Systems - Some Challenges
Automatically move to the next video in the Classroom when playback concludes
- 1 Intro
- 2 RL in User-Facing/Interactive Systems nature RL has found tremendous success with deep models
- 3 Some Challenges in User-facing RL (RecSys) Scale • Number of users (multi-user/MDPs) & actions combinatoriales, slates Idiosyncratic nature of actions
- 4 I. Stochastic Action Sets
- 5 SAS-MDPs: Constructing an MDP
- 6 SAS-MDPs: Solving Extended MDP
- 7 II. User-learning over Long Horizons Evidence of (very) slow user leaming and adaptation
- 8 Advantage Amplification Temporal aggregation leg, fixed actions can help amplify advantages
- 9 Advantage Amplification Temporal aggregation (eg, fixed actions) can help amplify advantages
- 10 Advantage Amplification Key points
- 11 An MDP/RL Formulation Objective: max cumulative user engagement' over session
- 12 The Problem: Item Interaction The presence of some items on the slate impacts user response hence value of others
- 13 User Choice: Assumptions Two key, but reasonable, assumptions
- 14 Full Q-Learning Decomposition still holds, standard Q-leaming update
- 15 Slate Optimization: Tractable Standard formulation: Fractional moved-integer program
- 16 Slate Optimization: Tractable Standard formulation: Fractional mixed-integer program
- 17 Synthetic Experiments Synthetic environment
- 18 Robustness to User Choice Models Change user choice model to cascade Joachims 2002