Learning and Optimization with Seasonal Patterns

Key Question Being Explored in This Insight:

A standard assumption adopted in the multi-armed bandit (MAB) framework is that the mean rewards are constant over time. This assumption can be restrictive in the business world as decision-makers often face an evolving environment where the mean rewards are time-varying. Ningyuan Chen discusses a non-stationary MAB model with K arms whose mean rewards vary over time in a periodic manner.

View the page of: Ningyuan Chen

Institution: University of Toronto

Explore more insights in:

Computer Science Formal Sciences Sciences

Media Literacy Toolkit

Faculti's Media Literacy Toolkit helps viewers critically engage with academic insights by analyzing the research context, identifying perspectives, and encouraging thoughtful evaluation.

Critical Questions to Consider ▼

What assumptions does the research make?
Are there alternative perspectives not explored?
What are the limitations of the research method?

Bias and Perspective Awareness ▼

This research comes from University of Toronto . Reflect on how the institution's academic focus and research partnerships may shape the questions being explored.

Further Reading ▼

Read the Study

Image courtesy of interviewee. April 9, 2024

Log-in or Sign-up to Faculti
Currently viewing this subject insight as a guest. You have insight(s) remaining for this month. Login to view 8000+ figures on the platform.

Generating Workspace...

Export as PDF

This video video primarily discusses the Multi-Armed Bandit (MAB) problem, a key concept in reinforcement learning and computer science. The MAB problem is illustrated using the metaphor of slot machines in a casino, where each machine (arm) offers different rewards, and a gambler must decide which machines to play in order to maximize their payout over time.

Key points include:

1. **MAB Framework**: The MAB problem involves exploration (trying each arm to learn their rewards) and exploitation (selecting the best-performing arm based on accumulated knowledge). A naive strategy might be to try each arm once, but due to the randomness of rewards, this is insufficient.

2. **Real-world Applications**: Two practical examples are given:
- **Parenting Decision**: The process of enrolling children in extracurricular activities can be thought of as a MAB problem, where parents explore different activities to find the ones their children enjoy and excel at.
- **Dynamic Pricing**: Businesses face a MAB scenario when determining optimal pricing for new products, as market demands can change.

3. **Dynamic Environment**: It is acknowledged that the rewards from arms may change over time due to factors like seasonality or competition. This video video adds complexity to the MAB problem, making it necessary to adapt strategies dynamically.

4. **Seasonality and Cycle Lengths**: The presenter suggests using structural patterns in the data, like seasonality, to improve decision-making within the MAB framework. By identifying these patterns, algorithms can better predict optimal strategies.

5. **Algorithm Development**: The discussed algorithm consists of two stages:
- The first stage focuses on learning the cycle lengths of rewards associated with various decisions (arms).
- The second stage uses this learned information to make informed decisions based on the cyclical nature of the environment.

6. **Performance Metrics**: The effectiveness of policies is evaluated using "regret," which measures how well a policy performs compared to an optimal "oracle" policy that knows the best option at all times.

7. **Future Work**: A gap remains in optimizing the regret in relation to the number of arms available, and the presenter encourages further research in this area.

Overall, the text emphasizes the relevance of the MAB framework in both academic and practical applications, touching on the importance of exploring new strategies and adapting to changing environments in order to maximize rewards.

Language:

Export as PDF

X-TIMESTAMP-MAP:LOCAL=00:00:00.000,MPEGTS=900000

00:00.105 --> 00:02.125
So Mountain Bandit framework is,

00:02.185 --> 00:04.965
is actually very important, framework

00:04.965 --> 00:07.165
or paradigm used in reinforcement learning

00:07.265 --> 00:08.925
and computer science nowadays.

00:09.265 --> 00:10.965
Um, the, the name is a bit funny, right?

00:11.325 --> 00:12.565
M Bandit, what does even mean?

00:12.905 --> 00:15.245
So the bandit in this case is actually,

00:15.305 --> 00:17.645
if you have been the audience been to Las Vegas

00:17.785 --> 00:19.845
or any casino, it's a slot machine.

00:20.145 --> 00:21.645
So, you know, if you go to slot machine,

00:21.645 --> 00:22.845
there's an arm, right?

00:22.845 --> 00:25.245
And if you pull the arm, sometimes you lose the money.

00:25.245 --> 00:26.645
Sometimes you can win big,

00:26.865 --> 00:29.125
but it's, it's, it's a random reward in some sense.

00:29.305 --> 00:31.685
So that's what people call one armed bandit.

00:32.065 --> 00:33.805
Um, so what is a multi-armed bandit?

00:33.825 --> 00:37.005
So imagine it's a slot machine, but it has multiple arms.

00:37.345 --> 00:39.685
In fact, I think it should be multiple slot machines

00:39.685 --> 00:40.765
with like one arm each.

00:40.765 --> 00:43.445
But, you know, people in academia come up

00:43.445 --> 00:44.525
with strange names all the time.

00:44.665 --> 00:46.485
So let's, this is a multi arm bandit.

00:46.485 --> 00:48.685
You have a slot machine, but with multiple arms,

00:49.305 --> 00:51.845
and each arms will give you some reward,

00:51.845 --> 00:53.965
but the reward of the arms might be different, right?

00:53.965 --> 00:56.885
So let's say one of the arms could give you, say,

00:57.005 --> 01:00.005
a hundred dollars per 10 poles on average, right?

01:00.075 --> 01:03.925
Another one could be giving you $30 per two poles.

01:04.145 --> 01:06.325
So, so they give you different reward on average,

01:06.785 --> 01:08.005
and you are a gambler

01:08.265 --> 01:10.045
and you have faced, you know, you, you're in front

01:10.045 --> 01:12.445
of the machine and you wanna maximize your reward

01:12.695 --> 01:16.005
after, let's say a hundred poles or a hundred rounds.

01:16.385 --> 01:19.685
Uh, what's your strategy? So this is the t bandit framework.

01:20.065 --> 01:22.165
So, so if you think about the problem, right?

01:22.225 --> 01:23.765
Um, there's a few sort

01:23.765 --> 01:25.765
of salient features of, of this problem.

01:26.075 --> 01:28.325
Well, first you don't, you know, nothing about arms.

01:28.515 --> 01:30.285
Some arms are good, some arms are bad,

01:30.545 --> 01:32.125
but you don't know a priority.

01:32.385 --> 01:35.645
So what, what the gambler wants to do is that it wants

01:35.645 --> 01:37.925
to pull arms sequentially so

01:37.925 --> 01:41.325
that over time it can identify the best arm

01:41.665 --> 01:43.805
and then play that arm more and more.

01:44.265 --> 01:47.765
So, so there is a bit of, people call exploration

01:47.865 --> 01:49.805
and an exploitation trade off here.

01:49.945 --> 01:51.725
You want, you wanna learn initially

01:52.305 --> 01:54.445
and then try to, try to find the best one.

01:55.105 --> 01:57.045
But in this firm, there are a lot of, there are a lot

01:57.045 --> 01:59.045
of nuances embedded in the framework.

01:59.265 --> 02:01.325
So one thing is that, I mean, one could think of

02:02.005 --> 02:05.405
a naive strategy, which is I'm going to pull each arm once

02:05.545 --> 02:06.725
and see which one is good,

02:06.745 --> 02:09.005
and then I'm going to settle on that arm, right?

02:09.005 --> 02:10.965
So that's, that's a sort of a,

02:11.205 --> 02:12.765
a naive way of think about problem.

02:13.105 --> 02:16.965
But because the arms, they are volatile, meaning that,

02:17.075 --> 02:19.845
when you pull an arm, it may give you $0,

02:19.985 --> 02:22.445
it may give you $5, it may give you $10.

02:22.795 --> 02:25.205
It's uncertain. So one pole is not enough.

02:25.305 --> 02:27.405
You may, you may miss the, the best arm

02:27.405 --> 02:30.925
because it happened to give you say, $0 in that round.

02:31.305 --> 02:35.365
So, so you need to sort of, overall finding,

02:35.515 --> 02:39.525
like finding, estimate the arm reward in, in a reliable way

02:39.585 --> 02:41.485
so that over time you are more

02:41.485 --> 02:44.845
and more confident that you will get the best arm.

02:45.065 --> 02:46.125
So, so here, no.

02:46.125 --> 02:48.685
Okay, so maybe the audience are already getting a bit

02:48.685 --> 02:50.125
of a bored, because like,

02:50.125 --> 02:51.805
if I'm not gambling, why do I need it?

02:51.805 --> 02:52.845
Right? Why do I need to know this?

02:52.985 --> 02:55.485
So let me give you three, examples.

02:55.945 --> 02:58.765
Two, maybe two examples enough. So I'm, I'm a,

02:59.385 --> 03:01.965
I'm a parent of a 9-year-old and, and a 2-year-old.

03:01.985 --> 03:04.285
So my older daughter, Ariel, she's, she's nine.

03:04.625 --> 03:08.205
And as a parent of girls, of boys of that age, like,

03:08.225 --> 03:11.125
we constantly think about what kind of extra curriculum,

03:11.125 --> 03:13.765
like sport music we want to enroll them in, right?

03:14.225 --> 03:17.645
And as much as we want, like they have 24 hours a day.

03:17.745 --> 03:21.645
So let's say each day I can only enroll them in one

03:21.645 --> 03:24.045
of the class and then it,

03:24.425 --> 03:27.445
but eventually I wanna find the hobby, the sport or music

03:27.585 --> 03:29.525
or whatever they are good at

03:29.585 --> 03:31.165
or they're interested in, right?

03:31.165 --> 03:32.245
So it's, it's a learning process,

03:32.825 --> 03:36.325
and you can think of it as on each day, I want

03:36.325 --> 03:38.205
to say point R in this case,

03:38.205 --> 03:40.925
point r meaning enrolling in a specific activity.

03:41.345 --> 03:45.725
And then over time, I want, I want them to sort of drop out

03:45.725 --> 03:48.085
of the activities they don't like or they're not good at

03:48.145 --> 03:49.845
and sort of converge to the one

03:49.845 --> 03:51.165
that they are really passionate about.

03:51.545 --> 03:54.365
So that's, that's essentially a ma bandit problem.

03:54.905 --> 03:58.285
And if you apply the principle, into the, this kind

03:58.285 --> 04:01.925
of real world, application here, so I,

04:02.045 --> 04:03.405
I think there are two guiding principles.

04:03.445 --> 04:05.525
I mean, I'm not a parenting or education expert,

04:05.625 --> 04:07.045
but I think there are two,

04:07.165 --> 04:09.005
messages from the Malm bandit framework.

04:09.185 --> 04:12.565
One is that you need to do exploration early on, right?

04:12.665 --> 04:14.445
You want, you want to try different things,

04:14.865 --> 04:17.325
and then over time you want to, you want to like,

04:17.325 --> 04:20.085
leave those activities that, that they're not good at

04:20.105 --> 04:21.725
and try to converge to a few.

04:21.945 --> 04:24.445
So that's the first. The second message is that

04:24.865 --> 04:26.645
how much you explore versus

04:26.825 --> 04:28.685
how much you want to just optimize.

04:29.505 --> 04:31.805
It depends. The age depends on age.

04:31.885 --> 04:33.405
So if you start from 5-year-old

04:33.405 --> 04:35.405
and you start from say, 13-year-old,

04:35.405 --> 04:36.605
the strategy could be different.

04:36.835 --> 04:40.045
That the proportion you should spend on exploration should

04:40.045 --> 04:43.045
depend on the, the time, how many rounds you're gonna play.

04:43.465 --> 04:46.245
So, so this is one example I usually like to, to use,

04:46.275 --> 04:47.445
talk about modern bandit.

04:47.515 --> 04:50.405
Another one is closer to what I'm working on,

04:50.415 --> 04:51.645
which is say, pricing.

04:52.575 --> 04:54.755
So in the business world, it's,

04:54.755 --> 04:56.875
it's a very common problem that,

04:56.875 --> 05:00.355
when a new product gets released, the firm wants

05:00.415 --> 05:02.475
to find a good price for the product

05:02.475 --> 05:04.195
because they don't have past data, right?

05:04.475 --> 05:06.155
I could say, let's say that a firm is

05:06.155 --> 05:07.235
selling a winter jacket.

05:07.695 --> 05:10.395
Um, it could be a hundred dollars, it could be 150.

05:11.015 --> 05:12.715
It depends on the market demand,

05:12.815 --> 05:14.435
it depends on the market response.

05:14.855 --> 05:18.435
So as a firm, it's, it's a, it's a decision making problem.

05:18.505 --> 05:20.875
It's a dynamic decision making problem with a lot

05:20.875 --> 05:22.075
of unknown information.

05:22.375 --> 05:25.195
Is a hundred dollars the best or $150 best?

05:25.535 --> 05:28.195
So in this case, the firm could try different prices,

05:28.735 --> 05:30.635
and over time it can converge

05:30.655 --> 05:32.675
to the optimal price, let's say 120.

05:33.095 --> 05:36.395
So this kind of framework is also, used a lot

05:36.935 --> 05:40.275
in this type of, business decision making problem.

05:40.655 --> 05:44.675
So in the classic mal banded problem, we usually think

05:44.675 --> 05:47.635
of the reward a, a as the, as the, as the arm

05:47.695 --> 05:50.755
and the reward of the arms, as sort of fixed over time.

05:50.975 --> 05:53.235
So then the problem becomes a bit easier

05:53.235 --> 05:55.995
because let's say I'm a gambler, I just wanna find the arm,

05:56.095 --> 05:58.485
in, in the real world example, find a decision

05:58.485 --> 06:01.845
that maxim maximizes my payoff or reward over time.

06:02.305 --> 06:05.365
But in, in many real world cases, it's not,

06:05.515 --> 06:06.765
it's not fixed, right?

06:06.785 --> 06:09.525
So that reward of an arm may change over time.

06:09.985 --> 06:12.325
So let's use the pricing example.

06:12.495 --> 06:14.925
Let's say I'm selling a winter jacket, right?

06:15.225 --> 06:16.605
Um, then a decision

06:16.825 --> 06:20.085
or an arm in this case is choosing a price, right?

06:20.345 --> 06:22.285
But, but the, the, the market demand

06:22.545 --> 06:25.725
for a certain price may change because of various reasons.

06:26.025 --> 06:29.045
One is seasonality. If you're selling winter jacket selling,

06:29.185 --> 06:32.485
you know, selling winter jacket at $200 in the summer

06:33.065 --> 06:34.885
may give you very little reward, right?

06:35.025 --> 06:36.365
In this case of the firm's profit.

06:36.745 --> 06:38.125
Um, so it changes over time.

06:38.125 --> 06:39.805
Another factor could be competition.

06:39.805 --> 06:42.725
There might be another company comes in selling very similar

06:42.965 --> 06:46.005
products, and then your, profit is gonna drop.

06:46.265 --> 06:50.445
So, so in the, in this case, it needs to, someone needs

06:50.445 --> 06:55.365
to revise the bandit framework to allow for the reward,

06:55.585 --> 06:57.805
the average reward to be changing over time.

06:58.575 --> 07:00.385
Okay? So, and then, then the problem,

07:00.385 --> 07:03.185
because I'm a lot harder because I'm not only learning

07:03.245 --> 07:04.665
how good this decision is,

07:04.765 --> 07:07.025
but I need to adapt to the changing environment

07:07.045 --> 07:10.185
and also need to learn how it changes, like over time,

07:10.365 --> 07:11.585
is it still good, right?

07:11.585 --> 07:13.785
Or is it become an inferior option?

07:14.245 --> 07:15.345
And it turns out,

07:15.345 --> 07:18.265
this problem has been studied in the literature,

07:18.265 --> 07:19.705
of computer science and ai,

07:20.045 --> 07:22.185
and it turns out to be a very hard problem.

07:22.645 --> 07:23.745
Uh, it's not hard

07:23.745 --> 07:27.065
because people cannot find a good algorithm for it.

07:27.415 --> 07:29.425
It's hard because people can show

07:29.695 --> 07:31.745
that no algorithm can perform well.

07:32.415 --> 07:35.385
Like you, there's no possibility you can find an algorithm

07:35.415 --> 07:39.185
that can learn it well, without losing something.

07:39.605 --> 07:41.785
Um, so this is where we start.

07:42.245 --> 07:43.745
Um, so we take this problem

07:43.805 --> 07:46.465
and think that, okay, if it's, if, if it's really hard

07:46.465 --> 07:49.465
to learn when you are, it's moving target, right?

07:49.485 --> 07:52.145
If it's really hard to learn of, of the reward

07:52.145 --> 07:55.065
of arms over time when it's changing,

07:55.175 --> 07:57.025
what if there's some structure out there?

07:57.405 --> 08:01.025
So, so the structure we are identifying is the seasonality.

08:01.285 --> 08:03.955
So like not all changes are the same, right?

08:03.955 --> 08:06.395
So certain type of changes they repeat itself.

08:06.885 --> 08:09.715
There are cycles. So if I'm selling the winter jacket,

08:10.305 --> 08:13.475
it's probably a reasonable assumption to think

08:13.475 --> 08:16.755
that the demand is like repeat itself over a year, right?

08:17.055 --> 08:19.515
So the, the sales may peak in January,

08:19.655 --> 08:21.675
and then it slowly goes down and down and down

08:21.675 --> 08:23.475
and hit the bottom maybe in the summer,

08:23.655 --> 08:25.315
and then it picks up in the fall

08:25.435 --> 08:28.155
because some, you know, some forward looking customers start

08:28.155 --> 08:30.555
to prepare for their winter, apparel.

08:30.815 --> 08:34.235
So this type of seasonality can help,

08:34.345 --> 08:38.115
help our problem a lot, meaning that we can actually,

08:38.115 --> 08:40.115
learn the non, non ality and,

08:40.115 --> 08:44.475
and try to find the optimalization, over time,

08:44.545 --> 08:47.435
without, you know, hitting the, the very challenging problem

08:47.465 --> 08:50.355
that, that the literature has identified.

08:50.815 --> 08:53.715
We find that if we impose a little bit of a structure,

08:53.725 --> 08:54.835
which is seasonality,

08:55.185 --> 08:58.005
and this structure is not really like impractical, right?

08:58.005 --> 08:59.765
We, we see this kind of thing over

08:59.785 --> 09:02.405
and over, in, in, in practice.

09:02.585 --> 09:05.525
And by imposing this slight, like, sort of,

09:05.925 --> 09:08.205
restricted structure, we can learn the non-ST.

09:08.205 --> 09:10.405
Institutionality really well. And this is the base

09:10.505 --> 09:13.565
of the design of our algorithm, really.

09:14.025 --> 09:16.765
So, our algorithm consists of two stages.

09:17.465 --> 09:20.485
So in the, the, in the first stage, what we,

09:20.515 --> 09:22.205
what we basically do is that, okay,

09:22.445 --> 09:25.605
I know learning the reward of the arms

09:25.875 --> 09:28.005
that are changing over time is really hard.

09:28.585 --> 09:30.965
Um, but what can be learned is actually,

09:31.385 --> 09:32.525
the cycle length.

09:33.225 --> 09:36.165
So each arms have different, they may have the same

09:36.185 --> 09:38.845
or different cycles, yearly cycle,

09:38.845 --> 09:40.645
monthly cycle, weekly cycle.

09:40.985 --> 09:44.525
We can devise some algorithm, to learn the cycle length.

09:44.985 --> 09:48.205
And the key methodology we use is,

09:48.525 --> 09:50.485
RIA transform Safaria euphoria is

09:51.085 --> 09:52.325
a very famous mathematician.

09:52.355 --> 09:54.765
It's like, it's the RIA transform.

09:54.845 --> 09:56.445
I, I think many of you,

09:56.785 --> 09:58.805
may have learned it in tri trigonometry.

09:58.825 --> 10:00.845
Uh, I know many of you hate trigonometry,

10:00.845 --> 10:03.165
but sometimes the things you hate still,

10:03.165 --> 10:04.325
are useful in certain areas.

10:04.665 --> 10:07.205
So what we show is that using this type of technique,

10:07.505 --> 10:11.045
we can identify the length of the cycle,

10:11.075 --> 10:12.845
with very high accuracy.

10:12.865 --> 10:15.565
So that's the first stage of the, of our algorithm.

10:15.825 --> 10:18.885
And you may ask, well, is it really necessary, right?

10:18.885 --> 10:21.325
Like in the, in the case of selling winter apparel,

10:21.325 --> 10:23.485
like even we don't need graduate student

10:23.485 --> 10:26.925
to identify maybe like a, a seventh grader can see

10:26.925 --> 10:29.325
that there's an annual cycle out there, right?

10:29.385 --> 10:31.565
Um, they don't have to be business genius to, to,

10:31.565 --> 10:34.005
to identify this kind of cyclic pattern.

10:34.385 --> 10:37.205
But, but our, message here is that, well,

10:37.235 --> 10:39.445
it's not really always, it's not always the case

10:39.515 --> 10:41.085
that in this case might be,

10:41.085 --> 10:42.125
but it's not always the case

10:42.125 --> 10:44.845
that the cycle length is very easy to identify.

10:45.105 --> 10:48.165
And let me give you a slightly, different example.

10:48.545 --> 10:51.085
So I used to work with, with a hospital,

10:51.085 --> 10:53.965
it's a emergency department of, of American University.

10:54.785 --> 10:56.885
And what they want to know is that they, they want

10:56.885 --> 11:00.365
to learn the arrival pattern of patients, right?

11:00.385 --> 11:02.405
How many of them will arrive on Monday?

11:02.505 --> 11:04.925
How many of them in morning, morning, morning, afternoon.

11:05.265 --> 11:08.645
And so this will help them, decide the staffing, right?

11:08.665 --> 11:10.805
How many people I need in, in the emergency room,

11:11.025 --> 11:12.645
and you know, in, in Canada.

11:12.785 --> 11:15.485
So in Canada, the healthcare system is really,

11:15.585 --> 11:16.645
really constrained.

11:16.665 --> 11:19.245
So this, this type of problem emerges everywhere in the

11:19.245 --> 11:20.885
world, I think especially after Covid.

11:21.585 --> 11:23.365
Um, so, so then,

11:23.505 --> 11:26.205
we can look at arrival pattern like every hour,

11:26.265 --> 11:28.645
how many patients arrive at the emergency department,

11:29.105 --> 11:32.365
and it's very clear that there's a weekly pattern, right?

11:32.585 --> 11:35.565
The weekly pattern here is that, I,

11:36.125 --> 11:38.805
I think it's over the weekend more people come in.

11:38.885 --> 11:41.205
I think those are the people who are not really that

11:42.115 --> 11:43.125
serious injury.

11:43.125 --> 11:44.165
So they choose the weekend.

11:44.585 --> 11:46.325
Um, so you, you see there's a peak.

11:46.665 --> 11:48.965
But what's interesting about this, that if you don't,

11:49.025 --> 11:51.165
if you don't, just look at the week, if,

11:51.185 --> 11:54.085
if you stretch out the data and look at longer horizon

11:54.385 --> 11:58.365
and apply the Fourier transform that, I described,

11:58.625 --> 12:00.925
you can identify some sort of monthly pattern.

12:01.425 --> 12:02.485
And that's kind of strange.

12:02.545 --> 12:05.445
And, and we initially, we didn't know how to explain it,

12:05.705 --> 12:07.245
but we still don't have a concrete,

12:07.325 --> 12:09.485
I think there's a few hypotheses there, like the,

12:09.505 --> 12:11.565
the moon cycle with affect people's

12:11.675 --> 12:13.165
mentality in some way, right?

12:13.345 --> 12:16.125
So, so there are different, different, explanation

12:16.125 --> 12:19.445
for this pattern, but the pattern exists, it's not as strong

12:19.465 --> 12:21.685
as the weekly pattern, but there's a monthly pattern.

12:22.185 --> 12:25.325
So using this example, one can see

12:25.325 --> 12:28.245
that sometimes the cycle length or the,

12:28.245 --> 12:29.525
or the seasonality, right?

12:29.525 --> 12:32.805
It is not as straightforward as people could just see,

12:32.875 --> 12:34.485
like you are using their intuition.

12:34.945 --> 12:37.285
So this is why this type of analysis can

12:37.845 --> 12:41.045
identify those hidden cycles that are not as strong

12:41.105 --> 12:42.325
as the dominant cycle,

12:42.505 --> 12:44.725
or like overwhelmed by the dominant cycle.

12:44.825 --> 12:47.165
But we can, we still, we still, we can still learn that.

12:47.745 --> 12:51.725
So this is our, sort of first stage algorithm,

12:52.695 --> 12:53.955
and in the second stage,

12:54.095 --> 12:55.835
we would just use the information

12:55.835 --> 12:57.115
learned from the first stage.

12:57.495 --> 13:00.595
So we prove that, the cycle length you,

13:00.695 --> 13:04.435
we learn from the first stage is, is very accurate,

13:04.435 --> 13:06.755
with certain kind of, performance guarantee.

13:07.095 --> 13:11.515
And then we can just assume that those arms have the given,

13:11.625 --> 13:13.235
have the cycle that we've just learned.

13:13.735 --> 13:16.835
And why this in, in, in information important?

13:17.265 --> 13:18.595
Well, given the cycle length,

13:18.795 --> 13:20.755
I can just look at the different phases, right?

13:20.755 --> 13:24.395
So if I know the, the cycle is repeats itself annually,

13:24.935 --> 13:27.835
and I'm, again, I'm selling the winter, jacket,

13:28.345 --> 13:30.715
then I could say, okay, on January,

13:31.175 --> 13:32.835
that's the phase of a cycle.

13:33.335 --> 13:34.915
Uh, what's the best decision?

13:35.185 --> 13:37.355
Well, is it a hundred dollars, $150?

13:37.835 --> 13:39.995
I learned the optimal decision January,

13:40.455 --> 13:41.915
and then I learned it in February.

13:42.195 --> 13:45.475
I learned in March, basically I learned the optimum decision

13:45.935 --> 13:48.755
in each of the phase of the cycle, right?

13:48.775 --> 13:50.075
So that solves the,

13:50.075 --> 13:52.995
basically solve the tality problem in a sense

13:52.995 --> 13:54.915
that given the structure,

13:54.935 --> 13:58.595
and given the learned cycle length, I can try to find the

13:59.275 --> 14:01.115
optimal decision for each phase.

14:01.615 --> 14:04.555
And I can design the, basically, it's a, it's a, the,

14:04.555 --> 14:06.795
the optimal decision changes over time, right?

14:06.815 --> 14:09.275
The January price is different from the February price,

14:09.275 --> 14:10.475
different from the March price,

14:10.855 --> 14:12.795
but we can learn it over time.

14:13.255 --> 14:16.755
Uh, and that will give the firm a lot of, leverage,

14:16.755 --> 14:19.155
in terms of like how to implement the pricing policy

14:19.615 --> 14:21.075
and how to make more money,

14:21.425 --> 14:23.835
from the market in this type of, papers.

14:24.235 --> 14:27.715
Although, like, I try to, make it like, as, as sort

14:27.715 --> 14:29.555
of practical as possible, but there's always a

14:29.555 --> 14:31.235
theoretical core to it.

14:31.615 --> 14:34.675
Uh, meaning what is good, right? What is a good policy?

14:34.975 --> 14:37.475
So usually in experimental science, the,

14:37.475 --> 14:40.475
the good policy should perform empirically well, right?

14:40.485 --> 14:44.035
Let's say I have this real problem, I implement the policy,

14:44.415 --> 14:46.635
and then let's say it's better than the benchmark

14:46.775 --> 14:48.635
by 50% or 10%.

14:48.855 --> 14:50.635
Um, that, that's one rule.

14:51.145 --> 14:53.805
But in the theoretical sort of computer science

14:53.825 --> 14:57.005
and AI literature, usually people wanna say, okay, so,

14:57.025 --> 15:01.565
so is there a quantifiable metric that regardless of

15:01.595 --> 15:04.325
what data is coming from that the policy is good.

15:04.505 --> 15:06.125
So in this case, the, the metric

15:06.125 --> 15:09.445
that people are looking at is, is called the regret, right?

15:09.585 --> 15:10.845
The regret is defined.

15:10.985 --> 15:13.445
Um, it's similar to the regret we interpreted.

15:13.665 --> 15:15.245
So, so you can think of this way,

15:15.295 --> 15:16.765
let's say you have an oracle.

15:17.265 --> 15:19.525
So this oracle, it knows everything.

15:19.745 --> 15:21.845
It doesn't need to learn. So in this case,

15:21.905 --> 15:24.405
the Oracle would just pick the best price from month

15:24.405 --> 15:26.125
to month because it knows everything.

15:26.265 --> 15:27.805
So it's, it's gotta be good, right?

15:28.005 --> 15:29.285
It's the best that one can do.

15:29.825 --> 15:31.405
And for any policy that

15:31.425 --> 15:34.725
or algorithm we want to develop, it's always the case

15:34.945 --> 15:36.845
how far we are from this Oracle, right?

15:36.845 --> 15:38.325
The smaller the gap is, the better

15:38.325 --> 15:40.165
because we can really not beat it.

15:40.315 --> 15:42.325
It's, it's the best, it's, it's cheating, basically.

15:42.745 --> 15:44.725
Um, and this is what regret means.

15:45.145 --> 15:48.005
So regret in this case is that, okay,

15:48.185 --> 15:51.965
what's the gap from the policy that we developed

15:52.105 --> 15:53.405
and the Oracle policy?

15:53.905 --> 15:57.005
And there are a few sort of criteria we want to have.

15:57.345 --> 16:01.325
One is that the regret over time should go to zero, right?

16:01.465 --> 16:04.925
As you have more time to learn, I wanna be as closer

16:05.505 --> 16:07.485
to the oracle as possible.

16:07.865 --> 16:11.085
So this is one, dimension to, to the regret.

16:11.085 --> 16:12.725
Like it should, it should diminishing,

16:12.785 --> 16:14.325
it should diminish over t.

16:14.325 --> 16:17.165
When you have, when you have longer time horizon,

16:17.425 --> 16:19.405
you get better and the better performance getting

16:19.405 --> 16:20.645
closer and closer to the oracle.

16:21.345 --> 16:24.205
And, there other dimensions, for example, in,

16:24.205 --> 16:27.685
in the paper, in this, study we explore is the dimension

16:27.755 --> 16:31.365
with respect to K, which is how many arms you have,

16:31.365 --> 16:32.405
how many decisions you have.

16:32.625 --> 16:37.445
So intuitively the more decisions are there, the,

16:37.545 --> 16:39.645
the, the harder the problem is, right?

16:39.645 --> 16:41.125
Because you wanna learn more.

16:41.285 --> 16:44.485
I wanna know if like compare, I just want

16:44.485 --> 16:46.485
to compare price a hundred, 150.

16:46.835 --> 16:50.725
That seems to be an easier task than comparing prices a

16:50.725 --> 16:54.005
hundred, a hundred, one, 102, all the way to 150, right?

16:54.005 --> 16:55.765
That, that's look, look like a,

16:55.835 --> 16:57.885
that look like a more daunting task.

16:58.465 --> 17:03.085
Um, so what we show is that we show the regret of our,

17:03.265 --> 17:06.365
algorithm and we show that it achieves the

17:06.875 --> 17:10.045
best possible regret in the time horizon,

17:10.545 --> 17:12.085
but still there's a little gap.

17:12.305 --> 17:14.405
And this gap is something we tried very hard,

17:14.425 --> 17:15.485
we were not able to close.

17:15.625 --> 17:18.685
The gap is that, that the regret seems to be,

17:19.205 --> 17:21.245
I improvable in the number of arms.

17:21.505 --> 17:23.165
So we, we, we get some result

17:23.165 --> 17:26.005
and it's really good, as a function of the number

17:26.005 --> 17:27.685
of arms, but it's not the best.

17:28.185 --> 17:32.205
So in this case, if among the audience, if you are,

17:32.585 --> 17:33.805
if you like mathematics

17:33.945 --> 17:36.085
and if you like this type of problem, you can definitely,

17:36.145 --> 17:37.805
read our work

17:38.265 --> 17:41.325
and we explain like why this is really hard to achieve.

17:41.345 --> 17:43.965
We try different ways and we give like the, the,

17:43.965 --> 17:45.765
the results we tried that didn't work.

17:46.025 --> 17:47.405
So you can, you can follow that

17:47.465 --> 17:51.125
and maybe, if you were able to close the gap, I think

17:51.125 --> 17:52.525
that will be very important

17:52.585 --> 17:54.885
and interesting contribution to this literature.

17:54.985 --> 17:57.805
So I would encourage the audience to, to try that out.

Understanding the Multi-Armed Bandit Framework

00:00:00 - So Mountain Bandit framework is, is actually very important, framework or paradigm used in reinforcement learning and computer science nowadays. the, the name is a bit funny, right? M Bandit, what does even mean? So the bandit in this case is actually, if you have been the audience been to Las Vegas or any casino, it's a slot machine. So, you know, if you go to slot machine, there's an arm, right? And if you pull the arm, sometimes you lose the money. Sometimes you can win big, but it's, it's, it's a random reward in some sense. So that's what people call one armed bandit.
00:00:33 - So imagine it's a slot machine, but it has multiple arms. In fact, I think it should be multiple slot machines with like one arm each. But, you know, people in academia come up with strange names all the time. So let's, this is a multi arm bandit. You have a slot machine, but with multiple arms, and each arms will give you some reward, but the reward of the arms might be different, right? So let's say one of the arms could give you, say, a hundred dollars per 10 poles on average, right? Another one could be giving you $30 per two poles.
00:01:06 - and you are a gambler and you have faced, you know, you, you're in front of the machine and you wanna maximize your reward after, let's say a hundred poles or a hundred rounds. what's your strategy? So this is the t bandit framework. So, so if you think about the problem, right? there's a few sort of salient features of, of this problem. Well, first you don't, you know, nothing about arms. Some arms are good, some arms are bad, but you don't know a priority. So what, what the gambler wants to do is that it wants
00:01:37 - that over time it can identify the best arm and then play that arm more and more. So, so there is a bit of, people call exploration and an exploitation trade off here. You want, you wanna learn initially and then try to, try to find the best one. But in this firm, there are a lot of, there are a lot of nuances embedded in the framework. So one thing is that, I mean, one could think of a naive strategy, which is I'm going to pull each arm once and see which one is good,
00:02:09 - So that's, that's a sort of a, a naive way of think about problem. But because the arms, they are volatile, meaning that, when you pull an arm, it may give you $0, it may give you $5, it may give you $10. It's uncertain. So one pole is not enough. You may, you may miss the, the best arm because it happened to give you say, $0 in that round. So, so you need to sort of, overall finding, like finding, estimate the arm reward in, in a reliable way
00:02:41 - and more confident that you will get the best arm. So, so here, no. Okay, so maybe the audience are already getting a bit of a bored, because like, if I'm not gambling, why do I need it? Right? Why do I need to know this? So let me give you three, examples. Two, maybe two examples enough. So I'm, I'm a, I'm a parent of a 9-year-old and, and a 2-year-old. So my older daughter, Ariel, she's, she's nine. And as a parent of girls, of boys of that age, like, we constantly think about what kind of extra curriculum,

Exploring Interests Through Multi-Armed Bandit

00:03:14 - And as much as we want, like they have 24 hours a day. So let's say each day I can only enroll them in one of the class and then it, but eventually I wanna find the hobby, the sport or music or whatever they are good at or they're interested in, right? So it's, it's a learning process, and you can think of it as on each day, I want to say point R in this case, point r meaning enrolling in a specific activity.
00:03:45 - of the activities they don't like or they're not good at and sort of converge to the one that they are really passionate about. So that's, that's essentially a ma bandit problem. And if you apply the principle, into the, this kind of real world, application here, so I, I think there are two guiding principles. I mean, I'm not a parenting or education expert, but I think there are two, messages from the Malm bandit framework. One is that you need to do exploration early on, right?
00:04:14 - and then over time you want to, you want to like, leave those activities that, that they're not good at and try to converge to a few. So that's the first. The second message is that how much you explore versus how much you want to just optimize. It depends. The age depends on age. So if you start from 5-year-old and you start from say, 13-year-old, the strategy could be different. That the proportion you should spend on exploration should depend on the, the time, how many rounds you're gonna play.
00:04:46 - talk about modern bandit. Another one is closer to what I'm working on, which is say, pricing. So in the business world, it's, it's a very common problem that, when a new product gets released, the firm wants to find a good price for the product because they don't have past data, right? I could say, let's say that a firm is selling a winter jacket. it could be a hundred dollars, it could be 150. It depends on the market demand, it depends on the market response.
00:05:18 - It's a dynamic decision making problem with a lot of unknown information. Is a hundred dollars the best or $150 best? So in this case, the firm could try different prices, and over time it can converge to the optimal price, let's say 120. So this kind of framework is also, used a lot in this type of, business decision making problem. So in the classic mal banded problem, we usually think of the reward a, a as the, as the, as the arm
00:05:50 - So then the problem becomes a bit easier because let's say I'm a gambler, I just wanna find the arm, in, in the real world example, find a decision that maxim maximizes my payoff or reward over time. But in, in many real world cases, it's not, it's not fixed, right? So that reward of an arm may change over time. So let's use the pricing example. Let's say I'm selling a winter jacket, right? then a decision or an arm in this case is choosing a price, right?

Challenges of Dynamic Pricing and Seasonality

00:06:22 - for a certain price may change because of various reasons. One is seasonality. If you're selling winter jacket selling, you know, selling winter jacket at $200 in the summer may give you very little reward, right? In this case of the firm's profit. so it changes over time. Another factor could be competition. There might be another company comes in selling very similar products, and then your, profit is gonna drop. So, so in the, in this case, it needs to, someone needs
00:06:55 - the average reward to be changing over time. Okay? So, and then, then the problem, because I'm a lot harder because I'm not only learning how good this decision is, but I need to adapt to the changing environment and also need to learn how it changes, like over time, is it still good, right? Or is it become an inferior option? And it turns out, this problem has been studied in the literature, of computer science and ai, and it turns out to be a very hard problem.
00:07:23 - because people cannot find a good algorithm for it. It's hard because people can show that no algorithm can perform well. Like you, there's no possibility you can find an algorithm that can learn it well, without losing something. so this is where we start. so we take this problem and think that, okay, if it's, if, if it's really hard to learn when you are, it's moving target, right? If it's really hard to learn of, of the reward of arms over time when it's changing,
00:07:57 - So, so the structure we are identifying is the seasonality. So like not all changes are the same, right? So certain type of changes they repeat itself. There are cycles. So if I'm selling the winter jacket, it's probably a reasonable assumption to think that the demand is like repeat itself over a year, right? So the, the sales may peak in January, and then it slowly goes down and down and down and hit the bottom maybe in the summer, and then it picks up in the fall
00:08:28 - to prepare for their winter, apparel. So this type of seasonality can help, help our problem a lot, meaning that we can actually, learn the non, non ality and, and try to find the optimalization, over time, without, you know, hitting the, the very challenging problem that, that the literature has identified. We find that if we impose a little bit of a structure, which is seasonality, and this structure is not really like impractical, right?
00:08:59 - and over, in, in, in practice. And by imposing this slight, like, sort of, restricted structure, we can learn the non-ST. Institutionality really well. And this is the base of the design of our algorithm, really. So, our algorithm consists of two stages. So in the, the, in the first stage, what we, what we basically do is that, okay, I know learning the reward of the arms that are changing over time is really hard.

Identifying Cycle Length in Complex Patterns

00:09:31 - the cycle length. So each arms have different, they may have the same or different cycles, yearly cycle, monthly cycle, weekly cycle. We can devise some algorithm, to learn the cycle length. And the key methodology we use is, RIA transform Safaria euphoria is a very famous mathematician. It's like, it's the RIA transform. I, I think many of you, may have learned it in tri trigonometry.
00:10:00 - but sometimes the things you hate still, are useful in certain areas. So what we show is that using this type of technique, we can identify the length of the cycle, with very high accuracy. So that's the first stage of the, of our algorithm. And you may ask, well, is it really necessary, right? Like in the, in the case of selling winter apparel, like even we don't need graduate student to identify maybe like a, a seventh grader can see that there's an annual cycle out there, right?
00:10:31 - to identify this kind of cyclic pattern. But, but our, message here is that, well, it's not really always, it's not always the case that in this case might be, but it's not always the case that the cycle length is very easy to identify. And let me give you a slightly, different example. So I used to work with, with a hospital, it's a emergency department of, of American University. And what they want to know is that they, they want to learn the arrival pattern of patients, right?
00:11:02 - How many of them in morning, morning, morning, afternoon. And so this will help them, decide the staffing, right? How many people I need in, in the emergency room, and you know, in, in Canada. So in Canada, the healthcare system is really, really constrained. So this, this type of problem emerges everywhere in the world, I think especially after Covid. so, so then, we can look at arrival pattern like every hour, how many patients arrive at the emergency department, and it's very clear that there's a weekly pattern, right?
00:11:36 - I think it's over the weekend more people come in. I think those are the people who are not really that serious injury. So they choose the weekend. so you, you see there's a peak. But what's interesting about this, that if you don't, if you don't, just look at the week, if, if you stretch out the data and look at longer horizon and apply the Fourier transform that, I described, you can identify some sort of monthly pattern. And that's kind of strange. And, and we initially, we didn't know how to explain it,
00:12:07 - I think there's a few hypotheses there, like the, the moon cycle with affect people's mentality in some way, right? So, so there are different, different, explanation for this pattern, but the pattern exists, it's not as strong as the weekly pattern, but there's a monthly pattern. So using this example, one can see that sometimes the cycle length or the, or the seasonality, right? It is not as straightforward as people could just see, like you are using their intuition. So this is why this type of analysis can

Optimal Decision-Making Through Cycle Analysis

00:12:41 - as the dominant cycle, or like overwhelmed by the dominant cycle. But we can, we still, we still, we can still learn that. So this is our, sort of first stage algorithm, and in the second stage, we would just use the information learned from the first stage. So we prove that, the cycle length you, we learn from the first stage is, is very accurate, with certain kind of, performance guarantee. And then we can just assume that those arms have the given,
00:13:13 - And why this in, in, in information important? Well, given the cycle length, I can just look at the different phases, right? So if I know the, the cycle is repeats itself annually, and I'm, again, I'm selling the winter, jacket, then I could say, okay, on January, that's the phase of a cycle. what's the best decision? Well, is it a hundred dollars, $150? I learned the optimal decision January, and then I learned it in February.
00:13:45 - in each of the phase of the cycle, right? So that solves the, basically solve the tality problem in a sense that given the structure, and given the learned cycle length, I can try to find the optimal decision for each phase. And I can design the, basically, it's a, it's a, the, the optimal decision changes over time, right? The January price is different from the February price, different from the March price, but we can learn it over time.
00:14:16 - in terms of like how to implement the pricing policy and how to make more money, from the market in this type of, papers. Although, like, I try to, make it like, as, as sort of practical as possible, but there's always a theoretical core to it. meaning what is good, right? What is a good policy? So usually in experimental science, the, the good policy should perform empirically well, right? Let's say I have this real problem, I implement the policy,
00:14:46 - by 50% or 10%. that, that's one rule. But in the theoretical sort of computer science and AI literature, usually people wanna say, okay, so, so is there a quantifiable metric that regardless of what data is coming from that the policy is good. So in this case, the, the metric that people are looking at is, is called the regret, right? The regret is defined. it's similar to the regret we interpreted. So, so you can think of this way,
00:15:17 - So this oracle, it knows everything. It doesn't need to learn. So in this case, the Oracle would just pick the best price from month to month because it knows everything. So it's, it's gotta be good, right? It's the best that one can do. And for any policy that or algorithm we want to develop, it's always the case how far we are from this Oracle, right? The smaller the gap is, the better because we can really not beat it. It's, it's the best, it's, it's cheating, basically. and this is what regret means. So regret in this case is that, okay,

Evaluating Regret in Decision-Making Algorithms

00:15:52 - and the Oracle policy? And there are a few sort of criteria we want to have. One is that the regret over time should go to zero, right? As you have more time to learn, I wanna be as closer to the oracle as possible. So this is one, dimension to, to the regret. Like it should, it should diminishing, it should diminish over t. When you have, when you have longer time horizon, you get better and the better performance getting
00:16:21 - And, there other dimensions, for example, in, in the paper, in this, study we explore is the dimension with respect to K, which is how many arms you have, how many decisions you have. So intuitively the more decisions are there, the, the, the harder the problem is, right? Because you wanna learn more. I wanna know if like compare, I just want to compare price a hundred, 150. That seems to be an easier task than comparing prices a
00:16:54 - That, that's look, look like a, that look like a more daunting task. so what we show is that we show the regret of our, algorithm and we show that it achieves the best possible regret in the time horizon, but still there's a little gap. And this gap is something we tried very hard, we were not able to close. The gap is that, that the regret seems to be, I improvable in the number of arms.
00:17:23 - and it's really good, as a function of the number of arms, but it's not the best. So in this case, if among the audience, if you are, if you like mathematics and if you like this type of problem, you can definitely, read our work and we explain like why this is really hard to achieve. We try different ways and we give like the, the, the results we tried that didn't work. So you can, you can follow that and maybe, if you were able to close the gap, I think that will be very important
00:17:54 - So I would encourage the audience to, to try that out.

Faculti AI can make mistakes

Research

Policy

Analysis

Data

Originals

Explainers

Podcasts

Faculti

New on Faculti

Idea Explorer

Definition

News Context

Critical Questions and Answers

Research Summaries

Learning and Optimization with Seasonal Patterns

Media Literacy Toolkit

Critical Questions to Consider ▼

Bias and Perspective Awareness ▼

Further Reading ▼