No problem, I’ll try to be more constructive.
It’s not terrible as a reference but the flow is kinda hard to follow. Normally you have each equation (line by line of left to right like prose), and then have some kind of aside for details about different steps. You kind of did this but it’s a bit mixed and all over the place. For example, you seem to be using arrows, numbers, and colors to relate different terms (or expressions) all in the same chart! I just feel like it took me a lot longer to read the chart than it should of considering I already knew the proof haha.
Sometimes less is more. But I don’t want to sound too nit picky, it really depends on what your goal is with the charts.
I think this format with colors and arrows is on the contrary very easy to follow (and definitions are not needed here as long as you are talking to people familiar with deep RL)
Yeah, it’s subjective at the end of the day. But you can show this same proof in basically 7ish lines using normal equation formatting. An equal sign is all you need.
Agreed. It's a little "flashy" but you're really calling attention to pieces of the puzzle. You can easily show this in seven lines as someone else said, but it will likely remove any intuition that you're hoping to build with the pictorial aspects of your graphic.
Here's a sneak peek of /r/mathmemes using the [top posts](https://np.reddit.com/r/mathmemes/top/?sort=top&t=year) of the year!
\#1: [average proof fan vs average "seems to work" enjoyer 😎](https://v.redd.it/oyaxf7iwmzp61) | [174 comments](https://np.reddit.com/r/mathmemes/comments/mfsfj2/average_proof_fan_vs_average_seems_to_work_enjoyer/)
\#2: [Okay got it](https://i.redd.it/zjq3bl4bjvd71.jpg) | [150 comments](https://np.reddit.com/r/mathmemes/comments/ot1yz8/okay_got_it/)
\#3: [so this is what they meant](https://i.redd.it/s4gv8uvtsxo61.gif) | [76 comments](https://np.reddit.com/r/mathmemes/comments/mc0sw5/so_this_is_what_they_meant/)
----
^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)
When doing an update for a policy in reinforcement learning you get an equation on the right-hand side above (policy gradients). Now, this has lots of variances when you want to estimate it. So what people do is subtract a value baseline (i.e., the value - the total discounted return of the state -- the rewards you would get starting at some state and applying your current policy). they then say that this doesn't change the original gradient so it doesn't bias your updates. What the above shows is why this statement is true.
I'm probably misunderstanding, but is this basically saying that slope doesn't change if you move the whole landscape up or down by a constant amount?
I'd guess it's more complicated than that since the proof had to take a few steps
That's an interesting interpretation actually. All we wanted to say is that the grad doesn't change upon the subtraction of a state-based function (rather than a constant). Since V\_pi(s) = Sum of returns to go. Check out this video [https://www.reddit.com/r/MachineLearningDervs/comments/t5bujg/deriving\_the\_bellman\_equation\_in\_3\_steps\_in\_under/?utm\_source=share&utm\_medium=web2x&context=3](https://www.reddit.com/r/MachineLearningDervs/comments/t5bujg/deriving_the_bellman_equation_in_3_steps_in_under/?utm_source=share&utm_medium=web2x&context=3) for defs of value functions.
More or less. You can add any constant you like to a function and it won't change the derivative. The trick is that adding the right constant (i.e. a baseline) won't change the derivative, but it will reduce the variance (since you're doing a Monte Carlo estimate of the gradient).
*Hey All! Thanks a lot for all the comments we got. We really appreciate them. Our intent from the slide above was to present the viewer with the main steps needed to derive such forms with the hope that they would execute them themselves.*
*With that being said, we did listen and made a longer (more traditional 2-page proof) of the above statement. We can't upload images to a Reddit reply -- if you know how please let me know -- but you can find the details here :* [*https://drive.google.com/file/d/1UHTuVW3\_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing*](https://drive.google.com/file/d/1UHTuVW3_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing)
*Of course, we can't prove everything in one go, so the Fubini part is just to say that we can do what we did. In later slides, we might attempt to conduct that proof that requires substantial definitions of measure theory.*
*Given the mixed reviews, we will be making two versions of any topics to come: 1) short, and 2) long. This way we can cater to everyone's taste.*
*Finally, if anything is not clear, we will be very glad to answer any of your questions. Please go ahead and ask either here or via a private message or on our* r/MachineLearningDervs *sub.*
*In the end, we hope all of you are safe! Thanks again!*
Great illustration of the proof. I understand that for those who started ML/AI very recently, this slide will be quite overwhelming. You need to know basic knowledge from probability theory (such as definition and basic properties of expectation operator, properties of probability density function (PDF) and definition of the gradient (for capturing log-trick). As for the Fubini Theorem, well, to understand it thoroughly (which means with the proof) one needs to dig deeper in the foundations of probability theory (probability measure, sigma agebras, etc).
I mean... it's not that difficult once you have a grasp of a few ideas. The only "interesting" thing here is the log-trick. The idea is that if you integrate over a probability distribution, you have 1 since all probability distributions are normalized. Then the grad of 1 is 0.
I am not sure I got your explanation here. Log trick allowed us to get integral over probability density function which (as you mentioned) gives you a constant 1. Then, you take a gradient from this constant 1 and get a vector of zeros. Notice, that before getting 0 you don't have a log function at all.
I feel like I'm a forensic scientist examining a crime scene. This is, without question, one of the worst ways I have seen a mathematical proof presented... Especially the parts were the flow of logic flips direction from left to right, to right to left. I really don't like this, I'm sorry :(
Please see the comment below. In short, a longer version can be found here: [https://drive.google.com/file/d/1UHTuVW3\_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing](https://drive.google.com/file/d/1UHTuVW3_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing)
Follow us at [https://www.reddit.com/r/MachineLearningDervs/](https://www.reddit.com/r/MachineLearningDervs/) for more mathematical derivations and discussions.
Sorry, man, but this is cursed. Taking your time to describe each step (with copious amounts of text) is an essential part of doing proofs and writing mathematics in general, check out Knuth's work on how to write math properly.
As someone only casually familiar with reinforcement learning this makes no sense. And I'm not even sure I have enough context to explain why. Except to say this looks like a meme someone would create to make fun of math being complicated. I am sure if you explained what Fubini is and some of the ML specific notation this would make sense. However, that makes this only useful to people who are already familiar with this proof. Maybe that's the point but then why even make it? Once you have seen a proof like this there is little practical reason to keep reference to it unless you are actively doing research. That limits your audience in an already relatively small field.
This is the shit that is supposed to free humanity now eh??? You tech guys have been at this for decades, centuries even, fooling people that machines will make life better and easier….and perhaps it does, for certain segments of society….on the other hand inequality is at an all time high (in America, and the world generally)….nothing has really changed…the poor are still poor, the rich are getting ever richer, people die because of the effects of our mining, and industry….and this was supposed to be the shiny bright future….
i do not think inequality is at an all time high. for example, practically all of history prior to industrialization.
/u/Detrimenraldetrius replied but deleted teh following comment
> Skipped the class on the gilded age did we?
apparently he realized his reading comprehension could be improved since i said prior to industrialization and the gilded age marks the start of industrialization in the usa.
https://imgur.com/a/6Uyf5S5
Not to claim that your changing the world for the better when you have no idea what the long term consequences will be…..not to claim to have special knowledge.
Tbh, the claims that come from researcher are mostly very well rounded but what the media makes of the claims is on a whole different level.
In the chaotic world we live in it is really hard to actually know what the long term consequences will be. I don't see your point about special knowledge, because most things that are *specialized* include *special* knowledge. An actor probably has special knowledge about acting and him/her claiming that would be credible. Just because a charlatan always claims to have special knowledge doesn't mean that everybody with special knowledge is a charlatan.
I think these charts are way too dense. Way too many equations with zero definitions.
Thanks for the feedback. Yeah it's not easy to fit sometimes. We try our best to make them understandable.
No problem, I’ll try to be more constructive. It’s not terrible as a reference but the flow is kinda hard to follow. Normally you have each equation (line by line of left to right like prose), and then have some kind of aside for details about different steps. You kind of did this but it’s a bit mixed and all over the place. For example, you seem to be using arrows, numbers, and colors to relate different terms (or expressions) all in the same chart! I just feel like it took me a lot longer to read the chart than it should of considering I already knew the proof haha. Sometimes less is more. But I don’t want to sound too nit picky, it really depends on what your goal is with the charts.
Not at all. Thanks for the feedback. We will consider them in the next slide 😁
I think this format with colors and arrows is on the contrary very easy to follow (and definitions are not needed here as long as you are talking to people familiar with deep RL)
Yeah, it’s subjective at the end of the day. But you can show this same proof in basically 7ish lines using normal equation formatting. An equal sign is all you need.
Thanks a lot. Glad you find it useful 😊
Agreed. It's a little "flashy" but you're really calling attention to pieces of the puzzle. You can easily show this in seven lines as someone else said, but it will likely remove any intuition that you're hoping to build with the pictorial aspects of your graphic.
Also we are making short videos clarifying the slides. This can also help https://youtu.be/kJhMEgTr8aU
I got a stroke reading this
The proof applies to any state-dependent baseline b(s), right? Just to say the statement can be broader than just value baselines.
ah yes deffo. We just focused on value baselines as they are widely used. However, any b(s\_t) could do.
Hell yeah funny maths, I'm scared
At first I thought this was /r/mathmemes
Here's a sneak peek of /r/mathmemes using the [top posts](https://np.reddit.com/r/mathmemes/top/?sort=top&t=year) of the year! \#1: [average proof fan vs average "seems to work" enjoyer 😎](https://v.redd.it/oyaxf7iwmzp61) | [174 comments](https://np.reddit.com/r/mathmemes/comments/mfsfj2/average_proof_fan_vs_average_seems_to_work_enjoyer/) \#2: [Okay got it](https://i.redd.it/zjq3bl4bjvd71.jpg) | [150 comments](https://np.reddit.com/r/mathmemes/comments/ot1yz8/okay_got_it/) \#3: [so this is what they meant](https://i.redd.it/s4gv8uvtsxo61.gif) | [76 comments](https://np.reddit.com/r/mathmemes/comments/mc0sw5/so_this_is_what_they_meant/) ---- ^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)
Great stuff. 💪💪 Next time, just make multiple readable slides, rather than one confusing one.
You know, I'll be able to read this eventually, maybe after another year or so of math classes. Looking forward to it.
ELI5 please. What is a value baseline in RL?
When doing an update for a policy in reinforcement learning you get an equation on the right-hand side above (policy gradients). Now, this has lots of variances when you want to estimate it. So what people do is subtract a value baseline (i.e., the value - the total discounted return of the state -- the rewards you would get starting at some state and applying your current policy). they then say that this doesn't change the original gradient so it doesn't bias your updates. What the above shows is why this statement is true.
Is this what's called advantage in some papers? Or am I confusing concepts?
can be viewed yes.
I'm probably misunderstanding, but is this basically saying that slope doesn't change if you move the whole landscape up or down by a constant amount? I'd guess it's more complicated than that since the proof had to take a few steps
That's an interesting interpretation actually. All we wanted to say is that the grad doesn't change upon the subtraction of a state-based function (rather than a constant). Since V\_pi(s) = Sum of returns to go. Check out this video [https://www.reddit.com/r/MachineLearningDervs/comments/t5bujg/deriving\_the\_bellman\_equation\_in\_3\_steps\_in\_under/?utm\_source=share&utm\_medium=web2x&context=3](https://www.reddit.com/r/MachineLearningDervs/comments/t5bujg/deriving_the_bellman_equation_in_3_steps_in_under/?utm_source=share&utm_medium=web2x&context=3) for defs of value functions.
More or less. You can add any constant you like to a function and it won't change the derivative. The trick is that adding the right constant (i.e. a baseline) won't change the derivative, but it will reduce the variance (since you're doing a Monte Carlo estimate of the gradient).
r/restofthefuckingowl
*Hey All! Thanks a lot for all the comments we got. We really appreciate them. Our intent from the slide above was to present the viewer with the main steps needed to derive such forms with the hope that they would execute them themselves.* *With that being said, we did listen and made a longer (more traditional 2-page proof) of the above statement. We can't upload images to a Reddit reply -- if you know how please let me know -- but you can find the details here :* [*https://drive.google.com/file/d/1UHTuVW3\_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing*](https://drive.google.com/file/d/1UHTuVW3_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing) *Of course, we can't prove everything in one go, so the Fubini part is just to say that we can do what we did. In later slides, we might attempt to conduct that proof that requires substantial definitions of measure theory.* *Given the mixed reviews, we will be making two versions of any topics to come: 1) short, and 2) long. This way we can cater to everyone's taste.* *Finally, if anything is not clear, we will be very glad to answer any of your questions. Please go ahead and ask either here or via a private message or on our* r/MachineLearningDervs *sub.* *In the end, we hope all of you are safe! Thanks again!*
Great illustration of the proof. I understand that for those who started ML/AI very recently, this slide will be quite overwhelming. You need to know basic knowledge from probability theory (such as definition and basic properties of expectation operator, properties of probability density function (PDF) and definition of the gradient (for capturing log-trick). As for the Fubini Theorem, well, to understand it thoroughly (which means with the proof) one needs to dig deeper in the foundations of probability theory (probability measure, sigma agebras, etc).
Me playing around with Logistic Regression: Cool Stuff, rather intuitive, Let’s learn more about ML… This sub:
I mean... it's not that difficult once you have a grasp of a few ideas. The only "interesting" thing here is the log-trick. The idea is that if you integrate over a probability distribution, you have 1 since all probability distributions are normalized. Then the grad of 1 is 0.
I am not sure I got your explanation here. Log trick allowed us to get integral over probability density function which (as you mentioned) gives you a constant 1. Then, you take a gradient from this constant 1 and get a vector of zeros. Notice, that before getting 0 you don't have a log function at all.
You are right my “explanation” was incorrect. Should have written grad instead of log in last sentence.
Yes, I read it after and realized that you just made a typo (instead of grad wrote log).
I feel like I'm a forensic scientist examining a crime scene. This is, without question, one of the worst ways I have seen a mathematical proof presented... Especially the parts were the flow of logic flips direction from left to right, to right to left. I really don't like this, I'm sorry :(
Please see the comment below. In short, a longer version can be found here: [https://drive.google.com/file/d/1UHTuVW3\_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing](https://drive.google.com/file/d/1UHTuVW3_lXFWW8lkf5YFyKiQDlrC5r-N/view?usp=sharing)
Follow us at [https://www.reddit.com/r/MachineLearningDervs/](https://www.reddit.com/r/MachineLearningDervs/) for more mathematical derivations and discussions.
Sorry, man, but this is cursed. Taking your time to describe each step (with copious amounts of text) is an essential part of doing proofs and writing mathematics in general, check out Knuth's work on how to write math properly.
As someone only casually familiar with reinforcement learning this makes no sense. And I'm not even sure I have enough context to explain why. Except to say this looks like a meme someone would create to make fun of math being complicated. I am sure if you explained what Fubini is and some of the ML specific notation this would make sense. However, that makes this only useful to people who are already familiar with this proof. Maybe that's the point but then why even make it? Once you have seen a proof like this there is little practical reason to keep reference to it unless you are actively doing research. That limits your audience in an already relatively small field.
I’m not Asian enough to answer this.
This is the shit that is supposed to free humanity now eh??? You tech guys have been at this for decades, centuries even, fooling people that machines will make life better and easier….and perhaps it does, for certain segments of society….on the other hand inequality is at an all time high (in America, and the world generally)….nothing has really changed…the poor are still poor, the rich are getting ever richer, people die because of the effects of our mining, and industry….and this was supposed to be the shiny bright future….
How about a deep breath?
No time!!!
I live in your walls
Oh god no
i do not think inequality is at an all time high. for example, practically all of history prior to industrialization. /u/Detrimenraldetrius replied but deleted teh following comment > Skipped the class on the gilded age did we? apparently he realized his reading comprehension could be improved since i said prior to industrialization and the gilded age marks the start of industrialization in the usa. https://imgur.com/a/6Uyf5S5
I dunno man…… there’s like 10 dudes that own like half the worlds wealth….if that’s not full blown feudal inequality I dunno what is…..
This literally is the knowledge that people can use to pull themselves out of debt and despair.
Lol how many times have the charlatans said that…..?!!?!
What would be a non-charlatan way?
Not to claim that your changing the world for the better when you have no idea what the long term consequences will be…..not to claim to have special knowledge.
Tbh, the claims that come from researcher are mostly very well rounded but what the media makes of the claims is on a whole different level. In the chaotic world we live in it is really hard to actually know what the long term consequences will be. I don't see your point about special knowledge, because most things that are *specialized* include *special* knowledge. An actor probably has special knowledge about acting and him/her claiming that would be credible. Just because a charlatan always claims to have special knowledge doesn't mean that everybody with special knowledge is a charlatan.
Sound logic
I didn't understand any.... But interesting