Hey folks,
We've been getting nailed with peak charges on the traffic coming from the WineHQ web server.
We think a large reason for this is that we were not sophisticated in how we flushed the fastly cache, so we'd get slammed in a big rush when we flushed.
I believe that Jeremy Newman has fixed that now, so hopefully it won't happen again. But those overcharges are brutal (i.e. thousands of dollars) :-(.
So I'm planning to put in place some traffic shaping to provide a hard upper limit for winehq (I'm going to impose 8Mbit; we currently pay for 10Mbpit committed; if I up that to 20, I may expand that).
This should not disrupt anything, but I thought I'd warn folks, so you'd know to blame me if things suddenly go dark <grin>.
Cheers,
Jeremy
On 10.04.2017 22:27, Jeremy White wrote:
Hey folks,
We've been getting nailed with peak charges on the traffic coming from the WineHQ web server.
We think a large reason for this is that we were not sophisticated in how we flushed the fastly cache, so we'd get slammed in a big rush when we flushed.
I believe that Jeremy Newman has fixed that now, so hopefully it won't happen again. But those overcharges are brutal (i.e. thousands of dollars) :-(.
So I'm planning to put in place some traffic shaping to provide a hard upper limit for winehq (I'm going to impose 8Mbit; we currently pay for 10Mbpit committed; if I up that to 20, I may expand that).
This should not disrupt anything, but I thought I'd warn folks, so you'd know to blame me if things suddenly go dark <grin>.
Cheers,
Jeremy
Hi Jeremy,
I would like to let you know in advance that putting up such a bandwidth limit could indeed cause a lot of trouble as soon as we are pushing the next release. Actually there shouldn't be a big difference between doing a cache flush and pushing a new release.
With an average size of about 3,3GB for each release (counting only the development and staging version) and a limit of 8 Mbits, this means populating even a single CDN mirror will take about 56 minutes. In practice this is probably a big underestimation because multiple servers might do queries at the same time. May I ask what the previous peak bandwidth was exactly?
I would also like to offer to use our build servers for the CDN synchronization. I believe that we should have sufficient resources to deal with such traffic peaks, and we are already doing the builds anyway. Please let me know if this would be an option.
Best regards, Sebastian
Hey Sebastian,
On 04/11/2017 10:40 AM, Sebastian Lackner wrote:
With an average size of about 3,3GB for each release (counting only the development and staging version) and a limit of 8 Mbits, this means populating even a single CDN mirror will take about 56 minutes. In practice this is probably a big underestimation because multiple servers might do queries at the same time. May I ask what the previous peak bandwidth was exactly?
The previous bandwidth was 10Mbit, burstable up to 20Mbit; at 20Mbit, our provider imposed a hard throttle. I believe our contract is on a 95% percentile burst, so if we spend more than 36 hours above 10Mbit, the dollars start adding up. That's the pain I need to avoid.
One of the issues we had was with the multiple mirrors; without a throttle of some kind in place, any time we would flush the cache, we'd be hammered by Fastly servers all over the world, and we'd pull 20Mbit for a sustained period.
We have now changed that to have only one Fastly server pull from us, and then it mirrors from there. We're also able to be more judicious in which files we request a refresh for.
It would not be all that hard, in theory, to remove the throttle only for the Fastly CDN, and allow it to pull at the maximum. So long as it's not doing that for more than 5% of the time, we should be okay. I'm also likely to upgrade to 20/30, which should allow the numbers to go up a bit.
I would also like to offer to use our build servers for the CDN synchronization. I believe that we should have sufficient resources to deal with such traffic peaks, and we are already doing the builds anyway. Please let me know if this would be an option.
That is certainly an option; I'd rather wait until we see if there is a serious problem before we pursue an alternate course.
The further truth is that, historically, our co-location provider never charged us for bandwidth; we got away with murder for years and years. But they were bought, and then bought again, and the new owners have started actually charging us for bandwidth :-(.
That means that this traffic shaping is long overdue; we should have had these controls in place, otherwise anyone on the Internet can force us into a high penalty zone, which we don't necessarily want.
We do have the shaping in place now; I'd like to see how it works. I'm hoping it's a nearly invisible change.
Please report problems or issues to me so I can investigate, and hopefully tune it into working.
Cheers,
Jeremy
On 04/11/2017 12:47 PM, Jeremy White wrote:
Hey Sebastian,
On 04/11/2017 10:40 AM, Sebastian Lackner wrote:
With an average size of about 3,3GB for each release (counting only the development and staging version) and a limit of 8 Mbits, this means populating even a single CDN mirror will take about 56 minutes. In practice this is probably a big underestimation because multiple servers might do queries at the same time. May I ask what the previous peak bandwidth was exactly?
The previous bandwidth was 10Mbit, burstable up to 20Mbit; at 20Mbit, our provider imposed a hard throttle. I believe our contract is on a 95% percentile burst, so if we spend more than 36 hours above 10Mbit, the dollars start adding up. That's the pain I need to avoid.
One of the issues we had was with the multiple mirrors; without a throttle of some kind in place, any time we would flush the cache, we'd be hammered by Fastly servers all over the world, and we'd pull 20Mbit for a sustained period.
We have now changed that to have only one Fastly server pull from us, and then it mirrors from there. We're also able to be more judicious in which files we request a refresh for.
It would not be all that hard, in theory, to remove the throttle only for the Fastly CDN, and allow it to pull at the maximum. So long as it's not doing that for more than 5% of the time, we should be okay. I'm also likely to upgrade to 20/30, which should allow the numbers to go up a bit.
I would also like to offer to use our build servers for the CDN synchronization. I believe that we should have sufficient resources to deal with such traffic peaks, and we are already doing the builds anyway. Please let me know if this would be an option.
That is certainly an option; I'd rather wait until we see if there is a serious problem before we pursue an alternate course.
The further truth is that, historically, our co-location provider never charged us for bandwidth; we got away with murder for years and years. But they were bought, and then bought again, and the new owners have started actually charging us for bandwidth :-(.
That means that this traffic shaping is long overdue; we should have had these controls in place, otherwise anyone on the Internet can force us into a high penalty zone, which we don't necessarily want.
We do have the shaping in place now; I'd like to see how it works. I'm hoping it's a nearly invisible change.
Please report problems or issues to me so I can investigate, and hopefully tune it into working.
Cheers,
Jeremy
Jeremy,
So with the multi-level caching scheme, after the initial cache of all of the static content, any new release (let's round up from Sebastian's 3.3GB figure to 4GB) at your peak of 20Mbps would take 26.6 minutes to sync a new release from your ISP to the single CDN cache. At 2 releases a month, that brings us up to 53.3 minutes per month, well below your 5% threshold of 36 hours. I'm not sure I see the point in traffic shaping beyond the initial building of the cache (and total cache invalidation should almost never happen going forward). Additionally, that means that even with multiple CDNs making caching queries directly, after a full cache of the files, you could safely handle adding 10-20 CDN direct mirrors. All of that logic assumes that you bite the bullet and do a complete caching at the start.
Does that logic seem correct to you?
threshold of 36 hours. I'm not sure I see the point in traffic shaping beyond the initial building of the cache (and total cache invalidation should almost never happen going forward). Additionally, that means that even with multiple CDNs making caching queries directly, after a full cache of the files, you could safely handle adding 10-20 CDN direct mirrors. All of that logic assumes that you bite the bullet and do a complete caching at the start.
Note that I think it was a lot more than 10-20 CDNs; that may have been a bug on Fastly's part, or they just may have a lot more mirrors than anyone appreciates.
Does that logic seem correct to you?
There are a lot of good reasons to do traffic shaping; we should have been doing it all along. For example, when we do exceed the 20Mbit limit, our provider just chops us off, and drops arbitrary packets, which leads to bad user experiences.
With traffic shaping, we can prioritize different traffic, which allows us to provide better service. For example, we can tune it so that when under load, the downloads to the CDNs go more slowly, but all other functions stay responsive. Notably, we can now shape it so ssh sessions stay responsive, so we can get in to the box and figure out the problem more easily (which we were unable to do during one particularly bad storm).
Also, we don't have great data on the actual use of traffic this past fall / winter; we mostly have theories. We have now put in place ntop so we should be able to better monitor going forward.
For the curious, our 'steady state' seems to be about 3Mbit, with about two thirds of that being web traffic, and one third being git. We have many spikes, though; I need to puzzle out the cause of some of those. I also don't have our Fastly stats ready to hand, but I know it's a lot. We owe Fastly a great debt of gratitude.
I realize it's counter intuitive, but the bottom line is that it's better for us to decide how to manage traffic than to have it imposed on us, whether by our provider or by an upstream CDN with a much bigger pipe than ours.
Again, please let me know of any problems that arise.
Cheers,
Jeremy