使用Amazon S3和Cloudfront智能缓存网页

Question

使用Amazon S3和Cloudfront智能缓存网页

javaamazon-web-servicesamazon-s3amazon-cloudfront

7

我有一个网站（在弹性Beanstalk上运行的Tomcat），用于生成艺术家专辑目录（每个艺术家一个单独的页面）。这可能会消耗大量资源，因此由于艺术家页面在一个月内不会更改，我在其前面放置了一个CloudFront分发。

我认为这意味着我的服务器永远不必再次提供任何艺术家请求，但实际情况并非如此。本文解释了每个边缘位置（欧洲、美国等）第一次查找资源时都会出现未命中，并且云前缓存中保留的资源数量是有限的，因此它们可能会被删除。

为了解决这个问题，我已经更改了服务器代码，将网页的副本存储在S3的bucket中，并在请求到达时首先检查这个副本，因此如果艺术家页面已经存在于S3中，则服务器将检索它并返回其内容作为网页。这极大地减少了处理时间，因为它只为特定的艺术家构造一个网页一次。

然而：

请求仍然必须发送到服务器以检查艺术家页面是否存在。
如果艺术家页面存在，则网页（有时可以很大，高达20MB）首先下载到服务器，然后服务器返回页面。

因此，我想知道是否可以改进这一点——我知道你可以将S3 bucket构建为另一个网站的重定向。是否有一种逐页的方式可以让艺术家请求转到S3 bucket，然后如果页面存在就返回页面，否则调用服务器？

或者，我能否让服务器检查页面是否存在，然后重定向到S3页面，而不是先下载页面到服务器？

- Paul Taylor

4个回答

0

虽然我以前没有做过，但这是我会考虑的一种技术。

首先按照您描述的方式设置S3存储桶，作为网站的“重定向”。
查看S3事件处理程序。它们只处理创建S3对象的情况，但您可以尝试首先进行GET请求，如果失败则响应POST或PUT到相同的路径，在其中放置一个“标记”文件或调用将触发事件的API？

https://aws.amazon.com/blogs/aws/s3-event-notification/ http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

一旦事件被触发，您可以让服务器通过SQS监听事件，或将您的艺术家创建者代码移入AWS Lambda，该Lambda将从SNS中获取信息。

我唯一担心的是GET请求来自哪里。您不希望任何人使用无效的POST请求访问您的S3存储桶 - 这会导致问题。但我会把这留给读者作为练习。

- Spedge

我的担忧是，这似乎非常复杂，并将我的代码与 AWS 基础架构紧密结合，这不是我特别想要的。目前唯一的 AWS 特定代码是 S3 Put 和 S3 Get。 - Paul Taylor

我会警惕不要承诺一个解决方案。如果你打算使用AWS，务必确保你有接口，但是如果你退缩了，你将错过一些非常棒的功能 - AWS的强大不在于每个解决方案，而在于它们之间的（我简直不敢相信我用了这个词）协同作用。我认为这不是一个特别复杂的解决方案，但是我不会去实施它 :) - Spedge

0

为什么不在Tomcat前面放置像ngx或apache这样的Web服务器？这意味着Tomcat运行在其他端口（如8085），Web服务器运行在80端口。它会接收请求并有自己的缓存。然后您根本不需要S3，而是可以返回到您的服务器+Cloudfront。

因此，Cloudfront会命中您的Web服务器，如果它在缓存中，则直接返回页面。否则，将转到Tomcat。

缓存可以在同一进程中或在Redis中...取决于您需要缓存的数据总大小。

- tgkprog

我喜欢EB提供的易于部署和易于扩展性，但我认为将缓存加入其中会破坏这一点。而且页面仍然由同一台机器服务，如果页面在我的服务器上被缓存，那么如果我重新部署，它们都将丢失 - 所以我不认为这是一个可行的解决方案。 - Paul Taylor

0

CloudFront缓存重定向，但不遵循它：http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomRedirects。

您没有提供具体的数字，但是如果您预先生成所有这些页面并将它们放入S3并直接指向CloudFront，这样会起作用吗？

如果可行，有几个好处：

您将内容生成与内容服务分离，从而使系统整体更加稳定
内容生成器的性能要求将大大降低，因为它可以慢慢地移动重新生成内容

如果您不知道需要预先生成哪些页面，则肯定行不通。

- Alex Z

是的，预生成所有页面是不可行的，因为那将需要生成大约100万个页面，这将需要大量的CPU/时间，并且这将是一种浪费，因为大多数这些页面永远不会被查看，所以我需要按需生成它们。这是网站albunack.net。 - Paul Taylor

回到重定向问题，我不清楚如果服务器将重定向发送到S3上的页面而不是返回页面本身会发生什么，并且下一次用户从Cloudfront请求页面时会发生什么。 - Paul Taylor

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Anshul Goyal · Accepted Answer

OP说：

它们有时可能很大，最多可达20MB

由于您提供的数据量可能相当大，因此我认为您可以将其分成两个请求而不是一个，其中将内容生成与内容服务部分解耦。这样做的原因是为了尽量减少从S3获取数据并提供数据所需的服务器时间/资源。

AWS支持预签名URL，其有效期较短；我们可以尝试在此处使用它以避免安全等问题。

目前，您的架构如下所示：客户端发起请求，您检查所请求的数据是否存在于S3中，如果存在，则获取并提供数据，否则生成内容并保存到S3中：

                           if exists on S3
client --------> server --------------------> fetch from s3 and serve
                    |
                    |else
                    |------> generate content -------> save to S3 and serve

在网络资源方面，这里的带宽和时间消耗总是会比其他地方多两倍。如果数据存在，你需要从服务器上获取并提供给客户端（所以是2倍）。如果数据不存在，你需要将其发送给客户端和S3（同样是2倍）。

相反，你可以尝试以下两种方法，两种方法都假设你有一些基本模板，并且其他数据可以通过AJAX调用获取，两种方法都降低了整体架构中的2倍因素。

Serve the content from S3 only. This calls for changes to the way your product is designed, and hence may not be that easily integrable.

Basically, for every incoming request, return the S3 URL for it if the data already exists, else create a task for it in SQS, generate the data and push it to S3. Based on your usage patterns for different artists, you should be having an estimate of how much time it takes to pull together the data on the average, and so return a URL which would be valid with the estimated_time_for_completetion(T) of the task.

The client waits for time T, and then makes the request to the URL returned earlier. It makes upto say 3 attempts for fetching this data in case of failure. In fact, the data already existing on S3 can be thought of as the base case when T = 0.

In this case, you make 2-4 network requests from the client, but only the first of those requests comes to your server. You transmit the data once to S3 only in the case it doesn't exists and the client always pulls in from S3.
```
                           if exists on S3, return URL
client --------> server --------------------------------> s3
                    |
                    |else SQS task
                    |---------------> generate content -------> save to S3 
                     return pre-computed url


           wait for time `T`
client  -------------------------> s3
```

Check if data already exists, and make second network call accordingly.

This is similar to what you currently do when serving data from the server in case it doesn't already exist. Again, we make 2 requests here, however, this time we try to serve data synchronously from the server in the case it doesn't exist.

So, in the first hit, we check if the content had ever been generated previously, in which case, we get a successful URL, or error message. When successful, the next hit goes to S3.

If the data doesn't exist on S3, we make a fresh request (to a different POST URL), on getting which, the server computes data, serves it, while adding an asynchronous task to push it to S3.
```
                           if exists on S3, return URL
client --------> server --------------------------------> s3

client --------> server ---------> generate content -------> serve it
                                       |
                                       |---> add SQS task to push to S3
```