--- title: "Improving API client performance with httpcache" description: "This vignette presents the main features of the httpcache package, showing how to use the cache-aware HTTP functions, how to invalidate cache as needed, and how to use logging to assess the performance of the cache." output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Improving API client performance with httpcache} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `httpcache` provides the HTTP verb functions `GET()`, `PUT()`, `PATCH()`, `POST()`, and `DELETE()`, which are drop-in replacements for those in the [httr](https://httr.r-lib.org) package. `GET()` responses are added to the local query cache; `PUT()`, `PATCH()`, `POST()`, and `DELETE()` requests trigger cache invalidation on the associated resources. For APIs where a `POST` requests is used to send a command that returns content and doesn't modify state (and hence is semantically more of a `GET`), you can use `cachedPOST()`, which writes to the cache and doesn't invalidate. To take advantage of these cache-aware functions, all you need to do is load `httpcache` instead of `httr`, or in package development, import from `httpcache`. ```{r} library(httpcache) ``` ```{r, results='hide', echo=FALSE, message=FALSE} options(width=120) ``` ```{r} system.time(a <- GET("https://httpbin.org/get")) system.time(b <- GET("https://httpbin.org/get")) ``` Notice how the second request returns instantly. It's reading from cache---there's no communication with a remote server, so no network latency and no server processing time. Remember: the fastest API request is the one you don't have to make. Reading from the query cache yields exactly the same response as if we had contacted the server. We can confirm: ```{r} identical(a, b) ``` ## Query logging How do we know, other than by the faster response, that we're hitting cache and not making server requests? When designing API clients in R, logging is an invaluable tool for understanding and improving request patterns. As you build layers of abstraction on top of the direct HTTP requests, it can be easy to make inefficient or repetitive requests that degrade performance for your users and impose unnecessary load on your servers. You can't improve what you can't measure, and the logging tools included in `httpcache` can help you measure. Let's clear the cache and repeat that exercise, this time with logging enabled. Use `startLog()` to enable the request log. `startLog` takes a file or connection argument, which it passes to `cat` for log writing. The default, same as for `cat`, prints to the standard output---your display, in an interactive session. (See `?cat` for details.) ```{r} clearCache() startLog() a <- GET("http://httpbin.org/get") b <- GET("http://httpbin.org/get") ``` Notice how the first request results in an "HTTP GET" and a "CACHE SET", while the second one gets a "CACHE HIT" and does not touch "HTTP". From this log output, we can conclude that the query cache is working. ### Log analysis You can also pass a file name to `startLog`. This makes it easier to read the log output back in as a `data.frame` and analyze it quantitatively. ### Custom logging You may want to send other events to the log, interspersed with your HTTP requests, whether for their own sake or to see how work done in R outside of the HTTP layer maps onto your server traffic. The function `logMessage()` writes to the connection specified by `startLog`, and it is available for general use. For error logging, the `halt()` function wraps `stop` and sends a message to the log (it also makes the awkwardly named `call.` argument to `stop` default to `FALSE` for cleaner error messaging). ## Cache invalidation As the [saying (or joke, depending on the version)](https://martinfowler.com/bliki/TwoHardThings.html), cache invalidation is one of the two hard problems in computer science. The trouble with caching what the server serves is that the server is the source of truth, and if the state of data on the server changes, our local copy of the data is stale. In some applications and with some APIs, we have no idea when the server state changes, but in many cases, the source of change on the server is actions that we initiate ourselves. In these cases, a local query cache is more feasible, and cache invalidation more tractable. `httpcache` provides some functions to direct cache invalidation. We've seen one already, `clearCache()`, which wipes the entire cache. Other functions give more surgical control. `dropOnly()` invalidates cache only for the specified URL. `dropPattern()` uses regular expression matching to invalidate cache. `dropCache()` is a convenience wrapper around `dropPattern` that invalidates cache for any resources that start with the given URL. Depending on the API with which you're communicating, you may not need to use those cache-invalidation functions directly, or you may need them only infrequently. `httpcache` was designed with [RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) APIs in mind, particularly those that expose resources that contain collections of entities that can be created, replaced, updated, and deleted ("[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)") with POST, PUT, PATCH, and DELETE, respectively. Consequently, these four HTTP verb functions are built with default cache invalidation actions: `POST` invalidates cache only for the request URL (`dropOnly`), for the case where POST creates a new entity appearing as a subresource; while `PUT`, `PATCH`, and `DELETE` drop cache for the request URL and everything "below" it (`dropCache`). For example, if `GET http://api.example/projects/` returns a catalog of project entities, and POST to `http://api.example/projects/` creates a resource at `http://api.example/projects/new_id/`, we need to bust cache for the project catalog on POST, but our cached responses for resources such as `http://api.example/projects/old_id/` should still be valid. But, if we modify `http://api.example/projects/new_id/`, we should invalidate cache for that resource and for other resources appearing as subresources of it, such as `http://api.example/projects/new_id/users/`. These verb functions in `httpcache` (`POST()`, `PUT()`, `PATCH()`, and `DELETE()`) all take a `drop` argument, which defaults as described above. To override them, you can specify a different call other than `dropCache(url)` or `dropOnly(url)`, or you can pass `drop = NULL` and call the cache-invalidation functions directly outside of the request functions. Depending on your API and your usage of it, however, `httpcache`'s cache management may just work for you with no additional effort. ## Caching across sessions The query cache you build up in one R session doesn't have to end with it. Use `saveCache()` to write out the contents of your cache to a `.rds` file. Restore it later with `loadCache()`.