Reliability of Expo Updates server?

We currently have our app configured to download all new updates upon launch. We don’t want users running old versions of our app. As we’ve started to get more users we’ve gotten a fair number of complaints about the app hanging for a long time upon open or not working. I know internet connections on mobile devices can vary widely so we initially thought it was just this.

However, we added extra logging a month ago to our expo update calls (both checking for updates and fetching updates). We’re seeing quite a lot of errors with Expo Update and each time we see them we check status.expo.dev and almost always things are green. (Today was an exception in which we did see indication of a known issue.)

For example, Updates.checkForUpdateAsync() often throws Manifest verification failed. Most days we get a few of these and we don’t have a lot of users yet, dozens of visitors each day. Sometimes it also throws The request timed out. but that’s less frequent. And then we also get occasional Failed to download new update on the fetchUpdateasync() line.

And everyone on our team internally has experienced the updates sometimes just taking a really long time. Normally, it just a few seconds. But every so often, even when on strong Wi-Fi with a fast internet, it’ll hang for 20-30 seconds on our splash screen. We check all logs, no exceptions, but our detailed logging shows it’s just taking a long time within our update code for no reason we can identify.

I wanted to check in and see whether there are known reliability issues with the update server? Maybe there’s an opportunity for better monitoring on your end so that the status page would reflect when there are issues? Or maybe you’re having some known issues scaling up right now?

We’d much rather expo update servers be so good that we don’t have to think about ever self-hosting and serving updates. I’m not excited about another server we to monitor and scale, and I’m not confident I can do better than you guys. :slight_smile: But I am realizing that the update server is a critical piece of infrastructure for our uptime, unless we want to take on the complexity of supporting older clients so I’m trying to better understand how we should approach this.

Thanks!

I just had one more thought about this: we are still using classic updates rather than EAS Updates. Are any of these reliability issues related to that?

I’ll break down this response to try and cover the several topics in your post.

Server reliability: the availability of the classic updates service has been at 99.999% the last 24 hours according to our monitoring. Availability is defined a server sending back a response that’s not an HTTP 5xx server error before our load balancers time out. There are sometimes short drops throughout the day, like we had a drop to 99.74% for a few minutes, but at this moment the server reliability is looking good.

We have been recently changing some of our network infrastructure and autoscaling strategy and some of those changes have caused outages. It’s possible you got caught during one of those, which would be one of several possible causes of the “The request timed out” or “Failed to download new update” errors you mentioned.

That said, the failure rate you seem to be seeing makes me wonder if there is perhaps an issue that is specific to your app’s updates. Which app of yours is having trouble, and what release channel, SDK version/runtime version, and platform are you seeing the issues on?

Slowness when opening an app: the best way to make launching the app always fast is to set fallbackToCacheTimeout to zero. This tells the app to always launch right away and check for an update in the background. You can also periodically check for an update in JS like it appears you are doing already (recommendation: do the check when your app is foregrounded, rather than polling).

Client reliability: slow or unreliable connections will make it slow to check for updates, which is why setting fallbackToCacheTimeout or being sure to catch errors from Updates.checkForUpdateAsync()/fetchUpdateAsync() are good ideas even if server reliability were perfect. You mentioned seeing issues with your development devices on strong Wi-Fi, though, which again makes me wonder if your update manifests are exceptionally large and causing slowness. One hypothesis behind “Manifest verification failed” is that the update manifest is too large for some clients to verify, which I haven’t heard of before but is just a guess.

EAS Update: the new EAS Update service is our main focus with regard to updates. Checking for updates is already faster for clients around the world with EAS Update. Future engineering work on updates is going to go into EAS Update since it is a paid service (we can do volume discounts) and speed, dashboards, integration with EAS Build, rollouts/rollbacks are some of the things that we’ll continue building for the new service.

1 Like

Hi James, this is interesting. Let me get really specific for a minute because this is definitely happening—not quite daily, but every day or two on our end.

Just last night, someone reported to us that their app would not update itself. It was getting an error (which we catch and display nicely on screen for the user). I opened the app on my device and I experienced the same thing. I checked our logs and the error was “Error: Failed to download new update”. I tried a half dozen times over a couple minutes and it kept getting that same error. One of these errors was at UTC Fri, Nov 11, 2022 2:56:42. This morning I re-opened the app and now it updated successfully. This is the kind of pattern we notice. To my unsophisticated eye, it feels like short periods of downtime with the app update servers.

Here’s the info on our setup:

login: explanation-co
fullName: @explanation-co/vizzable
app: Lava
ID: cedee92c-31dd-4bb8-aa1e-ac790c9ee26f

Latest build (which the above scenario could not download): 136d9014-a2fa-4f0a-819c-4acff8e68144
Expo SDK: 45.0.6
expo-updates: 0.13.4

And again, this is on the classic updates. We’re still pushing on one of our library vendors so that soon we can update to the latest Expo SDK and we’ll also switch to EAS Build then.

Regarding client speed & reliability: Our app can only be used when you are online. If people open it when they’re offline the app throws up a screen and lets you know you need to be connected to the internet. Because of this, we are requiring that users be on the latest update in order to use the app. As part of the app launch sequence, if we detect you are not using the latest update, we roadblock you with a spinner and quickly update the app. And even if you’re in the middle of using the app, if we detect a new update has been released we immediately update you to it. We’re treating it almost like the front-end for a website. Basically, we don’t support older clients. I’m not certain that this is a development strategy we’ll stick with forever, but it sure has enabled us to move fast, debug more easily, iterate on APIs without having to version them—so we’re pushing this as far as we can.

I took at look at your project’s manifest by simulating what an app would fetch. The manifest looks fine and the performance was good, ranging from 130ms to 270ms on my computer. So, that’s more evidence this is a transient issue.

Looking at logs from 2:56AM UTC, we did have a surge of traffic from one app. One way we try to maintain availability for apps is to rate limit surges on a per-app basis, but maybe your app got caught by the rate limiter despite not being the cause of the surge. I don’t think this is the case but we are looking into that. (The other reason for rate limits on Classic Updates is so we can focus mostly on EAS Update.)

Two things client-side that could provide more color are:
1.) Look at your device’s native logs to see what error messages there might be, especially if there are network errors or HTTP response errors.
2.) If there’s space on your roadmap, upgrade to SDK 47 and use the new Updates.readLogEntriesAsync(maxAge) API that provides logs related to updates.

I would love to improve support for apps that are kept highly up to date. A variant of how you’re approaching things I might suggest is to optimistically fetch updates with a modest timeout (say, 1~3 seconds) and if you can get the whole update in that time, then apply it. Otherwise, let the update get downloaded in the background. Speaking generally and not specifically about your app/use case, one thing I’ve found is that the expectations for native apps is that they launch right away and don’t need internet to show the UI without data that’s later fetched from the network.

Thanks for your reply, a few thoughts and questions.

On checking the native logs, do you have any suggestions on how to do this after the fact? Since these issues tend to be transient, it would be hard to get it reproduced on a device connected to a computer/Xcode.

We are definitely planning on moving to Expo 47 and EAS updates but have a third party dependency that we’re waiting on for updated React/React Native support. Would you say on the whole that EAS updates tends to be more reliable/performant?

Your advice on caching updates and playing them later is generally good advice, but in our case, we’re using forced updates to simplify keeping our back end in sync with the app. This approach means we simply don’t allow “old” clients. If we can’t get these updates to ultimately be highly reliable, we may need to revisit that approach.

Nick

Logs: I don’t recall exactly whether Console.app is able to see historical logs to a point, if so, that might prove useful. I know crash logs are stored and can be synced to macOS but since there’s no crash here that won’t be so useful. You could try to simulate bad network conditions with Network Link Conditioner on iOS.

EAS Update: it’s more reliable for a few reasons. A big one is there’s currently less traffic on the service. Also EAS Update and the expo-updates library are where almost all our updates-related engineering energy is going.