Recently I was scheduled for a 1-hour architecture review video chat with a customer who was trying to solve two challenges: alert lag and cloud storage spend. This was a very interesting customer (and the use-case!) and one of the coolest uses of AI technology I’ve seen.
Initially, I was invited because they believed that Google’s Anthos may help solve one of their issues and I’ve been working a lot in that space lately. Although Anthos may help them in the future, we concluded it wasn’t the right solution for their immediate needs.
This story is worth telling because it’s what I believe sets DoiT International apart from many cloud solution partners, besides our Cloud Management Platform and services at zero cost to customers. We won’t advise something we don’t believe is in our customer’s best interest and our teams are free to make those choices — this is why our customers love us too.
About the customer
This company’s SaaS solution is used by factories around the world to automate quality control. What I find admirable is they found a way to augment human workers and make them better at what they do instead of replacing them with a robot. This is the real promise of AI.
Factories connect their camera feeds that monitor assembly lines and this company’s software analyzes the video in real-time and ensures each step is performed. If it detects a missing step, it sends an alert so that it can be corrected.
Perhaps it’s the geek in me, but I was fascinated when our customer shared his screen and connected to a live feed and we watched as it checked off a list of tasks in an automotive radiator factory (rotate wheel counter-clockwise, pinch this valve, pressure test, etc.). It was really cool to see in action!
Alert Lag Issue
One issue they faced was a 4-second lag from the time they detected an issue until the worker could be notified. Their customers were demanding no more than a 2-second lag so there was time to halt the job and correct it before it was too late. They were given a 6-month mandate to solve the problem.
The company’s engineers had heard about Google’s Anthos, a platform that enables hybrid multi-cloud app modernization. Anthos empowers organizations to centralize and standardize their policy, security, configuration, and containerized application management across all major clouds, virtualized on-prem environments, and now bare metal (at the edge). They were convinced that if they shifted their processing closer to the customer it would solve this lag issue, plus it was a “shiny new thing.”
Cloud Storage Cost Issue
You can imagine how much storage can quickly add up when you’re saving video feeds (sometimes for years) for cameras in all these factories around the world. Each feed produced approximately 700MB per hour at 720p in H.264 format, even after they multiplexed and compressed it. They processed 130 hours per week, per factory.
The company was currently using standard tier cloud storage buckets, and costs were steadily rising. They also sought guidance on how they might improve their efficiency.
The customer shared their screen and an architecture diagram while myself and one other member of our cloud architecture team observed.
“Live streams come in and we store in Google Cloud Storage buckets then process it. We built a custom deep-learning neural net with 1 GPU per stream that identifies the start and stop of given actions. We store the data produced in Google Cloud Bigtable and then business logic interprets it,” they explained. “There is a 4.5-second lag and of that, over 3 seconds occurs in the neural net. We break up the incoming video into frames to feed into the neural net and then re-encode the video.”
Wait, what? “Please explain why you re-encode the video.”
“Oh, this goes back to poor choices early on when the video comes in we strip out the timestamps. We need to re-encode it after inference in order to produce the annotated feed.”
“How long does the re-encoding take?” my colleague asks.
“Just over 2 seconds,” they reply.
“And how long does it take for you to detect a mistake?”
“About 1–1.5 seconds”
It was quite clear at that moment that moving their workloads to the edge with Anthos wouldn’t solve their problem. It wasn’t network related, it was application related.
Proposed Lag Solution
After discussing their architecture we concluded there are two options to explore. The first would be to revisit the reason they removed timestamps during initial intake and potentially eliminate that. The other was to separate out their alerting from the re-encoding and send the alert immediately without waiting for the encoding to complete.
They agreed and would task their engineers to explore those avenues and hold off on implementing Anthos, for now at least. We also learned they did have clients who, for compliance reasons, didn’t want their feeds to leave their factory. There are still viable applications for Anthos On-Prem to solve in the near future, but for now, we wanted to help solve their immediate needs.
We also identified areas they may save by processing multiple streams with a single GPU, and by splitting the inference/alerting from the re-encoding, they may not need as powerful machines.
Proposed Storage Solution
During the initial walkthrough, they explained that they were using standard tier storage and not yet taking advantage of discounted less-frequent-use tiers. An easy win here which we agreed was to leverage Google Cloud Storage object lifecycle management to automatically shift storage objects either by age or access-frequency to lower price tiers.
Although the customer was happy about that and already planned to explore it, I wanted to dive in a little deeper and see if we could get them additional wins. As they described the H.264 format I recalled a recent talk including AI pioneers Ian Longellow and Andrew Ng, among others, on use cases for generative adversarial networks (GANs) besides “deep fakes”, and one was improved video compression.
I proposed they consider leveraging GANs to help further reduce the amount of data they must store when archiving these videos. They acknowledged the potential and were very pleased with our suggestions. Their engineers had actionable suggestions and I look forward to checking in on progress soon.
Another Architecture Review Success!
This recent example illustrates what to expect from a Senior Cloud Architect at DoiT International. We provide customer’s engineering staff a personalized “Stack Overflow” layer of support and help solve issues responding in half the time of most cloud vendor support requests. We also help onboard customer employees with cloud training. And, as in this example, we aid in cost optimization and architecture reviews (Infrastructure, Data, ML or AI, and software architecture).
A typical DoiT Cloud Architect’s week includes about 1/3 time helping solve support “tickets” (a.k.a. cases). Another 1/3 time is spent on customer calls like this, although most are intro sales calls where we get to know cool companies doing amazing things. The remaining 1/3 time is for you to decide (work on your next certification, learn something new, write blog posts about your experiments, or even tend to your garden and share pics on our victory-gardens Slack channel).
If this sounds interesting to you, I encourage visiting our Careers Page or ping me (with a message) on Twitter or LinkedIn. We seek the best and brightest, so our screening is tough, but once you’re “in” you’ll quickly find that DoiT International is unique (somewhere between a product and pro-serv company). I look forward to working with you soon, either as a customer or colleague.