Photo from Google Cloud.
I read this system design interview question online and wanted to write about it. When I initially tried to answer this question, I embarrassingly didn’t have a good way to structure my thoughts. I had a couple of ideas floating around, but I didn’t know where to start. I later realized that it was probably because server monitoring was so prevalent in modern web development that I had been taking it for granted. So I seemed to know a good deal about it, but I couldn’t really connect the dots. I had to take the time to do a bit of research and flesh out my thoughts.
If you got stuck when you first saw this question, don’t worry. It’s normal.
Usually, the first step in a system design interview is to ask for use case clarification. I don’t think that helps much here since we all know what a server monitoring system looks like. Asking for confirmation and directional guidance as we progress through the discussion may be a better strategy. It shows our knowledge depth and enables us to ask more concrete and intelligent questions.
There are literally thousands of things we can measure and monitor on a server.
We can define application layer metrics. For example, how many POST requests on a certain path has the server received? How many concurrent requests is the server dealing with? What’s the latency distribution of a particular method? How long is the unprocessed message queue? What’s the cache hit rate? These metrics are defined by the application and also tracked by the application in in-memory counters.
We can monitor language-/execution-specific metrics. I like Golang, so I’ll use that for illustration. For example, we want to know:
- Which Go version it is.
- How many concurrent Goroutines the process has.
- When the last garbage collection was.
- How many bytes of memory the process has allocated and freed.
- What the heap size is.
Golang has a
runtime package that provides methods for obtaining this kind of information.
We can also inspect the host system. For example, what’s the CPU/memory/disk usage? How many processes are running on the host? Are the filesystem, network, and system timer working properly? These kinds of measurements are surfaced by system calls and vary across platforms. We usually don’t collect them in the application process because they’re common to the host and we need higher privileges to access them. That’s when a daemon process comes in. It’s also worth noting that these kinds of metrics are quite expensive to obtain, so we should be careful about the polling frequency.
Now is a good time to pause and ask if there is anything in particular that the interviewer wants us to monitor. The answer is likely no because the gist of the interview question is to test our system design ability. It’s unlikely that the interview wants us to talk about the details of, for example, the filesystem cache stats.
The metrics are now all available on the server through in-memory counters, language runtime, and system calls. How do we get them out of the server?
Before proceeding, we should confirm with the interviewer: Do we need to get the metrics out of the server? Maybe we don’t. In that case, the server may expose an endpoint service with the metrics or it may just save the metrics to local disks and we can SSH onto it to inspect. The response from the interviewer is, however, very likely to be yes. We do want to get the metrics out. Otherwise, we’ll lose access to the metrics when the server is slow or down — just when we need them the most. We’d also want to put the metrics in a centralized place for better global monitoring and alerting.
A fundamental design question we face now is whether we should use push or pull, meaning whether the server should proactively send the metrics out or it should just expose an endpoint and wait reactively for inquiry.
Prometheus (a famous monitoring system) chose to pull, and there is a good blog post about the design choice. It advocates for pulling because it is more convenient. Each server being monitored only needs to gather the metrics in memory and serve them through an endpoint. It doesn’t need to care where the monitoring system is. It doesn’t need to worry about overloading the monitoring system if they send too much and/or too frequently. A global configuration about what to collect and the collection interval can be tuned in the monitoring system.
On the other hand, pushing may be useful in some scenarios (e.g. when a firewall prevents the monitoring system from accessing the server directly).
One disadvantage of pulling that the blog post didn’t mention is that it’s challenging to offer high availability and scalability with a pull-only model. If we’re using push, we can put a load balancer in front of a set of monitoring system replicas and have the servers being monitored send metrics through the load balancer. But with a pull-only model, the metrics are collected directly by a monitoring system instance. We’ll have to shard the metrics deliberately when pulling and deploy backup instances explicitly to support replication and failover. Depending on the environment, this may or may not be a big problem.
Another noteworthy point is that modern monitoring systems usually have more than one layer of push/pull to aggregate metrics in a hierarchy. Sometimes, a hybrid push/pull model will be used between the hierarchical layers to address the shortcomings of a homogeneous retrieval mechanism.
Now that we’ve managed to get the metrics from the servers being monitored into another system, the next thing is how to store them. Again, always ask for confirmation first to see if it’s necessary. Maybe the requirement is just to have a centralized in-memory metrics repository. In that case, we don’t need to worry about designing a time series database. Nevertheless, time series databases are interesting. Let’s assume that durability is a requirement. It often is, as a monitoring system without a history view isn’t that useful.
We want to store data samples — essentially a value with a timestamp — in chronological order for all time series. A time series is the full timeline view of a particular metric. For example, the P50 latency of
POST/myMethod from server
instance1 collected in every 15-second interval is a time series.
Our first intuition should be that we absolutely cannot just write individual data samples to files as they arrive. That’ll be prohibitively inefficient because the monitoring system may end up collecting a million data samples or more every minute. So some form of batching is crucial. With batching in memory, there comes the risk of losing data. It may not be a big deal if we’re only batching for a short period of time, as typical time series use cases can tolerate the loss of a few data samples. But if we want to batch for a longer period of time, a durability safeguard should be put in place. After all, the most recent data samples are usually the most valuable. We don’t want to be in a position where we lose the last 30 minutes of metrics.
Write-ahead-log (WAL) is the canonical solution to complement in-memory batching for durability. For more details, you can check out the detailed article I wrote on the topic. On a high level, WAL pipes the changes to a log file on disk that can be used to restore the system state after crashing. WAL doesn’t incur a big IO penalty because sequential file access is relatively fast.
Now we can buffer the data samples in memory for a while, which opens doors for better write efficiency. The next thing we should decide is how we structure the data in files. One time series per file sounds like a simple solution, but unfortunately, it won’t scale. There are just too many time series to create individual files for. We have to store multiple time series in a file. On the other hand, we can’t just put everything in a monolithic file. That file will be too big to operate. We need to cut the file in some way. The natural choice here is to cut the file by the time dimension. We can create one file per day or other configurable time window. Let’s call the data in a time window a block. If the data volume in a block is too big for one file, we can shard it across a few files. We also need an index file in each block to tell us which file and what file position to look for in a particular time series. See figure 1 for an illustration.
As data samples arrive in memory, we buffer them until we need to flush them to disk. When we flush them to disk, we organize them in blocks. The block for the most recent data samples typically represents a small time window. As the blocks grow older, we compact them to form longer time windows. This keeps the overall block number in check. What’s more, the old data samples are queried less frequently so we can put them in larger files for easy management. We can also down-sample the data as part of the compaction to reduce overall data volume. The idea of compacting young files into bigger old files is not new. It’s called a log-structured merge tree. LevelDB is probably the most famous implementation of it.
There are also other small but important details. Compression should be employed to reduce the overall data volume. Time series data is an excellent candidate for compression. All the data samples can be expressed as delta from the preceding data samples. Facebook published a paper that describes two particular tricks in time series data compression that led to 10x saving, taking advantage of the fact that adjacent data points are close. In their paper, a timestamp is encoded as delta of delta. The second-order derivative tends to have more zeroes. The floating-point value is encoded as an XOR result and can be restored by
(a XOR b) XOR a = b. When two floating-point values are close, their XOR result has a lot of zeroes.
So far, we’ve only been concerned with data in local disks. To grow beyond a single node, we need to scale the storage out. We can either back the data by using a distributed file system (such as HDFS) or redesign the storage logic to adapt to a more natively distributed infrastructure. We won’t have time to go into details here.
If we have the storage laid out as in Figure 1, querying is easy to envision. Time is a universal filtering criterion in all metrics searches. We use a time range to narrow the search down to a set of continuous blocks. If we know the time series ID, we can retrieve it from the files in those blocks. Otherwise, we’d need to add a reversed index to go from the search criteria to time series IDs and then follow the index files in blocks to locate the time series files. Once we retrieve the time series data, we can graph them for display.
Alerts can be registered inside the monitoring server. While it sniffs through the incoming data samples, it can alert for any abnormal pattern. We can also decouple them from the monitoring server and deploy them as downstream monitoring servers that only collect and inspect a more selected set of metrics.
That’ll conclude this system design interview. We’ve definitely covered more than what a typical system design interview would have time for. I hope that you’ve gotten a little bit of inspiration from this article.
- Pull doesn't scale - or does it? | Prometheus
- Building an Append-only Log From Scratch | by Eileen Pangu | Medium
- Log Structured Merge Tree (LSM-tree) Implementations (a Demo and LevelDB) | by Eileen Pangu | Medium
- Tuomas Pelkonen et al, Gorilla: A Fast, Scalable, In-Memory Time Series Database, VLDB, 2015