Reading large JSON files with memory constraints
I wanted to try code golfing a problem I saw raised occasionally about how to read/parse JSON files that were larger than available system memory. I went with using mmap
and file streams with filters.
As a contrived hypothetical, imagine having to calculate the average balance from a json object like:
>jq '.' bank_customers.json | head -n10
{
"customers": [
{
"name": "Mack Donalds",
"balance": 111457.33
},
{
"name": "Jimmy John",
"balance": -2032.33
}
...
A basic reader in RapidJSON could look like:
#define FATAL(...) do { fprintf(stderr, __VA_ARGS__); return -1;} while(0)
int main(int argc, char **argv) {
FILE* fp = fopen(argv[1], "rb");
if (fp == nullptr) {
FATAL("error opening file: %s. Reason: %s\n",
argv[1], strerror(errno));
}
char rbuf[65536];
rapidjson::FileReadStream is(fp, rbuf, sizeof(rbuf));
rapidjson::Document js;
rapidjson::ParseResult ok = js.ParseStream(is);
if (!ok) {
FATAL("error parsing json. Reason[%d]: %s\n",
(int)ok.Offset(), rapidjson::GetParseError_En(ok.Code()));
}
double num_cust = 0;
double avg_balance = 0.0;
for (auto && obj : js.GetObject()){
if (std::string(obj.name.GetString()) == "customers") {
for (auto && cust : obj.value.GetArray()) {
for (auto && prop : cust.GetObject()) {
if (std::string(prop.name.GetString()) == "balance") {
++num_cust;
double balance = (double)(prop.value.GetInt());
avg_balance = avg_balance + ((balance - avg_balance) / num_cust);
}
}
}
}
}
fclose(fp);
printf("average balance: %.2f\n", avg_balance);
return 0;
}
Running with Memory Constraints
On Linux systems with systemd support, systemd-run
is a one-shot command for running a command in “transient scope/service” units.
To limit the memory of a command:
systemd-run --user -t -G --wait -p MemoryMax=<max> <cmd+args>
Constraining the limit of the command to read all of the json at once shows the process was unsuccessful due to oom-kill
:
>ls -l /tmp/big.json
-rw-rw-r-- 1 rmorrison rmorrison 1572998392 Apr 15 14:56 /tmp/big.json
...
>systemd-run --user -t -G --wait -p MemoryMax=64M \
/tmp/read_json /tmp/big.json
Running as unit: run-u8357.service
Press ^] three times within 1s to disconnect TTY.
Finished with result: oom-kill
Main processes terminated with: code=killed/status=KILL
Service runtime: 5.124s
CPU time consumed: 5.084s
Streaming w/ RapidJSON
Instead of pulling the entire file into memory, use mmap
, and stream the data through a library that supports streaming.
int main(int argc, char **argv) {
char* buf = nullptr;
size_t buf_len = 0;
if (mmap_file(argv[1], &buf, buf_len) != 0) {
return -1;
}
rapidjson::StringStream ss(buf);
jshandler handler;
rapidjson::Reader reader;
rapidjson::ParseResult ok = reader.Parse(ss, handler);
if (!ok) {
FATAL("error parsing json. Reason[%d]: %s\n",
(int)ok.Offset(), rapidjson::GetParseError_En(ok.Code()));
}
printf("average balance: %.2f\n", handler.avg_balance);
munmap(buf, buf_len);
return 0;
}
This isn’t a robust approach, but the reader can perform aggregations based on filters or keys. In this case searching for "balance"
fields in individual objects.
struct jshandler {
bool Int(int i) { avg_more((int)i); return true; }
bool Uint(unsigned u) { avg_more((int)u); return true; }
// mapping key balance -> avg calculation
bool Key(const char* str, rapidjson::SizeType length, bool copy) {
if (std::string(str, length) == "balance") {
balance_flag = true;
}
return true;
}
// recalc mean and unset balance flag
void avg_more(int val) {
++num_cust;
avg_balance = avg_balance + ((val- avg_balance) / num_cust);
balance_flag = false;
}
bool balance_flag;
double num_cust;
double avg_balance;
};
Results
Running the streaming version of the JSON reader w/ memory limits:
>systemd-run --user -t -G --wait -p MemoryMax=64M \
/tmp/read_json_stream /tmp/big.json
Running as unit: run-u8358.service
Press ^] three times within 1s to disconnect TTY.
average balance: 494901.69
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 3.352s
CPU time consumed: 3.344s
Just as a sanity check verifying results against std reader w/o memory limits:
>./read_json /tmp/big.json
average balance: 494901.69
Link to code: https://github.com/tinselcity/experiments/tree/master/big_json
Notes
Throughput
This post was more about dealing with memory constraints than performance, but in my anecdotal testing, I found simdjson parses 4-10x faster than RapidJSON. simdjson also appears to support streaming an object as well as a stream of records.
Largest Object
Both simdjson and RapidJSON appear to have 4GB single object/file constraints, although RapidJSON could be customized to support larger sizes according to the author.
References
- RapidJSON: https://rapidjson.org
- simdjson: https://github.com/simdjson/simdjson
- Linux Control Group v2: https://docs.kernel.org/admin-guide/cgroup-v2.html
- Linux Page Cache Tutorial: https://biriukov.dev/docs/page-cache/6-cgroup-v2-and-page-cache/