Node JS Streams: Understanding data concatenation

One of the first things you learn when you look at node's http module is this pattern for concatenating all of the data events coming from the request read stream:

let body = ; request.on('data', chunk => { body.push(chunk); }).on('end', () => { body = Buffer.concat(body).toString(); });

However, if you look at a lot of streaming library implementations they seem to gloss over this entirely. Also, when I inspect the request.on('data',...) event it almost ever only emits once for a typical JSON payload with a few to a dozen properties.

request.on('data',...)

You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.

Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?. In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?

It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).

I find this to probably be the most confusing aspect of really understanding the node.js specifics of streams, there is a weird disconnect between streaming raw data, and dealing with atomic chunks like objects. Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries? If someone could clarify this it would be very appreciated.

1 Answer
1

The job of the code you show is to collect all the data from the stream into one buffer so when the end event occurs, you then have all the data.

end

request.on('data',...) may emit only once or it may emit hundreds of times. It depends upon the size of the data, the configuration of the stream object and the type of stream behind it. You cannot ever reliably assume it will only emit once.

request.on('data',...)

You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.

You only use this concatenating pattern when you are trying to get the entire data from this stream into a single variable. The whole point of piping to another stream is that you don't need to fetch the entire data from one stream before sending it to the next stream. .pipe() will just send data as it arrives to the next stream for you. Same for transforms.

.pipe()

Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?.

It is likely because the payload is below some internal buffer size and the transport is sending all the data at once and you aren't running on a slow link and .... The point here is you cannot make assumptions about how many data events there will be. You must assume there can be more than one and that the first data event does not necessarily contain all the data or data separated on a nice boundary. Lots of things can cause the incoming data to get broken up differently.

Keep in mind that a readStream reads data until there's momentarily no more data to read (up to the size of the internal buffer) and then it emits a data event. It doesn't wait until the buffer fills before emitting a data event. So, since all data at the lower levels of the TCP stack is sent in packets, all it takes is a momentary delivery delay with some packet and the stream will find no more data available to read and will emit a data event. This can happen because of the way the data is sent, because of things that happen in the transport over which the data flows or even because of local TCP flow control if lots of stuff is going on with the TCP stack at the OS level.

data

In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?

You really should not know or care because you HAVE to assume that any size object could be delivered in more than one data event. You can probably safely assume that a JSON object larger than the internal stream buffer size (which you could find out by studying the stream code or examining internals in the debugger) WILL be delivered in multiple data events, but you cannot assume the reverse because there are other variables such as transport-related things that can cause it to get split up into multiple events.

data

Object mode streams must do their own internal buffering to find the boundaries of whatever objects they are parsing so that they can emit only whole objects. At some low level, they are concatenating data buffers and then examining them to see if they yet have a whole object.

Yes, you are correct that if you were using an object mode stream and the object themselves were very large, they could consume a lot of memory. Likely this wouldn't be the most optimal way of dealing with that type of data.

Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries?

Yes, they do.

FYI, the first thing I do when making http requests is to go use the request-promise library so I don't have to do my own concatenating. It handles all this for you. It also provides a promise-based interface and about 100 other helpful features which I find helpful.

request-promise

Great explanation, +1. If you want to be specific, the max buffer for every chunk of the response stream is: 16384 bytes, which can be obtained via res._readableState.highWaterMark. Anyway as you stated, he shouldn't care in the code, he must assume that the JSON is larger than that.
– Marcos Casagrande
Jun 29 at 21:18

res._readableState.highWaterMark

@MarcosCasagrande - And, that's not a documented, guaranteed not to change value that no subclass of a stream will ever change either. But, as I think you meant, they can't assume that a 10k piece of data that is smaller than the buffer size will always come in one data event. Transport-related things could cause it to be split into multiple events. The stream does not buffer until the buffer is full. It buffers until there's no more data to read and then emits the data event with whatever it has at the time.
– jfriend00
Jun 29 at 21:43

data

I only meant it so the OP knows the buffer limit, as I said he shouldn't care at all in his code, as you perfectly explained. It is documented anyway
– Marcos Casagrande
Jun 29 at 21:51

@MarcosCasagrande - Yeah. I added some more to the middle of my answer about how a stream emits a data event when either its buffer fills up or when it momentarily finds no more data to read. It does not wait for the buffer to fill before emitting the data event. So, any momentary slow-down in packet delivery (which can happen for lots of reasons) can trigger a data event to get sent before all the data has arrived. Hopefully this explains to the OP why they have to code for multiple data events or use higher level processing that does it for them.
– jfriend00
Jun 29 at 21:57

data

Yeah, it's very normal to have data triggered without reaching the highWaterMark threshold. A great read as I stated before.
– Marcos Casagrande
Jun 29 at 22:05

data

highWaterMark

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk