Understanding Audio Media Flow


During a call, media components are managed by PJSUA-LIB, when PJSUA-LIB or PJSUA2 is used, or by the application if the application uses low level PJSIP or PJMEDIA API directly. Typically the media components for a (PJSUA-LIB) call are interconnected as follows:


The main building blocks for above diagram are the following components:

The media interconnection above would be set up as follows:

The whole media flow is driven by timing of the sound device, especially the playback callback.

Audio playback flow (the main flow)

  1. when pjmedia_aud_stream needs another frame to be played to the speaker, it calls play_cb callback that was specified in pjmedia_aud_stream_create()

  2. This callback belongs to pjmedia_snd_port. In this callback, pjmedia_snd_port calls pjmedia_port_get_frame() of its downstream port, which in this case is the conference bridge (pjmedia_conf).

  3. The conference bridge calls pjmedia_port_get_frame() for all ports in the conference bridge,

    1. then it mixes the signal together according to ports connection in the bridge, and deliver the mixed signal by calling pjmedia_port_put_frame() for all ports in the bridge according to their connection.

    2. A pjmedia_port_get_frame() call by conference bridge to media stream (pjmedia_stream) will cause it to pick one frame from the jitter buffer, decode the frame using the configured codec (or apply Packet Lost Concealment/PLC if frame is lost), and return the PCM frame to the caller. Note that the jitter buffer is filled-in by other flow (the flow that polls the network sockets), and will be described in later section below.

    3. A pjmedia_port_put_frame() call by conference bridge to media stream will cause the media stream to encode the PCM frame with the chosen codec, pack it into RTP packet with its RTP session, update RTCP session, schedule RTCP transmission, and deliver the RTP/RTCP packets to the underlying media transport that was previously attached to the stream. The media transport then sends the RTP/RTCP packet to the network.

    4. Once these processes finishes, the conference bridge returns the mixed signal for slot zero back to the original pjmedia_port_get_frame() call.

  4. The pjmedia_snd_port got the audio frame and returned it back to the audio device stream to finish the play_cb callback.

Audio recording flow

The above flow only describes the flow in one direction, i.e. to the speaker device. But what about the audio flow coming from the microphone?

  1. When the input sound (microphone) device has finished capturing one audio frame, it will report this event by calling rec_cb callback that was specified in pjmedia_aud_stream_create().

  2. This callback belongs to pjmedia_snd_port. In this callback, pjmedia_snd_port calls pjmedia_port_put_frame() of its downstream port, which in this case is the conference bridge (pjmedia_conf).

  3. When pjmedia_port_put_frame() function is called to the conference bridge, the bridge will just store the PCM frame to an internal buffer, to be picked up by the main flow (the pjmedia_port_get_frame() call to the bridge above) when the bridge collects frames from all ports and mix the signal.

Sound device timing problem

Ideally, rec_cb and play_cb should be called one after another, in turn and consecutively, by the sound device. But unfortunately this is not always the case; in many low-end sound cards, it is quite common to have several consecutive/burst of rec_cb callbacks and then followed by burst of play_cb calls.

Another less common problem with the sound device is when the total number of samples played and/or recorded by the sound device does not match the requested clock rate. This is what we call audio clock drift.

Both of these problems can be analyzed with Testing and optimizing audio device with pjsystest.

The conference bridge handles these problems by using Adaptive Delay Buffer. The delay buffer continuously learns the optimal delay to be applied to the audio flow at run-time, and may expand or shrink the buffer without distorting the audio quality. The drawback of using this buffer, however, is increased latency. The latency increases according to the jitter/bursts characteristics of the sound device.

The maximum (sound device) jitter that can be accomodated by the conference bridge’s is controlled by PJMEDIA_SOUND_BUFFER_COUNT macro, which default value is around 150 ms. It is possible that a very very bad sound device may overrun this buffer, which in this case it would be necessary to enlarge the PJMEDIA_SOUND_BUFFER_COUNT number in your config_site.h.

The untimely nature of the sound device may also contribute to overal jitter seen by the jitter buffer. See Jitter buffer features and operations for more information.

Incoming RTP/RTCP Packets

Incoming RTP/RTCP packets is not driven by any of the flow above, but by different flow (“thread”), that is the flow/thread that polls the socket descriptors (of the media transport).

The standard implementation of UDP media transport in PJMEDIA will register the RTP and RTCP sockets to an pj_ioqueue_t (see IOQUEUE documentation). Application can choose different strategy with regard to placing the ioqueue instance:

  • Application can instruct the Media Endpoint to instantiate an internal IOQueue and start one or more worker threads to poll this ioqueue. This probably is the recommended strategy so that polling to media sockets is done by separate thread (and this is the default settings in PJSUA-LIB).

  • Alternatively, application can use a single ioqueue for both SIP and media sockets, and poll the whole thing from a single thread, possibly the main thread. To use this, application will specify the ioqueue instance to be used when creating the media endpoint and disable worker thread. This strategy is probably preferable on a small-footprint devices to reduce (or eliminate) threads in the system.

The flow of incoming RTP packets are as follows:

  1. an internal worker thread in the Media Endpoint polls the ioqueue.

  2. an incoming packet will cause the ioqueue to call on_rx_rtp() callback of the UDP media transport. This callback was previously registered by the UDP media transport to the ioqueue.

  3. the on_rx_rtp() callback reports the incoming RTP packet to the media stream. The media stream was attached to the UDP media transport with pjmedia_transport_attach().

  4. the media stream unpacks the RTP packet using its internal RTP session, update RX statistics, de-frame the payload according to the codec being used (there can be multiple audio frames in a single RTP packet), and put the frames in the jitter buffer.

  5. the processing of incoming packets stops here, as the frames in the jitter buffer will be picked up by the main flow (a call to pjmedia_port_get_frame() to the media stream) above.