Understanding Audio Media Flow
Introduction
During a call, media components are managed by PJSUA-LIB, when PJSUA-LIB or PJSUA2 is used, or by the application if the application uses low level PJSIP or PJMEDIA API directly. Typically the media components for a (PJSUA-LIB) call are interconnected as follows:

The main building blocks for above diagram are the following components:
audio device stream (
pjmedia_aud_stream
), which represents a sound device,a Sound Device Port (
pjmedia_snd_port
), to translate sound device callbacks into calls to downstream media port’spjmedia_port_put_frame()
/pjmedia_port_get_frame()
.a Media Stream (
pjmedia_stream
) to convert between PCM audio to encoded RTP/RTCP packets,a Media Transport (
pjmedia_transport
) to transmit and receive RTP/RTCP packets to/from network.
The media interconnection above would be set up as follows:
PJSUA-LIB (or application) creates a conference bridge (
pjmedia_conf
) during initialization, and normally would retain this throughout the life time of the application.when making outgoing call or receiving incoming call, PJSUA-LIB opens a audio device stream (
pjmedia_aud_stream
) and creates a sound device port (pjmedia_snd_port
) and a media transport instance such as UDP media transport. The listening address and port number of the transport are put in the local SDP to be given to the INVITE session.once the offer/answer session in the call is established,
pjsip_inv_callback::on_media_update
callback is called and PJSUA-LIB creates apjmedia_stream_info
from both local and remote SDP by usingpjmedia_stream_info_from_sdp()
.PJSUA-LIB creates a media stream (
pjmedia_stream
) withpjmedia_stream_create()
, specifying thepjmedia_stream_info
and the media transport created earlier. This creates media stream according to the codec settings and other parameters in the media stream info, and also establish connection between the media stream and the media transport.application registers this media stream to the conference bridge with
pjmedia_conf_add_port()
application connects the media stream slot in the bridge to other slots such as slot zero which normally is connected to the sound device, with
pjmedia_conf_connect_port()
.
The whole media flow is driven by timing of the sound device, especially the playback callback.
Audio playback flow (the main flow)
when
pjmedia_aud_stream
needs another frame to be played to the speaker, it callsplay_cb
callback that was specified inpjmedia_aud_stream_create()
This callback belongs to
pjmedia_snd_port
. In this callback,pjmedia_snd_port
callspjmedia_port_get_frame()
of its downstream port, which in this case is the conference bridge (pjmedia_conf
).The conference bridge calls
pjmedia_port_get_frame()
for all ports in the conference bridge,then it mixes the signal together according to ports connection in the bridge, and deliver the mixed signal by calling
pjmedia_port_put_frame()
for all ports in the bridge according to their connection.A
pjmedia_port_get_frame()
call by conference bridge to media stream (pjmedia_stream
) will cause it to pick one frame from the jitter buffer, decode the frame using the configured codec (or apply Packet Lost Concealment/PLC if frame is lost), and return the PCM frame to the caller. Note that the jitter buffer is filled-in by other flow (the flow that polls the network sockets), and will be described in later section below.A
pjmedia_port_put_frame()
call by conference bridge to media stream will cause the media stream to encode the PCM frame with the chosen codec, pack it into RTP packet with its RTP session, update RTCP session, schedule RTCP transmission, and deliver the RTP/RTCP packets to the underlying media transport that was previously attached to the stream. The media transport then sends the RTP/RTCP packet to the network.Once these processes finishes, the conference bridge returns the mixed signal for slot zero back to the original
pjmedia_port_get_frame()
call.
The
pjmedia_snd_port
got the audio frame and returned it back to the audio device stream to finish theplay_cb
callback.
Audio recording flow
The above flow only describes the flow in one direction, i.e. to the speaker device. But what about the audio flow coming from the microphone?
When the input sound (microphone) device has finished capturing one audio frame, it will report this event by calling
rec_cb
callback that was specified inpjmedia_aud_stream_create()
.This callback belongs to
pjmedia_snd_port
. In this callback,pjmedia_snd_port
callspjmedia_port_put_frame()
of its downstream port, which in this case is the conference bridge (pjmedia_conf
).When
pjmedia_port_put_frame()
function is called to the conference bridge, the bridge will just store the PCM frame to an internal buffer, to be picked up by the main flow (thepjmedia_port_get_frame()
call to the bridge above) when the bridge collects frames from all ports and mix the signal.
Sound device timing problem
Ideally, rec_cb
and play_cb
should be
called one after another, in turn and consecutively, by the sound device. But
unfortunately this is not always the case; in many low-end sound cards,
it is quite common to have several consecutive/burst of rec_cb
callbacks
and then followed by burst of play_cb
calls.
So with 20ms ptime for example, rather than having one frame every 20ms, these devices would give PJMEDIA three or four frames every 60ms or 80ms. Since RTP packets are transmitted as soon as audio frame is available from the sound card, this would cause PJMEDIA to transmit RTP packets at (what looks like) irregular interval.
This should be fine, as the remote endpoint should be able to accommodate this with its jitter buffer. However, if you rather want PJMEDIA to transmit RTP packets at good interval, you can install a master clock port between sound device and conference bridge, so that the master port will drive the media clock instead. A master clock port uses an internal thread to drive the media flow, so it should provide better timing on most platforms.
Fortunately, using PJSUA/PJSUA2 API this is made simple by instantiating extra audio device #2077.
Note
Some sound device features will be unavailable on extra audio device:
auto close on idle
stereo mode
*Update*: since 2.15, there is a simpler & better approach, see #4149.
Another less common problem with the sound device is when the total number of samples played and/or recorded by the sound device does not match the requested clock rate. This is what we call audio clock drift.
Both of these problems can be analyzed with Testing and optimizing audio device with pjsystest.
The conference bridge handles these problems by using Adaptive Delay Buffer. The delay buffer continuously learns the optimal delay to be applied to the audio flow at run-time, and may expand or shrink the buffer without distorting the audio quality. The drawback of using this buffer, however, is increased latency. The latency increases according to the jitter/bursts characteristics of the sound device.
The maximum (sound device) jitter that can be accomodated by the conference bridge’s
is controlled by PJMEDIA_SOUND_BUFFER_COUNT
macro, which default value is around
150 ms. It is possible that a very very bad sound device may overrun this buffer,
which in this case it would be necessary to enlarge the PJMEDIA_SOUND_BUFFER_COUNT
number
in your config_site.h.
The untimely nature of the sound device may also contribute to overal jitter seen by the jitter buffer. See Jitter buffer features and operations for more information.
Incoming RTP/RTCP Packets
Incoming RTP/RTCP packets is not driven by any of the flow above, but by different flow (“thread”), that is the flow/thread that polls the socket descriptors (of the media transport).
The standard implementation of UDP media transport
in PJMEDIA will register the RTP and RTCP sockets to an
pj_ioqueue_t
(see IOQUEUE documentation).
Application can choose different strategy with regard to placing the
ioqueue instance:
Application can instruct the Media Endpoint to instantiate an internal IOQueue and start one or more worker threads to poll this ioqueue. This probably is the recommended strategy so that polling to media sockets is done by separate thread (and this is the default settings in PJSUA-LIB).
Alternatively, application can use a single ioqueue for both SIP and media sockets, and poll the whole thing from a single thread, possibly the main thread. To use this, application will specify the ioqueue instance to be used when creating the media endpoint and disable worker thread. This strategy is probably preferable on a small-footprint devices to reduce (or eliminate) threads in the system.
The flow of incoming RTP packets are as follows:
an internal worker thread in the Media Endpoint polls the ioqueue.
an incoming packet will cause the ioqueue to call
on_rx_rtp()
callback of the UDP media transport. This callback was previously registered by the UDP media transport to the ioqueue.the
on_rx_rtp()
callback reports the incoming RTP packet to the media stream. The media stream was attached to the UDP media transport withpjmedia_transport_attach()
.the media stream unpacks the RTP packet using its internal RTP session, update RX statistics, de-frame the payload according to the codec being used (there can be multiple audio frames in a single RTP packet), and put the frames in the jitter buffer.
the processing of incoming packets stops here, as the frames in the jitter buffer will be picked up by the main flow (a call to
pjmedia_port_get_frame()
to the media stream) above.