Understanding Audio Media Flow
Table of Contents
During a call, media components are managed by PJSUA-LIB, when PJSUA-LIB or PJSUA2 is used, or by the application if the application uses low level PJSIP or PJMEDIA API directly. Typically the media components for a (PJSUA-LIB) call are interconnected as follows:
The main building blocks for above diagram are the following components:
audio device stream (
pjmedia_aud_stream), which represents a sound device,
a Sound Device Port (
pjmedia_snd_port), to translate sound device callbacks into calls to downstream media port’s
a Media Stream (
pjmedia_stream) to convert between PCM audio to encoded RTP/RTCP packets,
a Media Transport (
pjmedia_transport) to transmit and receive RTP/RTCP packets to/from network.
The media interconnection above would be set up as follows:
PJSUA-LIB (or application) creates a conference bridge (
pjmedia_conf) during initialization, and normally would retain this throughout the life time of the application.
when making outgoing call or receiving incoming call, PJSUA-LIB opens a audio device stream (
pjmedia_aud_stream) and creates a sound device port (
pjmedia_snd_port) and a media transport instance such as UDP media transport. The listening address and port number of the transport are put in the local SDP to be given to the INVITE session.
once the offer/answer session in the call is established,
pjsip_inv_callback::on_media_updatecallback is called and PJSUA-LIB creates a
pjmedia_stream_infofrom both local and remote SDP by using
PJSUA-LIB creates a media stream (
pjmedia_stream_create(), specifying the
pjmedia_stream_infoand the media transport created earlier. This creates media stream according to the codec settings and other parameters in the media stream info, and also establish connection between the media stream and the media transport.
application registers this media stream to the conference bridge with
application connects the media stream slot in the bridge to other slots such as slot zero which normally is connected to the sound device, with
The whole media flow is driven by timing of the sound device, especially the playback callback.
Audio playback flow (the main flow)
pjmedia_aud_streamneeds another frame to be played to the speaker, it calls
play_cbcallback that was specified in
This callback belongs to
pjmedia_snd_port. In this callback,
pjmedia_port_get_frame()of its downstream port, which in this case is the conference bridge (
The conference bridge calls
pjmedia_port_get_frame()for all ports in the conference bridge,
then it mixes the signal together according to ports connection in the bridge, and deliver the mixed signal by calling
pjmedia_port_put_frame()for all ports in the bridge according to their connection.
pjmedia_port_get_frame()call by conference bridge to media stream (
pjmedia_stream) will cause it to pick one frame from the jitter buffer, decode the frame using the configured codec (or apply Packet Lost Concealment/PLC if frame is lost), and return the PCM frame to the caller. Note that the jitter buffer is filled-in by other flow (the flow that polls the network sockets), and will be described in later section below.
pjmedia_port_put_frame()call by conference bridge to media stream will cause the media stream to encode the PCM frame with the chosen codec, pack it into RTP packet with its RTP session, update RTCP session, schedule RTCP transmission, and deliver the RTP/RTCP packets to the underlying media transport that was previously attached to the stream. The media transport then sends the RTP/RTCP packet to the network.
Once these processes finishes, the conference bridge returns the mixed signal for slot zero back to the original
pjmedia_snd_portgot the audio frame and returned it back to the audio device stream to finish the
Audio recording flow
The above flow only describes the flow in one direction, i.e. to the speaker device. But what about the audio flow coming from the microphone?
When the input sound (microphone) device has finished capturing one audio frame, it will report this event by calling
rec_cbcallback that was specified in
This callback belongs to
pjmedia_snd_port. In this callback,
pjmedia_port_put_frame()of its downstream port, which in this case is the conference bridge (
pjmedia_port_put_frame()function is called to the conference bridge, the bridge will just store the PCM frame to an internal buffer, to be picked up by the main flow (the
pjmedia_port_get_frame()call to the bridge above) when the bridge collects frames from all ports and mix the signal.
Sound device timing problem
play_cb should be
called one after another, in turn and consecutively, by the sound device. But
unfortunately this is not always the case; in many low-end sound cards,
it is quite common to have several consecutive/burst of
and then followed by burst of
Another less common problem with the sound device is when the total number of samples played and/or recorded by the sound device does not match the requested clock rate. This is what we call audio clock drift.
Both of these problems can be analyzed with Testing and optimizing audio device with pjsystest.
The conference bridge handles these problems by using Adaptive Delay Buffer. The delay buffer continuously learns the optimal delay to be applied to the audio flow at run-time, and may expand or shrink the buffer without distorting the audio quality. The drawback of using this buffer, however, is increased latency. The latency increases according to the jitter/bursts characteristics of the sound device.
The maximum (sound device) jitter that can be accomodated by the conference bridge’s
is controlled by
PJMEDIA_SOUND_BUFFER_COUNT macro, which default value is around
150 ms. It is possible that a very very bad sound device may overrun this buffer,
which in this case it would be necessary to enlarge the
in your config_site.h.
The untimely nature of the sound device may also contribute to overal jitter seen by the jitter buffer. See Jitter buffer features and operations for more information.
Incoming RTP/RTCP Packets
Incoming RTP/RTCP packets is not driven by any of the flow above, but by different flow (“thread”), that is the flow/thread that polls the socket descriptors (of the media transport).
The standard implementation of UDP media transport
in PJMEDIA will register the RTP and RTCP sockets to an
pj_ioqueue_t (see IOQUEUE documentation).
Application can choose different strategy with regard to placing the
Application can instruct the Media Endpoint to instantiate an internal IOQueue and start one or more worker threads to poll this ioqueue. This probably is the recommended strategy so that polling to media sockets is done by separate thread (and this is the default settings in PJSUA-LIB).
Alternatively, application can use a single ioqueue for both SIP and media sockets, and poll the whole thing from a single thread, possibly the main thread. To use this, application will specify the ioqueue instance to be used when creating the media endpoint and disable worker thread. This strategy is probably preferable on a small-footprint devices to reduce (or eliminate) threads in the system.
The flow of incoming RTP packets are as follows:
an internal worker thread in the Media Endpoint polls the ioqueue.
an incoming packet will cause the ioqueue to call
on_rx_rtp()callback of the UDP media transport. This callback was previously registered by the UDP media transport to the ioqueue.
on_rx_rtp()callback reports the incoming RTP packet to the media stream. The media stream was attached to the UDP media transport with
the media stream unpacks the RTP packet using its internal RTP session, update RX statistics, de-frame the payload according to the codec being used (there can be multiple audio frames in a single RTP packet), and put the frames in the jitter buffer.
the processing of incoming packets stops here, as the frames in the jitter buffer will be picked up by the main flow (a call to
pjmedia_port_get_frame()to the media stream) above.