Understanding Audio Media Flow 

Introduction 

During a call, media components are managed by PJSUA-LIB, when PJSUA-LIB or PJSUA2 is used, or by the application if the application uses low level PJSIP or PJMEDIA API directly. Typically the media components for a (PJSUA-LIB) call are interconnected as follows:

The main building blocks for above diagram are the following components:

audio device stream (pjmedia_aud_stream), which represents a sound device,
a Sound Device Port (pjmedia_snd_port), to translate sound device callbacks into calls to downstream media port’s pjmedia_port_put_frame()/pjmedia_port_get_frame().
a Conference Bridge (pjmedia_conf),
a Media Stream (pjmedia_stream) to convert between PCM audio to encoded RTP/RTCP packets,
a Media Transport (pjmedia_transport) to transmit and receive RTP/RTCP packets to/from network.

The media interconnection above would be set up as follows:

PJSUA-LIB (or application) creates a conference bridge (pjmedia_conf) during initialization, and normally would retain this throughout the life time of the application.
when making outgoing call or receiving incoming call, PJSUA-LIB opens a audio device stream (pjmedia_aud_stream) and creates a sound device port (pjmedia_snd_port) and a media transport instance such as UDP media transport. The listening address and port number of the transport are put in the local SDP to be given to the INVITE session.
once the offer/answer session in the call is established, pjsip_inv_callback::on_media_update callback is called and PJSUA-LIB creates a pjmedia_stream_info from both local and remote SDP by using pjmedia_stream_info_from_sdp().
PJSUA-LIB creates a media stream (pjmedia_stream) with pjmedia_stream_create(), specifying the pjmedia_stream_info and the media transport created earlier. This creates media stream according to the codec settings and other parameters in the media stream info, and also establish connection between the media stream and the media transport.
application registers this media stream to the conference bridge with pjmedia_conf_add_port()
application connects the media stream slot in the bridge to other slots such as slot zero which normally is connected to the sound device, with pjmedia_conf_connect_port().

The whole media flow is driven by timing of the sound device, especially the playback callback.

Audio playback flow (the main flow)

when pjmedia_aud_stream needs another frame to be played to the speaker, it calls play_cb callback that was specified in pjmedia_aud_stream_create()
This callback belongs to pjmedia_snd_port. In this callback, pjmedia_snd_port calls pjmedia_port_get_frame() of its downstream port, which in this case is the conference bridge (pjmedia_conf).
The conference bridge calls pjmedia_port_get_frame() for all ports in the conference bridge,
1. then it mixes the signal together according to ports connection in the bridge, and deliver the mixed signal by calling pjmedia_port_put_frame() for all ports in the bridge according to their connection.
2. A pjmedia_port_get_frame() call by conference bridge to media stream (pjmedia_stream) will cause it to pick one frame from the jitter buffer, decode the frame using the configured codec (or apply Packet Lost Concealment/PLC if frame is lost), and return the PCM frame to the caller. Note that the jitter buffer is filled-in by other flow (the flow that polls the network sockets), and will be described in later section below.
3. A pjmedia_port_put_frame() call by conference bridge to media stream will cause the media stream to encode the PCM frame with the chosen codec, pack it into RTP packet with its RTP session, update RTCP session, schedule RTCP transmission, and deliver the RTP/RTCP packets to the underlying media transport that was previously attached to the stream. The media transport then sends the RTP/RTCP packet to the network.
4. Once these processes finishes, the conference bridge returns the mixed signal for slot zero back to the original pjmedia_port_get_frame() call.
The pjmedia_snd_port got the audio frame and returned it back to the audio device stream to finish the play_cb callback.

Audio recording flow 

The above flow only describes the flow in one direction, i.e. to the speaker device. But what about the audio flow coming from the microphone?

When the input sound (microphone) device has finished capturing one audio frame, it will report this event by calling rec_cb callback that was specified in pjmedia_aud_stream_create().
This callback belongs to pjmedia_snd_port. In this callback, pjmedia_snd_port calls pjmedia_port_put_frame() of its downstream port, which in this case is the conference bridge (pjmedia_conf).
When pjmedia_port_put_frame() function is called to the conference bridge, the bridge will just store the PCM frame to an internal buffer, to be picked up by the main flow (the pjmedia_port_get_frame() call to the bridge above) when the bridge collects frames from all ports and mix the signal.

Sound device timing problem 

Ideally, rec_cb and play_cb should be called one after another, in turn and consecutively, by the sound device. But unfortunately this is not always the case; in many low-end sound cards, it is quite common to have several consecutive/burst of rec_cb callbacks and then followed by burst of play_cb calls.

Another less common problem with the sound device is when the total number of samples played and/or recorded by the sound device does not match the requested clock rate. This is what we call audio clock drift.

Both of these problems can be analyzed with Testing and optimizing audio device with pjsystest.

The conference bridge handles these problems by using Adaptive Delay Buffer. The delay buffer continuously learns the optimal delay to be applied to the audio flow at run-time, and may expand or shrink the buffer without distorting the audio quality. The drawback of using this buffer, however, is increased latency. The latency increases according to the jitter/bursts characteristics of the sound device.

The maximum (sound device) jitter that can be accomodated by the conference bridge’s is controlled by PJMEDIA_SOUND_BUFFER_COUNT macro, which default value is around 150 ms. It is possible that a very very bad sound device may overrun this buffer, which in this case it would be necessary to enlarge the PJMEDIA_SOUND_BUFFER_COUNT number in your config_site.h.

The untimely nature of the sound device may also contribute to overal jitter seen by the jitter buffer. See Jitter buffer features and operations for more information.

Incoming RTP/RTCP Packets 

Incoming RTP/RTCP packets is not driven by any of the flow above, but by different flow (“thread”), that is the flow/thread that polls the socket descriptors (of the media transport).

The standard implementation of UDP media transport in PJMEDIA will register the RTP and RTCP sockets to an pj_ioqueue_t (see IOQUEUE documentation). Application can choose different strategy with regard to placing the ioqueue instance:

Application can instruct the Media Endpoint to instantiate an internal IOQueue and start one or more worker threads to poll this ioqueue. This probably is the recommended strategy so that polling to media sockets is done by separate thread (and this is the default settings in PJSUA-LIB).
Alternatively, application can use a single ioqueue for both SIP and media sockets, and poll the whole thing from a single thread, possibly the main thread. To use this, application will specify the ioqueue instance to be used when creating the media endpoint and disable worker thread. This strategy is probably preferable on a small-footprint devices to reduce (or eliminate) threads in the system.

The flow of incoming RTP packets are as follows:

an internal worker thread in the Media Endpoint polls the ioqueue.
an incoming packet will cause the ioqueue to call on_rx_rtp() callback of the UDP media transport. This callback was previously registered by the UDP media transport to the ioqueue.
the on_rx_rtp() callback reports the incoming RTP packet to the media stream. The media stream was attached to the UDP media transport with pjmedia_transport_attach().
the media stream unpacks the RTP packet using its internal RTP session, update RX statistics, de-frame the payload according to the codec being used (there can be multiple audio frames in a single RTP packet), and put the frames in the jitter buffer.
the processing of incoming packets stops here, as the frames in the jitter buffer will be picked up by the main flow (a call to pjmedia_port_get_frame() to the media stream) above.

Understanding Audio Media Flow

Introduction

Audio playback flow (the main flow)

Audio recording flow

Sound device timing problem

Incoming RTP/RTCP Packets

Understanding Audio Media Flow 

Introduction 

Audio recording flow 

Sound device timing problem 

Incoming RTP/RTCP Packets 