AI Connectivity
Available since 2.17: WebSocket client (#4859), PJMEDIA AI port (#4866), PJSUA2 wrapper (#4870).
Warning
This feature is experimental. The API may change in future releases.
Overview
PJMEDIA ships an AI media port that bridges the conference bridge to a real-time AI service over WebSocket, letting call audio be routed to a cloud model (speech-in, speech-out) the same way it is routed to any other conference port.
At a glance:
Transport: WebSocket (
ws://orwss://), via the pjlib-util WebSocket client.Media: full-duplex PCM16 over the conference bridge; the port runs at the backend’s native rate (24 kHz for OpenAI) and the conference bridge handles any resampling.
Backends: pluggable. A built-in OpenAI Realtime API backend is provided; other vendors can be added by implementing
pjmedia_ai_backend_op.Events: connect/disconnect, transcripts, response start/done, and server VAD speech-started/stopped are delivered via an application callback.
The feature is available at two levels:
Layer |
Entry point |
|---|---|
PJMEDIA |
|
PJSUA2 |
|
PJSUA2 API
AudioMediaAiPort wraps the PJMEDIA port as a standard AudioMedia
subclass, so it can be connected to/from any other conference port
(call media, sound device, player, recorder, etc.) with the usual
startTransmit()/stopTransmit().
class MyAiPort : public AudioMediaAiPort
{
void onEvent(const AiMediaEvent &event) override
{
switch (event.type) {
case PJMEDIA_AI_EVENT_CONNECTED:
std::cout << "AI connected\n";
break;
case PJMEDIA_AI_EVENT_TRANSCRIPT:
std::cout << "Transcript: " << event.text << "\n";
break;
case PJMEDIA_AI_EVENT_DISCONNECTED:
std::cout << "AI disconnected, status=" << event.status << "\n";
break;
default:
break;
}
}
};
MyAiPort ai;
ai.createPort(); // Uses the OpenAI Realtime backend
ai.connect("wss://api.openai.com/v1/realtime?model=gpt-4o-mini-realtime-preview",
apiKey);
// Bridge the call both ways
callAudio.startTransmit(ai);
ai.startTransmit(callAudio);
// ... when done ...
ai.disconnect();
Python (via SWIG) mirrors the same shape:
import pjsua2 as pj
class MyAiPort(pj.AudioMediaAiPort):
def onEvent(self, event):
if event.type == pj.PJMEDIA_AI_EVENT_TRANSCRIPT:
print("Transcript:", event.text)
ai = MyAiPort()
ai.createPort()
ai.connect("wss://api.openai.com/v1/realtime?model=...", api_key)
call_audio.startTransmit(ai)
ai.startTransmit(call_audio)
Note
onEvent() is invoked from the pjsua ioqueue worker thread. Keep
the callback non-blocking; marshal work to an application thread if
you need to do anything expensive (DB calls, UI updates, etc.).
PJMEDIA API
For applications that are not using PJSUA2, the underlying C API can be used directly. The port needs an ioqueue, a timer heap, and a backend instance. In a pjsua-based app you can reuse the SIP endpoint’s ioqueue and timer heap; in a standalone pjmedia app, create your own.
pjmedia_ai_backend *backend;
pjmedia_ai_openai_backend_create(pool, &backend);
pjmedia_ai_port_param prm;
pjmedia_ai_port_param_default(&prm);
prm.ioqueue = pjsip_endpt_get_ioqueue(pjsua_get_pjsip_endpt());
prm.timer_heap = pjsip_endpt_get_timer_heap(pjsua_get_pjsip_endpt());
prm.backend = backend; /* port takes ownership */
prm.cb.on_event = &on_ai_event;
prm.user_data = my_ctx;
/* Optional: prm.vad_enabled, prm.ptime_msec, prm.ssl_param */
pjmedia_ai_port *ai_port;
pjmedia_ai_port_create(pool, &prm, &ai_port);
pjmedia_port *port = pjmedia_ai_port_get_port(ai_port);
pjmedia_conf_add_port(conf, pool, port, NULL, NULL);
pj_str_t url = pj_str("wss://api.openai.com/v1/realtime?model=...");
pj_str_t token = pj_str(api_key);
pjmedia_ai_port_connect(ai_port, &url, &token);
Destroy the port with pjmedia_port_destroy() when done; it
disconnects the WebSocket and destroys the backend it owns.
Events
Event |
Meaning |
|---|---|
|
WebSocket and backend session are ready; audio streaming starts. |
|
Connection lost or closed. |
|
Text transcript fragment from the AI service ( |
|
AI response generation started / finished. |
|
Server-side VAD detected start / end of user speech. |
Backends
The bundled OpenAI Realtime backend uses PCM16 at 24 kHz over a single
WebSocket, base64-encoded inside JSON events, with server-side VAD for
turn detection and barge-in. Other vendors can be added by implementing
pjmedia_ai_backend_op (prepare_connect, on_ws_connected,
encode_audio, on_rx_msg, destroy) and exposing a factory
similar to pjmedia_ai_openai_backend_create().
Barge-in with the OpenAI backend is handled server-side: the AI port
keeps sending mic audio even while the AI is talking, so the server’s
VAD can detect interruptions and truncate the response automatically.
To avoid false barge-ins from the AI’s own playback leaking into the
mic, enable acoustic echo cancellation on the sound device
(pjmedia_snd_port_set_ec()) — the aidemo sample does this.
Limitations
The current PJSUA2 wrapper does not yet expose:
SSL/TLS parameters (certificate verification, client certs, cipher suites). The C API accepts
ssl_param; TLS uses defaults in PJSUA2.OpenAI session customization (voice, instructions, modalities) — the backend defaults are used as-is.
isConnected()/ state query — connection state must be inferred fromonEvent().JSON persistence for
AiMediaPortParam.
Sample app
pjsip-apps/src/samples/aidemo.c is a minimal end-to-end demo:
local sound device ↔ OpenAI Realtime API. Build the samples, export
OPENAI_API_KEY, and run:
export OPENAI_API_KEY="sk-..."
./pjsip-apps/bin/aidemo-x86_64-pc-linux-gnu
Useful options:
Option |
Description |
|---|---|
|
Override the WebSocket URL (default: OpenAI Realtime endpoint). |
|
Sound device clock rate in Hz. Defaults to the backend native
rate (24 kHz for OpenAI); set e.g. |
|
Null audio (no sound device); feeds silence. Useful for CI. |
|
Duration in seconds for null-audio mode (default 10). |
|
Log level 0–6 (default 4). |
Note
On Windows, log level 5+ can cause choppy audio because console output blocks the audio thread.