Local recording in Jitsi Meet

A progress report

Posted on: 2018-08-27 Last edited: 2018-08-28
Language:
  • English

Background & Motivation

Early this year, I began to collect recordings of stories in the Eastern Min language as part of an effort to build an audio corpus. We did a couple of interviews with people from all walks of life, mostly online, via the Chinese IM app WeChat (but really, the same idea if you use Skype). The way I recorded them was to use a virtual sound card such as VoiceMeeter to redirect the incoming audio (i.e. that of the interviewee), along with audio from my local microphone, to a recording software like Audacity.

The problem was that the remote participant always sounded very telephone-y, and because I was often in a different country as the interviewee, the network connection was not optimal, leading to constant jittering. Things get worse if there are multiple people on the other end, because I could only receive one, compressed and mixed-down track from the IM app.

In the meantime, I read about how some people managed to co-host podcasts remotely with decent audio quality. The idea is to ask every participant of a call to record their microphone locally, in high fidelity, and to sync up and mix the tracks after the call. The real-time audio provided by Skype or WeChat (or whatever conferencing app) is then only used for coordination.

This is a clever trick. However, I cannot directly use it because my interviewee are usually not tech-savvy enough to know how to set up audio recording on their computers. The only way seems to be embedding such a feature into an existing conferencing app, such that it can be enabled with a few clicks. And I believe with such a feature in an existing VoIP app, many audio creators can take advantage this trick easily too.

This is where Jitsi Meet comes in. I came across Jitsi Meet when I was searching for WebRTC talks on Youtube. And it blew my mind because it is such a complete WebRTC video conference solution, and it is open source! (Really, Jitsi Meet deserves more attention from the general public.) What is more, they are a Google Summer of Code organisation and at the time were accepting project proposals.

So I sent in my proposal (just in time...) and got accepted. Here is what I did:

Overview

Figure
Figure 1: Overview of the architecture of local recording in Jitsi Meet

The implementation of local recording is completely client-side (at least at the moment, more on that later). Jitsi Meet has both web (React) and mobile (React Native) clients, which share most of the business logic. I only managed to implement this on the web client at the moment. There are three main components:

  1. RecordingController, which handles the signalling between participants and serves as a facade to other components
  2. RecordingAdapter(s). These are wrappers of Web Audio API and different audio codecs. Having a uniform interface provides the flexibility of switching between different formats and serves as a point for future extensions.
  3. SessionManager, which manages and persists information about each segment of the recording. Such information is used for restoration from crash and concatenation of segments of the recording.

Next, I will explain some key technical aspects of the implementation.

Web Audio API

As it turns out, it is not hard to record audio in the browser. MediaRecorder provides simple APIs to record audio from any MediaStream. But currently, most browsers only support the Opus codec.

If you want to record full quality, uncompressed audio, there is an option to receive the raw PCM bits from the browser using an AudioContext and a ScriptProcessorNode. You can either save those PCM bits as a WAVE file or encode it with an audio codec.

Flac: saves space while remaining lossless

An hour of audio recording in WAVE can easily be hundreds of megabytes large. However, for the most conversations, roughly half of the time a participant will be silent.

This is where libflac.js, a JavaScript transpilation of the C Flac encoding library, comes in handy. Flac is a lossless audio codec, which allows us to “compress away” those silent segments (there are savings in space for the talking part as well), because the encoding algorithm is highly dependent on the content of the audio and silence encodes to (almost) zero bytes. In practice, Flac can save more than 50% of space compared to WAVE, and the added-up size of recordings on all participants’ devices is only equivalent to that of one full-total speaker time.

This is useful because:

  1. The users will likely upload the recording to somewhere (e.g. Google Drive) after the call, and smaller files are always desirable;
  2. The recordings will be persisted in each participant’s browser’s local storage, which does have a quota. A space-efficient codec lowers the risk of losing recordings due to space.

For this reason, we are making Flac the default format for local recording in Jitsi Meet.

Signalling: coordination of multiple participants

With recording and encoding solved, the next challenge is how to coordinate between participants in a Jitsi Meet conference so that they all start and end their recording simultaneously. Also, if someone joins a conference late or refreshes the browser, he/she should still be recorded. This requires synchronisation of the recording “state” for all participants in a conference.

Jitsi Meet uses XMPP under-the-hood for passing WebRTC offers, chat messages, and all sorts of system events. XMPP has this concept of “Presence”, which is the state you can persist in a conference room. I am leveraging this to store the current state of local recording in the room, and each participant’s browser checks this state to ensure they are in sync.

The following diagram illustrates what happens when the moderator in a 2-user conference starts the local recording. The signalling of the termination of local recording is similar, except that sendCommand carries a different message.

Figure
Figure 2: An illustration of how XMPP is used to coordinate local recording for all participants.

This is a naive approach though and has a few limitations. The primary one is that when the original participant who started the recording leaves the conference, the Presence gets erased by the XMPP server. An alternative approach is to layer this state on top of the same mechanism behind chat messages, which get persisted regardless of the presence of participants. There are also security implications which I will explain in the “Next steps” section.

Storage and exporting (Work in progress)

Many things can happen during a Jitsi Meet conference: someone might join late; browsers can crash; it might take a few seconds (a later feature); or the user might simply refresh the page.

All these events can lead to discontinuous recordings and make it hard to synchronise the tracks later on. Therefore it is important to store the recorded audio somewhere other than the RAM. IndexedDB is a good candidate for this, as it can allow us to store GBs of data with high performance.

Figure
Figure 3: Illustration of continuous recording segments and the “gaps” between them.

I already mentioned previously that it is SessionManager’s job to record and persist the start and end of each recording segment. With this information, we can calculate each the length of the gap between each continuous recording segments.

In terms of exporting recordings with discontinuous segments; there are basically two approaches:

  1. Encode a period of silence equivalent in length to the gap, and concatenate it with other segments; or
  2. Export each segment as a separate file and provide an additional metadata to indicate the “offset” of each segment from the start

For Flac, it is cheap to go with the first approach, because a period of silence, when encoded in Flac, takes up close to zero space (with the exception of block headers), and it does not require significant CPU time to do so. For WAVE, however, silence does take up the same amount of space as non-silence. And for Opus, concatenating segments involves decoding and re-encoding. Another reason for making Flac the default format.

Trying it out

To enable this feature, you need to be using the latest Jitsi Meet and add these lines to config.js:

localRecording: {
   enabled: true,
   format: ‘flac’ // can replace with ‘wav’, ‘ogg’
}

A demo instance of Jitsi Meet with local recording enabled is available at https://radium2.jitsi.net.

At the moment, there is no toolbar button for this feature. To bring up the Local Recording Controls window, simply press the shortcut key L.

Figure
Figure 4: Local Recording Controls window with one participant

Click on "Start Recording" and the browser will start to record your local microphone.

Now, try to use another browser (ideally on another device, since different browsers might contend with audio devices) to join the same Jitsi Meet conference. On the moderator's screen, you should see that new participant show up, and shortly their icon will turn green (indicating that their browser has started recording locally).

Figure
Figure 5: Local Recording Controls window with two participants

If a participant joins the conference with an unsupported client (i.e. the Jitsi Meet mobile app), they will just show up as a gray dot.

Figure
Figure 6: Local Recording Controls window with a participant on an unsupported client.

After the conference is over, click “Stop Recording” on the moderator’s browser. Each participant will be signalled, and their browsers will start to export their recordings individually. Currently, this audio file is automatically saved as a download, but this experience can be improved in the future.

Figure
Figure 7: The recording is exported and saved as a download. This happens on each participant’s browser.

Then, depending on the use case, those individual, high-quality recordings can be mixed and edited in an audio editor like Audacity or Audition.

Next steps

There are still a few important areas that need to be worked on:

  • UI for user consent and downloading past recordings. The former is a UX requirement by the Jitsi Meet designer. The latter will be helpful for the users if they cannot find the file immediately saved after the conference.

  • The signalling mechanism needs an overhaul. The current approach falls short in two areas: There is no mechanism for a client to verify that a command is actually sent by the moderator, so a malicious user can fake XMPP messages to act as the moderator and activate local recording in the conference. As previously noted, in the XMPP server, when the user who sets the Presence (in this case the moderator) leaves the conference, that Presence extension is removed from the corresponding XMPP room. This can cause a problem when the moderator's browser crashes, for example. I haven’t figured out how to do this exactly, but a solution will likely involve Jicofo as it is where the user roles are assigned and can, therefore, do the validation.

  • Ability to auto-upload recordings to external service (like Google Drive or a custom file server). This can provide another level of convenience as the participants will not have to manually send/upload their recordings (which can be annoying).

  • Mobile app. This feature will still feel incomplete without the support in mobile clients. I was definitely too optimistic in my original GSoC proposal, hoping that I can implement this on the mobile app as well. In terms of implementation, the code for signalling and coordination can be reused, but there needs to be an alternate RecordingAdapter for the mobile client. The React Native plugin (react-native-webrtc) that Jitsi Meet uses doesn’t have an implementation for AudioContext or MediaRecorder yet. So eventually I would need to dig into the native code for Android and iOS to get this working. But I will probably get to it when I finished the other items.

Acknowledgement

I am grateful for the help of my awesome mentors Boris and Lenny, as well as the many other community members of Jitsi, without whom this project would not have been possible.

This work was supported by Google Summer of Code 2018.