Hidden complexity in BBB 3.0

I recently experimented with BigBlueButton 3.0-RC4 again and tried do find out why hasura-graphql-server and postgresql are hogging half a CPU all the time on a fresh BBB server with just a single meeting and only one participant. This seems to be less of an issue if you actually have some traffic on the server because it certainly scale better with more users, but I was still curious and wanted to understand how this GraphQL stuff works in BBB. Oh boy what an adventure!

Disclaimer: I know that BigBlueButton 3.0 was a ton of work and that it must have been a real pain to rip out and replace all the legacy meteor stuff for a project this size. I do not want to criticize the decision that were made or the work of individual developers. Taking shortcuts and priorizing developer experience over performance is fine! Sometimes this is necessary to reach a goal with a limited time budget. I just want to express my surprise and concerns when I noticed how much complexity and overhead is burried below the surface. The core developers probably know most of this already, and I hope some of those issues can be tackled in future releases.

Update: The situation improved significantly since this article and related github issues were released.

So, lets pick a (not really) random example and go on an adventure, shall we? Let's look at how BBB implements a "shared mouse cursor" that can be seen on the whiteboard by all participants of a meeting. A simple feature, you might think.

Client/Server communication

A meeting has lots of state that need to be shared with and synced between all connected clients, preferrably in real time. Stuff like, who is in the meeting? Who is currently talking? Chat messages, white board annotations and much more. In BigBlueButton 2.x this was achieved by building the client on top of Meteor and MongoDB. Meteor is a pretty good fit for apps that need this kind of state synchronization. Unfortunately, MongoDB stopped beeing free software a while ago and meteor was slow (the way it was used in BBB). Many bottlenecks for large meetings were blamed on Meteor, so it was replaced. Obviously, ripping out a framework and replacing it with something completely different is a ton of work, so choices were made and shortcuts were taken to make this possible at all without a full rewrite.

In BigBlueButton 3, meteor was replaced with a Hasura, an 'universal data access layer' that speaks GraphQL and is backed by a database. The clients connects to a websocket and can trigger mutations on state, or subscribe to state changes, both via GraphQL. This is also used to synchronize mouse cursor positions, exactly what we are interested in today.

So, what happens when the presenter in a meeting moves the mouse cursor two pixels to the left? The client calls an innocent looking function named presentationPublishCursor() which sends half a kilobyte of JSON over the Hasura websocket:

{
    "id":"39980c1d-e8e5-45d5-8119-564828dca33d",
    "type":"subscribe",
    "payload": {
        "variables": {
            "whiteboardId":"2ef9c352f79b9bc20c4b9caaf5faeaca9dae7432-1740219557083/1",
            "xPercent":919.0178734915596,
            "yPercent":115.98215103149414
        },
        "extensions": {},
        "operationName":"PresentationPublishCursor",
        "query":"mutation PresentationPublishCursor($whiteboardId: String!, $xPercent: Float!, $yPercent: Float!) {\n  presentationPublishCursor(\n    whiteboardId: $whiteboardId\n    xPercent: $xPercent\n    yPercent: $yPercent\n  )\n}"
    }
}

This mutation of state is parsed and processed by the server and then acknowledged with two responses:

{"id":"39980c1d-e8e5-45d5-8119-564828dca33d","payload":{"data":{"presentationPublishCursor":true}},"type":"next"}
{"id":"39980c1d-e8e5-45d5-8119-564828dca33d","type":"complete"}

This happens once every 150ms by default, which is a lot, but required if you want somewhat smooth cursor movement. Every 250ms or so all clients are notified about the new cursor position, which looks like this:

{
    "type":"next",
    "id":"c51e8701-6282-479d-a4ae-131056cf95ba",
    "payload": {
        "data": {
            "pres_page_cursor_stream" : [{
                "xPercent":795.803587777274,
                "yPercent":83.839293888637,
                "lastUpdatedAt":"2025-02-22T10:19:53.119+00:00",
                "userId":"w_zbduowqxzbyo",
                "__typename":"pres_page_cursor"
            }]
        }
    }
}

So far, so good. A bit verbose perhaps, and there is some network and parsing overhead, but that does not explain why both hasura-graphql-server and postgresql are busy enough to use half a CPU all the time.

Lots of SQL under the hood

So, what happens on the server side? Let's go deeper.

As already mentioned, BigBlueButton 3.0 switched from Meteor to Hasura to manage and sync state between clients. In this configuration, Hasura stores all its state (and also non-state, more on that later) in a postgresql database. That's nice, postgres is really great. But an odd choice for real-time data you may think. Isn't something like redis way more common for this use case?

When looking at the sources, it's obvious that most of the server code, query logic and database schema were not hand-crafted by developers, but generated by Hasura tooling based on configuration snippets like these. At least I hope so, because the resulting DB schema is huge!

Using a framerwork like Hasura does make sense in this context. It can speed up development and allow developers focus on features while the framework hanles all the details. I somewhat understand and do not really question this choice. Sometimes you have to resort to these tools to get things done. But they come at a price: The table responsible for mouse movements has 6 columns, 5 indices and two foreign keys, most of them a compound keys. Oh, and there is a view that joins it with two additional tables. Yes, really:

CREATE TABLE "pres_page_cursor" (
    "pageId" varchar(100)  REFERENCES "pres_page"("pageId") ON DELETE CASCADE,
    "meetingId" varchar(100),
    "userId" varchar(50),
    "xPercent" numeric,
    "yPercent" numeric,
    "lastUpdatedAt" timestamp with time zone DEFAULT now(),
    CONSTRAINT "pres_page_cursor_pkey" PRIMARY KEY ("pageId","meetingId","userId"),
    FOREIGN KEY ("meetingId", "userId") REFERENCES "user"("meetingId","userId") ON DELETE CASCADE
);
create index "idx_pres_page_cursor_pageId" on "pres_page_cursor"("pageId");
create index "idx_pres_page_cursor_userID" on "pres_page_cursor"("meetingId","userId");
create index "idx_pres_page_cursor_userID_rev" on "pres_page_cursor"("userId", "meetingId");
create index "idx_pres_page_cursor_lastUpdatedAt" on "pres_page_cursor"("pageId","lastUpdatedAt");

CREATE VIEW "v_pres_page_cursor" AS
SELECT pres_page."presentationId", c.*,
        CASE WHEN pres_presentation."current" IS true AND pres_page."current" IS true THEN true ELSE false END AS "isCurrentPage"
FROM pres_page_cursor c
JOIN pres_page ON pres_page."pageId" = c."pageId"
JOIN pres_presentation ON pres_presentation."presentationId" = pres_page."presentationId";

You may already have a bad feeling and an idea what this means, and you are right: for every single meeting, every 150ms, this table is updated with new values and synced to disk. This is not an in-memory table by the way. This is a real table with full ACID guarantees in a relational database. If the server crashes, your last mouse movement will be save! What a relief!

To be fair, this is a particularly bad example. Mouse movements are low priority high frequency events. Those do not need to be persisted at all, let alone synced to disk. Connected clients know the last position and new clients can simply wait for the next event to catch up. Most other state tracked in that DB is different: the current slide for example is actual state that needs to be communicated to new clients as soon as they connect, so it needs to be stored somewhere. But on disk? In a relational database?

This is not much data we are talking about, maybe a couple of KB per meeting. Even on a heavily utilized BBB server, it would fit comfortably in memory. Using UNLOGGED tables for unimportant data may help reduce the IO load a little, but it's still the wrong tool for the job in my optinion. At least for high frequency data like mouse movements or whiteboard drawings.

Not enough SQL? Hold my beer and watch (the query log)

All this sill does not explain why the database is so busy all the time, even if nothing really happens and there is only one empty meeting on the server. So, what's causing all this load?

Earlier we saw that the updated cursor position is a "type":"next" event. This is the result of a GraphQL query with more than one result, in this case a GraphQL subscription. You request data from the server, and want to be notified whenever the result of your query changes. That's a pretty straight forward way to implement synchronized state.

So, how does Hasura know when the result of a complex GraphQL query changed? I'm pretty sure that there are better ways to do this, and maybe this is just an configuration issue, but at the time of writing Hasura seems to just query the database over and over again in a fast loop. Now, if you look at one of those queries, you quickly understand why the database is busy all the time. I gave up trying to format and make sense of this absurdity. Here it is, in all its glory:

SELECT "__subs"."result_id",
       "__fld_resp"."root" AS "result"
FROM UNNEST(($1)::UUID[], ($2)::JSON[]) AS "__subs"("result_id", "result_vars")
LEFT OUTER JOIN LATERAL (
    SELECT json_build_object('pres_page_cursor', "_pres_page_cursor"."root") AS "root"
    FROM (
        SELECT coalesce(json_agg("root" ORDER BY "root.pg.userId" ASC NULLS LAST), '[]') AS "root"
        FROM (
            SELECT row_to_json((
                SELECT "_e"
                FROM (
                    SELECT "_root.base"."userId" AS "userId", "_root.or.user"."user" AS "user", 'pres_page_cursor' AS "__typename") AS "_e")) AS "root", "_root.base"."userId" AS "root.pg.userId"
                    FROM (
                        SELECT *
                        FROM "public"."v_pres_page_cursor"
                        WHERE ((((("public"."v_pres_page_cursor"."meetingId") = ((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-meetingid']))::varchar))
                            OR ((("public"."v_pres_page_cursor"."meetingId") IS NULL)
                            AND (((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-meetingid']))::varchar) IS NULL)))
                            AND ((EXISTS (
                                SELECT 1
                                FROM "public"."v_user_ref" AS "__be_0_v_user_ref"
                                WHERE (((("__be_0_v_user_ref"."userId") = ("public"."v_pres_page_cursor"."userId"))
                                    AND ((("__be_0_v_user_ref"."meetingId") = ("public"."v_pres_page_cursor"."meetingId"))
                                    AND ('true')))
                                    AND (('true')
                                    AND ((((("__be_0_v_user_ref"."isModerator") = (('true')::boolean))
                                    OR ((("__be_0_v_user_ref"."isModerator") IS NULL)
                                    AND ((('true')::boolean) IS NULL)))
                                    AND ('true'))
                                    AND ('true'))))))
                            OR (((("public"."v_pres_page_cursor"."meetingId") = ((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-cursornotlockedinmeeting']))::varchar))
                            OR ((("public"."v_pres_page_cursor"."meetingId") IS NULL)
                            AND (((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-cursornotlockedinmeeting']))::varchar) IS NULL)))
                            OR ((("public"."v_pres_page_cursor"."userId") = ((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-cursorlockeduserid']))::varchar))
                            OR ((("public"."v_pres_page_cursor"."userId") IS NULL)
                            AND (((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-cursorlockeduserid']))::varchar) IS NULL))))))
                            AND (("public"."v_pres_page_cursor"."isCurrentPage") = ((("__subs"."result_vars"#>>ARRAY['synthetic', '0']))::boolean)))
                        ORDER BY "userId" ASC NULLS LAST) AS "_root.base"
                    LEFT OUTER JOIN LATERAL (
                        SELECT row_to_json((
                            SELECT "_e"
                            FROM (
                                SELECT "_root.or.user.base"."name" AS "name", "_root.or.user.base"."presenter" AS "presenter", "_root.or.user.base"."role" AS "role", 'user_ref' AS "__typename") AS "_e")) AS "user"
                                FROM (
                                    SELECT *
                                    FROM "public"."v_user_ref"
                                    WHERE (((("_root.base"."userId") = ("userId"))
                                        AND (("_root.base"."meetingId") = ("meetingId")))
                                        AND ((("public"."v_user_ref"."meetingId") = ((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-meetingid']))::varchar))
                                        OR ((("public"."v_user_ref"."meetingId") IS NULL)
                                        AND (((("__subs"."result_vars"#>>ARRAY['session', 'x-hasura-meetingid']))::varchar) IS NULL))))
                                    LIMIT 1)
                                AS "_root.or.user.base")
                            AS "_root.or.user" ON ('true')
            ORDER BY "root.pg.userId" ASC NULLS LAST) AS "_root") AS "_pres_page_cursor") AS "__fld_resp" ON ('true')

This huge query ist just for one type of subscription, and it is triggered once every 100ms to 250ms. Remember, this is for tracking mouse movement. Literally just two float values. There are many other types of state that clients subscribe to and I have seen much bigger queries flying by. On an empty server this may still be okay-ish since the actual database is tiny and most of it is cached in memory, but come on...

Hasuka tries to optimize this to a limited degree. Multiple subscriptions to the same query can be merged into batches of 1000 and handled by a single larger SQL query. This is probably the reason this model works at all for multiple clients and does not melt the CPU on busy servers, but there are still multiple subscriptions that hammer the database every 100ms with huge queries. There must be a better way.

How about a larger interval? Hasuka actually defaults to one poll per second. This would be fine for some types of state, but not for mouse movement or whiteboard changes. Those need to be quicker. But the interval can only be configured globally, not per query, so that's not an option. This alone should be an indicator that, maybe, piping real time mouse movement events though a relation database is not the best idea.

Conclusion

We all know this already: Complexity is bad, more so if it hides below a layer of false simplicity. Choose the right tool for the job. There is no silver bullet. All this also applies here.

At the same time, I understand why Hasura was chosen as a meteor replacement and I do not think it was a bad decision. It adds flexibility and simplifies development, two very important factors in a project like BigBlueButton. But it's also improtant to know and understand the price we pay for this flexibility and convenience, and regularly check if it's still worth it.

Let's hope BBB can reduce the impact a little bit in future releases by moving to a more efficient pub/sub and state model for high frequency data. A lot can be achieved by tweaking and optimizing the current approach, but only to a certain limit. The flexibility of GraphQL is also it's warkness. I can already see another re-write in a couple of years, maybe even earlier.