Understanding slack chat database design architecture with emojis and replies

Question

2

ClaraBeginner

Asked: December 16, 20212021-12-16T06:59:14-05:00 2021-12-16T06:59:14-05:00PostgreSQL

Understanding slack chat database design architecture with emojis and replies

I am trying to build a chat application similar to slack chat, I want to understand how they have designed their database that it returns so much information at once when someone loads a chat, which database is good for this problem, I am adding screenshot of the same for reference.

Initially when I started thinking about this I wanted to go ahead with PostgreSQL and always keeping tables normalized to keep it clean but as I went ahead normalization started to feel like a problem.

Users Table

id	name	email
1	John	[email protected]
2	same	[email protected]

Channels Table

id	channel_name
1	Channel name 1
2	Channel name 2

Participants table

id	user_id	channel_id
1	1	1
2	1	2
3	2	1

Chat table

id	user_id	channel_id	parent_id	message_text	total_replies	timestamp
1	1	1	null	first message	0	–
2	1	2	1	second message	10	–
3	1	3	null	third message	0	–

Chat table has column name parent_id which tells if it is parent message or child message I don’t want to go with recursive child messages so this is fine

Emojis table

id	user_id	message_id	emoji_uni-code
1	1	12	U123
2	1	12	U234
3	2	14	U456
4	2	14	U7878
5	3	14	U678

A person can react with many emojis on the same message

when someone loads I want to fetch last 10 messages inserted into tables with
all the emojis which have been reacted with each messages and replies like you can see in the image where it says 1 reply with person’s profile picture(this can be more than 1)

Now to fetch this data I have to join all the tables and then fetch the data which could be very heavy job on the back-end side, considering this is going to be very frequent.

What I thought is I would add two more columns in Chat table which are profile_replies and emoji_reactions_count and both will be of bson data types to store data something like this

This for emoji_reactions_count column

This is also with two ways one which is count only way

{
  "U123": "123",// count of reactions on an emoji
  "U234": "12"
}

When someone reacts i would update the count and insert or delete the row from Emojis table, Here I have a question, too frequent emoji updates on any message could become slow? because i need to update the count in above table everytime someone reacts with a emoji

OR

storing user id along with the count like this , this looks better I can get rid off Emojis table completely

{
  "U123": {
    "count": 123, // count of reactions on an emoji
    "userIds": [1,2,3,4], // list of users ids who all have reacted
  },
  "U234": {
    "count": 12,
    "userIds": [1,2,3,4],
  },
}

This for profile_replies column

[
  {
    "name": 'john',
    "profile_image": 'image url',
    "replied_on": timestamp
  },
  ... with similar other objects
]

Does this look fine solution or is there anything I can do to imporve or should I switch to some noSQL Database like mongodb or cassandra? I have considered about mongodb but this does not also look very good because joins are slow when data exponentially grows but this does not happen in sql comparatively.

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

slym45 · Answer 1 · 2021-12-22T17:19:53-05:00

Even though this is honestly more like a discussion and there is no perfect answer to such a question, I will try to point out things you might want to consider if rebuilding Slack:

Emoji table:
As Alexliz Blex already commmetend can be neglected for the very beginning of a chat software. Later on they could either be injected by your some cache in your application, somewhere in middleware or view or whereever, or stored directly with your message. There is no need to JOIN anything on the database side.
Workspaces:
Slack is organized in Workspaces, where you can participate with the very same user. Every workspace can have multiple channels, every channel can have multiple guests. Every user can join multiple workspaces (as admin, full member, single-channel or multi-channel guest). Try to start with that idea.
Channels:
I would refactor the channel wording to e.g. conversation because basically (personal opinion here) I think there is not much of a difference between e.g. a channel with 10 members and a direction conversation involving 5 people, except for the fact that: users can join (open) channels later on and see previous messages, which is not possible for closed channels and direct messages.

Now for your actual database layout question:

Adding columns like reply_count or profile_replies can be very handy later on when you are developing an admin dashboard with all kinds of statistics but is absolutely not required for the client.
Assuming your client does a small call to "get workspace members" upon joining / launching the client (and then obviously frequently renewing the cache on the clients side) there is no need to store user data with the messages, even if there are 1000 members on the same workspace it should be only a few MiB of information.
Assuming your client does the same with a call to "get recent workspace conversations" (of course you can filter by if public and joined) you are going to have a nice list of channels you are already in, and the last people you have talked to.

create table message
(
    id bigserial primary key,
    workspace_id      bigint                   not null,
    conversation_id   bigint                   not null,
    parent_id         bigint,
    created_dt        timestamp with time zone not null,
    modified_at       timestamp with time zone,
    is_deleted        bool not null default false,
    content           jsonb
)
    partition by hash (workspace_id);


create table message_p0 partition of message for values with (modulus 32, remainder 0);
create table message_p1 partition of message for values with (modulus 32, remainder 1);
create table message_p2 partition of message for values with (modulus 32, remainder 2);
...

So basically your query against the database whenever a user joins a new conversation is going to be:

SELECT * FROM message WHERE workspace_id = 1234 ORDER BY created_dt DESC LIMIT 25;

And when you start scrolling up its going to be:

SELECT * FROM message WHERE workspace_id = 1234 AND conversation_id = 1234 and id < 123456789 ORDER BY created_dt DESC LIMIT 25;

and so on… As you can already see you can now select messages very efficiently by workspace and conversation if you additionally add an INDEX like (might differ if you use partitioning):

create index idx_message_by_workspace_conversation_date
    on message (workspace_id, conversation_id, created_dt)
    where (is_deleted = false);

For message format I would use something similar than Twitter does, for more details please check their official documentation:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet

Of course e.g. your Client v14 should know how to ‘render’ all objects from v1 to v14, but thats the great thing about message format versioning: It is backwards compatible and you can launch a new format supporting more features whenever you feel like, a primitive example of content could be:

{
  "format": "1.0",
  "message":"Hello World",
  "can_reply":true,
  "can_share":false,
  "image": {
     "highres": { "url": "https://www.google.com", "width": 1280, "height": 720 },
     "thumbnail": { "url": "https://www.google.com", "width": 320, "height": 240 }
  },
  "video": null,
  "from_user": {
    "id": 12325512,
    "avatar": "https://www.google.com"
  }
}

The much complicated question imo is efficiently determining which messages have been read by each and every user. I will not go into detail of how to send push notifications as that should be done by your backend application and not by polling the database.
Using the previously gathered data from "get recent workspace conversations" (something like SELECT * FROM user_conversations ORDER BY last_read_dt DESC LIMIT 25 should do, in your case the Participants table where you would have to add both last_read_message_id and last_read_dt) you can then do a query to get which messages have not been read yet:

a small stored function returning messages
a JOIN statement returning those messages
a UNNEST / LATERAL statement returning those messages
maybe something else that doesn’t come to my mind at the moment. 🙂

And last but not least I would highly recommend not trying to rebuild Slack as there are so many more topics to cover, like security & encryption, API & integrations, and so on…

Understanding slack chat database design architecture with emojis and replies

День Города Львов

Best way to append data to postgresql table where there is data overlap between the ...

Creating procedure (that should be a function) in Postgresql and getting error message when calling ...

MySQL ISNULL SQL code equivalency for PostgreSQL

1 Answer

How do I quickly rename a MySQL database (change schema ...

mongo - couldn't connect to server 127.0.0.1:27017

How to query MongoDB with "like"

James

Jasongibra

Lonniebaf

Andres

Forgot Password,

databaseanswers.net Latest Questions

Understanding slack chat database design architecture with emojis and replies

Related Questions

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply