I am trying to build a chat application similar to slack chat, I want to understand how they have designed their database that it returns so much information at once when someone loads a chat, which database is good for this problem, I am adding screenshot of the same for reference.
Initially when I started thinking about this I wanted to go ahead with PostgreSQL and always keeping tables normalized to keep it clean but as I went ahead normalization started to feel like a problem.
Users Table
id | name | |
---|---|---|
1 | John | [email protected] |
2 | same | [email protected] |
Channels Table
id | channel_name |
---|---|
1 | Channel name 1 |
2 | Channel name 2 |
Participants table
id | user_id | channel_id |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 2 | 1 |
Chat table
id | user_id | channel_id | parent_id | message_text | total_replies | timestamp |
---|---|---|---|---|---|---|
1 | 1 | 1 | null | first message | 0 | – |
2 | 1 | 2 | 1 | second message | 10 | – |
3 | 1 | 3 | null | third message | 0 | – |
Chat table has column name parent_id which tells if it is parent message or child message I don’t want to go with recursive child messages so this is fine
Emojis table
id | user_id | message_id | emoji_uni-code |
---|---|---|---|
1 | 1 | 12 | U123 |
2 | 1 | 12 | U234 |
3 | 2 | 14 | U456 |
4 | 2 | 14 | U7878 |
5 | 3 | 14 | U678 |
A person can react with many emojis on the same message
when someone loads I want to fetch last 10 messages inserted into tables with
all the emojis which have been reacted with each messages and replies like you can see in the image where it says 1 reply with person’s profile picture(this can be more than 1)
Now to fetch this data I have to join all the tables and then fetch the data which could be very heavy job on the back-end side, considering this is going to be very frequent.
What I thought is I would add two more columns in Chat table which are profile_replies and emoji_reactions_count and both will be of bson data types to store data something like this
This for emoji_reactions_count column
This is also with two ways one which is count only way
{
"U123": "123",// count of reactions on an emoji
"U234": "12"
}
When someone reacts i would update the count and insert or delete the row from Emojis table, Here I have a question, too frequent emoji updates on any message could become slow? because i need to update the count in above table everytime someone reacts with a emoji
OR
storing user id along with the count like this , this looks better I can get rid off Emojis table completely
{
"U123": {
"count": 123, // count of reactions on an emoji
"userIds": [1,2,3,4], // list of users ids who all have reacted
},
"U234": {
"count": 12,
"userIds": [1,2,3,4],
},
}
This for profile_replies column
[
{
"name": 'john',
"profile_image": 'image url',
"replied_on": timestamp
},
... with similar other objects
]
Does this look fine solution or is there anything I can do to imporve or should I switch to some noSQL Database like mongodb or cassandra? I have considered about mongodb but this does not also look very good because joins are slow when data exponentially grows but this does not happen in sql comparatively.
Even though this is honestly more like a discussion and there is no perfect answer to such a question, I will try to point out things you might want to consider if rebuilding Slack:
As Alexliz Blex already commmetend can be neglected for the very beginning of a chat software. Later on they could either be injected by your some cache in your application, somewhere in middleware or view or whereever, or stored directly with your message. There is no need to JOIN anything on the database side.
Slack is organized in Workspaces, where you can participate with the very same user. Every workspace can have multiple channels, every channel can have multiple guests. Every user can join multiple workspaces (as admin, full member, single-channel or multi-channel guest). Try to start with that idea.
I would refactor the channel wording to e.g. conversation because basically (personal opinion here) I think there is not much of a difference between e.g. a channel with 10 members and a direction conversation involving 5 people, except for the fact that: users can join (open) channels later on and see previous messages, which is not possible for closed channels and direct messages.
Now for your actual database layout question:
So basically your query against the database whenever a user joins a new conversation is going to be:
And when you start scrolling up its going to be:
and so on… As you can already see you can now select messages very efficiently by workspace and conversation if you additionally add an INDEX like (might differ if you use partitioning):
For message format I would use something similar than Twitter does, for more details please check their official documentation:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet
Of course e.g. your Client v14 should know how to ‘render’ all objects from v1 to v14, but thats the great thing about message format versioning: It is backwards compatible and you can launch a new format supporting more features whenever you feel like, a primitive example of
content
could be:The much complicated question imo is efficiently determining which messages have been read by each and every user. I will not go into detail of how to send push notifications as that should be done by your backend application and not by polling the database.
Using the previously gathered data from "get recent workspace conversations" (something like
SELECT * FROM user_conversations ORDER BY last_read_dt DESC LIMIT 25
should do, in your case the Participants table where you would have to add both last_read_message_id and last_read_dt) you can then do a query to get which messages have not been read yet:And last but not least I would highly recommend not trying to rebuild Slack as there are so many more topics to cover, like security & encryption, API & integrations, and so on…