AirBnb Interview Onsite, July 2024

1. Code Review: Python

Go read this repo to get context for the overall project: https://gist.github.com/airbnb-robot/af6e9068639733bff79d4e3773a8d1dc

There are 3 pull requests: 1, 2, 3
They are in increasing order of complexity and difficulty.


Pull request 1:

This pull request is to fix the wrong type of consistency being sent through the API. The two types are "EVENTUAL" and "STRONG", but someone wrote "WEAK" and "STRONG", which caused a bug. This pull request fixes the bug.


Things to note:
* Use Enums instead of the list ["EVENTUAL", "STRONG"] to check the value being sent.
* Ask about casing.
* Ask the user to test all possible conditions for the enum value.
* Python should use f-strings instead of interpolations wherever possible.
* Ask the user to log whenever there's an error or add instrumentation and alerting. We should log the actual exception and NOT swallow the exception.
* Ask whether we have overall system tests.


Pull request 2:
For this PR, we add a DAO layer to fetch listings data from the listing service and host data from the hosts service. In order to do so, we first need to issue service calls to those services and then transform the raw responses into our internal DAO objects for related processing.

* There's a bug between the Listing and the Listings objects when you try to turn a Listing into Listings. If you update a field in Listing and it's not in Listings, then you'll have a key error when you try to do the transformation. Use dict.get(key) instead of dict[key].
* Look at the DAOs that are used. They are frozen data classes in Python. Add slots to them. This improves query performance and reduces memory usage.
* You should reraise exceptions and avoid ambiguous, catchall exceptions.
* Ask the user to add unit tests for the retries.
* The retry classes are very similar. You should ask the user to make a base classes with the common functionality.
* The type hinting for some of the data classes is incorrect. `x: datetime = None` should actually be `x: Optional[datetime] = None`.
* Different exceptions should be handled in different ways. You should not treat every Exception the same way.
* Suggest using async-await in some places for better performance.
* If the listing service gives you a different number of results from what you expected, then you should log the difference between what you expected and what you got.
* The logging should use static strings with extra parameters instead of dynamic, interpolated strings.


Pull request 3:

Atlantis has told us that we're calling their API too frequently and they have a batch endpoint that they'd like for us to use instead. To get the status of many reservations and take advantage of this, we will start batching up our calls to Atlantis to make this work without changing our external API. We plan to use a job queue for registrations and listings. Once the queue is large enough or after enough time has passed, we'll pick a bunch of items off this queue.

* Leave a general comment saying that you should talk to Atlantis to get a precise rate limit or find out what the problem is. This is a Staff+ level signal.
* Leave a comment saying that we should look for distinct listings and registrations instead of a list, which may have duplicates. This is a Staff+ level signal.
* Ask what the traffic is from AirBnb to the city of Atlantis. Maybe this isn't the right approach?
* Ask whether the queue is durable. Maybe you can use a different data structure instead. Maybe you can sync every few hours instead of every 500 events.
* Use a `list(set(registration_numbers))` instead of a `list(registration_numbers)`
* Do a bulk upload to YOUR databases instead of sequential updates.
* Use better type hinting in the Python. Some methods don't have return types explicitly stated.
* We should not have silent failures whenever we try to add jobs to the queue.
* Run the queue using an external job instead of from the code layer.
* The dequeuing code is not thread safe. You could try to send the same exact request from the queue multiple times.
* Magic constants spread throughout the file.
* The way the queue is set up, you will have data loss after 500 events have been added. You need to fix this.


2. System Design

This interview was highly structured. The interviewer began talking a lot, and I started asking questions about what he was saying. He got a little confused or frustrated, and then he told me he will just paste the question. His English was also pretty bad.

* Create a group chat
* Create an inbox, load the top 10 group chats that have the most recent messages for the user
* Send a message to the group chat
* Show messages of a group chat

I asked a lot of questions like:
- do we need to delete chats for users if they request account deletion?
- Do we need to consider security or admin tooling for the chat

He told me not to worry about any of that and just answer the 4 questions.


He was very interested in my data schema.
Cassandra and similar databases are just a waste. You don't need the write throughput.

The main question is SQL vs Non-Relational. He said the traffic is about 1 million group chats per year, assume 20 messages per group chat. That's 5500 messages / day, which is 228 messages / hour.

SQL can definitely handle that. BUT you don't need the strong consistency. It's probably to have high availability and easier horizontal scalability, so I went with DynamoDB.

My interviewer had ABSOLUTELY NO IDEA how DynamoDB worked. He didn't know about partition keys, sort keys, or secondary indices. I basically wrote my data schema AS IF it were SQL, which felt strange.

He asked me to walk through what happens in my system for each of the 4 workflows.

Group chat - easy.
Just create an entry in your group_chat table and corresponding entries in your chat_user table.

Create an inbox and load the top 10 group chats.
Simple - when the user logs in, you just query the top 10 most recently updated chats.

Consider this schema

group_chat
id: INT PK
created_at: datetime
last_modified_at: datetime (idx)
[Can also use a DB trigger to update last_modified_at.]

chat_user
id: INT PK
user_id: INT (idx, FK references user.id, ON UPDATE CASCADE, ON DELETE CASCADE)
group_chat_id: INT (idx, FK references group_chat.id, ON UPDATE CASCADE, ON DELETE CASCADE)
created_at: datetime

message
id: INT PK
sender_id: INT (idx, FK references user.id, ON UPDATE CASCADE, ON DELETE CASCADE)
message: varchar(500)
created_at: datetime


Whenever you write a message, you should use a producer-consumer model. The consumer should update the last_modified_at timestamp when you write a message in that group chat.

Send a message in the group chat.
I asked him whether we expect to send the message to the users who are online, or whether it should just show up next time the user refreshes their page. He said we want to push it.
Simple - I immediately said there were two paths:
1. Saving the message in the database and making it ready to load for the people who were NOT CURRENTLY ONLINE.
2. Pushing the message to active users.

He told me to ignore the second path. It was strange, because that one is pretty important.
For #1, do the following

User ----message---->  [Chat Server] ---message---> Kafka  ---message---> Consumer
The consumer (1) Writes the message to the message table, (2) Updates the group_chat table, and (3) delivers the message to the other online users (he said to ignore this) using at-least-once-delivery.

Show messages of a group chat
This is an easy query when you click a group chat. It just loads the most recent messages.


Follow up:

What if we want to create groups of users and refer back to the existing groups whenever the same group makes another reservation?

Simple.
New Tables:

Group:
Id INT PK
signature: text (idx)
created_at: datetime

group_member:
Id INT PK
user_id INT (idx) fk references ...
group_id: INT (idx), fk references group.id
created_at: datetime

Sort the user_ids and make a key out of the concatenated user_ids. This will be the Group.signature

E.g. if users 1, 2, 8 are in the group, then the key is "1,2,8"
If you want to see if users 2, 8, 1 have ever formed a group before, then you look for the key
"1,2,8"

Then change the chat_member table to reference groups instead of users.


Follow Up 2:
How do you deliver the messages to users who are actively online.

Go look at any system design for WhatsApp. Just use Redis + chat servers + messaging queues.

Follow Up 3:
How do you scale the system if you're creating 1,000,000 group chats per day?
DynamoDB horizontally scales REALLY WELL. Etc. etc. His English was very bad and I don't think he was even paying attention.


****************************

2. Project Deep Dive
I guess the interviewer customizes all his questions based on the project you choose.

Talk about a big project you were a part of.
Why were you chosen to lead it?
What were the steps you chose to implement it and why?
Talk about the other teams you worked with.
Talk about implementation details.
How long did it take, how many people did you lead, and what were the challenges/
How did you define milestones and distribute work?

How do you delegate work and make people successful vs. managing the overall timeline?

What challenges did you have?
What did you learn?

What compromises did you have to make with the product manager's vision?

What issues did you have with other teams?

How did you maintain high quality, especially during the rollout?
(Testing, Monitoring, alerting, rollout strategy)


We ran out of time because he kept asking follow ups on things he wanted to know about.
The recruiter told me I got a NO on this round because I didn't talk about my individual contributions enough. The manager seemed like he wanted to steer the interview and he never asked me about that, just a bunch of other stuff.

******************************

3. Coding

You are given an array like [5, 4, 3, 2, 1, 3, 4, 0, 3, 4]

Part 1:
Print a terrain where each number represents the height of a column at that index.

+
++    +  +
+++  ++ ++
++++ ++ ++
+++++++ ++
++++++++++ <--- base layer


Part 2:
Imagine we drop a certain amount of water at a certain column. The water can flow in whichever direction makes sense. Print the terrain after all the water has fallen.

dumpWater(terrain, waterAmount=8, column=1)

Should render
+
++WWWW+  +
+++WW++ ++
++++W++ ++
+++++++W++
++++++++++ <--- base layer