Building RakuMend, a Conversational Recommendation Engine.
Post-hackathon writeup. RAG & Making instruct-tuned LLMs behave on super-long context lengths.
So I won(Second, $3.6K) a hackathon a couple of months back in September. It was organised by the Rakuten India and had generative AI as its theme. Finally got time to do a technical writeup.
Note: The LLM landscape has changed quite a lot since September, when I built this, so all the discussion in this post might not be completely relevant.
One of the two products I built was RakuMend, a conversational shopping and recommendation engine powered by GPT3.5 and Google’s vertex AI for image captioning needs (GPT4 Vision wasn’t a thing at that time).
Rakuten has a global website, rakuten.com, which is basically a brand aggregator, and they get a cut every time a purchase is made through it.
I wanted to build a recommendation system around it, so it takes a natural language query (e.g., I am going to a work trip this weekend; suggest something comfy yet semi-formal) and gives out product recommendations from the inventory.
V1: ViT/CLIP + Cosine Similarity/Vector Search
The first version was powered by a simple vector search engine in which I scraped approximately 10,000 products from Rakuten Ichiba (their Japanese counterpart). Divided into two product categories, tops and bottoms. I used Clip (OpenAI’s image captioning model) to get image vectors and did a cosine similarity search to find matching outfits from the DB. I used pinecone for my vector DB, but really, a numpy dict would have worked just fine.
It worked «fine». It was too slow to be used on any production workload with 100s of categories. Also the recommendations weren’t that great tbh.
So onto the next version.
V2: GPT Functions and Full Text Search
Not having access to the whole product catalog was a bummer. But Rakuten’s full text search is really good.
So my idea was to leverage FTS with GPT function calling to get product ideas from GPT, generate search queries from it, scrape the top 5 results, and display them.
This was around the time when GPT unveiled function calling and models specifically tuned for that (instruction tuned).
A Note about GPT Function Calling: Its not intuitive at all and is quite unreliable. I don’t know anyone using function calling in production. I recommend writing your own parser (do a simple regex on return streamed responses and manually call the function in Python).
This is almost like retrieval-augmented generation (except we use full text search instead of vector similarity to find the top k results):
This is the final product I demoed at the hack.
How it works and the code:
This is the scraping module that searches and returns the top n products for the given query from Rakuten’s website :
This is the fastAPI endpoint that handles all the messages as well as user’s input images :
The high_def flag is a flag I implemented when I wanted to get the image captioning results from reverse-engineered Bard’s web API instead of Vertex’s PALM API.
The previous messages are separated by a % delimiter. Each query sends a user preference (age and gender) with it to give our LLM more context about the user.
This is the gpt responder module : It just parses the current user query and previous queries. Also appends system prompt before passing the whole thing to GPT3.5 api. LLM can be hot-swapped.
Important Note: The key to making sure your instruction-tuned models follow instructions in long context lengths is to append the system prompt again at the end of the dialog chain. This has to do with the LLM architecture and how the attention heads function. You might not have to do this at all, depending on the model (e.g., mistral models are pretty good at this). This thing wasted a whole lot of my time figuring it out at the hack.
Parse and replace module: regex(es) the GPT response and returns a structured output after searching for each term using rk_search module. This output is then rendered by my front end.
Image captioning : Added a feature that allowed users to add their own outfits and clothing items to chat and ask LLM to give recommendation about it.
Used google’s palm API (and bard’s reverse engineered API) to generate highly detailed image captions which were then fed to my LLM.
Here’s a demo of the feature :
Last but not least these were my prompts:
Here’s another demo vid of the complete project:
This thing, coupled with another product I build (RakuTry, a virtual try-on extension [writeup soon]), was enough to get me a second (first being an internal team [usual tbh]).
GG’s