Read below for details on my approaches, results and learnings
Contents
- Why build this in the first place?
- Possible solutions
- Implementation
- Results
- Conclusions
📣 Update: If anyone was doubting just how fast things are moving with AI, OpenAI has already made a move to make it much easier to do Q&A on documents that its models haven't been trained on. While not available to the public at the time of writing, Isabella Fulford's plugin allows ChatGPT to search and retrieve snippets from your personal or organisational documents. It uses the text-embedding-ada-002 model to generate embeddings and stores them in vector databases for efficient search and retrieval. It's awesome to see the pace of new tools being developed and launched.
1. Why build this in the first place?
OpenAI's ChatGPT has demonstrated an impressive range of use-cases, from content creation to text summarisation. As the AI arms race continues to escalate, Large Language Models (LLMs) are only going to get better. They will be trained on the as-yet-unplundered portions of the public internet, with future iterations boasting billions more tunable parameters. In this context, it seems size indeed matters; the larger the training data and parameter count, the more nuanced and sophisticated the generated outputs have been.
However,
what will really drive these models to become more potent and valuable is personalisation, specifically being able to train them on proprietary data. Imagine a generative AI that could craft drafts in a writer's unique voice rather than emulating a more famous author. Or, an AI that could conceptualise an artist's ideas in their distinctive style, rather than producing derivative pastiches of others' works that invite accusations of plagiarism. Or, for large companies, an AI that could distill the wealth of information on their intranets and internal wikis into a FAQs for new employees. These possibilities hold tremendous appeal.
So, just how easily can language models can be configured to answer questions on a specific topic they have not been explicitly trained on?
For example, GPT knows nothing about me. My digital footprint is a teeny-tiny speck on the internet. It's very likely that information about me on social media, Kaggle, or my portfolio website was either overlooked entirely or simply too scant and fragmented to influence GPT's training process.
Here's ChatGPT driving the final nail in the coffin:
2. Possible solutions
Let's take a step back and consider our starting point. ChatGPT has already learned general linguistic patterns and representations from its extensive training, so it can actually take a decent stab at answering questions about things it hasn't explicitly seen before. This is known as "zero-shot learning" but, as we saw just now, this approach doesn't perform well for specific tasks that require domain-specific knowledge (in this case, about me).
To overcome this obstacle, we need to provide the model with new training or examples. One way to achieve this is through fine-tuning, where we train an existing pre-trained model about a specific domain on additional labeled data, allowing it to adjust its parameters as it learns. The end result is a new, slightly more specialized version of the base model. Compared to zero-shot learning however, fine-tuning is much more data-intensive: we'd need a significant amount of labeled data about me, which we don't have.
What we're going to do instead is few-shot learning, in which the model is provided a relatively small amount of information about a topic before being asked the question. Under this method, we could provide a few paragraphs of information about me: my job, education, and interests, and then ask "What are Sai's hobbies?" In this case, the model would use the small amount of information it has been given, combined with its understanding of language, to generate a more accurate and informed response about me.
There are several approaches we can take to few-shot learning. In the next section, we'll explore the two methods I tested.
3. Implementation
Approach 1: PHP + Raw OpenAI API calls
My first working version was to build a simple webpage which takes a user's question and passes it to a PHP file that would make a call to an OpenAI API endpoint. I found a
good template on GitHub which makes a Curl request to the Completions API ('https://api.openai.com/v1/completions') but I adapted it to make a request to the Chat Completions API ('https://api.openai.com/v1/chat/completions').
The Completion API endpoint is suited for general language processing tasks, such as text completion and summarization, while the Chat Completion API endpoint is specifically designed for conversational AI applications. While the former offers a high degree of flexibility in terms of input and output, the Chat Completion API is fine-tuned on conversational data and produces more natural-sounding responses to open-ended questions or prompts, which is ideal for the type of bot I had in mind.
One important difference with the Chat Completion API is that it allows you to specify a chain of messages as a preamble to any user input. Moreover, you can "prime" the model with instructions from the "system". Here's an example:
$messages = array(
array(
"role" => "system",
"content" => "You are a kind, helpful assistant who answers questions about Sai Ulluri and his website."
),
array(
"role" => "user",
"content" => $user_question
)
);
Here, we're giving the model a clear definition of the role we want to it to play before giving it the user's question. This makes it more likely to give relevant answers.
Notice above where we inject the $user_question into the $messages array. The key trick here is that we don't need to pass in only the user's question. We can embellish it with extra information to help the model answer the question. Assuming that the question will be somewhat related to me or the website, we want to pass along as much relevant information about me as possible (there are limits to this, as we'll explore soon).
What follows is one,
particularly quick-and-dirty implementation of the few-shotting approach but it worked surprisingly well. Here's what I did:
I grabbed all the text from the handful of pages on my website along with the text from my CV/resume and stuck it in a .txt file.
Then, each time a user submits a question, the PHP file appends the question to the full corpus of information from the .txt file:
$sai_info = "Sai Ulluri is... blah blah blah... He went to school in... blah blah blah... His favourite food is ... blah blah blah ..."
$messages = array(
array(
"role" => "system",
"content" => "You are a kind, helpful assistant who answers questions about Sai Ulluri and his website."
),
array(
"role" => "user",
"content" => "Context: " + $sai_info + "Based on the given context, answer the following question: " + $user_question
)
);
This is the actual $message array that gets sent off to the Chat Completion endpoint. We're "faking" that the user themselves is (hopefully) providing the answer to their own question via the context given in $sai_info.
One downside though is that
we might be passing in a lot of information that isn't relevant too, leading to potentially lower quality responses. If a user is asking "What camera does Sai use?", it's hardly helpful to be passing in a wall of text with details about my work experience or about my artistic process for drawing.
Another consideration is cost: currently, for the gpt-3.5-turbo model that powers s.ai 🤖, OpenAI charges an admittedly paltry sounding $0.002 per 1k tokens (~ 750 words) that get passed into the API and that gets sent back as an answer. While I'm unlikely to be breaking the bank running a modest application on my webpage (teeny-tiny speck, remember?),
token count and cost becomes more important to manage for enterprise use-cases.
Approach 2: LangChain, Vector Store + Flask App
We can do better.
A more elegant few-shotting solution would be to utilise only the most relevant information about me to answer each question.
This second way to tackle the problem makes use of a set of tools in the
LangChain library by Harrison Chase. LangChain makes it easier to build on top of and get the most out of language models. For example, you can chain multiple requests to a model to get it to articulate its "chain of thought" before it provides a final answer. Another helpful feature is the ability to create Prompt Templates that allow you to specify instructions to the model or embellish a user's question, streamlining what we did before in the PHP solution. Here's what it might look like in Python using LangChain:
prompt_template = """
Answer as factually as possible.
If you do not know the answer, are not permitted to share that information, or if the question is not related to Sai or his website, say:
'I'm sorry, I don't know that information, feel free to ask Sai directly via email (sai@ulluri.com)'.
{context}
Question: {question}
Answer: """
# Create a PromptTemplate object with the prompt template and input variables
prompt = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
One final feature of LangChain that is particularly useful for the task at hand is how easily they integrate
vector stores; these are a way to store information in a compact way, that makes search and retrieval much more efficient.
Imagine you're in a massive library of books and you're asked to find a specific piece of information. But, there's one problem. The books are not organised in any particular way and there's no central place where everything is catalogued. You'd have to go through each and every book until you found it. This would take a long time and might not even yield the results you want. This is where vector stores come in. They're like an index or catalog for your library. They contain important information about each book, such as the title, author, subject, and even specific keywords that describe the book's content. Now, when you're looking for a specific piece of information, you can consult the vector store to quickly locate the most relevant books.
In the context of large language models, vector stores work in a similar way. Instead of books, you might have a massive dataset of text documents, and instead of an index, you'd have a vector store that contains important information about each document, such as its topic, keywords, and other relevant metadata. This vector store can be used to efficiently search and retrieve the most relevant documents to answer questions over a custom dataset.
To create a vector store from a set of documents, in this case a bunch of .txt files with the information from my website and my CV/resume, we first need to split each document up into semantically related pieces of text together using LangChain's TextSplitter. For example, one .txt file might have summaries I've written up about countries I've travelled to. The text splitting process might chunk these into paragraphs of text for each country.
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=200, separator="\n")
docs = []
for i, d in enumerate(data):
splits = text_splitter.split_text(d)
docs.extend(splits)
Then, these text chunks need to be converted into vectors using LangChain's Embeddings class. This process works by representing each word in the text as a high-dimensional vector. The vectors are constructed in such a way that pieces of text that are similar in meaning are represented as similar vectors. As a final step, we can save our vector store in a compressed format such as a pickle file.
store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
with open("faiss_store.pkl", "wb") as f:
pickle.dump(store, f)
With the corpus of information about me neatly packaged, all that's left to do is write a simple Flask app that takes in user questions via the chat interface, passes them along to our vector-store-enhanced language model and returns a response back.
What's more, we only have to create the vector store once, ahead of time, and it'll be there ready for any number of user questions.
# Define a route for the index page
@app.route('/')
def index():
return render_template('index.html')
# Define a route for a user submitting a question
@app.route('/submit', methods=['POST'])
def submit():
query = request.form['input'] # get user's question
response = get_answer(vdb, query) # query the vector database
return response
# Run the Flask app
if __name__ == '__main__':
app.run()
When we run the query against the vector store, we can also return some additional metadata using
get_openai_callback() such as the number of tokens processed by the user's question and the answer. From this, we can estimate how much it cost us for that interaction:
def get_answer(vector_db, query):
"""Queries a vector database with a query, retrieves output, calculates token count and cost, and returns a
response dictionary."""
with get_openai_callback() as cb:
output = vector_db(query, return_only_outputs=True)
response = {
'answer': output['result'], # query output
'token_count': cb.total_tokens, # token count returned by callback
'cost': round(cb.total_tokens * 0.002 / 1000, 5) # query cost ($0.002 per 1k tokens)
}
return response
4. Results
So, which approach was better? Let's answer this on the basis of three metrics:
- Cost - just how much are we saving by using the vector store?
- Speed - is there a significant increase in latency to the extra steps involved in the vector store approach?
- Quality - does restricting the information passed it prevent the model from giving good answers to multi-topic questions?
Let's see how it does on a few questions about me...
"Can I commission a drawing from Sai?"
Approach |
Response |
Tokens used |
Total Cost |
Speed (secs) |
Raw API Call |
"Yes, Sai is currently open for new commissions for portrait artwork. You can inquire about pricing information by email." |
2462 |
$0.00492 |
1.47 |
LangChain Vector Store |
"Yes, Sai is currently open for new commissions and you can inquire about pricing information by email at sai@ulluri.com." |
740 |
$0.00148 |
1.90 |
⮕ Straight off the bat, it's clear that using a vector store significantly reduces the number of tokens that are used in this request. The speed is a little worse though.
"Tell me about his trip to Kenya"
Approach |
Response |
Tokens used |
Total Cost |
Speed (secs) |
Raw API Call |
"Sai went to Kenya for two weeks in the summer of 2019. He split his time fairly evenly between the hustle and bustle of Nairobi, the vast expanse of the Maasai Mara, and the beautiful beaches of Diani on the coast. He met Maxwell, their adopted Black rhinoceros, at the Sheldrick Wildlife Orphanage, saw wild cheetahs for the first time, and found themselves surrounded by monkeys and duikers on a walk through Karura Forest. He would love to go back, particularly to visit Amboseli for its stunning views of Mt. Kilimanjaro." |
2568 |
$0.00514 |
5.35 |
LangChain Vector Store |
"Sai went to Kenya for two weeks in the summer of 2019. He split his time between Nairobi, the Maasai Mara, and Diani on the coast. He visited the Sheldrick Wildlife Orphanage and met their adopted Black rhinoceros, saw wild cheetahs, and went on a walk through Karura Forest where he was surrounded by monkeys and duikers. He would love to go back to visit Amboseli for its stunning views of Mt. Kilimanjaro." |
811 |
$0.00162 |
5.76 |
⮕ Again, both models give very similar responses, but the LangChain method is significantly cheaper.
"If Sai were a pizza topping, what would he be?"
Approach |
Response |
Tokens used |
Total Cost |
Speed (secs) |
Raw API Call |
"I'm sorry, I don't know that information, feel free to ask Sai directly via email (sai@ulluri.com)." |
2471 |
$0.00494 |
1.70 |
LangChain Vector Store |
"I'm sorry, I don't know that information, feel free to ask Sai directly via email (sai@ulluri.com)." |
776 |
$0.00155 |
1.82 |
⮕ This question is here to test the models' willingness to admit they don't know something. There's nothing in the training data about what kind of pizza topping I'd be, so it's nice to see that both models refuse to make something up. Sorry folks, this is one secret I'll take to the grave. Or, you know, you could just email me like s.ai suggests...
"What type of photography does Sai do?"
Approach |
Response |
Tokens used |
Total Cost |
Speed (secs) |
Raw API Call |
"Sai does travel photography and has a Nikon D750 camera. He also does portrait artwork with graphite and colored pencils." |
2464 |
$0.00493 |
2.20 |
LangChain Vector Store |
"Sai does photography related to his travels and has photos from his visits to India, Italy, Zambia, Kenya, Japan, Peru, Norway, China, and Indonesia. He uses a Nikon D750 camera." |
755 |
$0.00151 |
2.13 |
⮕ Here, the raw API call method actually seems to suffer a little from having too much irrelevant information injected along with the user's question. The question is squarely about my photography, but half of its answer is about my work as a portrait artist! On the other hand, the LangChain vector store has better delineated information to do with my photography and information to do with drawing.
"Tell me about Sai's recent data science projects and also in which year he went to university"
Approach |
Response |
Tokens used |
Total Cost |
Speed (secs) |
Raw API Call |
"Sai's recent data science projects include text summaries with ChatGPT, building a wildlife database, a chessboard heatmap, and dinosaur species discoveries. He went to university from 2013 to 2016." |
2494 |
$0.00499 |
2.29 |
LangChain Vector Store |
"Sai's recent data science projects include "Text summaries with ChatGPT" and "Building a wildlife database". He went to university in 2013-2016." |
1322 |
$0.00264 |
2.27 |
⮕ If a user asks such a two-in-one question that straddles distinct topic, we might expect a better answer from the raw API call approach where everything about me is passed alongside the question. Yet, the vector store approach demonstrates their versatility here, providing a clear answer to both parts of the user's query despite the answers being stored in separate chunks.
5. Conclusions
I had a lot of fun making this chatbot and learnt a ton.
Overall, across the five questions we tested, the LangChain + vector store approach resulted in a cost saving of 65% compared to raw API calls!
There's obviously so much more that could be done to tweak the LangChain model's performance, such as specifying clearer instructions about the model's role at the prompting stage or testing the impact of different chunk sizes during the creation of the vector store. Another improvement would come from incorporating persistent memory into the assistant allowing it to reference its chat history and provide more tailored answers to future questions.
I also recognise that the information about me that I was passing along to the models was still limited. For an enterprise use-case, it would be crucial to test how well the system performs with hundreds, if not thousands, of documents to ensure that latency is manageable.
One improvement I'd love to make in the future is to build such a model to embed different types of data. There's only so much text data about me on my website, but my drawings and photos are a much richer source of information. An ability to respond to user requests with images directly sourced from my photo library would be an incredible addition to the chatbot's capabilities. Such multimodal capabilities are on the horizon for these models, and I'm really excited to see what we can build with them in the coming months and years.