Chatbot Using Langchain#

Introduction#

A chat model usually consists of several major components:

  • chat model: this refers to the chat model interface in LangChain, i.e., the large langauge models.

  • prompt template: Prompt templates make it easy to assemble prompts that combine default messages, user input, chat history, and (optionally) additional retrieved context.

  • memory: this refers to the control of memory in the previous conversations.

  • retriever (optional): These are useful if you want to build a chatbot with domain-specific knowledge.

Loading Environment Variables#

## Set env var OPENAI_API_KEY or load from a .env file:

from dotenv import load_dotenv
load_dotenv('/Users/alvinchen/.env')
True

Quickstart for a Chatbot#

  • With a plain chat model, we can get chat completions by passing one or more messages to the model. The chat model will respond with a message.

  • The chat model interface is based around messages rather than raw text. Of particular relevance to our task are the following types: AIMessage, HumanMessage, SystemMessage.

    • AIMessage: These are messages generated by artificial intelligence (AI) systems. Imagine chatting with a chatbot or receiving automated responses from a smart program. AIMessages come from AI systems within LangChain.

    • SystemMessage: Similar to notifications or alerts you might receive from a computer system or application, SystemMessages in LangChain are notifications or updates from the LangChain system itself. They provide information about the system’s status or actions.

    • HumanMessage: These are messages exchanged between real people, just like regular conversations. When users interact with LangChain and send messages or input, those messages are categorized as HumanMessages.

from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

## Initate the chat model
chat = ChatOpenAI(model='gpt-4', temperature= 0.6)


## Human Message
chat.invoke(
    [
        HumanMessage(
            content="Translate this sentence from English to Chinese: Computational Linguistics is very challenging."
        )
    ]
)
AIMessage(content='计算语言学非常具有挑战性。')
## Human Message + System Message
messages = [
    SystemMessage(
        content="You are a professional translator that translates English to Taiwan Mandarin."
    ),
    HumanMessage(content="Computational linguistics is very challenging!"),
]
chat.invoke(messages)
AIMessage(content='計算語言學非常具有挑戰性!')
  • We can wrap the chat model in a chain, which is designed with built-in memory for remembering the previous converstaional exhcanges.

from langchain.chains import ConversationChain

conversation = ConversationChain(llm=chat)
conversation.invoke('Translate this sentence from English to Taiwan Mandarin: Computational Linguistics is very challenging.')
{'input': 'Translate this sentence from English to Taiwan Mandarin: Computational Linguistics is very challenging.',
 'history': '',
 'response': '"計算語言學非常具有挑戰性。"'}
conversation.invoke('Support the sentence with two examples.')
{'input': 'Support the sentence with two examples.',
 'history': 'Human: Translate this sentence from English to Taiwan Mandarin: Computational Linguistics is very challenging.\nAI: "計算語言學非常具有挑戰性。"',
 'response': 'Sure, here are two examples related to computational linguistics:\n\n1. 自然語言處理(NLP):這是計算語言學的一個重要分支,主要研究如何讓電腦能理解和生成人類語言。例如,機器翻譯就是NLP的一個應用,它需要理解來源語言的語義並嘗試在目標語言中進行準確的翻譯,這是一項非常具有挑戰性的任務。\n\n2. 語音識別:這也是計算語言學的一個主要研究領域,它涉及到讓電腦能識別並理解人類的語音。例如,智能助手如Siri或Alexa需要能夠理解用戶的語音指令並做出適當的反應。這需要大量的數據和複雜的算法,也是一項非常具有挑戰性的工作。'}

Memory#

from langchain.memory import ConversationBufferMemory

## Create Buffer Memory
memory = ConversationBufferMemory()

## Add some previous context 
memory.chat_memory.add_user_message("What day is today?")
memory.chat_memory.add_ai_message("Sunday")
memory.load_memory_variables({})
{'history': 'Human: What day is today?\nAI: Sunday'}
  • Summary Memory

from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationSummaryMemory

# llm = ChatOpenAI(temperature=0)
memory = ConversationSummaryMemory(llm=chat)
memory.save_context({"input": "Today is Sunday."}, {"output": "Ok."})
memory.save_context(
    {"input": "Today is November 26, 2023"},
    {"output": "oh, thank you for telling me"},
)
memory.load_memory_variables({})
{'history': 'The human states that today is Sunday, November 26, 2023 and the AI acknowledges this.'}
  • Put memory and llm in the conversation chain

conversation = ConversationChain(llm=chat, memory = memory)
conversation.run("So what date was the day before yesterday?")
'The day before yesterday was Friday, November 24, 2023.'

Conversation Chain#

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Loading PDF
loader = PyPDFLoader("../../../../ENC2045_demo_data/ENC2045Syllabus.pdf") # /Users/alvinchen/Library/CloudStorage/GoogleDrive-alvinworks@gmail.com/My Drive/ENC2045_demo_data/ENC2045Syllabus.pdf
pages = loader.load()

## Split into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
documents = text_splitter.split_documents(pages)
vector = FAISS.from_documents(documents, embeddings)
retriever = vector.as_retriever()
retriever.invoke("What is ENC2045?")
[Document(page_content='ENC2045: Computational Linguistics\nThis site is last-updated on 2024-02-23\n Annoucements', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
 Document(page_content='Logistics\nCourse Website: ENC2045 Computational Linguistics (https://alvinntnu.github.io/NTNU_ENC2045/)\nInstructor’s Email Address: alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw)', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 2}),
 Document(page_content='course enrollment.\n2023-12-24: This course has prerequisites. A test on python basics will be conducted at the beginning of the semester. Please', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
 Document(page_content='of 2024.\n2023-12-24: This course is designed for linguistics majors. If you are NOT a linguistics major, please contact the instructor for\ncourse enrollment.', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0})]
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate
from langchain.memory import ChatMessageHistory

chat_history_tracker = ChatMessageHistory()

QAPrompt = PromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context> 

Question: {input}""")
document_chain = create_stuff_documents_chain(chat, QAPrompt)
chat_history_tracker.messages
[]
document_chain.invoke(
    {
        "input": 'Please tell me about ENC2045 in simple words.',
        "context": documents,
    }
)
'ENC2045 is a course on Computational Linguistics. It focuses on using computational tools to process and understand natural language data. Topics include text processing, machine learning, deep learning, and neural networks. The course is very hands-on and includes classic examples of many task-oriented implementations. The main coding language used is Python, and students are expected to have a working knowledge of it. The course is designed specifically for linguistics majors, and it does not cover the mathematical operations behind the algorithms in depth. The course materials are available online and students should frequently check the website for updates.'

Notes#

  • It is still not unclear to me how we can combine Memory and Retrieval as one chain. I think this is under development and now the procedures can be less transparent.

References#

  • This tutorial is based on the official documentations, Chatbots.