Langchain: An Introduction#

This image is from Langchain official documentation.

What is Langchain?#

Langchain is an open-source framework designed for developers working with artificial intelligence (AI). It facilitates the integration of large language models (LLMs) like GPT-4 with external sources of computation and data. Here’s a breakdown of Langchain’s key components and functionalities:

1. Integration of Large Language Models (LLMs):#

  • Langchain allows developers to seamlessly connect LLMs such as GPT-4 to external data sources and computation platforms.

  • This integration enables developers to leverage the vast knowledge and capabilities of LLMs in combination with their own data and applications.

2. Addressing Specific Information Needs:#

  • While LLMs like GPT-4 possess extensive general knowledge, Langchain addresses the need for specific information from proprietary or domain-specific data sources.

  • Developers can utilize Langchain to connect LLMs to their own datasets, including documents, PDF files, or proprietary databases.

3. Dynamic Data Referencing:#

  • Unlike traditional methods that involve pasting snippets of text into chat prompts, Langchain allows for referencing entire databases of proprietary data.

  • Developers can segment their data into smaller chunks and store them in a vector database as embeddings, enabling efficient referencing and retrieval.

4. Pipeline for Language Model Applications:#

  • Langchain facilitates the development of language model applications following a structured pipeline.

  • The pipeline typically involves:

    • User input: Initial questions or queries from users.

    • Language model interaction: Sending user input to the LLM for processing.

    • Similarity search: Matching user queries with relevant data chunks in the vector database.

    • Action or response: Providing answers or taking actions based on the combined information from the LLM and vector database.

5. Practical Use Cases:#

  • Langchain’s capabilities enable a wide range of practical applications, particularly in personal assistance, education, and data analytics.

  • Examples include booking flights, transferring money, learning new subjects, and analyzing company data for insights.

6. Key Components of Langchain:#

  • LLM Wrappers: Facilitate connection to LLMs like GPT-4.

  • Prompt Templates: Dynamically generate prompts for LLMs based on user input.

  • Indexes: Extract relevant information from datasets for LLM processing.

  • Chains: Combine multiple components to build LLM applications following a specific task.

  • Agents: Enable LLMs to interact with external APIs for additional functionality.

7. Continuous Development and Expansion:#

  • Langchain is continually evolving, with new features and capabilities being added regularly.

  • The framework offers a flexible and scalable solution for developers looking to integrate LLMs into their applications.

In summary, Langchain provides developers with a powerful framework for harnessing the capabilities of LLMs and integrating them with external data sources, enabling the development of sophisticated language model applications across various domains.

Install#

# pip install langchain
# pip install langchain-community
# pip install langchain-core
# pip install -U langchain-openai
#!pip install langchain openai weaviate-client

API Setup#

To save environment variables in a .env file and use the dotenv library in Python to load them, follow these steps:

Saving Environment Variables in a .env File:#

  1. Create a new file in your project directory and name it .env. This file will store your environment variables.

  2. Add your environment variables to the .env file in the format VARIABLE_NAME=variable_value. For example:

    OPENAI_API_KEY=your_api_key
    DATABASE_URL=your_database_url
    

Using dotenv in Python to Load Environment Variables:#

  1. Install the dotenv library if you haven’t already installed it. You can install it using pip:

    pip install python-dotenv
    
  2. In your Python script, import the dotenv module:

    from dotenv import load_dotenv
    
  3. Load the environment variables from the .env file using the load_dotenv() function. Place this line at the beginning of your script:

    load_dotenv()
    
  4. Access the environment variables in your Python script using the os.environ dictionary. For example:

    import os
    
    api_key = os.environ.get('API_KEY')
    database_url = os.environ.get('DATABASE_URL')
    
    print("API Key:", api_key)
    print("Database URL:", database_url)
    

Notes:#

  • Make sure to add the .env file to your project’s .gitignore file to prevent sensitive information from being exposed.

  • You can also specify the path to the .env file if it’s located in a different directory:

    load_dotenv('/path/to/your/env/file/.env')
    

By following these steps, you can save environment variables in a .env file and use the dotenv library in Python to load them into your script. This approach helps keep sensitive information separate from your codebase and makes it easier to manage environment variables in your projects.

# Load environment variables

from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
load_dotenv('/Users/alvinchen/.env')
True
find_dotenv()
'/Users/alvinchen/.env'

Basic Query#

This image is from Langchain official documentation.

## initialize Chat model
from langchain_openai import ChatOpenAI
chat = ChatOpenAI(model_name="gpt-4",temperature=0.3)
## Interact with the Chat model immediately
response = chat.invoke("explain large language models in one sentence")
print(response.content,end='\n')
Large language models are advanced machine learning algorithms designed to understand and generate human-like text by being trained on a vast amount of data.

Messages#

# import schema for chat messages and ChatOpenAI in order to query chatmodels GPT-3.5-turbo or GPT-4

from langchain_core.messages import (
    HumanMessage,
    SystemMessage,
    AIMessage
)
chat = ChatOpenAI(model_name="gpt-3.5-turbo",temperature=0.3)
messages = [
    SystemMessage(content="You are an expert data scientist"),
    HumanMessage(content="Write a Python script that trains a neural network on simulated data ")
]
response=chat.invoke(messages)

print(response.content,end='\n')
Sure, here is an example Python script using the popular deep learning library TensorFlow to train a simple neural network on simulated data:

```python
import numpy as np
import tensorflow as tf

# Generate simulated data
np.random.seed(0)
X = np.random.rand(100, 2)
y = np.random.randint(0, 2, 100)

# Define the neural network architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(2,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=10, batch_size=32)

# Evaluate the model
loss, accuracy = model.evaluate(X, y)
print(f'Loss: {loss}, Accuracy: {accuracy}')
```

In this script, we first generate simulated data with 2 features and binary labels. We then define a simple neural network with one hidden layer of 10 neurons and an output layer with a sigmoid activation function. We compile the model with binary crossentropy loss and train it on the simulated data for 10 epochs.

Finally, we evaluate the model on the training data and print out the loss and accuracy. You can modify this script to experiment with different neural network architectures, loss functions, optimizers, and hyperparameters.

Prompt Template#

# Import prompt and define PromptTemplate

from langchain_core.prompts import PromptTemplate

template = """
You are a college professor with an expertise in building deep learning models. 
Answer the answer of {question} like I am five.
"""

prompt = PromptTemplate.from_template(
    template=template,
)
# Run LLM with PromptTemplate
response = chat.invoke(prompt.format(question="What is backpropogation?"))
print(response.content,end='\n')
Backpropagation is like a teacher helping you learn how to ride a bike by telling you what you did wrong and how to fix it, so you can get better and better at riding without falling off.

Chain#

chain = prompt | chat
response = chain.invoke({"question": "What is gradient descent?"})
print(response.content,end='\n')
Imagine you are trying to find the bottom of a hill by taking small steps downhill. Gradient descent is like a magical way to figure out which direction to step in order to get to the bottom of the hill faster. It helps us adjust our steps so we can reach the bottom of the hill (or the best solution) in the quickest way possible.
from langchain_core.output_parsers import StrOutputParser

chain2 = prompt | chat | StrOutputParser()
chain2.invoke({"question": "What is gradient descent?"})
'Imagine you are trying to find the bottom of a big slide in a playground. Gradient descent is like taking small steps down the slide until you reach the bottom. It helps us find the best way to adjust our deep learning model to make it work better.'

Chaining A Series of Prompts#

# Import LLMChain and define chain with language model and prompt as arguments.

from langchain.chains import LLMChain
chain = LLMChain(llm=chat, prompt=prompt)

# Run the chain only specifying the input variable.
print(chain.invoke("what is derivative?"))
{'question': 'what is derivative?', 'text': "Imagine you are playing with a toy car on a track that goes up and down hills. The derivative is like looking at how fast the car is going at different points on the track. If the car is going uphill, the derivative tells us how steep the hill is. If the car is going downhill, the derivative tells us how fast it's speeding up. It helps us understand how things are changing at different moments."}
# Define a second prompt 

second_prompt = PromptTemplate(
    input_variables=["prev_ans"],
    template="Translate the answer description of {prev_ans} in traditional Chinese",
)
chain_two = LLMChain(llm=chat, prompt=second_prompt)
# Define a sequential chain using the two chains above: the second chain takes the output of the first chain as input

from langchain.chains import SimpleSequentialChain
overall_chain = SimpleSequentialChain(chains=[chain, chain_two], verbose=True)

# Run the chain specifying only the input variable for the first chain.
explanation = overall_chain.invoke("what is gradient descent?")
print(explanation)
> Entering new SimpleSequentialChain chain...
Imagine you are trying to find the bottom of a hill by taking small steps downhill. Gradient descent is like taking tiny steps in the direction that will help you reach the bottom of the hill faster. It's a method used by computers to adjust and improve their predictions in deep learning models.
想像一下,你正在尝试通过小步走下坡来找到山脚下。梯度下降就像是在朝着能让你更快到达山脚下的方向迈出微小步伐。这是计算机在深度学习模型中用来调整和改进预测的方法。

> Finished chain.
{'input': 'what is gradient descent?', 'output': '想像一下,你正在尝试通过小步走下坡来找到山脚下。梯度下降就像是在朝着能让你更快到达山脚下的方向迈出微小步伐。这是计算机在深度学习模型中用来调整和改进预测的方法。'}
# Import utility for splitting up texts and split up the explanation given above into document chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 50,
    chunk_overlap  = 0,
)

texts = text_splitter.create_documents([explanation])
# Individual text chunks can be accessed with "page_content"

texts[0].page_content
'想像一下你盲目地试图找到山坡的底部。梯度下降就像是在感觉最陡峭的方向上小步往下走,这样你最终会到达最'

Retrieval-Augmented Generation#

This image is from Langchain official documentation.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../../../ENC2045_demo_data/ENC2045Syllabus.pdf") # /Users/alvinchen/Library/CloudStorage/GoogleDrive-alvinworks@gmail.com/My Drive/ENC2045_demo_data/ENC2045Syllabus.pdf
pages = loader.load()
pages
[Document(page_content='ENC2045: Computational Linguistics\nThis site is last-updated on 2024-02-23\n Annoucements\nImportant course information will be posted on this web page and announced in class. You are responsible for all material that appears\nhere and should check this page for updates frequently.\n2024-02-23: The current version of the course website is based on the Spring Semester, 2021. It will be updated for the spring\nof 2024.\n2023-12-24: This course is designed for linguistics majors. If you are NOT a linguistics major, please contact the instructor for\ncourse enrollment.\n2023-12-24: This course has prerequisites. A test on python basics will be conducted at the beginning of the semester. Please\nread the FAQ (FAQ.html) very carefully. This course is NOT OPEN to auditors.\n Course Description\nComputational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text\nanalytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural\nlanguage data (in spoken or textual formats). Therefore, this course aims to introduce useful strategies and common work\x00ows that\nhave been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will focus on\ntextual data processing.\nA selective collection of potential topics may include:\nA Pipeline for Natural Language Processing\nText Normalization\nText Tokenization\nParsing and Chunking\nIssues for Chinese Language Processing (Word Segmentation)\nFeature Engineering and Text Vectorization\nTraditional Machine Learning\nClassi\x00cation Models (Naive Bayes, SVM, Logistic Regression)\nCommon Computational Tasks:\nSentiment Analysis\nTex Clustering and Topic Modeling\nDeep Learning and Neural Network\nNeural Language Model\nSequence Models\nRNN\nLSTM/GRU\nSequence-to-sequence Model\nAttention-based Models\nTransfer Learning\nLarge Language Models(LLM) & Retrieval-Augmented Generation (RAG)\nMultimodal Processing\nThis course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-\nclass theme-based tutorial sessions. The main coding language used in this course is Python  (https://www.python.org/). We will\nmake extensive use of the language. It is assumed that you know or will quickly learn how to code in Python. In fact, this course\nassumes that every enrolled student has working knowledge of Python. (If you are not sure if you ful\x00ll the prerequisite, please\ncontact the instructor \x00rst.)', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
 Document(page_content='A test on Python Basics will be conducted on the \x00rst week of the class to ensure that every enrolled student ful\x00lls the prerequisite.\n(To be more speci\x00c, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python:\nLearn Just Enough Python to Build Useful Tools (https://www.amazon.com/Lean-Python-Learn-Enough-Useful/dp/1484223845)).\nThose who fail on the Python basics test are NOT advised to take this course.\nPlease note that this course is designed speci\x00cally for linguistics majors in humanities. For computer science majors, please note that\nthis course will not feature a thorough description of the mathematical operations behind the algorithms. We focus more on the\npractical implementation.\n Course Schedule\n(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)\nWeek Date Topic\nWeek 1 2023-02-23 Course Orientation and Computational Linguistics Overview\nWeek 2 2023-03-01 NLP Pipeline\nWeek 3 2023-03-08 Machine Learning Basics: Regression and Classi\x00cation\nWeek 4 2023-03-15 Naïve Bayes, Logistic Regression\nWeek 5 2023-03-22 Feature Engineering and Text Vectorization\nWeek 6 2023-03-29 Common NLP Tasks (Guest Speaker: Robin Lin from Droidtown Linguistic Tech. Co.\xa0Ltd.\xa0)\nWeek 7 2023-04-05 Holiday\nWeek 8 2023-04-12 Midterm Exam\nWeek 9 2023-04-19 Neural Network: A Primer\nWeek 10 2023-04-26 Deep Learning NLP and Word/Doc Embeddings\nWeek 11 2023-05-03 Sequence Model I: RNN and Neural Language Model\nWeek 12 2023-05-10 Sequence Model II: LSTM and GRU\nWeek 13 2023-05-17 Sequence Model III: Sequence-to-Sequence Model & Attention\nWeek 14 2023-05-24 Transformer, BERT, Transfer Learning, and Explainable AI\nWeek 15 2023-05-31 LLM, RAG, and Multimodal Processing\nWeek 16 2023-06-07 Final Exam\n Course Requirement\n', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 1}),
 Document(page_content=' Course Materials\nAll the course materials are available on the course website. Please consult the instructor for the direct link to the course materials.\nThey will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website.\n Logistics\nCourse Website: ENC2045 Computational Linguistics (https://alvinntnu.github.io/NTNU_ENC2045/)\nInstructor’s Email Address: alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw)\nInstructor’s Name: Alvin Chen\nOf\x00ce Hours: By appointment\nIf you have any further questions related to the course, please consult FAQ (FAQ.html) on our course website or write me at any time\nat alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw).\n Disclaimer & Agreement\nWhile I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any\nerrors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with\nno guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty\nof any kind, express or implied.\nYou may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent,\nyou cannot disclose con\x00dential information of the website (e.g., log-in username and password) to any third party.', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 2})]
## We can load webpages as context documents
# import bs4
# from langchain_community.document_loaders import WebBaseLoader
# loader = WebBaseLoader("https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/intro.html")
# pages = loader.load()
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter()
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")


documents = text_splitter.split_documents(pages)
vector = FAISS.from_documents(documents, embeddings)


docs = vector.similarity_search("What is ENC2045?", k=2)

for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
0: ENC2045: Computational Linguistics
This site is last-updated on 2024-02-23
 Annoucements
Important course information will be posted on this web page and announced in class. You are responsible for all material that appears
here and should check this page for updates frequently.
2024-02-23: The curr
2: Course Materials
All the course materials are available on the course website. Please consult the instructor for the direct link to the course materials.
They will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website.
 Logistics
Course Website: E
  • create_stuff_documents_chain(): This chain takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM. It passes ALL documents, so you should make sure it fits within the context window the LLM you are using.

  • create_retrieval_chain(): This chain takes in a user inquiry, which is then passed to the retriever to fetch relevant documents. Those documents (and original inputs) (done by the create_stuff_documents_chain()) are then passed to an LLM to generate a response

documents[:7]
[Document(page_content='ENC2045: Computational Linguistics\nThis site is last-updated on 2024-02-23\n Annoucements\nImportant course information will be posted on this web page and announced in class. You are responsible for all material that appears\nhere and should check this page for updates frequently.\n2024-02-23: The current version of the course website is based on the Spring Semester, 2021. It will be updated for the spring\nof 2024.\n2023-12-24: This course is designed for linguistics majors. If you are NOT a linguistics major, please contact the instructor for\ncourse enrollment.\n2023-12-24: This course has prerequisites. A test on python basics will be conducted at the beginning of the semester. Please\nread the FAQ (FAQ.html) very carefully. This course is NOT OPEN to auditors.\n Course Description\nComputational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text\nanalytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural\nlanguage data (in spoken or textual formats). Therefore, this course aims to introduce useful strategies and common work\x00ows that\nhave been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will focus on\ntextual data processing.\nA selective collection of potential topics may include:\nA Pipeline for Natural Language Processing\nText Normalization\nText Tokenization\nParsing and Chunking\nIssues for Chinese Language Processing (Word Segmentation)\nFeature Engineering and Text Vectorization\nTraditional Machine Learning\nClassi\x00cation Models (Naive Bayes, SVM, Logistic Regression)\nCommon Computational Tasks:\nSentiment Analysis\nTex Clustering and Topic Modeling\nDeep Learning and Neural Network\nNeural Language Model\nSequence Models\nRNN\nLSTM/GRU\nSequence-to-sequence Model\nAttention-based Models\nTransfer Learning\nLarge Language Models(LLM) & Retrieval-Augmented Generation (RAG)\nMultimodal Processing\nThis course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-\nclass theme-based tutorial sessions. The main coding language used in this course is Python  (https://www.python.org/). We will\nmake extensive use of the language. It is assumed that you know or will quickly learn how to code in Python. In fact, this course\nassumes that every enrolled student has working knowledge of Python. (If you are not sure if you ful\x00ll the prerequisite, please\ncontact the instructor \x00rst.)', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
 Document(page_content='A test on Python Basics will be conducted on the \x00rst week of the class to ensure that every enrolled student ful\x00lls the prerequisite.\n(To be more speci\x00c, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python:\nLearn Just Enough Python to Build Useful Tools (https://www.amazon.com/Lean-Python-Learn-Enough-Useful/dp/1484223845)).\nThose who fail on the Python basics test are NOT advised to take this course.\nPlease note that this course is designed speci\x00cally for linguistics majors in humanities. For computer science majors, please note that\nthis course will not feature a thorough description of the mathematical operations behind the algorithms. We focus more on the\npractical implementation.\n Course Schedule\n(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)\nWeek Date Topic\nWeek 1 2023-02-23 Course Orientation and Computational Linguistics Overview\nWeek 2 2023-03-01 NLP Pipeline\nWeek 3 2023-03-08 Machine Learning Basics: Regression and Classi\x00cation\nWeek 4 2023-03-15 Naïve Bayes, Logistic Regression\nWeek 5 2023-03-22 Feature Engineering and Text Vectorization\nWeek 6 2023-03-29 Common NLP Tasks (Guest Speaker: Robin Lin from Droidtown Linguistic Tech. Co.\xa0Ltd.\xa0)\nWeek 7 2023-04-05 Holiday\nWeek 8 2023-04-12 Midterm Exam\nWeek 9 2023-04-19 Neural Network: A Primer\nWeek 10 2023-04-26 Deep Learning NLP and Word/Doc Embeddings\nWeek 11 2023-05-03 Sequence Model I: RNN and Neural Language Model\nWeek 12 2023-05-10 Sequence Model II: LSTM and GRU\nWeek 13 2023-05-17 Sequence Model III: Sequence-to-Sequence Model & Attention\nWeek 14 2023-05-24 Transformer, BERT, Transfer Learning, and Explainable AI\nWeek 15 2023-05-31 LLM, RAG, and Multimodal Processing\nWeek 16 2023-06-07 Final Exam\n Course Requirement', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 1}),
 Document(page_content='Course Materials\nAll the course materials are available on the course website. Please consult the instructor for the direct link to the course materials.\nThey will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website.\n Logistics\nCourse Website: ENC2045 Computational Linguistics (https://alvinntnu.github.io/NTNU_ENC2045/)\nInstructor’s Email Address: alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw)\nInstructor’s Name: Alvin Chen\nOf\x00ce Hours: By appointment\nIf you have any further questions related to the course, please consult FAQ (FAQ.html) on our course website or write me at any time\nat alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw).\n Disclaimer & Agreement\nWhile I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any\nerrors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with\nno guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty\nof any kind, express or implied.\nYou may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent,\nyou cannot disclose con\x00dential information of the website (e.g., log-in username and password) to any third party.', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 2})]
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.documents import Document

## define prompt template
prompt = PromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")


## create chain
document_chain = create_stuff_documents_chain(chat, prompt)

## When invoking the caht, define `context`` documents


document_chain.invoke({
    "input": "What is ENC2045?",
    "context": documents
})
'ENC2045 is a course on Computational Linguistics.'
from langchain.chains import create_retrieval_chain

retriever = vector.as_retriever()

## Specific setting for retriever
# retriever = vector.as_retriever(
#     search_type="similarity_score_threshold", 
#     search_kwargs={"score_threshold": 0.3, "k":3})

retrieval_chain = create_retrieval_chain(retriever, document_chain)
response = retrieval_chain.invoke({"input": "Who is the instructor of the course ENC2045?"})
print(response["answer"])
The instructor of the course ENC2045 is Alvin Chen.

Chat History Management#

  • In addition to retrieving external documents as context information, LLM also needs to consider the conversation history for more precise answers.

  • create_history_aware_retriever(): This chain takes in conversation history and then uses that to generate a search query which is passed to the underlying retriever.

from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder, ChatPromptTemplate

# First we need a prompt that we can pass into an LLM to generate this search query

prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}"),
    ("user", "Given the above conversation, generate a search query to look up in order to get information relevant to the conversation")
])


history_chain = create_history_aware_retriever(chat, retriever, prompt)
chat_history = [HumanMessage(content="How many assignments do students need to do?"), 
                AIMessage(content="Four.")]

history_chain.invoke({
    "chat_history": chat_history,
    "input": "What are the four assignments?"
})
[Document(page_content='A test on Python Basics will be conducted on the \x00rst week of the class to ensure that every enrolled student ful\x00lls the prerequisite.\n(To be more speci\x00c, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python:\nLearn Just Enough Python to Build Useful Tools (https://www.amazon.com/Lean-Python-Learn-Enough-Useful/dp/1484223845)).\nThose who fail on the Python basics test are NOT advised to take this course.\nPlease note that this course is designed speci\x00cally for linguistics majors in humanities. For computer science majors, please note that\nthis course will not feature a thorough description of the mathematical operations behind the algorithms. We focus more on the\npractical implementation.\n Course Schedule\n(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)\nWeek Date Topic\nWeek 1 2023-02-23 Course Orientation and Computational Linguistics Overview\nWeek 2 2023-03-01 NLP Pipeline\nWeek 3 2023-03-08 Machine Learning Basics: Regression and Classi\x00cation\nWeek 4 2023-03-15 Naïve Bayes, Logistic Regression\nWeek 5 2023-03-22 Feature Engineering and Text Vectorization\nWeek 6 2023-03-29 Common NLP Tasks (Guest Speaker: Robin Lin from Droidtown Linguistic Tech. Co.\xa0Ltd.\xa0)\nWeek 7 2023-04-05 Holiday\nWeek 8 2023-04-12 Midterm Exam\nWeek 9 2023-04-19 Neural Network: A Primer\nWeek 10 2023-04-26 Deep Learning NLP and Word/Doc Embeddings\nWeek 11 2023-05-03 Sequence Model I: RNN and Neural Language Model\nWeek 12 2023-05-10 Sequence Model II: LSTM and GRU\nWeek 13 2023-05-17 Sequence Model III: Sequence-to-Sequence Model & Attention\nWeek 14 2023-05-24 Transformer, BERT, Transfer Learning, and Explainable AI\nWeek 15 2023-05-31 LLM, RAG, and Multimodal Processing\nWeek 16 2023-06-07 Final Exam\n Course Requirement', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 1}),
 Document(page_content='ENC2045: Computational Linguistics\nThis site is last-updated on 2024-02-23\n Annoucements\nImportant course information will be posted on this web page and announced in class. You are responsible for all material that appears\nhere and should check this page for updates frequently.\n2024-02-23: The current version of the course website is based on the Spring Semester, 2021. It will be updated for the spring\nof 2024.\n2023-12-24: This course is designed for linguistics majors. If you are NOT a linguistics major, please contact the instructor for\ncourse enrollment.\n2023-12-24: This course has prerequisites. A test on python basics will be conducted at the beginning of the semester. Please\nread the FAQ (FAQ.html) very carefully. This course is NOT OPEN to auditors.\n Course Description\nComputational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text\nanalytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural\nlanguage data (in spoken or textual formats). Therefore, this course aims to introduce useful strategies and common work\x00ows that\nhave been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will focus on\ntextual data processing.\nA selective collection of potential topics may include:\nA Pipeline for Natural Language Processing\nText Normalization\nText Tokenization\nParsing and Chunking\nIssues for Chinese Language Processing (Word Segmentation)\nFeature Engineering and Text Vectorization\nTraditional Machine Learning\nClassi\x00cation Models (Naive Bayes, SVM, Logistic Regression)\nCommon Computational Tasks:\nSentiment Analysis\nTex Clustering and Topic Modeling\nDeep Learning and Neural Network\nNeural Language Model\nSequence Models\nRNN\nLSTM/GRU\nSequence-to-sequence Model\nAttention-based Models\nTransfer Learning\nLarge Language Models(LLM) & Retrieval-Augmented Generation (RAG)\nMultimodal Processing\nThis course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-\nclass theme-based tutorial sessions. The main coding language used in this course is Python  (https://www.python.org/). We will\nmake extensive use of the language. It is assumed that you know or will quickly learn how to code in Python. In fact, this course\nassumes that every enrolled student has working knowledge of Python. (If you are not sure if you ful\x00ll the prerequisite, please\ncontact the instructor \x00rst.)', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
 Document(page_content='Course Materials\nAll the course materials are available on the course website. Please consult the instructor for the direct link to the course materials.\nThey will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website.\n Logistics\nCourse Website: ENC2045 Computational Linguistics (https://alvinntnu.github.io/NTNU_ENC2045/)\nInstructor’s Email Address: alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw)\nInstructor’s Name: Alvin Chen\nOf\x00ce Hours: By appointment\nIf you have any further questions related to the course, please consult FAQ (FAQ.html) on our course website or write me at any time\nat alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw).\n Disclaimer & Agreement\nWhile I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any\nerrors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with\nno guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty\nof any kind, express or implied.\nYou may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent,\nyou cannot disclose con\x00dential information of the website (e.g., log-in username and password) to any third party.', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 2})]
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the user's questions based on the below context:\n\n{context}"),
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}"),
])

## user query
document_chain = create_stuff_documents_chain(chat, prompt)

## combine user query, history chain
retrieval_chain = create_retrieval_chain(history_chain, document_chain)
retrieval_chain.invoke({
    "chat_history": chat_history,
    "input": "Are you saying four?"
})
{'chat_history': [HumanMessage(content='How many assignments do students need to do?'),
  AIMessage(content='Four.')],
 'input': 'Are you saying four?',
 'context': [Document(page_content='A test on Python Basics will be conducted on the \x00rst week of the class to ensure that every enrolled student ful\x00lls the prerequisite.\n(To be more speci\x00c, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python:\nLearn Just Enough Python to Build Useful Tools (https://www.amazon.com/Lean-Python-Learn-Enough-Useful/dp/1484223845)).\nThose who fail on the Python basics test are NOT advised to take this course.\nPlease note that this course is designed speci\x00cally for linguistics majors in humanities. For computer science majors, please note that\nthis course will not feature a thorough description of the mathematical operations behind the algorithms. We focus more on the\npractical implementation.\n Course Schedule\n(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)\nWeek Date Topic\nWeek 1 2023-02-23 Course Orientation and Computational Linguistics Overview\nWeek 2 2023-03-01 NLP Pipeline\nWeek 3 2023-03-08 Machine Learning Basics: Regression and Classi\x00cation\nWeek 4 2023-03-15 Naïve Bayes, Logistic Regression\nWeek 5 2023-03-22 Feature Engineering and Text Vectorization\nWeek 6 2023-03-29 Common NLP Tasks (Guest Speaker: Robin Lin from Droidtown Linguistic Tech. Co.\xa0Ltd.\xa0)\nWeek 7 2023-04-05 Holiday\nWeek 8 2023-04-12 Midterm Exam\nWeek 9 2023-04-19 Neural Network: A Primer\nWeek 10 2023-04-26 Deep Learning NLP and Word/Doc Embeddings\nWeek 11 2023-05-03 Sequence Model I: RNN and Neural Language Model\nWeek 12 2023-05-10 Sequence Model II: LSTM and GRU\nWeek 13 2023-05-17 Sequence Model III: Sequence-to-Sequence Model & Attention\nWeek 14 2023-05-24 Transformer, BERT, Transfer Learning, and Explainable AI\nWeek 15 2023-05-31 LLM, RAG, and Multimodal Processing\nWeek 16 2023-06-07 Final Exam\n Course Requirement', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 1}),
  Document(page_content='ENC2045: Computational Linguistics\nThis site is last-updated on 2024-02-23\n Annoucements\nImportant course information will be posted on this web page and announced in class. You are responsible for all material that appears\nhere and should check this page for updates frequently.\n2024-02-23: The current version of the course website is based on the Spring Semester, 2021. It will be updated for the spring\nof 2024.\n2023-12-24: This course is designed for linguistics majors. If you are NOT a linguistics major, please contact the instructor for\ncourse enrollment.\n2023-12-24: This course has prerequisites. A test on python basics will be conducted at the beginning of the semester. Please\nread the FAQ (FAQ.html) very carefully. This course is NOT OPEN to auditors.\n Course Description\nComputational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text\nanalytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural\nlanguage data (in spoken or textual formats). Therefore, this course aims to introduce useful strategies and common work\x00ows that\nhave been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will focus on\ntextual data processing.\nA selective collection of potential topics may include:\nA Pipeline for Natural Language Processing\nText Normalization\nText Tokenization\nParsing and Chunking\nIssues for Chinese Language Processing (Word Segmentation)\nFeature Engineering and Text Vectorization\nTraditional Machine Learning\nClassi\x00cation Models (Naive Bayes, SVM, Logistic Regression)\nCommon Computational Tasks:\nSentiment Analysis\nTex Clustering and Topic Modeling\nDeep Learning and Neural Network\nNeural Language Model\nSequence Models\nRNN\nLSTM/GRU\nSequence-to-sequence Model\nAttention-based Models\nTransfer Learning\nLarge Language Models(LLM) & Retrieval-Augmented Generation (RAG)\nMultimodal Processing\nThis course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-\nclass theme-based tutorial sessions. The main coding language used in this course is Python  (https://www.python.org/). We will\nmake extensive use of the language. It is assumed that you know or will quickly learn how to code in Python. In fact, this course\nassumes that every enrolled student has working knowledge of Python. (If you are not sure if you ful\x00ll the prerequisite, please\ncontact the instructor \x00rst.)', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
  Document(page_content='Course Materials\nAll the course materials are available on the course website. Please consult the instructor for the direct link to the course materials.\nThey will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website.\n Logistics\nCourse Website: ENC2045 Computational Linguistics (https://alvinntnu.github.io/NTNU_ENC2045/)\nInstructor’s Email Address: alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw)\nInstructor’s Name: Alvin Chen\nOf\x00ce Hours: By appointment\nIf you have any further questions related to the course, please consult FAQ (FAQ.html) on our course website or write me at any time\nat alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw).\n Disclaimer & Agreement\nWhile I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any\nerrors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with\nno guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty\nof any kind, express or implied.\nYou may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent,\nyou cannot disclose con\x00dential information of the website (e.g., log-in username and password) to any third party.', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 2})],
 'answer': 'The text does not provide information on the number of assignments students need to do in the course.'}
chat_history = [HumanMessage(content="How many assignments do students need to do?"), 
                AIMessage(content="Four."),
                HumanMessage(content='I am telling you that these four assignments include coding, reviewing, testing, and presentation.'),
                AIMessage(content='Thank you for the information.')]
retrieval_chain.invoke({
    "chat_history": chat_history,
    "input": "Can you repeat the four assignments?"
})
{'chat_history': [HumanMessage(content='How many assignments do students need to do?'),
  AIMessage(content='Four.'),
  HumanMessage(content='I am telling you that these four assignments include coding, reviewing, testing, and presentation.'),
  AIMessage(content='Thank you for the information.')],
 'input': 'Can you repeat the four assignments?',
 'context': [Document(page_content='A test on Python Basics will be conducted on the \x00rst week of the class to ensure that every enrolled student ful\x00lls the prerequisite.\n(To be more speci\x00c, you are assumed to have already had working knowledge of all the concepts included in the book, Lean Python:\nLearn Just Enough Python to Build Useful Tools (https://www.amazon.com/Lean-Python-Learn-Enough-Useful/dp/1484223845)).\nThose who fail on the Python basics test are NOT advised to take this course.\nPlease note that this course is designed speci\x00cally for linguistics majors in humanities. For computer science majors, please note that\nthis course will not feature a thorough description of the mathematical operations behind the algorithms. We focus more on the\npractical implementation.\n Course Schedule\n(The schedule is tentative and subject to change. Please pay attention to the announcements made during the class.)\nWeek Date Topic\nWeek 1 2023-02-23 Course Orientation and Computational Linguistics Overview\nWeek 2 2023-03-01 NLP Pipeline\nWeek 3 2023-03-08 Machine Learning Basics: Regression and Classi\x00cation\nWeek 4 2023-03-15 Naïve Bayes, Logistic Regression\nWeek 5 2023-03-22 Feature Engineering and Text Vectorization\nWeek 6 2023-03-29 Common NLP Tasks (Guest Speaker: Robin Lin from Droidtown Linguistic Tech. Co.\xa0Ltd.\xa0)\nWeek 7 2023-04-05 Holiday\nWeek 8 2023-04-12 Midterm Exam\nWeek 9 2023-04-19 Neural Network: A Primer\nWeek 10 2023-04-26 Deep Learning NLP and Word/Doc Embeddings\nWeek 11 2023-05-03 Sequence Model I: RNN and Neural Language Model\nWeek 12 2023-05-10 Sequence Model II: LSTM and GRU\nWeek 13 2023-05-17 Sequence Model III: Sequence-to-Sequence Model & Attention\nWeek 14 2023-05-24 Transformer, BERT, Transfer Learning, and Explainable AI\nWeek 15 2023-05-31 LLM, RAG, and Multimodal Processing\nWeek 16 2023-06-07 Final Exam\n Course Requirement', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 1}),
  Document(page_content='ENC2045: Computational Linguistics\nThis site is last-updated on 2024-02-23\n Annoucements\nImportant course information will be posted on this web page and announced in class. You are responsible for all material that appears\nhere and should check this page for updates frequently.\n2024-02-23: The current version of the course website is based on the Spring Semester, 2021. It will be updated for the spring\nof 2024.\n2023-12-24: This course is designed for linguistics majors. If you are NOT a linguistics major, please contact the instructor for\ncourse enrollment.\n2023-12-24: This course has prerequisites. A test on python basics will be conducted at the beginning of the semester. Please\nread the FAQ (FAQ.html) very carefully. This course is NOT OPEN to auditors.\n Course Description\nComputational Linguistics (CL) is now a very active sub-discipline in applied linguistics. Its main focus is on the computational text\nanalytics, which is essentially about leveraging computational tools, techniques, and algorithms to process and understand natural\nlanguage data (in spoken or textual formats). Therefore, this course aims to introduce useful strategies and common work\x00ows that\nhave been widely adopted by data scientists to extract useful insights from natural language data. In this course, we will focus on\ntextual data processing.\nA selective collection of potential topics may include:\nA Pipeline for Natural Language Processing\nText Normalization\nText Tokenization\nParsing and Chunking\nIssues for Chinese Language Processing (Word Segmentation)\nFeature Engineering and Text Vectorization\nTraditional Machine Learning\nClassi\x00cation Models (Naive Bayes, SVM, Logistic Regression)\nCommon Computational Tasks:\nSentiment Analysis\nTex Clustering and Topic Modeling\nDeep Learning and Neural Network\nNeural Language Model\nSequence Models\nRNN\nLSTM/GRU\nSequence-to-sequence Model\nAttention-based Models\nTransfer Learning\nLarge Language Models(LLM) & Retrieval-Augmented Generation (RAG)\nMultimodal Processing\nThis course is extremely hands-on and will guide the students through classic examples of many task-oriented implementations via in-\nclass theme-based tutorial sessions. The main coding language used in this course is Python  (https://www.python.org/). We will\nmake extensive use of the language. It is assumed that you know or will quickly learn how to code in Python. In fact, this course\nassumes that every enrolled student has working knowledge of Python. (If you are not sure if you ful\x00ll the prerequisite, please\ncontact the instructor \x00rst.)', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 0}),
  Document(page_content='Course Materials\nAll the course materials are available on the course website. Please consult the instructor for the direct link to the course materials.\nThey will be provided as a series of online packets (i.e., handouts, script source codes etc.) on the course website.\n Logistics\nCourse Website: ENC2045 Computational Linguistics (https://alvinntnu.github.io/NTNU_ENC2045/)\nInstructor’s Email Address: alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw)\nInstructor’s Name: Alvin Chen\nOf\x00ce Hours: By appointment\nIf you have any further questions related to the course, please consult FAQ (FAQ.html) on our course website or write me at any time\nat alvinchen@ntnu.edu.tw (mailto:alvinchen@ntnu.edu.tw).\n Disclaimer & Agreement\nWhile I have made every attempt to ensure that the information contained on the Website is correct, I am not responsible for any\nerrors or omissions, or for the results obtained from the use of this information. All information on the Website is provided “as is”, with\nno guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information, and without warranty\nof any kind, express or implied.\nYou may print a copy of any part of this website for your personal or non-commercial use. Without the author’s prior written consent,\nyou cannot disclose con\x00dential information of the website (e.g., log-in username and password) to any third party.', metadata={'source': '../../../../ENC2045_demo_data/ENC2045Syllabus.pdf', 'page': 2})],
 'answer': 'The four assignments include coding, reviewing, testing, and presentation.'}

Memory#

  • Right now, the module Memory i still under active development.

  • To work with Memory, we will use the legacy chain, langchain.chains.LLMChain(), which is still under development of its compatibility with the LCEL framework.

from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory


# Notice that "chat_history" is present in the prompt template
template = """You are a nice college professor having a conversation with a student.

Previous conversation:
{chat_history}

New student's question: {question}
Response:"""

prompt = PromptTemplate.from_template(template)
# Notice that we need to align the `memory_key`
memory = ConversationBufferMemory(memory_key="chat_history", k = 1)
conversation = LLMChain(
    llm=chat,
    prompt=prompt,
    verbose=True, ## see the original prompts
    memory=memory
)
conversation.invoke("what is your name?")
> Entering new LLMChain chain...
Prompt after formatting:
You are a nice college professor having a conversation with a student.

Previous conversation:


New student's question: what is your name?
Response:

> Finished chain.
{'question': 'what is your name?',
 'chat_history': '',
 'text': "My name is Professor Johnson. It's nice to meet you."}
memory.chat_memory.add_user_message("I think your name is Alvin Chen, right?")
memory.chat_memory.add_ai_message("Yes. My name is Alvin Chen.")
conversation.invoke("So what is your name really?")
> Entering new LLMChain chain...
Prompt after formatting:
You are a nice college professor having a conversation with a student.

Previous conversation:
Human: what is your name?
AI: My name is Professor Johnson. It's nice to meet you.
Human: I think your name is Alvin Chen, right?
AI: Yes. My name is Alvin Chen.

New student's question: So what is your name really?
Response:

> Finished chain.
{'question': 'So what is your name really?',
 'chat_history': "Human: what is your name?\nAI: My name is Professor Johnson. It's nice to meet you.\nHuman: I think your name is Alvin Chen, right?\nAI: Yes. My name is Alvin Chen.",
 'text': 'My name is Alvin Chen. I apologize for any confusion earlier.'}
## somehow the k window size is not working?
memory.load_memory_variables({})
{'chat_history': "Human: what is your name?\nAI: My name is Professor Johnson. It's nice to meet you.\nHuman: I think your name is Alvin Chen, right?\nAI: Yes. My name is Alvin Chen.\nHuman: So what is your name really?\nAI: My name is Alvin Chen. I apologize for any confusion earlier."}

References#