Langchain Applications — Part 3— Embedding Models

Shishir Singh
4 min readJun 19, 2023

--

This article is part of a series explaining Langchain applications with simple Python code examples using OpenAI. Part 2 discussed Chat Model Applications

Langchain concepts — Part 3 introduced Models. In this article, we will take a deeper look into ChatModels applications.

Embedding Models

Embedding models in LangChain are used to transform the text into numerical representations, or embeddings, that can be processed by machine learning algorithms. These embeddings are used in various natural language processing (NLP) tasks, such as understanding text, analyzing sentiments, and translating languages.

In LangChain, these models can generate embeddings for both queries and documents. When a query is embedded, the text string is converted into an array of numbers, each representing a dimension in the embedding space. For documents, the function embedDocuments takes an array of text strings and returns an array of their respective embeddings.

LangChain integrates with different model providers for generating embeddings. The OpenAIEmbeddings class, for instance, uses the OpenAI API to create embeddings, and this can be done using either OpenAI's API key or Azure's OpenAI API key. Other integrations include CohereEmbeddings, TensorFlowEmbeddings, and HuggingFaceInferenceEmbeddings​.

In terms of handling API usage, LangChain provides additional features such as setting a timeout, handling rate limits and dealing with API errors. For instance, the timeout option can be set when instantiating an Embeddings model to stop waiting for a response after a certain amount of time. The maxConcurrency option can be set to specify the maximum number of concurrent requests to the provider, helping manage rate limits. If a model provider returns an error, LangChain has a built-in mechanism to retry the request up to 6 times, with exponential backoff, but this can be modified with maxRetries an option.

Example: Embed Query and Embed Document

from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-ada-002",
openai_api_key=openai_api_key
)

The first line is an import statement that imports the OpenAIEmbeddings class from the langchain.embeddings.openai module. The langchain.embeddings.openai the module is a package that provides various text embedding models from OpenAI. The OpenAIEmbeddings class is a wrapper for the OpenAI API, which allows you to access and use different models from OpenAI.

The second line is an assignment statement that creates an instance of the OpenAIEmbeddings class and assigns it to the variable embeddings. The instance is initialized with two parameters: model and openai_api_key. The model the parameter specifies which model from OpenAI to use. In this case, it is "text-embedding-ada-002", which is a neural network model that can generate text embeddings. Text embeddings are numerical representations of texts that capture their semantic and syntactic information. The openai_api_key parameter is a secret key that authenticates your access to the OpenAI API.

Embed Query

text = "This is a test query."
query_result = embeddings.embed_query(text)
print(query_result)

embed_query(text: str) → List[float][source]

This is a method of the OpenAIEmbeddings class that allows you to get an embedding for a query text using OpenAI’s embedding endpoint. A query text is a string that contains a question or a keyword that you want to use to search for relevant documents or texts.

Abridged...
[-0.005056409165263176, 0.00508662685751915, -0.005231000017374754, -0.01525652315467596, 0.01798282004892826, -0.00454271025955677, -0.002179023576900363, -0.004317757207900286, 0.025490211322903633, -0.0025987124536186457, 0.025329051539301872, 0.04311042279005051, -0.029599804431200027, 0.0020010757725685835, 0.0001051844737958163, -0.022119272500276566, -0.011126786470413208, -0.0015377392992377281, -0.004374834708869457]

Embed Document

documents = ["This is a sample document.", "This is another sample document."]
document_embeddings = embeddings.embed_documents(texts=documents, chunk_size=1000)
print(document_embeddings)

embed_documents(texts: List[str], chunk_size: Optional[int] = 0) → List[List[float]][source]

This is a method of the OpenAIEmbeddings class that allows you to get embeddings for a list of texts using OpenAI’s embedding endpoint. Embeddings are numerical representations of texts that capture their semantic and syntactic features. They can be used for various tasks such as search, clustering, recommendations, etc. As expected we get two embedding lists in output.

documents = ["This is a sample document.", "This is another sample document."]
document_embeddings = embeddings.embed_documents(texts=documents, chunk_size=1000)
print(len(document_embeddings))
2

Conclusion

LangChain’s embedding models, as demonstrated through Python examples, offer a robust and versatile approach to transforming text into numerical representations, or embeddings, which are instrumental in various natural language processing tasks. The system’s compatibility with different model providers, like OpenAI, Cohere, TensorFlow, and HuggingFaceInference, underscores its flexibility and broad applicability.

LangChain’s rich features, such as its built-in mechanism for handling API usage, including timeouts, rate limits, and error management, provide additional robustness. These mechanisms ensure that the system can efficiently handle and recover from potential disruptions during the embedding process.

The example provided, which uses the OpenAIEmbeddings class from LangChain, showcases how straightforward it is to integrate these powerful embedding models into applications. Whether it’s embedding a simple query or a set of documents, developers can generate numerical representations of their texts with ease and precision.

In summary, LangChain’s embedding models present a flexible, efficient, and powerful solution for transforming text into numerical form, an essential step in many machine learning and NLP tasks. The ability to customize and manage these models makes them an invaluable tool for developers and researchers working with text data.

In the next set of articles, we will cover other key concepts, including Part-4 Prompts, Indexes, Memory, Chains, and Agents.

GitHub Python code after the conclusion of the series.

--

--

Shishir Singh
Shishir Singh

Written by Shishir Singh

Digital Assets, Blockchains, DLTs, Tokenization & Protocols & AI Intersection