How to load GitHub repository with LangChain
Quick answer
Use LangChain's
GitLoader or GitHubRepoLoader classes to load documents directly from a GitHub repository. These loaders clone or fetch files from the repo and convert them into LangChain documents for further processing.PREREQUISITES
Python 3.8+pip install langchain>=0.2.0Git installed on your systemOpenAI API key (optional for downstream tasks)
Setup
Install LangChain and ensure git is installed on your system to clone repositories. Set your OpenAI API key as an environment variable if you plan to use LangChain with OpenAI models.
pip install langchain>=0.2.0 Step by step
This example uses LangChain's GitLoader to clone a GitHub repository and load its files as documents.
from langchain_community.document_loaders import GitLoader
import os
# Clone and load the GitHub repo
repo_url = "https://github.com/hwchase17/langchain"
loader = GitLoader(repo_path="./langchain_repo", clone_url=repo_url)
docs = loader.load()
# Print the first document's content
print(docs[0].page_content[:500]) output
'''# langchain/__init__.py """LangChain is a framework for developing applications powered by language models.""" from langchain.schema import * # noqa from langchain.chains import * # noqa from langchain.llms import * # noqa from langchain.prompts import * # noqa from langchain.vectorstores import * # noqa from langchain.embeddings import * # noqa from langchain.document_loaders import * # noqa __version__ = "0.2.0" '''
Common variations
You can use GitHubRepoLoader for more control, such as loading specific file types or branches. Async loading and integration with LangChain chains for LLM processing are also common.
from langchain_community.document_loaders import GitHubRepoLoader
loader = GitHubRepoLoader(
repo_url="https://github.com/hwchase17/langchain",
branch="main",
file_filter=lambda file_path: file_path.endswith('.py')
)
docs = loader.load()
print(f"Loaded {len(docs)} Python files from the repo.") output
Loaded 50 Python files from the repo.
Troubleshooting
- If cloning fails, ensure
gitis installed and accessible in your system PATH. - Check your internet connection and repository URL for typos.
- For private repos, configure SSH keys or use authentication tokens.
Key Takeaways
- Use LangChain's GitLoader or GitHubRepoLoader to load GitHub repos as documents.
- Ensure git is installed and repo URLs are correct to avoid cloning errors.
- Filter files by extension or branch for targeted loading.
- Set environment variables for API keys and authentication securely.
- Loaded documents can be used directly with LangChain chains and LLMs.