Large Language Models (LLMs) like ChatGPT and Llama-2 have been 🔥on fire 🔥. I’ve been using these models for a while and recently realized that while I extensively use them to help me program faster, I usually leave them out of my target code. I recently conducted a super manual task involving a small amount of fuzzy reasoning. Naturally, after spending all that time, I wanted to know whether an LLM could have handled the job. Manually prompting ChatGPT showed some promising results, but conducting a thorough analysis using ChatGPT’s web chat interface would have been unreasonable.
In classic two-birds-one-stone fashion, I used this to explore how I can programmatically interact with LLMs. Taking on this project would enable me to efficiently assess the performance of LLMs for the task at hand (my research question) and teach me how to access LLMs programmatically (teach me a new skill). This post covers my research and learning journey; it catalogs some of the LLM technologies I interacted with and discusses their capabilities and limitations.
Approaches to Programmatic LLM Access
As mentioned above, efficiently leveraging LLMs at scale often requires programmatic access. In this post, I explore two main methods: running Llama-2 locally on my MacBook Pro and interacting with the online model ChatGPT-3.5. Each approach has its unique advantages and challenges.
Running a local LLM provides significant control over data privacy, as all processing is done in an environment you control. This control is particularly beneficial for sensitive or confidential tasks. These benefits come at the cost of setup complexity, computational limitations, and limited scalability.
Online LLM API
An online LLM usually offers the advantage of tapping into a vendor’s robust cloud infrastructure (e.g., GCP, AWS, Azure). Using online LLMs ensures rapid response times and eliminates the need for extensive local computational resources. The setup is relatively straightforward, reducing technical overhead and making it more accessible. Additionally, the scalability of this approach is well-suited for handling large volumes of queries or complex computational tasks. However, this convenience comes with considerations around data privacy, as sensitive information is processed externally. There is also the potential for costs associated with API usage and the reliance on a stable internet connection for uninterrupted interaction.
For my local LLM exploration, I decided to use Llama-2. This decision was influenced by the need to explore ways to protect data privacy by processing data on my machine. I used an early 2023 MacBook Pro with an M2 Pro Chip and 32GB RAM. There are many ways to set up a local Llama-2 instance.
Local Llama Choices and Setup
These options included:
- Building it from scratch – This would have offered the most customization but required significant technical expertise and time.
- Ollama – An alternative that provides a more streamlined setup process.
llama-cpp-python– I chose this option due to its easy setup and robust documentation. This approach was greatly simplified by following this helpful blog post, which provided clear instructions and resources.
The setup process involved:
- Downloading the
.gguffile: This contains the actual model, and I sourced the file from Hugging Face.
llama-cpp-python: This was a straightforward process of employing pip as per below.
pip install llama-cpp-python
Llama Coding and Configuration
The coding aspect was relatively straightforward:
# Location of the GGUF model
model_path = '/home/jovyan/Downloads/llama-2-7b-chat.Q2_K.gguf'
# Create a llama model
model = Llama(model_path=model_path, n_ctx=4096)
However, I encountered a hiccup with the initial boilerplate code, which didn’t have the context length set and defaulted to something much smaller than 4096. This led to issues with prompt length during my initial experiment. I needed to max out the context length because I passed substantial amounts of text to the LLM.
Calling the Llama
The snippet below illustrates creating a prompt, setting model parameters, and running the model to obtain a response.
# Prompt creation
system_message = "You are a helpful assistant"
user_message = "Generate a list of 5 funny dog names"
prompt = f"""<s>[INST] <<SYS>>
# Model parameters
max_tokens = 100
# Run the model
output = model(prompt, max_tokens=max_tokens, echo=True)
# Print the model output
It’s relatively straightforward. The one thing to note for folks who are used to the web-based chat LLM interface world is that the prompt has two components: the system and user messages. The user message is what you send as a user of web-based ChatGPT. The system message is additional information that the system (e.g., the developer) sends to the LLM to help shape its behavior. While I need to do more research, you, as a developer, can pack information into both parts.
Local Llama Performance Limitations
Regarding performance, my local Llama-2 setup was relatively slow, with response times exceeding a minute per query.
This highlighted one of the critical trade-offs of a local format: computational power versus data privacy and control.
A final note is that I was using a relatively powerful personal machine; however, how I was using
llama-cpp-python may not have been taking full advantage of the hardware.
After exploring the local setup with Llama-2, I turned my attention to the ChatGPT API. N.B. there are other ways to access the ChatGPT API (such as Azure). My initial step was to briefly skim the OpenAI documentation, which I promptly discarded once I found some code to get me started.
Initial Research and Costs
The OpenAI Playground was a valuable resource. It allowed me to experiment with different prompts and settings, giving me a feeling for setting up the ChatGPT API, as you can use it to generate boilerplate code. One thing to note is that even with a subscription to ChatGPT Plus, separate payment is required for API usage. I was initially concerned about the potential costs, but it was cheap.
Setting Up ChatGPT API Access
For the implementation, I used the OpenAI Python library, a straightforward and powerful tool for interacting with ChatGPT. Here’s the code I used (based on the current version of the OpenAI package, available as of November 28, 2023):
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
"content": "You are an expert academic ophthalmologist who is conducting a systematic review..."
"content": "Some technical details... Please respond with one word: \"relevant\" or \"irrelevant\""
ChatGPT API Performance
The performance of this setup was impressive. For 500 queries, the average response time was around 4 seconds. Many responses were even faster, with a median time of 0.6 seconds. This was a significant improvement over the local Llama-2 setup. However, I noticed several queries took 10 minutes, likely due to throttling implemented by OpenAI.
In terms of cost, I was surprised at how inexpensive it was. Running more than 500 queries amounted to only about 60 cents, which was WAY cheaper than I expected!
I did the Llama-2 coding throughout an evening and took on the ChatGPT API coding the following morning. In total, it took less than 5 hours! Both approaches were straightforward. I was worried about the cost of the online LLM, but that wasn’t an issue, especially considering how much time it saved me compared to the local LLM.
As always, there’s optimization to be done. For instance, while using the ChatGPT API, I initially sent individual messages. However, I later realized that the OpenAI client might be capable of handling multiple messages simultaneously. I need to check on this, but the message data structure implies it, and I imagine it would significantly increase efficiency. Another important consideration that I still need to discuss is deployment. Although I’ve done deployments on local machines, it is often best to use a cloud service provider, and all the major ones now provide LLMs.
The primary motivation behind this exploration was a quick academic study, the details of which will be revealed in due time. The overall goal was to assess the efficacy of an LLM in assisting with a labor-intensive aspect of research. Without programmatic LLM access, this would have been impossible to determine. Based on how easy it was to set up this experiment, I am now interested in exploring other tasks that involve sifting through large volumes of academic literature.
The results of this study are still being tabulated (beep-boop), and I am excited about what they will reveal about the capabilities and limitations of LLMs in academic research. Once the results are ready, I plan to share them here, providing insights into the practical application of LLMs in a real-world research scenario.
Go ÖN Home
P.S. Exploring Ollama
Ollama is another potential avenue for running LLMs locally. I plan to check it out to see their dockerized deployments’ performance. Running the LLM on a docker container on my machine was my initial goal, but my initial attempts failed severely.
P.P.S. Handling the OpenAI Key Securely
Like many other APIs, you need an API key to access OpenAI’s ChatGPT API calls. I don’t like storing the key in plain text in my Jupyter notebook (just in case I share the notebook publicly). To address this, I developed this little code snippet that I put in my Jupyter notebooks that use the ChatGPT API:
# Required imports
# Prompt for the API key
OPENAI_API_KEY = getpass.getpass("Enter your OPENAI_API_KEY: ")
# Set the key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# Verify the environment variable is set
print("Environment variable OPENAI_API_KEY set successfully.")
This method uses
getpass to securely input the API key and
os to set the key as an environment variable.
This approach keeps the key out of the codebase, reducing the risk of accidental exposure.
I want to thank Kevin Quinn for reviewing this post and providing feedback on it and the project!