Getting Started with Gemini API: A Comprehensive Practical Guide
Getting Started with Google's Latest Multi-Modal AI Model
Gemini, the latest LLM model from Google, marks a significant leap forward in the realm of perfect answers to your questions using images, audio, and text. Concurrently, Bard, its predecessor, is making a notable comeback. This dynamic duo promises to revolutionize the way we interact with information, offering nearly flawless responses to queries encompassing images, audio, and text.
This hands-on tutorial will show you how to use the Gemini API and set it up on your computer. We’ll go over various Python API functions, like creating text and understanding images, to help you make the most out of Gemini’s capabilities in a simple way. Get ready to make your queries smoother and more advanced with Gemini!
Table of Contents:
What is the Gemini Model?
Setting Up Working Environment & Getting Started
Customizing the Model Response
Gemini Pro Vision
Chat Conversations Using Gemini
Embeddings Model with Gemini
1. What is the Gemini Model?
Gemini is a novel AI model that emerges from collaborative efforts among various Google teams, including Google Research and Google DeepMind. Uniquely designed as a multimodal entity, Gemini possesses the ability to comprehend and process diverse forms of data, encompassing text, code, audio, images, and video.
As Google’s most advanced and extensive AI creation to date, Gemini stands out for its exceptional flexibility, enabling seamless operation across a broad spectrum of systems, ranging from expansive data centers to compact mobile devices. This adaptability holds the promise of transforming the landscape of AI application development and scalability for businesses and developers alike.
There are three versions of the Gemini model designed for different use cases:
Gemini Ultra: Largest and most advanced AI capable of performing complex tasks.
Gemini Pro: A balanced model that has good performance and scalability.
Gemini Nano: Most efficient for mobile devices.
Gemini Ultra takes center stage with cutting-edge performance, surpassing GPT-4 across various metrics. Notably, it achieves a milestone by outperforming human experts on the Massive Multitask Language Understanding benchmark. This benchmark evaluates proficiency in world knowledge and problem-solving across 57 diverse subjects, highlighting Gemini Ultra’s advanced capabilities in understanding and addressing complex challenges.
2. Setting Up Working Environment & Getting Started
To use the Gemini API, we have to first get an API key that you can get from Google AI Studio. Click the “Get an API key” button and then click on “Create an API key in a new project”
Copy the API key and set it as an environment variable or put it into a variable if you do not intend to share the code with anyone. The next step is to install the Python API using PIP:
!pip install -q -U google-generativeai
After that, we will set the API key to Google’s GenAI and initiate the instance.
import google.generativeai as genai
import os
gemini_api_key = os.environ["GEMINI_API_KEY"]
genai.configure(api_key = gemini_api_key)
After setting up the API key, using the Gemini Pro model to generate content is simple. Provide a prompt to the `generate_content` function and display the output as Markdown.
from IPython.display import Markdown
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content("What is Large Language Model?")
Markdown(response.text)
Gemini can generate multiple responses, called candidates, for a single prompt. You can select the most suitable one. In this case, we had only one response.
3. Customizing the Model Response
You can customize your response using the generation_config argument. Here we are limiting the candidate count to 5, adding the stop word “Open AI,” and setting max tokens and temperature.
response = model.generate_content(
'Tell me the story of the rise of LLM and chatbots.',
generation_config=genai.types.GenerationConfig(
candidate_count=1,
stop_sequences=['OpenAI'],
max_output_tokens=200,
temperature=0.7)
)
Markdown(response.text)
4. Gemini Pro Vision
Gemini Pro Vision is a Gemini large language vision model that understands input from text and visual modalities (image and video) in addition to text to generate relevant text responses.
Gemini Pro Vision is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from images and video. It’s adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.
Let's start with loading an image using the following code:
import PIL.Image
img = PIL.Image.open('/content/images.jpg')
img
Let’s load the Gemini Pro Vision model and provide it with the image.
model = genai.GenerativeModel('gemini-pro-vision')
response = model.generate_content(img)
Markdown(response.text)
The Temple Mount is a holy site located in Jerusalem. It is the holiest site in Judaism and the third holiest site in Islam. The Temple Mount is home to the Al-Aqsa Mosque, the Dome of the Rock, and the Western Wall. The Temple Mount is a disputed territory, and its status is a major source of conflict between Israelis and Palestinians.
The model accurately identified the palace and provided additional information about its history.
We will now provide text and the image to the API. We have asked the vision model to write a blog about the conflict and the rights of Palestinians using the image as a reference.
response = model.generate_content(["Write a blog post about palestine right using the image as reference.", img])
Markdown(response.text)
The image shows two young men standing in front of the Dome of the Rock, a Muslim shrine in Jerusalem. They are both wearing Palestinian flags around their shoulders. The image is a powerful symbol of Palestinian identity and resistance. It is a reminder of the Palestinian people’s long struggle for freedom and independence.
The Dome of the Rock is a holy site for Muslims, Jews, and Christians. It is located in the Old City of Jerusalem, which is a disputed territory between Israel and Palestine. The Palestinians believe that the Dome of the Rock is the site of the Prophet Muhammad’s ascent to heaven. The Israelis believe that the Dome of the Rock is the site of the Second Temple.
The Dome of the Rock has been a flashpoint of violence between Israelis and Palestinians for many years. In 1967, Israel captured the Old City of Jerusalem from Jordan. Since then, the Israelis have controlled the Dome of the Rock. However, the Palestinians still consider the Dome of the Rock to be a Muslim holy site.
In 2000, the Israeli Prime Minister Ariel Sharon visited the Dome of the Rock. This visit sparked the Second Intifada, a Palestinian uprising that lasted for five years. The Second Intifada ended in 2005, but the conflict between Israelis and Palestinians continues.
The Dome of the Rock is a symbol of the Palestinian people’s struggle for freedom and independence. It is a reminder of the long history of conflict between Israelis and Palestinians. The Dome of the Rock is also a symbol of hope for peace. It is a reminder that Israelis and Palestinians must find a way to live together in peace.
5. Chat Conversations Using Gemini
We can set up the model to have a back-and-forth chat session. This way, the model remembers the context and response using the previous conversations. In this example, we will start the chat session and ask the model to help me get started with Large Language Models.
model = genai.GenerativeModel('gemini-pro')
chat = model.start_chat(history=[])
chat.send_message("Can you please guide me on how to start with Large Langeuag Models?")
chat.history
You can see that the `chat` object is saving the history of the user and mode chat. We can also display them in a Markdown style.
Let’s ask the follow-up question about how to fine-tune LLMs:
We can see that the model has answered the follow-up question and also it returned the first question and its answer.
6. Embeddings Model with Gemini
Embedding models are gaining popularity in the realm of context-aware applications. The Gemini embedding-001 model facilitates the transformation of words, sentences, or entire documents into dense vectors that capture semantic meaning. This vectorized representation enables straightforward comparisons of textual similarity by evaluating the corresponding embedding vectors.
We can provide the content to embed_content and convert the text into embeddings.
output = genai.embed_content(
model="models/embedding-001",
content="Can you please guide me on how to start playing Dota 2?",
task_type="retrieval_document",
title="Embedding of Dota 2 question")
print(output['embedding'][0:10])
[0.060604308, -0.023885584, -0.007826327, -0.070592545, 0.021225851, 0.043229062, 0.06876691, 0.049298503, 0.039964676, 0.08291664]
We can convert multiple chunks of text into embeddings by passing a list of strings to the ‘content’ argument.
output = genai.embed_content(
model="models/embedding-001",
content=[
"Can you please guide me on how to start playing Dota 2?",
"Which Dota 2 heroes should I start with?",
],
task_type="retrieval_document",
title="Embedding of Dota 2 question")
for emb in output['embedding']:
print(emb[:10])
[0.060604308, -0.023885584, -0.007826327, -0.070592545, 0.021225851, 0.043229062, 0.06876691, 0.049298503, 0.039964676, 0.08291664]
[0.04775657, -0.044990525, -0.014886052, -0.08473655, 0.04060122, 0.035374347, 0.031866882, 0.071754575, 0.042207796, 0.04577447]
References
Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:
Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM