bene : studio is a global consultancy, helping startups, enterprises and HealthTech companies to have better product
Google Gemini from a Software Engineering POV
Article by Máté Nagy, Software Engineer
Unsure where to start with the Gemini AI API? This guide simplifies your journey by summarizing the key functionalities relevant to our use cases. Dive into clear explanations and practical examples, without wading through bulky documentation. Explore further details at your own pace with the provided references. Let’s discover the potential of Gemini AI together!
The Gemini API documentation is concise and well-written, making it easy to familiarize yourself with Gemini AI’s main features. I will attempt to summarize the key points essential for our use cases. Additionally, I will include the original documentation in the bibliography for those who wish to conduct a more thorough exploration.
Terms of service extract
Here are some points from the Terms of Service that can be important in the decision of whether Gemini AI should be used in a project. The key takeaways are:
We cannot use Gemini as a means to make a diagnosis or even to aid the process. In applications where FDA approval is needed, the usage of the service is mostly off the table. This does not mean that we cannot use it to create a better user experience in an application, but the use cases in such a software is very limited.
The usage of the service is limited to products serving end users. Creating applications that provide APIs for other products are prohibited. Creating an API as a layer of a product, which supplies data for the front end is an allowed utilization.
Another drawback is by using Gemini AI through any interface we are granting a license to Google to use for improvement and development. In the process of developing Gemini AI, human reviewers can read, annotate, and process the service’s inputs and outputs which would most likely grossly violate HIPAA laws.
The Terms of Service also mention that any output provided by their LLMs does not constitute medical, legal, financial, or other professional advice, medical treatment or diagnosis. This means providing such services legally classified as the aforementioned categories is highly discouraged and could open the company to liabilities related to those services.
These terms could probably be circumvented by classifying a service as a wellness product, therefore it would not fall under laws regulating clinical practice. However these solutions should be considered with care as the Civil Law System in the USA is very open to such cases. Every product skirting on the boundaries of the mentioned uses should be subjected to legal consultation with professionals in the matter. This introduction (like the output of Google’s LLMs) does not constitute legal advice.
You will only use the Services directly or in connection with a service that you offer directly to end users, i.e., you will not use the Services (e.g., Gemini API) to power another application programming interface.
– Generative AI APIs Additional Terms of Service
Most importantly:
You may not use the Services in clinical practice, to provide medical advice, or in any manner that is overseen by or requires clearance or approval from a medical device regulatory agency.
– Generative AI APIs Additional Terms of Service
The license you grant to Google under the “Submission of Content” section in the API Terms also extends to any content (e.g., prompts, images, sources) you submit through the Services or any other API interface (e.g., Firebase Extensions). Google uses this data, consistent with our Privacy Policy, to provide, improve, and develop Google products and services and machine learning technologies, including Google’s enterprise features, products, and services.
– Generative AI APIs Additional Terms of Service
To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account and API key before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Services.
– Generative AI APIs Additional Terms of Service
Don’t rely on the Services for medical, legal, financial, or other professional advice. Any content regarding those topics is provided for informational purposes only and is not a substitute for advice from a qualified professional. Content does not constitute medical treatment or diagnosis.
– Generative AI APIs Additional Terms of Service
Gemini AI’s features
Gemini AI consists of two models: gemini-pro and gemini-vision pro. The main difference between the two models is that gemini-vision-pro accepts prompts containing images. While gemini-pro only accepts and outputs texts, gemini-vision-pro requires at least one, but accepts as many as 16 individual images and optionally additional text as a prompt. By providing additional text to accompany image prompts, could be used for captioning and other procedures.
Google provides two versions of their AI models: v1 and v1beta. While the beta version has more features available it is not recommended for production use, as it is subject to major breaking changes. The stable v1 version will not receive major changes and its interface will not change. This version will receive full support over the lifetime of the version.
Gemini Pro
Since most use cases will not require image-based prompts and the gemini-pro model is more feature-rich, we will dive deeper into the features available in the model. Most of these features are like interfaces to the underlying LLM, but they provide extra functionality while accessing it.
Multi-turn conversations
This feature provides a chat-like mode for accessing the model’s functionality. This mode retains the user’s chat history for the duration of the conversation. You can also customize the experience of the user by providing instructions and context in the form of conversation history when initializing the chat. This way the model can provide a more fine-tuned output.
Streamed responses
Streamed responses are a feature for normal text-based prompting and chat. The feature enables you to pass forward the response while it is generated and not wait for the whole response to return it to the user.
Embeddings
Embeddings is a feature used to create vectorial representations of phrases used for semantic search, text classification, and clustering. You can use this feature to better search context in vector databases. Since this topic is pretty heavy and a niche use case I am gonna leave explaining it in depth to the documentation.
Technical details
The API provides access to both the Gemini model family and the legacy PaLM generative AI models. We won’t be exploring the PaLM models in this document. Google provides a rich API for development, the documentation of these libraries are thorough, please check it for reference if needed. This library in addition to a REST API is implemented in:
Model limits
The API’s limits and properties for each model as of writing this tutorial (Dec 2023). You can also query the actual metadata and attributes of the models.
Gemini Pro’s limits:
Input | text |
---|---|
Output | text |
Functions | Generates text, handle multi-turn conversational format |
Shots | Can handle zero, one, and few-shot tasks |
Input token limit | 30720 |
output token limit | 2048 |
Rate limit | 60 requests per minute |
Gemini Pro Vision’s limits:
Input | text and images |
---|---|
Output | text |
Functions | Can take multimodal inputs, text and image. |
Shots | Can handle zero, one, and few-shot tasks. |
Input token limit | 12288 |
Input image limits | Maximum of 16 individual images.Maximum of 4MB for the entire prompt, including images and text.No specific limits to the number of pixels in an image; however, larger images are scaled down to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio. |
Image MIME types | PNG – image/png, JPEG – image/jpeg, WEBP – image/webp, HEIC – image/heic, HEIF – image/heif |
Output token limit | 4096 |
Rate limit | 60 requests per minute |
Embedding’s limits:
Input | text |
---|---|
Output | text |
Functions | Generates text embeddings for the input text. |
Input token limit | Optimized for creating embeddings for text of up to 2048 tokens. |
Rate limit | 1500 requests per minute |
A little intro into LLMs to get you started
This section goes into some concepts of Large Language Models to provide some insight into the usage and prompt engineering necessary to create applications. These terms are mostly generally used in the industry and therefore the definitions provided here are not gonna be surprising to people already familiar with an alternative LLM.
Model parameters
Max output tokens
A token is defined as approximately four tokens, meaning 100 tokens correspond to roughly 70 words. This parameter sets a maximum value for that, so you can define the length of the answer you desire.
Temperature
Temperature is a parameter that is only used when topK and topP are applied. This parameter defines the randomness of the answer. Set to 0 the model will always give the same deterministic answer for a prompt. While higher temperature answers seem more creative, sometimes you desire a more stable result.
topK
The model chooses tokens from its vocabulary, in the case of a topK value of one, it chooses the top 1 most probable token. The parameter changes the number of tokens from the most probable tokens, this list is passed down to further filtering with topP.
topP
Setting the topP parameter filters the list of probable tokens even more, by adding the probabilities up, from most probable to least probable until it reaches the topP value. The default value for this parameter is 0.95
stop_sequences
You can set a sequence of characters to be used as a stop sequence. When the defined sequence appears in the generated output, the generation will stop.
Range of parameters
Model parameter | Values (range) |
---|---|
Candidate count | 1-8 (integer) |
Temperature | 0.0-1.0 |
Max output tokens | Use get_model (Python) to determine the maximum number of tokens for the model you are using. |
TopP | 0.0-1.0 |
Types of prompts
Zero-shot prompts
The prompt only relies on instructions and optionally on content to process. The developer does not provide examples.
Commonly used patterns:
- Instruction – content
- Instruction – content – instruction
- Continuation (continue the provided input)
One-shot prompts
The prompt only contains one example of a pattern to be recognized by the LLM. These examples can be pairings of some sort, provided as a category – value pair. By providing a pair and a category afterwards we prompt the model to fill in the value for the missing category.
Few-shot prompts
The concept is the same as with one-shot prompts, but by providing multiple examples, we can enable the LLM to complete complex and sophisticated analysis on a dataset.
Safety settings
One of the most important settings in our AI application is concerning safety. Gemini API provides specific safety settings setting boundaries for the AI model’s generated output. These settings can be modified for the needs of the application. In addition to the safety settings available for modification by the developer, some core settings are pre-set and not modifiable. These safeguards mainly cover content that endangers child safety and other core harms. The adjustable safety filters cover the following categories:
- Harassment
- Hate speech
- Sexually explicit
- Dangerous
You can optionally allow some content that falls under the previously mentioned categories, depending on your use case. Gemini API’s safety filters block content based on the probability of it being harmful and not the severity of it. You need to carefully test and consider the level of blocking needed for your application. Safety settings are part of the request you are sending to the service, so it can be adjusted on a by-request basis.
Definitions of safety settings
Categories | Descriptions |
---|---|
Harassment | Negative or harmful comments targeting identity and/or protected attributes. |
Hate speech | Content that is rude, disrespectful, or profane. |
Sexually explicit | Contains references to sexual acts or other lewd content. |
Dangerous | Promotes, facilitates, or encourages harmful acts. |
The exact API reference for the category ratings of harm can be found here.
The following table describes the settings available for blocking unwanted content. You can adjust these settings for each category we discussed to accommodate your use case. These settings compare probabilities of the output falling into one of these categories. If you want to block content that has a high probability of being in one of the categories set the level for that category to Block few. Everything having a lower level of probability will be let through.
The default setting for all categories is Block some.
Threshold (Google AI Studio) | Threshold (API) | Description |
---|---|---|
Block none | BLOCK_NONE | Always show regardless of the probability of unsafe content |
Block few | BLOCK_ONLY_HIGH | Block when high probability of unsafe content |
Block some | BLOCK_MEDIUM_AND_ABOVE | Block when medium or high probability of unsafe content |
Block most | BLOCK_LOW_AND_ABOVE | Block when low, medium or high probability of unsafe content |
HARM_BLOCK_THRESHOLD_UNSPECIFIED | Threshold is unspecified, block using default threshold |
Safety feedback
In case some content is caught by a filter, Gemini AI will block that content. The response will contain the API will contain the SafetyFeedback field which will include the exact safety setting that was triggered and the severity of the offence. This offence is defined as a safety rating which consists of the offended category and the probability of the harm. The probability classifications can be seen in the table below.
Probability | Description |
---|---|
NEGLIGIBLE | Content has a negligible probability of being unsafe |
LOW | Content has a low probability of being unsafe |
MEDIUM | Content has a medium probability of being unsafe |
HIGH | Content has a high probability of being unsafe |
For example, if the output from the model contains text related to harassment and the filter blocks it due to it having a high probability of falling in that category. The safety rating contained in the response would have the category field set to HARASSMENT and the probability set to HIGH.
Safety considerations from the viewpoint of the developer
Gemini AI’s documentation recommends some points while developing an AI-integrated application to be kept in mind. Google recommends using an iterative approach, repeating the “Consider adjustments” and “Perform tests” steps, until an acceptable configuration is reached.
- Understanding the safety risks of your application
- Considering adjustments to mitigate safety risks
- Performing safety testing appropriate to your use case
- Soliciting feedback from users and monitoring usage
Accessing Gemini services
To use Gemini you will need to get an API key. You can achieve that by following the necessary documentation. You will need to fully enable the Workspace Early Access apps application for your Gmail account for now. You also have to grant access to some of your data related to your account. For now, It is not even granted that you can create an API key even then.
Bibliography
Gemini API overview
Terms of service
API versions
Oauth
Embeddings guide
LLM concepts
Intro into prompting
Prompt best practices
Multimodal prompts
Semantic retrieval
Function calling
Safety settings
Safety guidance
Harm category reference
Safety settings reference
Content filter reference
Safety feedback reference