bene : studio is a global consultancy, helping startups, enterprises and HealthTech companies to have better product

Google Gemini from a Software Engineering POV

February 07, 2024 - Máté Nagy

Article by Máté Nagy, Software Engineer

Unsure where to start with the Gemini AI API? This guide simplifies your journey by summarizing the key functionalities relevant to our use cases. Dive into clear explanations and practical examples, without wading through bulky documentation. Explore further details at your own pace with the provided references. Let’s discover the potential of Gemini AI together!

The Gemini API documentation is concise and well-written, making it easy to familiarize yourself with Gemini AI’s main features. I will attempt to summarize the key points essential for our use cases. Additionally, I will include the original documentation in the bibliography for those who wish to conduct a more thorough exploration.

Terms of service extract

Here are some points from the Terms of Service that can be important in the decision of whether Gemini AI should be used in a project. The key takeaways are:

We cannot use Gemini as a means to make a diagnosis or even to aid the process. In applications where FDA approval is needed, the usage of the service is mostly off the table. This does not mean that we cannot use it to create a better user experience in an application, but the use cases in such a software is very limited.

The usage of the service is limited to products serving end users. Creating applications that provide APIs for other products are prohibited. Creating an API as a layer of a product, which supplies data for the front end is an allowed utilization.

Another drawback is by using Gemini AI through any interface we are granting a license to Google to use for improvement and development. In the process of developing Gemini AI, human reviewers can read, annotate, and process the service’s inputs and outputs which would most likely grossly violate HIPAA laws.

The Terms of Service also mention that any output provided by their LLMs does not constitute medical, legal, financial, or other professional advice, medical treatment or diagnosis. This means providing such services legally classified as the aforementioned categories is highly discouraged and could open the company to liabilities related to those services.

These terms could probably be circumvented by classifying a service as a wellness product, therefore it would not fall under laws regulating clinical practice. However these solutions should be considered with care as the Civil Law System in the USA is very open to such cases. Every product skirting on the boundaries of the mentioned uses should be subjected to legal consultation with professionals in the matter. This introduction (like the output of Google’s LLMs) does not constitute legal advice.

You will only use the Services directly or in connection with a service that you offer directly to end users, i.e., you will not use the Services (e.g., Gemini API) to power another application programming interface.
– Generative AI APIs Additional Terms of Service

Most importantly:

You may not use the Services in clinical practice, to provide medical advice, or in any manner that is overseen by or requires clearance or approval from a medical device regulatory agency.
– Generative AI APIs Additional Terms of Service

The license you grant to Google under the “Submission of Content” section in the API Terms also extends to any content (e.g., prompts, images, sources) you submit through the Services or any other API interface (e.g., Firebase Extensions). Google uses this data, consistent with our Privacy Policy, to provide, improve, and develop Google products and services and machine learning technologies, including Google’s enterprise features, products, and services.
– Generative AI APIs Additional Terms of Service

To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account and API key before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Services.
– Generative AI APIs Additional Terms of Service

Don’t rely on the Services for medical, legal, financial, or other professional advice. Any content regarding those topics is provided for informational purposes only and is not a substitute for advice from a qualified professional. Content does not constitute medical treatment or diagnosis.
– Generative AI APIs Additional Terms of Service

Gemini AI’s features

Gemini AI consists of two models: gemini-pro and gemini-vision pro. The main difference between the two models is that gemini-vision-pro accepts prompts containing images. While gemini-pro only accepts and outputs texts, gemini-vision-pro requires at least one, but accepts as many as 16 individual images and optionally additional text as a prompt. By providing additional text to accompany image prompts, could be used for captioning and other procedures.

Google provides two versions of their AI models: v1 and v1beta. While the beta version has more features available it is not recommended for production use, as it is subject to major breaking changes. The stable v1 version will not receive major changes and its interface will not change. This version will receive full support over the lifetime of the version.

Gemini Pro

Since most use cases will not require image-based prompts and the gemini-pro model is more feature-rich, we will dive deeper into the features available in the model. Most of these features are like interfaces to the underlying LLM, but they provide extra functionality while accessing it.

Multi-turn conversations

This feature provides a chat-like mode for accessing the model’s functionality. This mode retains the user’s chat history for the duration of the conversation. You can also customize the experience of the user by providing instructions and context in the form of conversation history when initializing the chat. This way the model can provide a more fine-tuned output.

Streamed responses

Streamed responses are a feature for normal text-based prompting and chat. The feature enables you to pass forward the response while it is generated and not wait for the whole response to return it to the user.

Embeddings

Embeddings is a feature used to create vectorial representations of phrases used for semantic search, text classification, and clustering. You can use this feature to better search context in vector databases. Since this topic is pretty heavy and a niche use case I am gonna leave explaining it in depth to the documentation.

Technical details

The API provides access to both the Gemini model family and the legacy PaLM generative AI models. We won’t be exploring the PaLM models in this document. Google provides a rich API for development, the documentation of these libraries are thorough, please check it for reference if needed. This library in addition to a REST API is implemented in:

Python
Node.js
Web compatible JavaScript
Swift
Android – cloud and on-device implementations
Go

Model limits

The API’s limits and properties for each model as of writing this tutorial (Dec 2023). You can also query the actual metadata and attributes of the models.

Gemini Pro’s limits:

Input	text
Output	text
Functions	Generates text, handle multi-turn conversational format
Shots	Can handle zero, one, and few-shot tasks
Input token limit	30720
output token limit	2048
Rate limit	60 requests per minute

Gemini Pro Vision’s limits:

Input	text and images
Output	text
Functions	Can take multimodal inputs, text and image.
Shots	Can handle zero, one, and few-shot tasks.
Input token limit	12288
Input image limits	Maximum of 16 individual images.Maximum of 4MB for the entire prompt, including images and text.No specific limits to the number of pixels in an image; however, larger images are scaled down to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.
Image MIME types	PNG – image/png, JPEG – image/jpeg, WEBP – image/webp, HEIC – image/heic, HEIF – image/heif
Output token limit	4096
Rate limit	60 requests per minute

Embedding’s limits:

Input	text
Output	text
Functions	Generates text embeddings for the input text.
Input token limit	Optimized for creating embeddings for text of up to 2048 tokens.
Rate limit	1500 requests per minute

A little intro into LLMs to get you started

This section goes into some concepts of Large Language Models to provide some insight into the usage and prompt engineering necessary to create applications. These terms are mostly generally used in the industry and therefore the definitions provided here are not gonna be surprising to people already familiar with an alternative LLM.

Model parameters

Max output tokens

A token is defined as approximately four tokens, meaning 100 tokens correspond to roughly 70 words. This parameter sets a maximum value for that, so you can define the length of the answer you desire.

Temperature

Temperature is a parameter that is only used when topK and topP are applied. This parameter defines the randomness of the answer. Set to 0 the model will always give the same deterministic answer for a prompt. While higher temperature answers seem more creative, sometimes you desire a more stable result.

topK

The model chooses tokens from its vocabulary, in the case of a topK value of one, it chooses the top 1 most probable token. The parameter changes the number of tokens from the most probable tokens, this list is passed down to further filtering with topP.

topP

Setting the topP parameter filters the list of probable tokens even more, by adding the probabilities up, from most probable to least probable until it reaches the topP value. The default value for this parameter is 0.95

stop_sequences

You can set a sequence of characters to be used as a stop sequence. When the defined sequence appears in the generated output, the generation will stop.

Range of parameters

Model parameter	Values (range)
Candidate count	1-8 (integer)
Temperature	0.0-1.0
Max output tokens	Use `get_model` (Python) to determine the maximum number of tokens for the model you are using.
TopP	0.0-1.0

Types of prompts

Zero-shot prompts

The prompt only relies on instructions and optionally on content to process. The developer does not provide examples.
Commonly used patterns:

Instruction – content
Instruction – content – instruction
Continuation (continue the provided input)

One-shot prompts

The prompt only contains one example of a pattern to be recognized by the LLM. These examples can be pairings of some sort, provided as a category – value pair. By providing a pair and a category afterwards we prompt the model to fill in the value for the missing category.

Few-shot prompts

The concept is the same as with one-shot prompts, but by providing multiple examples, we can enable the LLM to complete complex and sophisticated analysis on a dataset.

Safety settings

One of the most important settings in our AI application is concerning safety. Gemini API provides specific safety settings setting boundaries for the AI model’s generated output. These settings can be modified for the needs of the application. In addition to the safety settings available for modification by the developer, some core settings are pre-set and not modifiable. These safeguards mainly cover content that endangers child safety and other core harms. The adjustable safety filters cover the following categories:

Harassment
Hate speech
Sexually explicit
Dangerous

You can optionally allow some content that falls under the previously mentioned categories, depending on your use case. Gemini API’s safety filters block content based on the probability of it being harmful and not the severity of it. You need to carefully test and consider the level of blocking needed for your application. Safety settings are part of the request you are sending to the service, so it can be adjusted on a by-request basis.

Definitions of safety settings

Categories	Descriptions
Harassment	Negative or harmful comments targeting identity and/or protected attributes.
Hate speech	Content that is rude, disrespectful, or profane.
Sexually explicit	Contains references to sexual acts or other lewd content.
Dangerous	Promotes, facilitates, or encourages harmful acts.

The exact API reference for the category ratings of harm can be found here.

The following table describes the settings available for blocking unwanted content. You can adjust these settings for each category we discussed to accommodate your use case. These settings compare probabilities of the output falling into one of these categories. If you want to block content that has a high probability of being in one of the categories set the level for that category to Block few. Everything having a lower level of probability will be let through.

The default setting for all categories is Block some.

Threshold (Google AI Studio)	Threshold (API)	Description
Block none	BLOCK_NONE	Always show regardless of the probability of unsafe content
Block few	BLOCK_ONLY_HIGH	Block when high probability of unsafe content
Block some	BLOCK_MEDIUM_AND_ABOVE	Block when medium or high probability of unsafe content
Block most	BLOCK_LOW_AND_ABOVE	Block when low, medium or high probability of unsafe content
	HARM_BLOCK_THRESHOLD_UNSPECIFIED	Threshold is unspecified, block using default threshold

Safety feedback

In case some content is caught by a filter, Gemini AI will block that content. The response will contain the API will contain the SafetyFeedback field which will include the exact safety setting that was triggered and the severity of the offence. This offence is defined as a safety rating which consists of the offended category and the probability of the harm. The probability classifications can be seen in the table below.

Probability	Description
NEGLIGIBLE	Content has a negligible probability of being unsafe
LOW	Content has a low probability of being unsafe
MEDIUM	Content has a medium probability of being unsafe
HIGH	Content has a high probability of being unsafe

For example, if the output from the model contains text related to harassment and the filter blocks it due to it having a high probability of falling in that category. The safety rating contained in the response would have the category field set to HARASSMENT and the probability set to HIGH.

Safety considerations from the viewpoint of the developer

Gemini AI’s documentation recommends some points while developing an AI-integrated application to be kept in mind. Google recommends using an iterative approach, repeating the “Consider adjustments” and “Perform tests” steps, until an acceptable configuration is reached.

Accessing Gemini services

To use Gemini you will need to get an API key. You can achieve that by following the necessary documentation. You will need to fully enable the Workspace Early Access apps application for your Gmail account for now. You also have to grant access to some of your data related to your account. For now, It is not even granted that you can create an API key even then.