Using GPT3 with speech to text and text to speech to create a nice bartender AI.
Find a file
2023-12-11 21:32:28 +02:00
ai added threshold similarity to config 2023-12-11 21:32:28 +02:00
config_tools copy right update 2022-02-12 18:43:56 +02:00
pdf_gen removed unnessesary packages 2022-11-07 21:04:26 +01:00
tts winrt voices 2023-08-12 23:40:51 +03:00
web_api added missing / 2022-08-11 18:37:33 +03:00
.gitignore more ignore 2023-12-11 19:47:22 +02:00
LICENSE.txt added license 2021-12-01 22:37:29 +02:00
Pipfile removed unnessesary packages 2022-11-07 21:04:26 +01:00
quantumbar_keeper.py merge fixes 2023-12-11 20:02:01 +02:00
README.md fixed typos 2023-05-31 22:30:18 +03:00
requirements.txt Merge branch 'whisper' of gitlab.com:Kreolis/quantumbar-keeper into whisper 2023-08-12 22:47:03 +02:00

QuantumBar Keeper

Using GPT3 with speech to text and text to speech to create a nice bartender AI.

Setup

Use the requiremnets.txt file to install all required packages with pip:

python3 -m pip install -r requirements.txt
pipenv install -r requirements.txt

For speech recognition you can use the following:

  • CMU Sphinx (works offline)
  • Google Speech Recognition (may require a api key for performance)

If you want to use Sphinx and have problems with installing pocketsphinx try to intall it with pipwin:

python3 -m pip install pipwin
python3 -m pipwin install pocketsphinx

If pip can't find the package transformers. Please help yourself with the following:

python3 -m pip install git+https://github.com/huggingface/transformers

Duplicate the config_template.json and rename it to config.json. You can make any adjustments you prefer to the new config.json. The program will always use the content of the file with the name config.json.

Usage

The program uses the default microphone and speaker of the system. Please configure your in and outputs and their respective levels in your system settings.

Starting the program in a console:

python3 .\quantumbar_keeper.py

If no arguments are supplied the first text to speech voice will be selected. Normal mode will be engaged which is the normal chit chat between the user and the AI.

You can end the program with Ctrl+C or the API.

Normal Mode

The program will print all used AI parameters. It will wait for a voice to react to the line:

AI: Hello! I am the bartender of Quantum bar. How can I help you?

It will wait some time for a sound on the microphone. If no sound is heard it will ask again after a while.

When I sound is heard and interpreted as language the result is speech to text result is shown in the console. It is passed to the AI. The AI will answer to the interpreted response and the user can react upon it further. A chit chat can be made as long as you want.

If the program stops hearing sounds it will resume to wait for another human to arrive. The AI will summaries your chat with her (will not be read out loud).

Optional Arguments

  • -h, --help show this help message and exit
  • -m MODE, --mode MODE Change the AI mode of program: detecthuman (will detect human and stop after human is not responding), timed (will give the user time specified in -ct to speak with AI), tokenbased (will give the user time based on specified token numbers in -ct to speak with AI). charbased (will give the user time based on specified char numbers in -ct to speak with AI). Default: charbased"
  • -ct CHATTIME, --chattime CHATTIME Set time that the human has time for a session with the AI in timed mode. Time is in seconds. Set max token count for token based mode. Set max char count for word based mode. Default: 7000
  • -d DEADTIME, --deadtime DEADTIME Set time that the AI waits until it resets when no human is detected. Time is in seconds. Default: 20s
  • -t TESTMODE, --testmode TESTMODE Change the testmode of program: voicescan-text (test tts voices on system, without each saying something), voicescan-voice (same as voicescan-text but with them talking, caution! Might take long!), test-all (make a test of all: tts, stt, etc), test-text_only (same as test-all, but not stt), apitest (no ai bot but only api), audio-dev (display all audio devices), audio-test (play random sentence), latex (test latex pdf generation), print (test latex pdf generation and send to default printer), Empty for normal operation.
  • -i MIC, --mic MIC Select microphone ID for text to speech operations
  • -o OUTPUT, --output OUTPUT Select output device ID for AI output
  • -p PHRASELIMIT, --phraselimit PHRASELIMIT The phrase_time_limit parameter is the maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If phrase_timeout is None, there will be no phrase time limit. Time is in seconds. Default: None
  • -r RECTIMEOUT, --rectimeout RECTIMEOUT The recognizer timeout parameter is the maximum number of seconds that this will wait for a phrase to start before giving up. Time is in seconds. Default: 10s
  • -e RECENERGY, --recenergy RECENERGY Set the speech recognizers energy threshold. If dynamic threshold is actived, this is the starting value. Higher values mean that it will be less sensitive. Default: 400
  • -g, --recdynenable Set this to turn on the speech recognizers dynamic energy threshold. Default: off
  • -l, --createpdfs Set this to turn on the latex pdf generartion of summary. Default: off
  • -pp, --printpdfs Set this to turn on the pdf sending to standard printer. Default: off
  • -c CONFIGNAME, --configname CONFIGNAME Specify location of gpt config

Logging

Every session will be automatically logged to the directory "log" inside the project folder. The file name of each session can be seen at the beginning.

PDFs and Printing

The program can generate automatically a PDF for each user chat summary. The default template can be found in pdf_gen. Files will be placed in the folder summary_pdfs.

PDF generation reuqires a latex installation. Please follow the offical instructions. The required packages depend on the used template PDF.

Automatic printing can be also enabled. This will send the the PDF file automatically to the default printer. Please set your default printing settings (e.g. paper size) in the system settings.

On Windows a seperate installation of Ghostscript is required. Please download the latest release here.

Windows EXE (experimnetal)

You can create a windows executable with auto-py-to-exe. Just install it with pip. You can then launch it with auto-py-to-exe. Use the provided .json for the needed configuration files. File locations have to be adjusted! The output can then be found in the folder "output/quantumbar_keeper". The folder can be zip for distribution.

API

The app generates an WEB API automatically with flask. In development mode you find the endpoint under localhost:5000. The following endpoints are accesable (if not outerwise stated the endpoints will be plain html):

  • /: Home with help text
  • /config/: Home of the current configuration endpoint. The following subpoints are available:
    • mode/: current mode of the AI bot, string: [detecthuman, timed, tokenbased, charbased]
    • ai/: current GPT-3 AI configurations in json format, see below
    • end_conditions/: current chat end conditions, List[string]
    • voiceID/: current voice ID in use, string
    • recognizer/: current speech recognition settings in json format, see below
    • deadtime/: current deadtime setting, integer
    • chattime/: current chat time/tokens/chars setting, integer
  • /state/: Home of the current state endpoint. The following subpoints are availabe:
    • state/: current state of the AI bot, string: [idle, chat_ai_query, summary_ai_query, sentiment_ai_query, interrupt_ai_query, recognize_speech, record_speech, ai_talking]
    • sentiment/: current sentiment of the user line, float: [positive=1, neutral=0, negative=-1]
    • start_chattime/: datetime when the bot has detected a human chat partner, string: "%Y-%m-%d, %H:%M:%S"
    • tokens/: used tokens of the currently running chat, integer
    • chars/: used chars of the currently running chat, integer
    • chat_counter/: counter for how many times the chat has ended, integer
    • summary/: summary of the chat, will update once chat is over, string
  • /humandetected/1/: sets the start trigger for the chat, should be reset after start, otherwise it will start next chat rightaway
  • /humandetected/0/: sets the stop trigger for the chat
  • /add_user_id/STRING/: used to pass user id as a string to open ai
  • /del_user_id/STRING/: used to delete user id as a string from open ai
  • /stopServer/: Will send a signal to the app to be stopped in 5 sec. This will shutdown the entire app not only the API. It will display a confirmation that the request has been processed.

Configuration

You will have to supply a "OPENAI_KEY" enviromental variable. You can easily save it in a .env file like this: OPENAI_KEY=put_your_key_here

The GPT3 response parameters can be adjusted in "config.json". Also the Voice ID can be set there, as well as the stop conditions that the user can say.

Text to speech

This application can be used with larynx and pyttsx3 text to speech models. By specificing the voice engine in the config file one can switch between them. Please also provide either the voiceID for pyttsx3 to use or the correct relative path to the voiceID gnnx model for larynx.

Larynx models should be placed in the folder tts. Voices and vocoders can be downloaded from here.

Speech to text

For speech to text several models are available. Google and Sphinx can be used as provided by speech_recognition python package. It is also possible to use Whisper by OpenAI for transcription. Additionally the build-in feature for automatic translation to english .

Json Config

Please use the provided conifg_template.json as a reference and save your settings as config.json.

AI

{
    "AI_NAME": {
        "MODEL":string,
        "STOP": string,
        "TEMPERATURE": float,
        "TOP_P": float,
        "FREQUENCY_PENALTY": float,
        "PRESENCE_PENALTY": float,
        "BEST_OF": float,
        "MAX_TOKENS": integer,
        "START_CHAT_LOG": string,
        "PRECHAT": string,
        "POSTCHAT": string
    },
}

Human Substituion Texts

{    
    "HUMAN_TEXT_UNRESPONSIVE": string,
    "HUMAN_TEXT_CHAT_TIMEOUT": string,
    "HUMAN_TEXT_UNSAFE_FILTER": string,
}

Chat Parameters

{    
    "CHATTIME": integer,
    "INTRO_LINE": string,
    "TIMEUP_LINE": string,
    "CHAT_END_SENTENCE": string,
    "END_CONDITIONS": [
        string,
        string,
    ],
}

TTS Parameters

{    
    "VOICE_ID": string,
    "VOICE_ENGINE": string,
}

STT Parameters

{    
    "STT_ENGINE": string, // "google", "sphinx", "whisper", "whisper-translate"
}