Using LangChain & LLMs to split up transcribed text

Created on 2024-02-11T21:14:34-06:00

This card assumes you have already used a tool (such as whisper.cpp) to transcribe a three hour broadcast and now want something a bit more readable than two hundred pages of text.

# where the magic happens
from langchain_community.llms import Ollama
llm = Ollama(model="mistral")

# to break up text so it fits within prompts
from langchain.text_splitter import RecursiveCharacterTextSplitter

doc = None
with open("/tmp/speech.wav.txt", "r") as f:
   doc = f.read()

spit = RecursiveCharacterTextSplitter(chunk_size=1000)
doc = spit.split_text(doc)

for par in doc[0:2]:
   print("-----")
   #for k in llm.stream(f"Give a bullet point summary of the following text: {par}"):
   for k in llm.stream(f"Rewrite the following text in a single sentence: {par}"):
      print(k, end='')

This segment was made due to a very long and often obnoxious news pundit, where the transcription process turned multiple hours in to a two hundred page ebook which was still considerably tedious to read. Do be warned that inaccuracy is inevitable between Whisper converting the speech in to text and especially by Mistral reinterpreting cut down portions of the text to create the summaries.

An interesting work would be to keep time codes from the subtitle format and perform summaries by five minute intervals. Such would allow you to refer back to the original video if you wanted more details as well as being a little more coherently structured than simply cutting the news program by however much text fit in to the buffer at the time.