Creating a lyric search engine with OpenAI and Supabase

Github repo

Goal

Create a one page site that uses OpenAI's createEmbedding API and Supabase vector database extension to search song lyrics using natural language. You can grab the code from the lyricGPT Github.


Introduction

When you create a Supabase application, they give you a Postgres database. In a few click you will be able to add an extension that enables you to store the vectors of data you want to search.


This closely follows Supabase's tutorial: Storing OpenAI embeddings in Postgres with pgvector.


Create and storing embeddings of the song lyrics

After signing up for supabase, create your first project with default settings.


Next we need to click on the terminal icon on the left sidebar. It should say SQL editor. We are going to pass in SQL commands to our supabase datbase.

terminal

Let's enable the vector extension:

create extension vector;

For this project I grabbed data from Kaggle. It has song lyrics of 81 artist broken up into folders. The next command creates the the documents tables that will store song information, lyrics, and it's generated embedding. If you want to learn about embedding databases, I found Cloudflare's blog post to be the most helpful.

create table documents (
  id bigserial primary key,
  content text,
  title text,
  artist text,
  lyrics text,
  embedding vector (1536)
);

Creating a Supabase application means you are standing up an impressive amount of infastructure, most importantly, a Postgres database. Supabase's superpower is they take the Postgres database and turn it into a platofrm for to build applications on top. Most impressively they do this all open source and by focusing on Postgres fundamentals. If you get really into Supabase, you are getting really into Postgres. It's a win, win.


Here we are going to create a Postgres Function. match_documents is a function that will take in a query embedding and return the most similar documents. We will use this function to search for songs.

create or replace function match_documents (
  query_embedding vector(1536),
  similarity_threshold float,
  match_count int
)
returns table (
  id bigint,
  content text,
  artist text,
  title text,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    documents.id,
    documents.content,
    documents.artist,
    documents.title,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > similarity_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

Thinking about perfomance, lets add an index:

create index on documents
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

Run create-next-app to get a folder with a nextjs app. Open that folder in your favorite code editor.

bunx create-next-app@latest --typescript --tailwind --eslint --src-dir --use-bun --app --import-alias='@/*' lyric-gpt

Don't have bun? Install instructions


Download the Rap Genius Lyrics Set and store it in a folder inside your next project called rawdata. I used the first 10 folders. You can use more or less.


Because I think we might want to call this code in the future, we should make an /api route for creating vector embeddings. This function is somewhat specifc to the structure of the rap genius data set, but I think it could be easily modified for other datasets. It will first parse the folder /rawData for filepaths and then read the csv files. We then batch 10 songs at a time and send them to openai to generate the embeddings.


We are sending a somewhat dumb way of generating embeddings. We are combining the artist, title, and lyrics into one string. This is not ideal, but it works. I think a better way would be to break up the lyrics into chunks and generate embeddings for each chunk. This would allow us to search for a specific part of a song.


Once we get the embeddings back from openai, we insert them into our database. We are using supabase's javascript client to insert the data.

import * as fs from "fs";
import * as path from "path";
import { parse } from "csv-parse";
import { createClient } from "@supabase/supabase-js";
import { Configuration, OpenAIApi } from "openai";
 
const supabaseUrl = process.env.supabaseUrl ?? "";
const supabaseKey = process.env.SUPABASE_KEY ?? "";
const supabaseClient = createClient(supabaseUrl, supabaseKey);
const configuration = new Configuration({
  apiKey: process.env.openApiKey,
});
const openAi = new OpenAIApi(configuration);
 
export async function GET() {
  const lyricPath = await getDocuments();
  for (let i = 0; i < lyricPath.length; i++) {
    console.log("processing", lyricPath[i])
    let allSongRows: songRow[] = [];
    const path = lyricPath[i];
    const fileContent = fs.readFileSync(path, { encoding: "utf-8" });
    const parseCsv = parse(fileContent, {
      delimiter: ",",
      columns: headers,
    });
 
    await new Promise((resolve, reject) => {
      parseCsv.on("readable", () => {
        let record;
        while ((record = parseCsv.read())) {
          allSongRows.push(record);
        }
      });
      parseCsv.on("error", (err) => reject(err));
      parseCsv.on("end", () => resolve("done"));
    });
 
    for (let i = 0; i < allSongRows.length; i += 10) {
      const rows = allSongRows.slice(i, i + 10);
      const allInsertsPromises = rows.map(async (row) => {
        const songStringObj = ` ${row.artist} ${row.title} ${row.lyrics}`;
        const { data, error } = await supabaseClient
          .from("documents")
          .select("id")
          .eq("title", row.title)
          .eq("artist", row.artist);
        if (error) console.log(error);
        if (data && data.length > 0) {
          console.log("skipping", row.title, row.artist, data);
          return null;
        }
        const songString = songStringObj
          .replace(/\n/g, " ")
          .trim()
          .toLowerCase();
        const embeddingResponse = await openAi
          .createEmbedding({
            model: "text-embedding-ada-002",
            input: songString,
          })
          .catch((err) => {
            console.log("error", err);
          });
 
        if (!embeddingResponse) {
          return null;
        }
        const [{ embedding }] = embeddingResponse.data.data;
        return await supabaseClient.from("documents").insert({
          title: row.title,
          artist: row.artist,
          lyrics: row.lyrics,
          content: songString,
          embedding,
        });
      });
      await Promise.all(allInsertsPromises);
    }
  }
  return new Response("ok");
}
 
type songRow = {
  artist: string;
  id: string;
  lyrics_owner_id: string;
  primary_artist_id: string;
  primary_artist_name: string;
  song_art_image_thumbnail_url: string;
  title: string;
  url: string;
  pageviews: string;
  lyrics: string;
};
 
const headers = [
  "artist",
  "id",
  "lyrics_owner_id",
  "primary_artist_id",
  "primary_artist_name",
  "song_art_image_thumbnail_url",
  "title",
  "url",
  "pageviews",
  "lyrics",
];
const folderPath = "./rawData/";
async function getDocuments() {
  // get file paths of all csv
  const files = fs.readdirSync(folderPath); // Read the files synchronously
  return files.filter(isCsvFile).map(toFilePath); // Extract the CSV file paths
}
 
function isCsvFile(file: string) {
  return path.extname(file) === ".csv";
}
 
function toFilePath(file: string) {
  return path.join(folderPath, file);
}

To run this: bun run dev and then go to localhost:3000/api/generateEmbeddings in your browser. This will take a while. I think it took me about 30 minutes to run.


Now that we have inserted the songs into our database, we can start writing our search function.

THis function takes the user query, generates an embbeding, and search our database for the most similar.

Thats it.

import { openAi, supabaseClient } from "@/utils";
 
export const corsHeaders = {
  "Access-Control-Allow-Origin": "*",
  "Access-Control-Allow-Headers":
    "authorization, x-client-info, apikey, content-type",
};
 
export async function GET(req: Request) {
  // Handle CORS
  if (req.method === "OPTIONS") {
    return new Response("ok", { headers: corsHeaders });
  }
 
  // Extract search query from request URL
  const url = new URL(req.url);
  const query = url.searchParams.get('query') || '';
  // OpenAI recommends replacing newlines with spaces for best results
  const input = query.replace(/\n/g, " ");
 
  // Generate a one-time embedding for the query itself
  const embeddingResponse = await openAi.createEmbedding({
    model: "text-embedding-ada-002",
    input,
  });
 
  const [{ embedding }] = embeddingResponse.data.data;
 
  // Ideally for context injection, documents are chunked into
  // smaller sections at earlier pre-processing/embedding step.
  const { data: documents, error: err } = await supabaseClient.rpc(
    "match_documents_new",
    {
      query_embedding: embedding,
      similarity_threshold: 0.78, // Choose an appropriate threshold for your data
      match_count: 10, // Choose the number of matches
    }
  );
 
  const songs: `${string} - ${string}`[] = documents.map(
    (x: { title: string; artist: string; content: string }) => {
      return `${x.artist} - ${x.title}`;
    }
  );
 
  // Remove duplicates
  const uniqueSongs = [...new Set(songs)];
  console.log(uniqueSongs);
 
  return new Response(
    JSON.stringify({
      query,
      uniqueSongs,
    }),
    {
      headers: { ...corsHeaders, "Content-Type": "application/json" },
    }
  );
}

// make query about hip hop To run this: bun run dev and then go to localhost:3000/api/searchVector?query=happy+hip+hop in your browser. This will return a json object with the query and the most similar songs.

To make sure you have all the correct dependencies, here is my package.json

{
  "name": "lyricgpt",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint"
  },
  "dependencies": {
    "@supabase/supabase-js": "^2.14.0",
    "@types/node": "18.15.11",
    "@types/react": "18.0.33",
    "@types/react-dom": "18.0.11",
    "common-tags": "^1.8.2",
    "csv-parse": "^5.3.6",
    "encoding": "^0.1.13",
    "eslint": "8.37.0",
    "eslint-config-next": "13.2.4",
    "gpt3-tokenizer": "^1.1.5",
    "next": "13.2.2",
    "node-fetch": "^3.3.1",
    "oneline": "^1.0.3",
    "openai": "^3.2.1",
    "react": "18.2.0",
    "react-countup": "^6.4.2",
    "react-dom": "18.2.0",
    "theme-ui": "^0.15.7",
    "typescript": "5.0.3"
  },
  "devDependencies": {
    "@types/common-tags": "^1.8.1",
    "@types/mdx": "^2.0.4",
    "autoprefixer": "^10.4.14",
    "postcss": "^8.4.21",
    "tailwindcss": "^3.3.1"
  }
}

Exploring more

To make the frontend usable you need to edit you app/page.tsx file and create a component file that will host the form client side.

import { QueryAndSongList } from "@/components";
 
export default function Page() {
  return (
    <div className="min-h-screen bg-gradient-to-r min-w-full from-blue-600/25 to-sky-600/25">
      <div className="md:container md:mx-auto  mx-6">
        <div className="flex flex-col justify-center items-center  ">
          <h1 className="bg-gradient-to-r from-gray-900 via-gray-700 to-gray-900 inline-block text-transparent bg-clip-text text-7xl md:text-8xl font-bold h-28 mt-8">
            LyricGPT{" "}
          </h1>
          <QueryAndSongList />
        </div>
      </div>
    </div>
  );
}
 

And then create a file in the component folder called QueryAndSongList.tsx

"use client";
import { Dispatch, FormEvent, SetStateAction, useState } from "react";
 
export function QueryAndSongList() {
  const [isLoading, setIsLoading] = useState(false);
  const [songList, setSongList] = useState<`${string} - ${string}`[]>([]);
 
  return (
    <>
      <QueryForm setSongList={setSongList} setIsLoading={setIsLoading} />
 
      {isLoading ? (
        <Loading />
      ) : (
        <div className="flex flex-col space-y-8 mt-4 mb-4">
          {songList.map((song, index) => {
            return (
              <div key={index}>
                <div className="relative md:min-w-[600px] mx-auto">
                  <div className="absolute -inset-1 bg-gradient-to-r from-sky-800 to-sky-900 rounded-lg blur opacity-25"></div>
                  <div className="relative  px-7 py-6 bg-white ring-1 ring-gray-900/5 rounded-lg leading-none flex items-center justify-start space-x-6">
                    <div className="h-full flex justify-center items-center w-full">
                      <p className="text-slate-800 text-3xl">{song}</p>
                    </div>
                  </div>
                </div>
              </div>
            );
          })}
        </div>
      )}
    </>
  );
}
 
function QueryForm({
  setSongList,
  setIsLoading,
}: {
  setSongList: Dispatch<SetStateAction<`${string} - ${string}`[]>>;
  setIsLoading: Dispatch<SetStateAction<boolean>>;
}) {
  const handleSubmit = async (
    event: FormEvent<HTMLFormElement> & {
      target: {
        query: HTMLInputElement;
      };
    }
  ) => {
    setIsLoading(true);
    event.preventDefault();
    const data = {
      query: event.target.query.value,
    };
    const JSONdata = JSON.stringify(data);
 
    // Form the request for sending data to the server.
    const options = {
      // The method is POST because we are sending data.
      method: "POST",
      // Tell the server we're sending JSON.
      headers: {
        "Content-Type": "application/json",
      },
      // Body of the request is the JSON data we created above.
      body: JSONdata,
    };
 
    // Send the form data to our forms API on Vercel and get a response.
    const response = await fetch("/api/searchVector", options);
 
    // Get the response data from server as JSON.
    // If server returns the name submitted, that means the form works.
    const result = await response.json();
    setSongList(result.uniqueSongs);
    setIsLoading(false);
  };
  return (
    <>
      <form onSubmit={handleSubmit} className="w-full">
        <div className=" flex w-full  flex-col items-center mb-3 mt-4 justify-center">
          <input
            type="text"
            autoComplete="false"
            id="query"
            name="query"
            placeholder="Happy rap songs"
            className="px-3 py-3 w-full md:w-[600px]  placeholder-slate-300 text-slate-600 relative bg-white  rounded text-sm border-0 shadow outline-none focus:outline-none focus:ring-sky-800 focus:ring hover:ring hover:ring-sky-800 pr-10"
          />
          <span className="z-10 h-full leading-snug font-normal absolute text-center text-slate-300  bg-transparent rounded text-base items-center justify-center w-8 right-0 pr-3 py-3">
            <i className="fas fa-user"></i>
          </span>
          <button hidden type="submit"></button>
          <button className="mt-8 w-48 rounded-xl bg-slate-700/50 hover:bg-slate-800/50 h-8 cursor-pointer">
            Search
          </button>
        </div>
      </form>
    </>
  );
}
function Loading() {
  return (
    <div className="flex justify-center items-center my-10">
      <div className="animate-spin rounded-full h-24 w-24 border-b-4 border-slate-400 "></div>
    </div>
  );
}

If you visit localhost:3000 you should see a form that will allow you to search for songs. This is hitting the api route searchVector and our Supabase database. Snazzy!


Conclusion

While it takes a bit of work to get the frontend together, Supabase vector integration and vercel api routes make building full stack application enjoyable.