Create a one page site that uses OpenAI's createEmbedding
API and Supabase vector database extension to search song lyrics using natural language. You can grab the code from the lyricGPT Github.
When you create a Supabase application, they give you a Postgres database. In a few click you will be able to add an extension that enables you to store the vectors of data you want to search.
This closely follows Supabase's tutorial: Storing OpenAI embeddings in Postgres with pgvector.
After signing up for supabase, create your first project with default settings.
Next we need to click on the terminal icon on the left sidebar. It should say SQL editor. We are going to pass in SQL commands to our supabase datbase.
Let's enable the vector extension:
create extension vector;
For this project I grabbed data from Kaggle. It has song lyrics of 81 artist broken up into folders. The next command creates the the documents tables that will store song information, lyrics, and it's generated embedding. If you want to learn about embedding databases, I found Cloudflare's blog post to be the most helpful.
create table documents (
id bigserial primary key,
content text,
title text,
artist text,
lyrics text,
embedding vector (1536)
);
Creating a Supabase application means you are standing up an impressive amount of infastructure, most importantly, a Postgres database. Supabase's superpower is they take the Postgres database and turn it into a platofrm for to build applications on top. Most impressively they do this all open source and by focusing on Postgres fundamentals. If you get really into Supabase, you are getting really into Postgres. It's a win, win.
Here we are going to create a Postgres Function. match_documents
is a function that will take in a query embedding and return the most similar documents. We will use this function to search for songs.
create or replace function match_documents (
query_embedding vector(1536),
similarity_threshold float,
match_count int
)
returns table (
id bigint,
content text,
artist text,
title text,
similarity float
)
language plpgsql
as $$
begin
return query
select
documents.id,
documents.content,
documents.artist,
documents.title,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > similarity_threshold
order by documents.embedding <=> query_embedding
limit match_count;
end;
$$;
Thinking about perfomance, lets add an index:
create index on documents
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
Run create-next-app to get a folder with a nextjs app. Open that folder in your favorite code editor.
bunx create-next-app@latest --typescript --tailwind --eslint --src-dir --use-bun --app --import-alias='@/*' lyric-gpt
Don't have bun? Install instructions
Download the Rap Genius Lyrics Set and store it in a folder inside your next project called rawdata
. I used the first 10 folders. You can use more or less.
Because I think we might want to call this code in the future, we should make an /api
route for creating vector embeddings. This function is somewhat specifc to the structure of the rap genius data set, but I think it could be easily modified for other datasets.
It will first parse the folder /rawData
for filepaths and then read the csv files.
We then batch 10 songs at a time and send them to openai to generate the embeddings.
We are sending a somewhat dumb way of generating embeddings. We are combining the artist, title, and lyrics into one string. This is not ideal, but it works. I think a better way would be to break up the lyrics into chunks and generate embeddings for each chunk. This would allow us to search for a specific part of a song.
Once we get the embeddings back from openai, we insert them into our database. We are using supabase's javascript client to insert the data.
import * as fs from "fs";
import * as path from "path";
import { parse } from "csv-parse";
import { createClient } from "@supabase/supabase-js";
import { Configuration, OpenAIApi } from "openai";
const supabaseUrl = process.env.supabaseUrl ?? "";
const supabaseKey = process.env.SUPABASE_KEY ?? "";
const supabaseClient = createClient(supabaseUrl, supabaseKey);
const configuration = new Configuration({
apiKey: process.env.openApiKey,
});
const openAi = new OpenAIApi(configuration);
export async function GET() {
const lyricPath = await getDocuments();
for (let i = 0; i < lyricPath.length; i++) {
console.log("processing", lyricPath[i])
let allSongRows: songRow[] = [];
const path = lyricPath[i];
const fileContent = fs.readFileSync(path, { encoding: "utf-8" });
const parseCsv = parse(fileContent, {
delimiter: ",",
columns: headers,
});
await new Promise((resolve, reject) => {
parseCsv.on("readable", () => {
let record;
while ((record = parseCsv.read())) {
allSongRows.push(record);
}
});
parseCsv.on("error", (err) => reject(err));
parseCsv.on("end", () => resolve("done"));
});
for (let i = 0; i < allSongRows.length; i += 10) {
const rows = allSongRows.slice(i, i + 10);
const allInsertsPromises = rows.map(async (row) => {
const songStringObj = ` ${row.artist} ${row.title} ${row.lyrics}`;
const { data, error } = await supabaseClient
.from("documents")
.select("id")
.eq("title", row.title)
.eq("artist", row.artist);
if (error) console.log(error);
if (data && data.length > 0) {
console.log("skipping", row.title, row.artist, data);
return null;
}
const songString = songStringObj
.replace(/\n/g, " ")
.trim()
.toLowerCase();
const embeddingResponse = await openAi
.createEmbedding({
model: "text-embedding-ada-002",
input: songString,
})
.catch((err) => {
console.log("error", err);
});
if (!embeddingResponse) {
return null;
}
const [{ embedding }] = embeddingResponse.data.data;
return await supabaseClient.from("documents").insert({
title: row.title,
artist: row.artist,
lyrics: row.lyrics,
content: songString,
embedding,
});
});
await Promise.all(allInsertsPromises);
}
}
return new Response("ok");
}
type songRow = {
artist: string;
id: string;
lyrics_owner_id: string;
primary_artist_id: string;
primary_artist_name: string;
song_art_image_thumbnail_url: string;
title: string;
url: string;
pageviews: string;
lyrics: string;
};
const headers = [
"artist",
"id",
"lyrics_owner_id",
"primary_artist_id",
"primary_artist_name",
"song_art_image_thumbnail_url",
"title",
"url",
"pageviews",
"lyrics",
];
const folderPath = "./rawData/";
async function getDocuments() {
// get file paths of all csv
const files = fs.readdirSync(folderPath); // Read the files synchronously
return files.filter(isCsvFile).map(toFilePath); // Extract the CSV file paths
}
function isCsvFile(file: string) {
return path.extname(file) === ".csv";
}
function toFilePath(file: string) {
return path.join(folderPath, file);
}
To run this: bun run dev
and then go to localhost:3000/api/generateEmbeddings
in your browser. This will take a while. I think it took me about 30 minutes to run.
Now that we have inserted the songs into our database, we can start writing our search function.
THis function takes the user query, generates an embbeding, and search our database for the most similar.
Thats it.
import { openAi, supabaseClient } from "@/utils";
export const corsHeaders = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Headers":
"authorization, x-client-info, apikey, content-type",
};
export async function GET(req: Request) {
// Handle CORS
if (req.method === "OPTIONS") {
return new Response("ok", { headers: corsHeaders });
}
// Extract search query from request URL
const url = new URL(req.url);
const query = url.searchParams.get('query') || '';
// OpenAI recommends replacing newlines with spaces for best results
const input = query.replace(/\n/g, " ");
// Generate a one-time embedding for the query itself
const embeddingResponse = await openAi.createEmbedding({
model: "text-embedding-ada-002",
input,
});
const [{ embedding }] = embeddingResponse.data.data;
// Ideally for context injection, documents are chunked into
// smaller sections at earlier pre-processing/embedding step.
const { data: documents, error: err } = await supabaseClient.rpc(
"match_documents_new",
{
query_embedding: embedding,
similarity_threshold: 0.78, // Choose an appropriate threshold for your data
match_count: 10, // Choose the number of matches
}
);
const songs: `${string} - ${string}`[] = documents.map(
(x: { title: string; artist: string; content: string }) => {
return `${x.artist} - ${x.title}`;
}
);
// Remove duplicates
const uniqueSongs = [...new Set(songs)];
console.log(uniqueSongs);
return new Response(
JSON.stringify({
query,
uniqueSongs,
}),
{
headers: { ...corsHeaders, "Content-Type": "application/json" },
}
);
}
// make query about hip hop
To run this: bun run dev
and then go to localhost:3000/api/searchVector?query=happy+hip+hop
in your browser. This will return a json object with the query and the most similar songs.
To make sure you have all the correct dependencies, here is my package.json
{
"name": "lyricgpt",
"version": "0.1.0",
"private": true,
"scripts": {
"dev": "next dev",
"build": "next build",
"start": "next start",
"lint": "next lint"
},
"dependencies": {
"@supabase/supabase-js": "^2.14.0",
"@types/node": "18.15.11",
"@types/react": "18.0.33",
"@types/react-dom": "18.0.11",
"common-tags": "^1.8.2",
"csv-parse": "^5.3.6",
"encoding": "^0.1.13",
"eslint": "8.37.0",
"eslint-config-next": "13.2.4",
"gpt3-tokenizer": "^1.1.5",
"next": "13.2.2",
"node-fetch": "^3.3.1",
"oneline": "^1.0.3",
"openai": "^3.2.1",
"react": "18.2.0",
"react-countup": "^6.4.2",
"react-dom": "18.2.0",
"theme-ui": "^0.15.7",
"typescript": "5.0.3"
},
"devDependencies": {
"@types/common-tags": "^1.8.1",
"@types/mdx": "^2.0.4",
"autoprefixer": "^10.4.14",
"postcss": "^8.4.21",
"tailwindcss": "^3.3.1"
}
}
To make the frontend usable you need to edit you app/page.tsx
file and create a component file that will host the form client side.
import { QueryAndSongList } from "@/components";
export default function Page() {
return (
<div className="min-h-screen bg-gradient-to-r min-w-full from-blue-600/25 to-sky-600/25">
<div className="md:container md:mx-auto mx-6">
<div className="flex flex-col justify-center items-center ">
<h1 className="bg-gradient-to-r from-gray-900 via-gray-700 to-gray-900 inline-block text-transparent bg-clip-text text-7xl md:text-8xl font-bold h-28 mt-8">
LyricGPT{" "}
</h1>
<QueryAndSongList />
</div>
</div>
</div>
);
}
And then create a file in the component folder called QueryAndSongList.tsx
"use client";
import { Dispatch, FormEvent, SetStateAction, useState } from "react";
export function QueryAndSongList() {
const [isLoading, setIsLoading] = useState(false);
const [songList, setSongList] = useState<`${string} - ${string}`[]>([]);
return (
<>
<QueryForm setSongList={setSongList} setIsLoading={setIsLoading} />
{isLoading ? (
<Loading />
) : (
<div className="flex flex-col space-y-8 mt-4 mb-4">
{songList.map((song, index) => {
return (
<div key={index}>
<div className="relative md:min-w-[600px] mx-auto">
<div className="absolute -inset-1 bg-gradient-to-r from-sky-800 to-sky-900 rounded-lg blur opacity-25"></div>
<div className="relative px-7 py-6 bg-white ring-1 ring-gray-900/5 rounded-lg leading-none flex items-center justify-start space-x-6">
<div className="h-full flex justify-center items-center w-full">
<p className="text-slate-800 text-3xl">{song}</p>
</div>
</div>
</div>
</div>
);
})}
</div>
)}
</>
);
}
function QueryForm({
setSongList,
setIsLoading,
}: {
setSongList: Dispatch<SetStateAction<`${string} - ${string}`[]>>;
setIsLoading: Dispatch<SetStateAction<boolean>>;
}) {
const handleSubmit = async (
event: FormEvent<HTMLFormElement> & {
target: {
query: HTMLInputElement;
};
}
) => {
setIsLoading(true);
event.preventDefault();
const data = {
query: event.target.query.value,
};
const JSONdata = JSON.stringify(data);
// Form the request for sending data to the server.
const options = {
// The method is POST because we are sending data.
method: "POST",
// Tell the server we're sending JSON.
headers: {
"Content-Type": "application/json",
},
// Body of the request is the JSON data we created above.
body: JSONdata,
};
// Send the form data to our forms API on Vercel and get a response.
const response = await fetch("/api/searchVector", options);
// Get the response data from server as JSON.
// If server returns the name submitted, that means the form works.
const result = await response.json();
setSongList(result.uniqueSongs);
setIsLoading(false);
};
return (
<>
<form onSubmit={handleSubmit} className="w-full">
<div className=" flex w-full flex-col items-center mb-3 mt-4 justify-center">
<input
type="text"
autoComplete="false"
id="query"
name="query"
placeholder="Happy rap songs"
className="px-3 py-3 w-full md:w-[600px] placeholder-slate-300 text-slate-600 relative bg-white rounded text-sm border-0 shadow outline-none focus:outline-none focus:ring-sky-800 focus:ring hover:ring hover:ring-sky-800 pr-10"
/>
<span className="z-10 h-full leading-snug font-normal absolute text-center text-slate-300 bg-transparent rounded text-base items-center justify-center w-8 right-0 pr-3 py-3">
<i className="fas fa-user"></i>
</span>
<button hidden type="submit"></button>
<button className="mt-8 w-48 rounded-xl bg-slate-700/50 hover:bg-slate-800/50 h-8 cursor-pointer">
Search
</button>
</div>
</form>
</>
);
}
function Loading() {
return (
<div className="flex justify-center items-center my-10">
<div className="animate-spin rounded-full h-24 w-24 border-b-4 border-slate-400 "></div>
</div>
);
}
If you visit localhost:3000
you should see a form that will allow you to search for songs. This is hitting the api route searchVector
and our Supabase database. Snazzy!
While it takes a bit of work to get the frontend together, Supabase vector integration and vercel api routes make building full stack application enjoyable.