Entity Resolution: Harnessing the Power of Vector Databases

Oliver
July 17, 2023
Data Collaboration at Scale

This article is about entity resolution. In particular, we’ll be diving into the realm of vector databases and how they excel at finding semantically similar entities for resolution.

What is entity resolution?

Before we look into vector databases, let’s quickly recap what entity resolution is.

Entity resolution, also known as record linkage or deduplication, refers to the process of identifying and merging records that refer to the same real-world entity.

It’s a crucial task in various domains, including customer data management, fraud detection, and information retrieval.

Traditional entity resolution methods

Traditional entity resolution methods typically rely on rule-based or probabilistic approaches.

While these techniques have served us well, they often struggle when dealing with large-scale datasets or entities with complex relationships.

This is where vector databases come to the rescue!

What makes vector databases good at finding similar entities?

So, what makes vector databases so good at finding semantically similar entities to resolve?

The key lies in their ability to leverage the power of vector representations and advanced machine learning algorithms.

Vector representations capture the essence of entities by encoding their characteristics into high-dimensional vectors. These vectors represent a distributed representation of the entities, where each dimension corresponds to a unique feature.

For example, in a customer entity, dimensions could represent attributes like age, gender, purchase history, etc.

By employing vector representations, we can transform entities into a mathematical space where relationships between entities can be quantified. This is known as an embedding space, where entities with similar characteristics are closer to each other, and dissimilar entities are farther apart.

Now, let’s look at how vector databases use these embeddings to perform efficient entity resolution.

How do vector databases use embeddings?

Vector databases employ advanced indexing techniques, such as approximate nearest neighbor (ANN) search algorithms to rapidly find entities with similar embeddings. As a side note, we use the HNSW (hierarchical navigable small world) algorithm which is a multilayered graph approach for determining similarity. These algorithms are specifically designed to handle high-dimensional vector spaces efficiently.

When resolving entities, vector databases enable us to perform similarity searches based on the semantic characteristics encoded in the embeddings. We can search for entities that are similar to a given target entity by comparing their embeddings’ proximity.

The great advantage of vector databases is their ability to handle large-scale datasets and complex relationships effortlessly. Since the vector representations capture the nuanced features of entities, they can identify semantically similar entities across various domains and contexts.

To summarize…

The benefits of vector databases in entity resolution:

Efficiently handle large-scale datasets.
Account for complex relationships between entities.
Enable accurate resolution by leveraging semantic similarities.

We have recently added a vector database sidecar called VectorLink to TerminusCMS. It provides enables you to use vector embeddings with your data to leverage semantic tools. We believe that knowledge graphs plus vector search is the future and we have it now. You can try it out for free by signing up to TerminusCMS and choosing the community package (no credit card required). You need an OpenAI API key to use VectorLink. Take a look at OpenAI’s website for its restrictions and terms.

If you’d like more technical information about how we built VectorLink, read the Building a Vector Database blog.

If you’re interested in entity resolution for your business, get in touch to discuss your requirements and to see how we can help.