NTD in AI: DBSCAN

Non-technical definitions in AI

DBSCAN, an acronym for Density-Based Spatial Clustering of Applications with Noise, is an algorithm used in unsupervised learning to find patterns of clustering in data.

Unlike [k-Means], which is a centroid-based algorithm, DBSCAN does not require practitioners to set critical hyperparameters. In contrast, for k-Means the results are sensitive to the number of clusters (the “k” in “k-Means”), which has to be tweaked manually.

DBSCAN instead uses the idea of density rather than centroids to find clusters. The 2 hyperparameters in this algorithm are the minimum distance from a selected point to be in a cluster (typically denoted with $\epsilon$), and the minimum number of points within that distance which are required to be defined as a cluster, n.

With these 2 parameters, the clusters are found iteratively. First an example in the set to be clustered is chosen. The number of members which are below the minimum distance, $\epsilon$, from that point is discovered. If it is greater than n then it becomes the first cluster.

Each member in that first cluster is then tested. If the member has n or more neighbours with distance less than $\epsilon$, then any new members are added to the cluster. If there are less than n neighbours with distance less than $\epsilon then no new examples are added. This is repeated until no more new members can be added to the cluster.

Then, a second example in the dataset not in the first cluster is chosen and the process is repeated. This iterative process is run until all points either belong to a cluster or are outliers. Hence, without labeled examples, clusters in the data can be found.

See also k-Means, clustering, unsupervised learning.

Machine learning is a technical subject and the use of technical terms by engineers have the potential of coming between clear communication with non-engineers, especially in the business setting. In spare moments I started to put together simple, non-technical definitions of nouns and verbs used in the field of machine learning as a kind of Rosetta Stone for non-engineers.This is a work-in-progress which I may collect into a book one day. This is one of those definitions.

NTD in AI: DBSCAN

Other non-technical definitions: