sexta-feira, 25 de setembro de 2015

pgAFIS: Biometric Elephant

Biometric technologies are increasingly used in civilian applications. One example is that in the 2014 elections, 21.6 million Brazilians (15% of the country's voters) in 762 municipalities (including 15 capitals) should use biometrics in electronic voting machines in order to reduce the risk of errors, fraud and slowness.

In India a government pioneering initiative promises to create the largest biometric database in the world. In the country there are over half a billion people who lack any kind of identification, making it impossible for them to receive government aid, open bank accounts, apply for loans, get a driver's license, among others. Expecting to register the biometric data of more than one million people a day, the project hopes to have by the end of 2014 the records of all 1.2 billion Indians in its database.

Thus, while scratching on the subject, I found out that the sourcecodes of biometric routines from FBI/NIST are public. Such routines are the heart of AFIS (Automatic Fingerprint & Identification System), among which two algorithms stand out: the characteristics extractor (feature extractor) and the comparator of minutiae (matcher).

After studying this NIST code (in C language), I decided to create support for such features natively on my favorite DBMS, PostgreSQL. And so it was born pgAFIS (Automated Fingerprint Identification System support for PostgreSQL), an extension capable of providing feature extraction and minutiae comparison from SQL statements within PostgreSQL.

But how does that work in practice?

Well, let's start by data modeling. You'll need to create in the database a table in which information of fingerprints will be stored, as described below:

Table "public.fingerprints"
 Column |     Type     | Modifiers 
 id     | character(5) | not null
 pgm    | bytea        | 
 wsq    | bytea        | 
 mdt    | bytea        | 
 xyt    | text         | 
    "fingerprints_pkey" PRIMARY KEY, btree (id)
  • "pgm" stores fingerprints raw images (PGM)
  • "wsq" stores fingerprints images in a compressed form (WSQ)
  • "mdt" stores fingerprints templates in  XYTQ  type in a binary format (MDT)
  • "xyt" stores fingerprints minutiae data in textual form
The data type used in these columns is "bytea" (an array of bytes), similar to BLOB in other DBMSs. PGM and WSQ formats are open and well-known in the market. On the other hand, MDT is a format I created. :D

Therefore, fingerprints images may be obtained in raw format and stored on "pgm" column in that table. PGM format is a kind of bitmap, where there is no compression. Fingerprint readers usually store image files using this format.

Now pgAFIS comes into play! A particular function is able to convert the binary data from a PGM content into a WSQ format (a kind of JPEG created by NIST/FBI). This is done by running this SQL statement:

UPDATE fingerprints
SET wsq = cwsq(pgm, 2.25, 300, 300, 8, null);

The second step where pgAFIS acts is the extraction of the fingerprints local characteristics (the minutiae) from the WSQ content. The result is the information in XYTQ type (i.e., horizontal and vertical coordinates, slope angle and quality) of each minutia. Again, you'll only need to execute a single SQL statement:

UPDATE fingerprints
SET mdt = mindt(wsq, true);

To get an idea of storing volume: fingerprint images in the PGM format of 300x300 pixels of 90 kB, when compressed in the WSQ format occupy 28 kB of disk space. After extracting minutiae and generating the content in the MDT format (specific to pgAFIS), that data occupies measly 150 bytes! The textual version of MDT, XYT, takes a little more: about 300 bytes!

  length(pgm) AS raw_bytes,
  length(wsq) AS wsq_bytes,
  length(mdt) AS mdt_bytes,
  length(xyt) AS xyt_chars
FROM fingerprints

  id   | pgm_bytes | wsq_bytes | mdt_bytes | xyt_chars 
 101_1 |     90015 |     27895 |       162 |       274
 101_2 |     90015 |     27602 |       186 |       312
 101_3 |     90015 |     27856 |       146 |       237
 101_4 |     90015 |     28784 |       154 |       262
 101_5 |     90015 |     27653 |       194 |       324
(5 rows)

These processes are part of the of biometrics Retrieve of biometrics represented in the figure below:

Very cool, but what we do with this pile of binary data? Biometric comparisons!

Biometric systems are able to perform two essential comparison operations: Verification and Identification.

In Verification, also called Authentication or [1:1] Search, a sample fingerprint is compared against a single fingerprint already stored in the database. This process can be seen in the figure below:

And that's where pgAFIS strikes again! With this extension, this operation can also be performed through a simple SQL statement:

SELECT (bz_match(a.mdt, b.mdt) >= 20) AS match
FROM fingerprints a, fingerprints b
WHERE = '101_1' AND = '101_6';

In this statement, the fingerprint identified as "101_1" is compared to "101_6". If the level of similarity between the two is at least 20, we can consider that both are equal, that is, refer to the same finger of the same person. It is a quick procedure, because the identification of the person is given at the start and the system only needs to return a boolean value: "yes" or "no".

Oppositely, Identification or [1:N] Search is a costly process for the system, since a sample fingerprint must be compared against all existing records in the biometrics database. In addition, its feedback is a set of possible identifications found in the database, that is, those considered most similar to the sample according to the comparison algorithm. This process can be seen in the figure below:

Even with this exception that the process may adversely affect the server, pgAFIS provides a feature to handle this risk: the limitation on the number of possible suspects. In SQL, a search example would be like this:

SELECT AS probe, AS sample,
  bz_match(a.mdt, b.mdt) AS score
FROM fingerprints a, fingerprints b
WHERE = '101_1' AND !=
  AND bz_match(a.mdt, b.mdt) >= 23
In this example, the three fingerprints that more resemble the one labeled "101_1" are returned.

Did you enjoy it? Have fun with it! Contribute! Source codes are available on GitHub at