Why Normalization Matters in Legal Analytics
UniCourt is a leader in providing access to meaningful, actionable legal analytics. Our advanced machine learning technology normalizes entity names, so that users can obtain valuable insights on the entities involved in litigation to spur business intelligence and development, devise winning litigation strategies, and remain informed about the legal issues affecting their interests. However, without accurate data processing methods, legal analytics can often lead to poor business decisions founded on flawed information. As the saying goes, “trash in, trash out.”
“Correctly identifying entities is one of the foundations for building reliable analytics,” says Dr. Yongxin Yan, VP of Analytics at UniCourt and former Chief Scientist at Lex Machina. Although issues of spelling and syntax may appear trivial, cases involving a particular party, attorney, or judge may be missing from an analytics report if the data is pulled from sources where an entity name is misspelled, lacks a proper suffix, or the syntax is wrong. If two law firms are erroneously mixed, their case count may be artificially doubled. If one law firm is erroneously identified as two separate firms, its case count may be cut in half. In fact, poor entity normalization continues to be one of the leading sources of error for analytics. As such, entity normalization is a necessary precursor to producing meaningful legal analytics.
The Building Blocks of Entity Normalization
Entity normalization, also called entity linking or entity disambiguation, is a data processing procedure that identifies mentions of entities, disambiguates them, and organizes that data in relation to the entities. In the context of legal documents, normalization identifies and collects variations in names and spellings, combines this with other information, and then makes the best guess as to what are the actual underlying entities. “In any data set … the same entity may appear in different forms,” Dr. Yan explains. This means that when a user performs a search, the normalization technology will locate nearly all variations of an entity from different sources, such as data from different court systems, state bar organizations, and legal documents.
For example, imagine you are searching for records on an attorney named Philip Paul De Luca. This name can appear throughout official court records in myriad forms, accounting for various representations of his title, different spellings and misspellings of his name, the inclusion of middle initials, and more. Here are just a few examples of the various ways Mr. De Luca may be represented throughout court data sets:
- DELUCA PHILLIP P. LAW OFFICES OF
- DELUCA, PHILIP PAUL
- LAW OFFICES – PHILIP P. DELUCA ESQ
- LAW OFFICE OF PHILIP P. DELUCA – PHILIP P. DELUCA
- LAW OFCS OF PHILIP P DELUCA
- DELUCA, PHILIP P.
- DELUCA PHILLIP
- DELUCA PHILIP
- LAW OFFICES OF PHILIP DELUCA
- DE LUCA, PHILIP PAUL
- PHILIP P. DE LUCA
- PHILIP P DE LUCA
- DE LUCA PHILIP P LAW OFFICES OF
- DE LUCA PHILIP P. LAW OFFICES (ASSOC’D)
In order to gather accurate information about Mr. De Luca’s cases, we need a system that identifies these various iterations of his name as the same attorney. This same problem exists for judges, parties, and business entities.
Normalization is the solution to this problem. With our machine learning process, a user who searches “Phillip De Luca” will locate results for each of the above iterations of the name. Without this technology, a user would miss out on valuable information on Mr. De Luca that would be triggered by a search using an alternate spelling or representation.
From our perspective, the ultimate goal of normalization is to identify and verify real world entities and connect all other relevant information to their various aliases and names, such as their phone numbers, email addresses, bar numbers (for attorneys), office addresses, and their registered companies. With proper identification of real world entities you can build more accurate analytics, you can boost your CRM efficacy, and you can create countless new and innovative uses from access to reliable, structured data.
Inherent Challenges of Entity Normalization
Normalization is not yet a failproof solution, explains Dr. Yan. Machine learning experts are working toward developing and constantly improving the way that data is gathered and organized through normalization. “People may initially underestimate the importance and difficulty of this effort,” Dr. Yan explains. He states that the most challenging aspects of streamlining the normalization process include the following:
Name Inconsistency: This includes inconsistent or ad hoc use of abbreviations, spelling errors, and name changes. For instance, in the Phillip De Luca example, the attorney is represented throughout different sources in at least fourteen different ways.
Confusing or Transposing Names: Different entities may have identical or similar names. For example, consider the law firm Steptoe & Johnson PLLC, which can be confused for Steptoe & Johnson LLP if only searching for Steptoe & Johnson.
Entity Inconsistency: When two law firms merge into one, how should we gather statistics for the three entities involved? When one law firm splits into three, where one firm retains the old name, and the other two separately merged into two different law firms, how should we calculate statistics for the 6 entities involved? Entities frequently split, merge, acquire other entities, and change names due to partner movement. Their names may change in the process, and may be related to previous entities in complex ways. Different users may subconsciously make different assumptions when they use analytics.
Entity Complexity: A large conglomerate may have multiple entities related in complex ways and share the same words as part of their names. For instance, Wells Fargo may refer to Wells Fargo & Company, Wells Fargo Bank N.A., Wells Fargo Bank Minnesota (or whichever branch is being searched), or Wells Fargo Financial, in legal documents depending on context of the cases. Different legal analytics users, or even the same user on different occasions, may use the same name to refer to different entities or combinations of entities.
How Normalization Facilitates Meaningful Analytics
In the Phillip De Luca example, gathering the various spellings and representations of Mr. De Luca’s name is beneficial because users can learn more about him and the cases he handles. Normalization can take this one step further. When coupled with other pieces of data like party names or judges, users can gather much more data on Mr. De Luca, such as the types of clients he typically represents, what types of cases he handles, the results of certain actions (motions) in those cases, which judges he faces most frequently, and which venues he routinely appears in. This information can also help opposing counsel make a series of valuable inferences about Mr. De Luca or other entities of interest and leverage this data to devise an informed litigation strategy.
Beyond improving your search results in the UniCourt application, our normalization combined with our APIs allows you to enrich your internal databases with more complete data sets. This means you can not only obtain actionable intelligence from UniCourt’s legal analytics, but that you can also layer our data over your own matter management system or CRM system to create your own business intelligence and future opportunities for business development.
We are proud to be a leader in providing access to meaningful, actionable legal analytics, and that our machine learning technology normalizes court data, so you can make the most of public records.