February 19, 2021

The Challenges of Symbology in Financial Data

When it comes to symbology and identifiers in financial data, it gets complicated. The Head of IEX Cloud, Tim Baker, walks through the nuances and problems commonly faced by customers and industry players across the space. Read more

Cloud icon

Tim Baker

Identity Crisis

A few years ago, my wife wanted a sculpture for our front yard garden for her birthday. She found what she liked, and we commissioned the piece that duly arrived on a truck from Houston two months later. The artist had titled the piece Symbolism – I suppose because it incorporated the “∞” symbol. Since Iʼd found myself immersed in the topic of Symbology, we renamed it as such: “Symbology.”

I think about Symbology almost every day. According to Wikipedia “a symbol in computer programming is a primitive data type thatʼs instances have a unique human-readable form. Uniqueness is enforced by holding them in a symbol table . … and most common indirectly is their use to create object linkages .”

Bloombergʼs OpenFIGI site states: “Symbology refers to more than a code – it is the methodology and system for defining how data is related and how that information is conveyed”.

The past dozen or so years I have followed the evolution of this topic as data integration and provenance has become mission critical in the financial data space. There remain persistent challenges and costs associated with something so basic as disambiguating whether a security is the one you think it is – before you trade it, analyze it, connect it with associated data, study it in a chart or regress it against other time series data.

Challenges with symbology are part of a bigger issue around reference data – with no agreed upon standards, vendors are left up to their own devices to create their own ontologies, treatments, and yes, symbology. Some attempts to create utilities to solve for these challenges have had mixed results. EDM Councilʼs FIBO for instance, looks to create consensus across the industry. Finding a solution is made more difficult perhaps by the presence of “too many cooks” – and potentially the involvement of some of the vendors in these initiatives. But then how do you come to a consensus when the stakes are so high, and so many people and organizations must agree?

For entities (companies and other organizations) the Global Legal Identity Foundation (GLIEF) has made great strides to help the financial services industry solve the entity identification issue – and has openly licensed the data for use and reuse freely across the industry. But the reality is that challenges continue as vendors keep proprietary identifiers locked down which they believe keeps clients tied into their data offering, or from licensing revenues of the identifier in its own right.

I get it – maintaining this complex network of identifiers and symbology is a non-trivial thing and can be expensive. Firms have started to open a little – take the open from Refinitiv, or FIGI from Bloomberg. Both are somewhat “open,” and certainly useful in their own right. They also differ considerably in their scope. For instance, FIGI is solely securities orientated, whereas PermID not only addresses equities but also includes related identifier classes such as companies, people, and sectors.

Refinitiv org diagram

So Why Is This So Important?

We live in an ever more interconnected and complex world, where there are more and more complex securities and information about those securities. Even while the number of listed companies has been falling, the advent of derivatives, ETFs and indices means that a breakage in one place can have a ripple effect across the industry. There are now more ETFs and funds in the U.S. than there are underlying stocks. The whole value chain of financial services is dependent on tickers and symbology to work – so that pre-trade, trade, and post-trade processes work seamlessly.

But fundamental flaws in symbology still make for brittle systems across the industry. Downstream fixes and a lack of standards for these fixes means that there are fundamental inconsistencies in the output from vendors on which professionals and systems rely. Such ambiguity makes it hard for quants to build reliable models, while systems to process the vast amounts of unstructured data require high-quality entity data for the purpose of named entity recognition (NER). More generally, machine learning applications require clean data to improve the signal, so poor and patchy entity data will worsen results.

Perhaps a topical example will help make my point: take the recent listing of Snowflake on the New York Stock Exchange (NYSE) – ticker SNOW. You might be surprised to hear that as recently as 2017, SNOW was the NYSE listed ticker for a ski resort developer called Intrawest until it was bought by private equity group Fortress and delisted.

A Google search for Intrawest will give the impression that itʼs still listed with the ticker SNOW:

Intrawest google search screenshot

Itʼs worse! – go back to 2000 and youʼll find SNOW used to be IGN Entertainment – in 2002 it changed its ticker from SNOW to IGNX! There is also an Amsterdam listed stock that still trades with the ticker SNOW (the ISIN is NL0010627865).

The solution, of course, is not to use human readable tickers as your primary identifier for securities in the first place. Better, instead, to rely on the unique identifiers assigned to a security when it is listed or created. The human readable ticker and market identifiers should be metadata associated with such identifiers, along with other descriptive information (metadata) about the security. In most markets, this will typically be an ISIN or a CUSIP for the U.S. and Canada.

So, back to our Snowflake example – Intrawestʼs common stock has a CUSIP of 46090K109, whereas the CUSIP for Snowflakeʼs Class B stock is TC8Q0SS64. Neither are very recognizable to the human eye – but to a machine they work.

I wonʼt get into how ISINs and CUSIPs are licensed (here is a handy link that gives more information), but needless to say it is certainly not free for any systematic or business use of the data. Redistribution of such identifiers also needs to data vendors wonʼt deliver reference or price data to a customer unless they attest that they have a CUSIP or ISIN license. What that means is that even if you have decided to adopt one of the commercial open identifiers like FIGI or PermID, youʼll still struggle to map these back to the corresponding CUSIP or ISIN without that license to the vendor and the CUSIP Service Bureau.

To summarize, most security identifiers have their flaws, are tightly licensed, are expensive to use, and arenʼt easily cross referenced to each other.

Figure 3: Alacra Entity and Security Identifiers

Alacra table

As mentioned, securities are issued to companies, and the Legal Entity Identifier (LEI) is a sound and open identifier for issuers and financial counterparties. GLEIF and the Association of National Numbering Agencies (ANNA) launched a daily file linking ISINs to their issuerʼs LEIs. Itʼs a big file and too large to be loaded into Excel. While useful, it is just the ISIN and the CUSIP with no other metadata. It also currently doesnʼt include retired ISINs – so I assume itʼs not that useful to help disambiguate my SNOW example. GLEIF also provides a one-to-one mapping table to a companyʼs BIC. Refinitiv also publishes a link between the LEI and the firmʼs open identifier PermID – which does open more possibilities.

Schemas to classify companies are also very challenging, notwithstanding most approaches to classifying a company into a sector require a one-to-one mapping. This is difficult for a firm like Apple, which can be categorized simultaneously as Software, Hardware, Gaming, or Banking! The public schemas are limited, and both private schemas (such as Global Industry Classification Standard, or GICS, and Industrial Classification Benchmark, or ICB) can be pricey and are not openly licensed. GICS (owned by MSCI) and ICB (owned by FTSE) are multi-tiered schemas and arenʼt that different in their approach. Other vendors have their own industry classifications to help reduce costs. For example, Refinitiv has a multi-level schema called Thomson Reuters Business Classification (TRBC), and the Primary Business Sector is mapped on (this field is only licensed for non-commercial use). Of course, the real value of these schemas is the mapping of them to the companies, which requires a process and people.

So Where Are We Heading?

I wish there was good news here – there isnʼt much yet. Despite the flaws and challenges I have identified, the industry seems very set in its ways. Exchanges allow the reuse of tickers and have outsourced the issuance and monetization of identifiers to commercial entities. Vendors have probably gone as far as they will go to open up their internal schemas. Mapping across these schemas is left up to the better staffed customers. There is no indication that these competing firms will collaborate in the space.

The good news is that with more and more advanced system availability to match entity data, the easier it becomes for firms to reliably spot errors and disambiguate entity data. Larger firms can afford to properly license datasets, so that larger vendors will deliver cross referencing files to them. Smaller firms, however, find the costs prohibitive or fly under the radar for as long as possible – and we know where that could end up!

Personally, Iʼd like to see tighter standards and a more modern approach to making source data more accessible and open. For example, it should be easy to query public data to identify securities that have been delisted. Ticker re-use should be disallowed. Yes, itʼs fun to have SNOW as a ticker – but was there any consideration as to the downstream consequences?

SEC rules example

There are rules out there – in Canada “symbols previously used by other issuers cannot be reassigned for 53 weeks.” In the U.S., reuse is permitted after 90 days, unless the change causes investor confusion.


Suggestions for the Alternative Data Community

Needless to say, the more that an alternative data provider can do to ease the customer burden to test and integrate the data the better. Symbology is a major pain point so take the time to understand the space. If your data is at the entity level, append the PermID or LEI – it is open, and customers will be able to resolve to a stock ticker if required. Or better still, add the tickerʼs PermID and FIGI in there for good measure - the more the better.

Make sure you document your schema, and any adjustments you have made. Keep an eye on things too – as you now know – identifiers change through time!

Finally, leverage as many of the open standards out there as you can:

  • Legal Entity Identifier (LEI): Good for banks and big companies and provides a cross reference to ISIN and BIC.
  • FIGI (Bloomberg): Has become more developer-friendly and has gained traction across the industry.
  • PermID (Refinitiv): Great API, and a matching tool that helps create common licensing. Includes LEI mapping.
  • Open FactSet: Symbology service, for a fee. Although, I think for data partners, you will get a pass.
  • Most other large vendors provide matching and cross-referencing services for a fee, and these are often bundled with other products – donʼt forget to ask if you are a customer!
  • Iʼm not explicitly saying avoid the fee-liable identifiers like the CUSIP and ISIN. It is just that they can be expensive, especially for a startup. There may also be cases where the use of CUSIPs or ISINs is unavoidable – for instance, when working with fixed income data – but there are approaches you can use to keep costs down and stay within the licensing regime. Also, just because a CUSIP is published on the SEC website, it does not mean you are licensed to use it or distribute it.
  • And finally
  • Here is Symbology – outside our front door (and a deer).
Symbology photo