Easier SEC Disclosure NLP with Standardized HTML

Ingesting SEC disclosures for algorithmic natural language processing (NLP) is difficult because the HTML is poorly formed.  Now Calcbench API users can access standardized disclosure HTML.

For instance, Microsoft's Contingencies note looks like this  -




but the HTML looks like this -



everything is a paragraph, there is no hierarchy, the headers are not headers.


Calcbench's standardized HTML looks like this -



The hierarchy of headers headers is correct and they are in sections with the text to which they refer.


To get the standardized HTML use the disclosure API (Calcbench API access required) and pass the standardized=True to the DisclosureSearchResults objects returned by the disclosure_search method , documentation.



Comments

Popular posts from this blog

On Amazon and Server Lifespans

McDonalds versus Chipotle Earnings and KPIs

Inventory as a Non-Current Asset