Easier SEC Disclosure NLP with Standardized HTML
Ingesting SEC disclosures for algorithmic natural language processing (NLP) is difficult because the HTML is poorly formed. Now Calcbench API users can access standardized disclosure HTML.
For instance, Microsoft's Contingencies note looks like this -
everything is a paragraph, there is no hierarchy, the headers are not headers.
Calcbench's standardized HTML looks like this -
The hierarchy of headers headers is correct and they are in sections with the text to which they refer.
Calcbench's standardized HTML looks like this -
To get the standardized HTML use the disclosure API (Calcbench API access required) and pass the
standardized=True
to the DisclosureSearchResults
objects returned by the disclosure_search
method , documentation.See the example notebook.
Comments
Post a Comment