27. Data Cleaning Tutorial

Published at 1683715445.339142

Clean, accurate, and complete data is the foundation of an algorithmic trading system. Inaccurate data will affect the performance of the algorithm during the testing phase and real trading.

Common Data Errors

  • Data collection programming errors. Common in internal data collection systems and third-party data services.

  • Partial data loss errors. Common when the data collection system and storage experience unexpected problems. For example, data sources changing their data structure will lead to data loss errors.

  • Raw data errors. Occur when the data supply for the whole market experiences issues that cannot be fixed in a short term. An example is an erroneous buy order for tens of millions of contracts on the Vietnamese derivatives market.

  • Manual input errors. Occur on financial statement data and other information that has not been automated at the data entry step.

  • Empty data field. Appears when scanning is incorrect on financial statements of stock tickers on UPCOM for example.

  • Time difference. Occurs when data sources update with a large delay compared to the time stamp issued. This error is most common in financial statement data of small companies in UPCOM market. 

  • Past data adjustment. This is a particularly difficult error to fix when past data is adjusted for various reasons. This error is common in stock  data.

In Vietnam, data errors outside the price and volume range also often occur. Algorithmic traders should carefully consider data sources before developing algorithms. For inaccurate and skewed data that cannot be automatically collected, traders should consider skipping this data source until the data cost falls to an acceptable level.

How to Clean Data

Standardize Data

  • Define data standards. Different systems will always have different data standards. However, data needs standardization to ensure compatibility, ensuring data retrieval tasks, and system scalability in the long run. For example in a Python system, the time standard is the float data type in Unix timestamp.

  • Standardize data. Input data typically has different formats compared to those in algorithmic trading systems. It’s important to standardize data before system integration.

  • Define the trusted data source. This data source is the most trusted source and can provide first-tier data to the entire market. Based on the price and volume data of Vietnam’s derivative market, HNX is the trusted base data source. The base data source typically has the highest accuracy and stability. Defining and using the trusted base source will enhance the stability and reduce overhead costs of data cleaning.

Data Validation

Compare two data sources to validate and clean data. Basic validation techniques include:

  • Identify missing data. Add the missing data to the system database or remove it entirely.

  • Identify data duplicates. Remove duplicates. Duplicate data is usually related to a listed company event and is often published at different times.

  • Identify data anomalies. Remove any anomalies. Anomaly data can be detected using a simple statistical and probability method. An example of data anomalies in the Vietnamese stock market is the unchanged derivative price, as the trading volume exceeds the processing capacity of the trading system. It results in a constant price over a long period of time. Another example is the occurrence of unusually large orders at millions of contracts, while the average daily trading volume is around 200,000 contracts.

  • Identify invalid data. Trading prices that exceed ceiling and floor prices or other straightforward constraints are clear signals to review and clean up data. Typically an invalid data point will help identify a range of data that needs cleaning and adjusting.

In addition, some data errors are easy to recognize with a close look. An example is the closing volume in the Vietnamese stock market may not be updated correctly by third-party service providers.