This section covers different data aspects in algorithmic trading. We will discuss the two groups of data and principles of data collection, as well as common data management issues.

Two Data Groups in Algorithmic Transactions

There are two main groups of data in algorithmic trading: input trading data (market data, financial data, commodity data) and output trading data (algorithmic trading data). Sometimes, the output trading data may serve as input for other algorithms.

Input trading data includes but is not limited to standard data in algorithmic trading as described in article 26. An example is real-time market data of a financial instrument. Real-time data can run at different speeds depending on the market size.
Output trading data is the data generated when trading systems operate. They can be both buy and sell signals generated based on input trading data or algorithm trading orders. While the input trading data is to operate algorithms, the data generated during operations is to monitor the algorithms. Besides, this data group is also used to research and develop new algorithms. During operations, it is key to keep only important information so as to not overload the system operator as well as to optimize data storage.

Two Important Criteria in Selecting Data Sources

Data latency and completeness are two main points when selecting data sources. Meanwhile, defining key concepts and mastering technology is of utmost importance in collecting and managing data.

Data Latency. Latency is the time difference between when data is generated versus when an algorithm receives it. Latency is a key aspect of real-time trading, regardless of market size. Investors need to be concerned about data size as the market grows. However, latency always has a higher priority. The reason is in both large and small markets, if the data latency is very large, the trading algorithm will not be able to take advantage of market opportunities. However, in trading data, algorithmic traders may tolerate higher latency than usual.

Data latency can be minimized by accessing the best real-time trading data source. Since latency is the top priority when choosing a data source, it will lead to the critical choice of the physical location of a data collection system or a data server. Due to regional and service limitations, investors should anticipate technical problems when setting up and maintaining services. It’s also optimizes costs when implementing trading algorithms in the future.

Data Completeness (Data Coverage). Standard data completeness or coverage is the ratio of data obtained from a source to actual data. For example, a market has a total trading time of 360 minutes. The data source, due to technical limitations, can only record 336 minutes of trading data. In this case, the completeness of the data source is 336 divided by 360, which is approximately 93.33%. In most cases, it’s not straightforward to measure the coverage of transaction data in real time. Therefore, having multiple data sources and cross-checking with each other is of high importance.

Data Storage and Management in Algorithmic Trading

Key concept. The first thing to keep in mind is not the storage system but what data should be stored. Algorithmic traders need to identify what data to store. Some basic concepts right from the beginning are prices, orders, signals, and order execution. Selecting the right concepts helps in planning and designing a more efficient data system. A reliable database contributes to fast, smooth, flexible, and scalable system development. This also makes the data storage system easy to upgrade in the future.

Technology selection. After defining concepts and data fields, the choice of the technology stack is not as important. The right tools and skills in technology are more important than the technology choice itself.

Data storage management. There are two main types of data storage in algorithmic trading: temporary storage for data trading, and database for historical trading data.

Temporary storage. This is a repository of real-time trading data, including market data, index, and commodity data. Since algorithms must reach data as quickly as possible, storing data in an in-memory database, or caching system like Redis is a reasonable choice. Note that the data should have an expiry time. It’s because this data is often used for intraday trading. Temporary storage data from the previous day, or any previous period, will often be unusable. Using improper data can lead to serious consequences, especially for intraday trading algorithms. Therefore, checking data correctness in terms of time, and determining expiry times are mandatory. If data from previous days or periods needs to be accessed, a database is more suitable than a temporary storage system.

Database. When a trading day ends, the intraday trading data should be written down in the database for long-term use. This is to research and develop new algorithms, backtest existing algorithms, and improve operating algorithms. Many types of databases are popular like Postgres, MySQL, etc. These databases are used by data management and mining tools like ElasticSearch, Logstash, and Kibana (ELK) to display, illustrate, track, report, summarize, analyze, and evaluate historical data.

28. Data Management in Algorithmic Trading

Two Data Groups in Algorithmic Transactions

Two Important Criteria in Selecting Data Sources

Data Storage and Management in Algorithmic Trading

Glossary