TL;DR
Over 2022/23 while working at Mainstream Renewable Power
on an internal web application called StationManager
used by the Energy Analysis Group
, I maintained a “data pipeline” which fetches sensor readings from the world’s most remote places, and transforms them into useful data sets, which form the basis upon which the construction of renewables (wind turbines or solar panels) on site hinges.
By switching to a dedicated timeseries database, Postgres/TimescaleDB
, and standardising all readings into a consistent format before storage, I was able to greatly simplify both the system & the code needed to process them.
Adapting the system was not straightforward, but was ultimately worth the effort:
- I reduced the number of components in the system, which reduced the number of failure modes, and the lines of code required to glue them together - adding ~1,000 lines enabled removing ~25,000
- I could (and did) add tests using representative sample data to guarantee the behaviour of importing and exporting timeseries readings
I owe much of the motivation behind this project to Hadley Wickham’s “Tidy Data” paper & Dan McKinley’s Choose Boring Technology
The system’s job -
Its core value lies in fetching files from remote loggers (which record sensor readings) and transforming them into files useful for analysis4 -
Over 2014/15 a brave individual (Paul Hughes) pulled the bulk of a system together. It then passed through the hands of three more people (Sean Hayes, Andrew McGregor & Tomasz Jama-Lipa) before reaching me, with each person adding their own twist to keep it alive & make it useful.
After a year of struggling to keep the show on the road, I spent a year rebuilding its foundations.
So how did it work? And what did I do differently?
I want to thank my manager of the last two years, Romain Molins, without whose backing none of this would have been possible
Table of Contents
Getting started
It took me 2-3 months before I was “comfortable” to make a code change -
- The server crashed multiple times a week
- The code was untested5
- I couldn’t run the code on my laptop because there was no setup in place for a local developer environment
- Not all dependencies (
Python
& non-Python
3rd party libraries) in use were documented - I had never used the
Python
web framework in which the web application was written (Django
), or anything like it - I didn’t know anything about relational databases; upon which the web application relies to store data
First up the crashes.
It turns out that the web application was crashing because it was running on Command Prompt
on a Windows Virtual Machine
in which Quick-Edit Mode
was enabled (the default behaviour). On Romain’s hunch, I turned it off & bingo, no more crashes.
I managed to hunt down all depedencies & setup a reproducible developer environment on my laptop, using poetry
6 for Python
& Docker Compose
7 for everything else (like databases & non-Python
libraries).
Later when
Docker
refused to run on my machine due to networking issues, I was forced to rebuild this developer environment innix
viadevenv.sh
.
With my new superpower I could now change things. And so I did. And something broke. And so I patched it. And something broke…
Under pressure to fix bugs, I made changes without first covering the system in tests[^QIO], and I suffered for it. So I paused all code changes and set about testing the most common usage scenarios.
I scripted a bot to replicate these scenarios by clicking through a browser via Selenium
, and I automated running this bot automatically before any code change could be accepted via GitHub Actions
8 which by luck was made possible by the work on the local developer environment.
I was now somewhat happier to make changes to the web application.
I was still scared, however, to touch the data pipeline - the Python
glue that transformed raw data into useful file outputs. I was scared because I couldn’t easily test it, so I couldn’t be sure that my changes wouldn’t break things. Why?
This web application depends on a database, so before testing a particular thing one has to -
- Replace the database with a “test” database
- Write to the “test” database all data that is required for the thing to work
This is “okay” to do & well documented online since it’s standard practice.
The pipeline, however, depended on two different databases (MySQL
& Microsoft SQL Server
) on two different servers and on “cache”9 files -
I couldn’t for the life of me work out how to replace the two databases with a “test” database to test this “glue” code well enough to confidently make changes to it10.
And this “glue” code was only the tip of the iceberg.
It was the job of a collection of different tools to fetch readings from remote loggers & import these files to the sensor readings database, any of which going wrong caused issues downstream. So for a year I (mostly) avoided touching this.
And there certainly were problems which prevented the entire team from doing their job11.
So when I was finally confident that I could rebuild the data pipeline more cleanly on top of better foundations, I did.
The “old” way -
How files were fetched from remote loggers
To fetch files from remote loggers manufactured by Campbell Scientific
, it’s quite straightforward12 …
- Install their
LoggerNet
software on aMicrosoft Windows
server - Configure the remote sensors (or “loggers”) in
LoggerNet
… & tada, the files are fetched & saved to this Microsoft Windows
server.
But what about the non-Campbell Scientific
loggers?
Nothing off-the-shelf corresponding to LoggerNet
existed. Remote sensors are normally in very remote places (like deserts) so connecting to them is not cheap or easy. So a 3rd party, SmartGrid Technologies
, was hired to sync files from the non-Campbell Scientific
loggers to their cloud-based filesystem. These files still needed to be synced from there so a Python
job was created to do so.
These files are normally compressed & encrypted. So another Python
job was created to automatically unzip them (via 7zip
) & decrypt them (via ZPH2CSV.exe
) where relevant.
These jobs need to be run periodically. So a batchfile
specifying the Python
jobs was scheduled in Task Scheduler
on the same server as the web application.
How files were imported to a database
Most logger files only contain readings for at most a few days, so all files associated with a particular logger need to be amalgamated.
Campbell Scientific
provide an application called LNDB
which automatically exports logger readings to database tables. Each database table contains all readings for a particular logger & looks like …
timestamp | sensor_name_1 | … | sensor_name_n |
---|---|---|---|
value | value | … | value |
Again, what about the other loggers?
Again, nothing off-the-shelf existed, and so a custom equivalent was built in Python
-
Reading text files & importing them to a database table is surprisingly hard -
- How is it encoded?
- Are the first few lines metadata?
- What columns represent timestamps? How are they formatted?
- Are there columns in the file that are not reflected in the database table? If so, the database table must be updated!
LNDB
knows what type of file it has to deal with since all files are Campbell Scientific
. The other manufacturers each have their own conventions. So a Python
job was needed to adapt to the conventions of each type of logger. It also tracked which files have been imported & which have not so this import status was “viewable” by the team.
IEA Task 43
is worth a mention here. This valiant cross-organisational team is pushing to standardise data exchange to help relieve this particular burden.
How about importing multiple files at the same time? The team used a “Task Queue” …
A task queue works like a restaurant. The waiters add an order to the queue & the chefs pull orders from the queue when they have time to process it.
… to queue import jobs and process these jobs using as many workers as available.
In practice, the task queue
huey
didn’t actually run tasks in parallel as this wasn’t well supported13 on eitherWindows
or the task queue databasesqlite
. It only ran one job at a time.
Finally, the importer also handled files uploaded directly to the file server. To do so the team would have to …
- Connect to the Virtual Private Network (VPN)
- Map this remote file server as a network drive
- Copy & paste across
… and since mapped network drives didn’t allow granting “write access” without also granting “delete access” it was very possible that someone could accidentally delete everything.
How sensor readings were cleaned
Before readings can be used for analysis they need to be processed14 and cleaned15 every time new data is added to a source. Oftentimes a particular source might be associated with millions of readings16, so how was this scaled?
This one’s a bit hairy.
pandas
, aPython
data-manipulation library, asks the “sensor metadata” database for sensor metadata (calibrations, type of data measured …), connection strings to the “readings” database, & user-specified timestamps of erroneous readings viaDjango
(powered bymysqlclient
)pandas
asks the “timeseries” database for sensor readings viaSQLAlchemy
(powered bypyodbc
&ODBC Driver for SQL Server 17
)- If specified,
pandas
caches readings to apickle
file to skip round trips to the timeseries database pandas
re-calibrates sensor readings & filters out erroneous ones using rules & user-specified timestamps of erroneous readings
This is all glued together by a magic Python
class called StationRaw
. It hides this system complexity behind a friendly interface.
How sensor readings were accessed
Now for the “visible” parts -
Data was accessed from either the web application (built using the Django
web framework), or from a Jupyter Notebook
server - a web-based interactive computing platform.
The web application receives two types of requests; a “normal” request and a “big” request.
Most requests are “normal” requests, and are well served by Django
-
- To display a web page it asks a database for the data it needs to render files that the browser needs (
HTML
,CSS
&JavaScript
) so it can display a user interface - To serve “static” files like
csv
text files or images it can just send them directly (typically by routing to another tool likeNGINX
orApache
)
“Big” requests are harder to handle. If someone wanted to re-export a particular source then millions of readings needed to be processed to do so, and so they had to wait however long it took for the Python
“glue” to pull everything together.
The web application also enabled configuring the data pipeline to change the exported data sets. One could flag erroneous readings graphically so that they would be filtered out, or change a particular sensor’s settings so the resulting readings would be shifted.
Back in 2014/15 there was a push internally to do all energy analysis in Python
in Jupyter Notebooks
, however, by the time I started, these notebooks had been largely phased out in favour of dedicated tooling like Windographer
, and were only used for scenarios not yet covered by such tooling.
To make the lives of the team easier, they could access and run the notebooks from their browser over a Virtual Private Network (VPN) connection without having to install anything.
Since the notebooks relied on the “glue” and the “glue” relied on Django
, the notebooks shared their environment with the web application. So if someone tried !pip install X
they broke the web application, or !del /S C:\*
they wiped the server. If they ran a notebook that used a lot of RAM or CPU, or if multiple people forgot to close their running notebooks, they could use up the server’s resources and bring down the web application & database. This coupling also made rolling out updates to the web application harder since they might bring down the notebook server and interrupt a running notebook.
If any step in the data pipeline broke or went down (for whatever reason) then data access failed, and someone would have to work out where and why.
The “new” way -
How files are now fetched from remote loggers
There wasn’t much more to be done here other than some housekeeping17 since the loggers are configured on-site to use either LoggerNet
or the SmartGrid
connection.
How files are now imported to a database
Now for some heftier changes.
Sensor readings were stored in one database, sensor metadata in another, and data was processed in Python
. The web application needs a database, so why not also use the same database for storing timeseries?
If all data is stored in one database, then why not ask the database to process readings instead of using Python
?
How data is stored can introduce a lot of accidental complexity!
Let’s say we have two database tables corresponding to two different loggers …
timestamp | Wind Speed 10m 180deg | Wind Direction 10m 180deg |
---|---|---|
2023-11-27 00:00:00 | 5 | 60 |
… | … | … |
… for one and …
timestamp | WS10.180 | WD10.180 |
---|---|---|
2023-11-27 00:00:00 | 4 | 70 |
… | … | … |
… for another.
Clearly both Wind Speed 10m 180deg
and WS10.180
mean the same thing but to a computer its not so clear.
How do I find all wind speed readings? Each column name corresponds to a height, magnetic orientation and a type of reading, so I first need to know what each column means.
How do I link “metadata” to each column?
I can’t easily express this in the database language SQL
so it’s hard to link at the database level.
Python
is much more flexible, and so I can import the database data so I can hold both separately and join them as required albeit in a rather complex manner18.
But what if the readings were instead standardised before storing them?
Now how do I find all wind speed readings?
Easy. I now don’t need to resort to Python
since I can now join readings to their metadata in the database on matching sensor names -
So …
timestamp | sensor_id | value |
---|---|---|
2023-11-27 00:00:00 | 1 | 5 |
2023-11-27 00:00:00 | 2 | 60 |
2023-11-27 00:00:00 | 3 | 4 |
2023-11-27 00:00:00 | 4 | 70 |
… joins with …
sensor_id | data_type | height | magnetic_orientation |
---|---|---|---|
1 | “Wind Speed” | 1 | 5 |
2 | “Wind Direction” | 2 | 60 |
3 | “Wind Speed” | 3 | 4 |
4 | “Wind Direction” | 3 | 70 |
… on matching sensor_id
to form …
timestamp | sensor_id | value | data_type | height | magnetic_orientation |
---|---|---|---|---|---|
2023-11-27 00:00:00 | 1 | 5 | “Wind Speed” | 10 | 180 |
2023-11-27 00:00:00 | 2 | 60 | “Wind Direction” | 10 | 180 |
2023-11-27 00:00:00 | 3 | 4 | “Wind Speed” | 10 | 180 |
2023-11-27 00:00:00 | 3 | 70 | “Wind Direction” | 10 | 180 |
… from which I can now filter on “Wind Speed” or do whatever else I might require.
But won’t timeseries readings be really slow to store & query since the table will contain a lot16 of readings?
Yes. In a traditional relational database like Microsoft SQL Server
, Postgres
or MySQL
it will be, since these databases are designed to store & query new entries row by row.
But what about timeseries databases?
TimescaleDB
is an extension to the Postgres
19 database that enables working with timeseries readings from within the database. It provides a special table called a Hypertable
which speeds up insert & query performance for timeseries tables20. So provided that its performance was comparable to Python
I saw it as a viable alternative.
The web application framework, Django
, works well with Postgres
so by switching to TimescaleDB
I found I could store all sensor metadata & all timeseries in the same database.
So three databases …
… became, well, two databases …
Now this change didn’t come for free, I had to …
- Build an importer on the web application to parse sensor files and import them to the
TimescaleDB
database - Adapt the web application so it would accept sensor files sent from another program (via an “Application Programming Interface” built on
django-rest-framework
) - Build an exporter to send files to the web application alongside their file type, since the importer needs to know what type of file it is dealing with before it can import it
- Build out a user interface on the web application to allow manually uploading files
- Rebuild the task queue on
dramatiq
13 to run multiple file import tasks at the same time21
Having said all that, I figured it was worth the cost for the simplicity it enables for the next step - data cleaning.
How sensor readings are now cleaned
Once all readings for all sensors were stored in a single table, I could link each sensor reading to its corresponding metadata, re-calibrate and filter out erroneous readings in only a few lines of SQL
, the database language.
What was not so straightforward, however, was exporting files in the formats that I wanted -
- Big queries can be slow. Asking a database for millions of readings can be slow since - unlike
pandas
- databases read from disk as well as memory22.
- Reformatting query results in
SQL
is hard, so I decided to query in batches & stream each batch throughPython
to reformat
Ideally I would have liked to reformat readings from one reading per row to one column per sensor in
Postgres
since it’s slow to switch toPython
and inherently more complex. However, I found reformatting & exporting to files viaCOPY TO
suprisingly hard inPostgres
. I wanted to stream fromPostgres
into azip
file of multiple text files where each text file represents a different type of sensor (wind speed, direction etc). I just found this too difficult to express this inSQL
. It bothered me that exporting a file for a big source could take longer than 15 minutes.
How to speed up queries?
I could use database indexes23 and a task queue to run multiple queries at once. Parallel tasks in this case, however, are not as easy since exporting millions of readings to a file is very resource intensive -
How many workers should be in the task queue?24
What if the database runs out of connections?25
How sensor readings are now accessed
Now finally back to the “visible” parts -
Somewhat sadly for the backend engineer, the bulk of this work is mostly “invisible” other than -
- Faster data access
- Faster “big” requests
- Manual file uploads
- Direct access to “raw” files of sensor readings
It was designed so data is exported to files in advance so data access doesn’t require any extra work. It’s as simple as accessing files.
The “big” requests, like re-exporting files, are offloaded to a task queue which executes them when it has time to do so.
The Jupyter Notebook
server asks the web application for data (via an “Application Programming Interface” or API) rather than using its Python
code directly, so the notebooks are portable. They can be run locally or on a dedicated multi-user cloud platform.
Closing Remarks
After all of that, I still had to manage complexity and so I still had failure modes26.
I didn’t manage a 100% smooth transition from one system to the other. There were issues, and on more than one occasion the data pipeline went down. I made my best effort at covering the new system in code tests (to check each part was doing what I designed it to do) but it’s really hard to cover all scenarios. Mistakes can’t be avoided, but they can be managed. The only way to remain sane is to find out about a problem as soon as it occurs via email alerts or otherwise27. Tests will only get you so far.
On the bright side, I was finally able to fully test the data flow on sample data from all of the logger manufacturers used so far, so I could (mostly) guarantee the behaviour of importing, processing & exporting. This “test suite” can now be built upon each time an issue with the data pipeline occurs that I hadn’t anticipated, and in this manner strengthened.
The aim of this project was not to provide flashy new things, but rather to setup foundations which can be built upon for years to come & upon which a developer can rely on to keep them out of trouble.
Has this been achieved?
Only time will tell!
Footnotes
-
Sensor readings are tracked by loggers. Loggers save readings to text files,these files are synced to a server, processed, cleaned, amalgamated, reformatted into useful files, & made accessible via a friendly web application user interface. ↩
-
The team need to know if loggers are syncing or if a sensor is faulty, so meteorological station managers can go on-site & fix them if needs be. As well as flagging erroneous readings automatically, the system provides a user interface for manually flagging. ↩
-
In cases where a particular type of analysis is not yet well served by 3rd party tooling, the team uses a shared
Jupyter Notebook
server to explore & visualise data inPython
↩ -
In most cases a
Windographer
file, a 3rd party tool used to analyze & visualize wind resource data ↩ -
Software tests check that code does what it was designed to do. They provides software engineers with guard-rails, since if an aspect of a system is well tested then you might find out if your change breaks something before you roll it out. ↩
-
poetry
is aPython
library for tracking & managing other 3rd partyPython
libraries. SometimesPython
libraries depend on non-Python
libraries, whichpoetry
by design cannot manage, so one has to resort to something likeDocker
(orconda
ornix
) ↩ -
Docker
lets you define an operating system in a configuration file, it can then spin this up in the background & launch your code in it. If you share the configuration with others they can use it to spin up the same operating system as you.Docker Compose
lets you define configuration file in which multiple systems are defined, typically a database & an operating system. It will spin up all of these systems & link them to one another. ↩ -
GitHub Actions
lets you define a configuration file which specifies actions (bash
commands) to run in scenarios like “on receiving proposals for code changes” ↩ -
If something takes a long time to run & is run multiple times, it is common to cache it to a file or a database & use the cache to skip reruns ↩
-
The “glue” code accesses connection strings (including credentials) for one database from the other database, so I could fill one “test” database with connection strings pointing towards the other “test” database, which once filled with sample test data would do the job. The glue was now kind of tested but still a mess, so this left me with a flakily tested mess. The code complexity reflected the system complexity. ↩
-
In one case, a tool that fetches files of sensor readings from remote sensors (
LoggerNet
) went down for a few days. Upon restoring it we noticed that the “readings” database was missing a few days of readings, and the tool that imports these files to this database (LNDB
) refused to backfill these missing readings. So I had to manually import each file for each gap. In another case, I attempted to updatepandas
, a 3rd partyPython
library used to “glue” the pipeline together, to the latest version. This update resulted in invalid sensor readings being exported from the system to analysts. It took us a few weeks to notice & luckily had no impact, but was stressful nonetheless. ↩ -
At least from a software engineer’s perspective, installing loggers & configuring
LoggerNet
is not something I have experience with ↩ -
I couldn’t figure out how to use the prior task queue engine
huey
to run tasks in parallel on aWindows
operating system. Most task queues use*nix
only features for parallelism so don’t bother supporting it.Windows
has been a hard constraint on us, and can really limit tooling options. Thankfully,dramatiq
supports windows & proved itself to be alternative. ↩ ↩2 -
For example; wind speed is typically measured by an anemometer which counts the number of rotations per second (or frequency) of its spinning cups. To convert these rotations to wind speed one needs to know an anemometer’s “calibrations” which are measured in a wind tunnel. Most of the loggers recorded wind speed using “generic calibrations”, and so need to be “re-calibrated” using calibrations that are specific to a particular sensor. If sensor readings are not re-calibrated using the correct settings then the readings will be off & the analysis relying on them will be wrong. Mathematically, “calibrations” refer to the slope,
m
, and offset,c
, inf(x) = mx + c
wherex
refers to frequency &f(x)
to wind speed. ↩ -
Sometimes there are issues with a sensor. If it requires replacement, it takes time to go on-site and do so. The values recorded in the interim will be wrong, so they need to be filtered out. ↩
-
Typically each logger records average, standard deviation, minimum & maximum values every 10 minutes for each sensor. If a logger linked to 20 sensors records for 6 years, then it will produce
20 sensors * 4 reading types * 6 readings/hour * 24 hours * 365 days * 6 years = 25,228,800 readings
. ↩ ↩2 -
Credentials & file paths were hard-coded into the
Python
job so I pulled them into aTOML
file to make them easier to edit. I also adapted the scripts into notebooks, Each notebook does one thing (unzipping, decrypting, file syncing), to make them easier to read & edit ↩ -
Like -
{ "source_1": { "timeseries": { "timestamp" ["2023-11-27 00:00:00", ...], "Wind Speed 10m 180deg": [5, ...], "Wind Direction 10m 180deg": [60, ...], }, "metadata": { "Wind Speed 10m 180deg": { "data_type": "Wind Speed", "magnetic_orientation": 180 "height": 10, }, "Wind Direction 10m 180deg": { "data_type": "Wind Direction", "magnetic_orientation": 180 "height": 10, }, } }, "source_2": { "timeseries": { "timestamp" ["2023-11-27 00:00:00", ...], "WS10.180": [4, ...], "WD10.180": [70, ...], }, "metadata": { "WS10.180": { "data_type": "Wind Speed", "magnetic_orientation": 180 "height": 10, }, "WD10.180": { "data_type": "Wind Direction", "magnetic_orientation": 180 "height": 10, }, } } }
-
It’s a very popular database, see StackOverFlow’s 2023 Developer Survey ↩
-
As of 2023-11-20
TimescaleDB
by default chunks readings in compressed hypertables into intervals of one week soPostgres
doesn’t need to read all rows for queries that only care about a particular period of time, see Hypertables ↩ -
Except this time the task queue engine
dramatiq
was actually able to run tasks on parallel onWindows
↩ -
Since they’re careful with memory usage, this also means that a single database query won’t use up all of a server’s memory & crash it ↩
-
I found out the hard way that if you don’t create appropriate indexes for your queries then they will take forever to run.
TimescaleDB
wrote up a very helpful blog on this topic. I managed to improve performance quite a lot by wrapping my slow queries inEXPLAIN ANALYSE
to see whether or not they actually used the indexes I created for them. ↩ -
I found estimating the appropriate number of workers for the task queue to be somewhat of a fine art. I experimented with various numbers while watching resource usage to guess appropriate numbers. ↩
-
The web applications and the workers both needed connections. I ran into trouble when my task queue workers exhausted the
Postgres
connection pool which caused the connected web application to crash. I worked out that I could limit the number of connections by routing myPostgres
connection through a connection pool viaPgBouncer
, which forced reusing connections rather than spinning up new ones. This helped but wasn’t enough. I found thatPostgres
was still spinning up parallel workers to answer particular queries if the query planner decided this was necessary, so only after fiddling withmax_parallel_workers_per_gather
&max_parallel_workers
inpostgresql.conf
was I able to bring this under control. ↩ -
Failure modes -
- Sensor readings are not uploaded to the web application
- Sensor readings are not imported to the database
- A file export fails
LoggerNet
orSmartGrid
go down, so no new sensor readings are fetched- A database backup fails
-
A Workflow Orchestration tools like
prefect
might have given me a lot of comfort, however, I was hesitant to lock us into another cloud product ↩