“How do you eat an octopus? One bite at a time…”
When I first encountered Singer, I was armed with basic Python skills and experience with REST APIs. I was familiar with data integration and ETL but I had no knowledge of the Singer platform. After learning how to use Singer to set up a tap and pipe data to a target, I immediately wanted to learn how to build one. I began coding a very basic tap with a few endpoints. The Singer project provides documentation but I struggled with where and how to start.
As I worked my way through the process I began to see how everything fits together – and once I could see it, I thought it would be helpful to map it out for other new Singer developers. I created this infographic to serve as a guideline based on my experience developing several taps.
Let me walk you through the steps I take.
As I begin tap development, I seek to understand the data source API, authentication, endpoints, query parameters (especially sorting and filtering), pagination, error codes, and rate limiting by reading the API documentation and by running REST GET requests with an app like Talend’s API Tester. For each of the streams, I record the endpoint URL, query parameters, primary key, bookmark fields, and other metadata in streams.py. I examine the API response formats, nested objects and arrays, and field data types and create a schema.json file for each endpoint. (Singer Tools’ singer-infer-schema can help out here.) Once you have the schema.json files and a well-documented streams.py containing the necessary stream and endpoint metadata, writing the discovery.py and schema.py for a new tap becomes almost “boilerplate,” meaning they don’t change much for each tap.
With the REST API query tool, I figure out the API calls, credentials, and keys needed to authenticate. The tap_config.json file stores the credentials (OAuth client/secret, user/password, etc.). The client.py code has a class that authenticates with the credentials and provides a GET and POST request function. Client.py also includes API call metrics and a rate limit and backoff “decorator” to deal with 429 and 500 errors and retries. To instantiate and run the client, I create the master controller, init.py. It’s a pretty simple management function that calls other files — client.py, discover.py, and sync.py — to do the real work.
Sync.py controls the data replication. In examining the API, I determine how the tap will loop through the endpoint streams, and for each stream, how the tap will loop through the API calls and data results. Depending on the requirements, this may include date windowing, pagination and offsets, query filtering and sorting, parent-child subqueries, and other methods to loop or chunk through the data. I strive to make a general metadata-driven process based on the streams.py metadata. Sync.py has a master sync function (to loop through all of the endpoint streams) and functions to sync an endpoint (to date-window and/or page through batches of results), transform records (calling transform.py to transform each record or batch of records), and process records (to send a batch of results to the target). It also has functions for getting and setting bookmarks and updating the currently syncing stream.
I initially like to build each of these components with a virtual environment (using PyEnv, virtualenv, and virtualenvwrapper) and a Python Jupyter Notebook to create and test each of the functions. Then I include the functions in sync.py and transform.py and begin my integration tests from the command line using the Singer Discover utility (to test out Discovery mode and create my catalog.json) and the Singer Tools singer-check-tap utility as my first target. Once I resolve any bugs, I set up my next target to be target-stitch, which uses the Stitch Import API connector. I store my Stitch organization ID and import token in my target_config.json. I first try piping data to target-stitch in dry-run mode and then in normal sync mode. With normal sync mode, I test the initial load (based on the config.json start date) and the ongoing incremental load (using the state.json). With discovery mode and the catalog.json, I test syncing all of the endpoints and fields and then only syncing a subset of the endpoints and fields.
At the root level of the tap, I make sure to include a few files for packaging and versioning the tap (setup.json, MANIFEST.in, LICENSE, CHANGELOG.md, .gitignore) and README.md to document the streams, endpoints, and commands needed to run the tap. Finally, before committing the code to the tap’s Git repository, I check the code quality with Pylint and resolve any issues.
That whole process for developing a tap matches the processes and files at the bottom of the infographic from left to right. Singer provides a template for working through the otherwise daunting task of building an integration piece by piece in a structured, manageable, modular, and ordered process — to allow you to eat the octopus one bite at a time.
And now a word from our sponsor:
The idea behind the Singer open source project is to allow anyone to build a reusable data integration and run it either on their own hardware or on Stitch – taking advantage of our monitoring, alerting, credential management, and autoscaling infrastructure.
If you’ve thought about writing your own taps and this post has you excited to get started, you’ll find documentation in the getting started repository on the Singer GitHub project. Once you’ve put a tap together, learn how to contribute it to the Singer repository and share it with the community.
By: Jeff Huth – Bytecode IO’s Data Engineer, Architect and Analyst