Click to learn more about author Greg Nist.
I was frustrated.
I’d spent two-and-a-half years as a consultant developing software for a state government agency, and I felt like we were running on a treadmill. Our ability to drive the agency forward to more effectively run its operations was being constrained. Constrained not by the usual suspects of excessive bureaucracy or leadership issues or a talent deficiency, but rather by the very foundation that our applications were built upon –the database.
A Business Problem Caused by Technology.
The agency was responsible for processing all forms of state taxes (corporate, sales, individual, etc), and it needed a service that would enable a call center operator to quickly locate a return when a constituent requested information or support.
At this time, only about half of state individuals and corporations filed electronic returns. This meant a lot of paper was being processed. Returns were scanned, form types identified, OCR and ICR completed, and manual correction of data performed on low-confidence results. The result was that bad data landed in the database, hindering the ability of the agency to deliver value. The service built to meet the agency’s needs provided a basic SQL interface to a few key fields in the database such as name, tax ID, and street address. Unfortunately, it wasn’t helping the agency. If the call center operator couldn’t find the return based on a few simple checks, the system kicked off an inefficient manual process to find the actual paper return.
What the agency really needed was the power to unlock all the data that it had. It needed to find returns based on any data on the return, perform a wildcard search, and be able to apply filters to results to narrow in on the right match. Plus, the services required semantic intelligence. If a caller identified himself as “Johnny,” the service should understand that “John”, “Jon”, and “Jonathan” might also be of value.
And lastly, the agency needed the ability to query across different shapes of the same business object or entity. In this project, the main entity was the tax return. Tax return forms changed year to year. We didn’t want to have to spend costly cycles redesigning our database and reprogramming our services every year.
But we did.
The development team I was on spent over half of our time working to change, clean up, and adapt the brittle foundation of our software, which was an Oracle RDBMS running in a local data center.
RDBMS technology has brought amazing benefits to businesses and is a proven tool for certain use cases and data. But data complexity and volume has increased and so have user expectations. Personalized and intuitive user experiences (like the service we needed to provide) and the ability to meet complex regulatory requirements that span across an entire organization are of critical importance. Relational-only solutions can be limiting because the time and cost required to implement these types of complex projects on RDBMS are high.
This really hit home on my project. Requirement changes, the introduction of a new data source, or simply adding annual updates to tax forms sent us back through the costly cycle of reworking the schema and rewriting ETL. We wanted to embrace agile development, but our database prevented us from truly realizing the benefits.
I began a journey to explore new technologies and approaches to enable the team to shift development time toward what really mattered most to the agency: providing innovative data services to power the apps that delivered value.
A Technology Problem Solved with Technology
Like most journeys, you don’t get to your destination overnight. For me, making the mental shift from RDBMS to a next-generation database had a learning curve. Here are the lessons I learned that will help you make changes, too.
Lesson One: Build on a sturdy but flexible foundation. When making the choice to include a new database in your enterprise architecture, picking the right one is critical. Almost every vendor says it is multi-model now, but when I was starting out nine years ago, this wasn’t the case. There was a broad NoSQL landscape and no clear winners. There were column stores, key-value, graph, and document databases. Each had its pros and cons, and no one had yet figured out the right uses.
I wanted something to manage many document types such as JSON, XML, full-text, and binary while also being able to use RDF triples. I found that a NoSQL database platform provided the integrated architecture that I needed.
Think deeply about what your project needs are and what platform works best. For my project, we were building applications to process tax returns, and we were literally processing documents. Storing the return data as a document was a natural fit. Then you add in the ability to have zero to many schemas and to define those iteratively. This was valuable as it eliminates the pain of spending massive amounts of project time adapting the “one schema to rule them all” so that it could handle the annual changes to the tax code and forms. Couple that with built-in search and semantic capability, plus tooling that enabled non-technical domain experts to participate in the process, and I saw that I needed the flexibility of a non-relational solution.
Lesson Two: Assess all requirements. We were enabling an enterprise. That word “enterprise” is important here. In my project, we had to manage personally identifiable information (PII) including financial details for individuals and corporations. To even be considered for a mission-critical project like this, a database must meet certain requirements. It needs to be scalable with flexible deployment options in the cloud (any cloud), have a robust security model to protect certain PII in order to meet compliance requirements, provide the ability to perform ACID transactions to ensure data integrity, and have built-in capabilities for high availability and disaster recovery.
Depending on your use case, a multi-model approach that is cloud neutral likely is going to maximize your power and flexibility.
Lesson Three: Start with a business objective. The best database projects begin with a focus on what matters most to the customer. Start your first sprint by picking a business goal that delivers significant value and focus on the data services that will deliver that value.
Then, get the data you need to deliver that service and nothing more. You know you’re going to need more data over time and that’s OK. You don’t need to get everything up front. Your foundation is flexible.
And then bring that data (regardless of source and shape) into your database using orchestration tools such as Apache NiFi, MuleSoft, or other favorites within your enterprise. And when you’re working in an industry with strict compliance requirements (who isn’t?), you’ll want to make sure that your database will track the provenance and lineage of the data for you automatically as you bring the data in.
Ideally, the solution you are using will no longer need to begin with the long drawn-out process of trying to identify every possible requirement (impossible), every needed data source (likely to change), and then building the perfect schema (no such thing). All this before you can even begin real development work
Getting the data in is fast and easy when the database doesn’t require schema and supports multiple models. Don’t get me wrong, schemas are necessary and good. But the relational model that requires you to have exactly one at all times is limiting and rigid.
In short, it was a must to bring agile development to the database tier, so that the database can be an enabler for effective agile development throughout the IT organization.
But just because it’s easy to get data into the database, it doesn’t mean that you’ll leave it as-is forever.
Lesson Four: Model when you need it, where you need it. Even with a flexible, schema-agnostic database as your foundation, data modeling is still important work. But there is a key difference between the non-relational NoSQL approach and the RDBMS. With NoSQL, as solutions like MarkLogic provide, you no longer have to think of data modeling through the lens of tables and rows and columns and relationships and third normal form.
Instead, you’ll again focus on the services that need to be delivered and the data they will consume. Then based on those service requirements, you add model to the data that you’ve already got loaded. You add just the model that you need for the sprint that you’re working on by curating the data right inside the database.
Data curation enables you to quickly get the data into a shape that supports the service that is being built. With the right solution, that means using no-code tooling to configure your entities, map different shapes and sources of data to your entities, validate or transform data properties, denormalize (possibly), and master your data. The curation process should be fast, flexible, iterative, and ongoing as you progress through additional sprints and deliver on customer needs.
Regardless of what you do to the data during the curation process, you are implementing a fast way to store the data, metadata, and provenance all in the same place, queryable together, and independently secured. You get the ability to work with multiple models, and when needed you can leverage RDF triples to express relationships between entities as well as add additional context and meaning to your data.
As you add model to your data, you can also continue to organize the data over time. Continuing to evolve your data over time is a key enabler to agile development. For example, you can add new properties and indexes, or group data into collections, both of which can then be used to power the needed query interfaces against the data.
Also consider features that enable tighter security permissions where needed, controlling who can access data at the document level or even go one level deeper and control who can access data within a document down to the JSON property or XML element and attribute level.
For my project, this capability was critical. For example, access to PII such as a taxpayer’s Social Security number, home address, and income would not be appropriate for most users and doing so would open up the agency to unnecessary risk.
Change can be unnerving but is often necessary in order to move forward.
Once I made the mental shift from using SQL to using search, and combined that with the ability to load data quickly and easily without a lot of upfront work, and saw the value of and how to do iterative curation inside the database, the lightbulb finally went on in my (formerly) RDBMS-dominated brain.
And the answer to the root cause of my frustration was clear: find a unique and more effective approach to data integration.
By bringing agile to the database layer, developer focus throughout the entire enterprise can finally shift towards what matters most to the customer: delivering to them the apps they need to run the organization and doing so with less time and cost than with alternative technology choices.