Click to learn more about author Ian Rowlands.
It’s been a while since I posted. It’s been a busy summer as we’ve been working on launching our latest product. However, I’ve also taken some time to experiment with some of the tools that Citizen Data Scientists (CDS, for short) might be working with, to get more of a feel for what the role actually feels like.
I’ve played with a pair of programming languages (Python and R), a Data Wrangling/Preparation tool (Trifacta) and a couple of analytics platforms (QLIQ and Tableau). Just to be clear, I’m not plugging or criticizing any of these tools. My experience says as much about me and my background as it does about the tools.
Since I raised it, here’s my background. In the early 1970’s, I learned to program at college. BASIC, COBOL, Fortran, Pascal, a little APL and a little ADA. My first “real” job was with a Heavy Chemicals business. I learned to program in IBM Assembler, and PL/I, as well as writing some “real” COBOL and SAS. I worked with multiple DBMSs and TP systems. I took what I knew into the vendor world, and until the mid-90’s wrote quite a bit of code, and chased bugs in some massive applications.
The CDS tools were a LOT harder to get going with than I had expected. If you don’t believe me or think it’s because I’m getting old, try it for yourself. Here are some conclusions I drew for anybody wanting to stage a support environment for the CDS:
- Everybody Can’t Be a CDS. It needs some experience and a detailed understanding of aspects of the business. A good CDS knows what questions to ask and wants to ask and answer them. My experimentation was only useful because I was pursuing a particular objective. A good CDS is willing to keep fighting for the answers and has dealt with the frustration of data complexity and imperfection. There has to be a commitment to getting usable results and making decisions based on data. The ability to learn how to use the tools is important – but that technical facility has no value without the business-like approach.
- Don’t Let Your CDSs Wildcat. There has to be a framework, to define both the areas to be explored and the portfolio of tools to be used. I explain some of the reasons below.
- Think about how you want to split work between CDSs and IT specialists. CDSs are unlikely to survive in isolation. They will need support. I hope it’s self-evident that they should focus on doing data stuff – not on making the exercise possible. In getting started, I found some of the tools I was trying to use needed significant work before they were useful.
- Do a Lot of there are many tools available, their capabilities overlap. I would be very grateful if someone would produce a usable CDSs landscape. When I thought about doing my little project, I looked around but didn’t find anything genuinely helpful. The major building blocks seem to be data ingestion, understanding, preparation, analysis, documentation, publication, collaboration, and governance.
- For every tool that you decide to make available provide a supporting environment. Identify the training, reading material, and useful websites. Have an “expert” to assist with knotty problems. I spent a fair bit of time deciding which of the vast amount of available stuff was helping me and which was just a fascinating diversion.
- Make every effort to be sure that CDSs document One thing that has not changed is that metadata is vital! Even after a couple of weeks, I found myself having to go back and re-understand the data I had created but left, for a short time, untouched.
- Think about governance. You need to decide what data goes into the Data Lake, and who can access it, but there might be a much more important question. Under what circumstances do results, or solutions based on those results, get used to drive business decisions?
It’s obviously not a complete list, but it gets towards the bigger point that bore more and more heavily on me as I pursued my little project. We have been here before. The emergence of Big Data technologies might be the third or fourth time that new tools have emerged to allow business users to work more closely with data on their own initiative, with a much-lessened dependence on data specialists.
There were specialized time-sharing systems, and then spreadsheets, SQL-based tools, and BI technologies. As each new technology emerged, the same problems emerged with them. Costs got a little out of control; data got to places it shouldn’t have got to, “experts” spent too much time helping others less able than themselves, and a lot of what was done was left undocumented, leaving no audit trail to explain business decisions. Each iteration has been more expensive. We stand at the beginning of an exciting new data journey. Now is the time to make sure that we learn the lessons of data history!