Data wrangling – just a phase or here to stay?

There have been myriad articles, blogs, posts in the last 12-24 months about Data Wrangling. I don’t intend to re-hash those here – if you want a summary of Data wrangling take a look at this article by Lukas Biewald http://www.computerworld.com/article/2902920/the-data-science-ecosystem-part-2-data-wrangling.html

We have worked with many clients where Data Wrangling has been the largest part of the Professional services engagement. Forget about all the desired outcomes of better/faster/new/amazing insights that part only comes after we get the data into the cluster.. Getting the data into a usable format for the Hadoop cluster and then ensuring it stays that way is usually a major piece of effort. Others have described it as ‘Janitorial’ work. That hides the high level of complexity in choosing how to map the raw data into formats that the client will want to use and is therefore suitable for the Hadoop cluster to ingest.

So – given that Data Wrangling is a well known concept now – is that skill going to be required for some time or will tools emerge (tools again….) that will semi automate or automate completely the process?

There are some products out there like Trifacta, ClearstoryData and then multiple open source tools like Tabula, DataWrangler (confused yet?), R Packages and you can even use Python (with Pandas).

Many of these cross over into Dashboarding and Visualization – even Datameer could be considered a Data Wrangling tool in some ways.

The question is – will Data Wrangling as a required skill set, and, more importantly, as a major element of Big Data projects, disappear under an onslaught of products that can do it quicker and more cost effectively?

My view is – in your dreams. The old standby of the three V’s shows why. Volumes are increasing, velocity is increasing and, most importantly for Data Wrangling – Variety is going to continue to accelerate. Think about IoT, Realtime streaming, Transactional data, unstructured data from legacy systems, new use cases emerging every day. For sure the standard Data Wrangling tasks of SQL, CSV, XML, JSON etc will be handled by products but with the ever growing number of data sources and as Big Data continues to redefine Enterprise computing I don’t think Data Wrangling is going to disappear for a while yet. Customers can prepare themselves for continuing to spend a large amount of their Big Data budgets on simply getting data ready to be ingested.

Want to know more ?

For a good list of free Data Wrangling tools visit the Varonis blog http://blog.varonis.com/free-data-wrangling-tools/

Ben Lorica gave a good summary in January of 2015. http://radar.oreilly.com/2015/01/lessons-from-next-generation-data-wrangling-tools.html

What do you think – is Data wrangling going to disappear and be replaced by products like Trifacta?

If you have additional questions, get in touch with us!

USA

Corporate Head Quarters

2205 152nd Avenue NE
Redmond, WA 98052
USA

+1 (425) 605 1289

Latin America

(Mexico, Colombia & Chile)

Mexico City

Córdoba 42 Int. 807, Roma Norte, Cuauhtémoc, 06700, Mexico City

+52 (55) 5255 1329

United Kingdom

London

85 Great Portland Street, First Floor, London, W1W 7LT

+44 2030 971584

Ireland

Sligo

77 Camden Street Lower, Dublin, D02 XE80, Ireland

+353 71 915 9710

Search Guard is a trademark of floragunn GmbH, registered in the U.S. and in other countries. Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries. OpenSearch is licensed under Apache 2.0. All other trademark holders rights are reserved.