The Data Wrangler's Handbook: Simple Tools for Powerful Results

ALA Member
$61.19
Price
$67.99
Item Number
978-0-8389-1909-5
Published
2019
Publisher
ALA Neal-Schuman
Pages
176
Width
6"
Height
9"
Format
Softcover
AP Categories
A
C
I

Primary tabs

You don't need to be an ALA Member to purchase from the ALA Store, but you'll be asked to create an online account/profile during the checkout to proceed. This Web Account is for both Members and non-Members. Note that your ALA Member discount will be applied at the final step of the checkout process.

If you are Tax-Exempt, please verify that your account is currently set up as exempt before placing your order, as our new fulfillment center will need current documentation. Learn how to verify here.

  • Description
  • Table of Contents
  • About the author
  • Reviews

Data manipulation and analysis are far easier than you might imagine—in fact, using tools that come standard with your desktop computer, you can learn how to extract, manipulate, and analyze data (and metadata) of any size and complexity. In this handbook, data wizard Banerjee will familiarize you with easily digestible but powerful concepts that will enable you to feel confident working with data. With his expert guidance, you’ll learn how to

  • use a single-word command to sort files of any size by any criteria, identify duplicates, and perform numerous other common library tasks;
  • understand data formats, delimited text and CSV files, XML, JSON, scripting, and other key components of data;
  • undertake more sophisticated tasks such as comparing files, converting data from one format to another, reformatting values, combining data from multiple files, and communicating with APIs (Application Programming Interfaces);
  • save time and stress through simple techniques for transforming text, recognizing symbols that perform important tasks, a Regular Expression cheat sheet, a glossary, and other tools.

Library technologists and those involved in maintaining and analyzing data and metadata will find Banerjee’s resource essential.

List of Figures and Tables
Acknowledgments
Introduction

Chapter 1    Getting Started with the Command Line
Finding the Command Line

  • Mac
  • Windows

Meet the Command Line

Chapter 2    Command Line Concepts
Two Powerful Symbols

  • Direct Output to a File (Greater Than Symbol)
  • Direct Output to Another Program (Pipe Symbol)

Command Substitution
Regular Expressions—The Swiss Army Knife for Data

  • Literal Characters
  • Special Characters
  • Wildcard Characters
  • Logical Operators
  • Grouping

Scripting

Chapter 3    Understanding Formats, by David Forero

Chapter 4    Simplify Complicated Problems
Isolating Specific Data Elements
Converting Data into Formats That Are Easier to Work With

Chapter 5    Delimited Text
CSV (Comma Separated Values)

  • Commas and Quotation Marks in CSV Files
  • Multiline Fields in CSV Files

Multivalued Fields in Delimited Files

Chapter 6    XML
So What Is XML, Really?
What Makes XML So Useful?
Why Is XML So Easy?

  • DOM (Document Object Model)
  • XPath
  • XSLT (eXtensible Stylesheet Language Transformations)

Working with Large XML Files
Working with Complex XML Files
XmlStarlet
Installing XmlStarlet
Converting XML Documents

Chapter 7    JSON (JavaScript Object Notation)

Chapter 8    Scripting
Variables
Arguments
Conditional Execution
Loops

Chapter 9    Solving Common Problems
Viewing Large Files
Locating Files That Contain Particular Data
Finding Files with Specific Characteristics
Working with Internal Metadata
Working with APIs
Combining Data from Different Sources
Other Tasks

Chapter 10    Conclusions
One-Line Wonders
Locating, Viewing, and Performing Basic File Operations

  • Combine Information from Multiple Files into a Single File
  • Combine Three Files, Each Consisting of a Single Column into a Three-Column Table
  • Extract 1,000 Random Lines or Records from a File
  • Find Files with Specific Characteristics
  • Find All Lines in All Files in the Current Directory as Well as All Subdirectories Containing a Regular Expression
  • Identify All Files in Current Directories and Subdirectories That Contain a Value
  • List All Files in Current Directory and Subdirectories over a 100 MB in Order of Decreasing Size
  • List the Names, Pixel Dimensions, and File Sizes of All Files in the Current Directory and Subdirectories in Tab Delimited Format
  • Print Line Number of File That Match Occurred On
  • Split Large Files into Smaller Chunks with Each File Breaking on a Line
  • View 200 Characters Starting at Position 38562 in a File
  • View Lines 4369–4374 of a File

Retrieving and Sending Information over a Network

  • Retrieve a Document from the Web and Send It to a File
  • Send an XML Document to an API Requiring HTTP Authentication

Sorting, Counting, Deduplication, and File Comparison

  • Combine Two Files on a Common Field
  • Compare Two Sorted Files
  • Count Occurrences for Each Entry in a File, Listed in Order of Decreasing Frequency
  • Count Records Containing an Expression
  • Count Words, Lines, and Characters in Files
  • Identify All Unique Entries and Supply a Count of How Many Times Each Occurs
  • Sort a File and Remove Duplicates, Show Only Duplicated Entries, or Show Only Unique Entries

Useful Scripting Operations

  • Capture Parameters Passed to a Script
  • Divide a Line into Parameters
  • Iterate through Every Item in Parameter List
  • Perform a Loop
  • Perform an Operation Conditionally
  • Run a Script on Every Line of a File
  • Send the Output of a Command as Arguments to Another Command
  • Send the Output of a Command to Another Command
  • Send the Output of a Command to a File
  • Store the Output of a Command in a Variable
  • Use Foreign Character Sets in a Terminal Window

Transforming Text

  • Convert File of Dates to YYYY-MM-DD Format
  • Convert to Title Case
  • Convert to Upper Case
  • Convert List of Names from Direct Order to Indirect Order
  • Extract and Manipulate All Lines in a File That Match a Complex Pattern
  • Extract and Manipulate All Entries in All Files in an Entire Directory Hierarchy That Match a Pattern
  • Remove Lines from a File That Match a Pattern
  • Remove Carriage Return Characters Inserted by Windows Programs from a File
  • Remove Newline Characters from a File
  • Replace Newlines in a File with Character 7 (Bell)
  • Replace Search_Expr with Replace_Expr Only on Lines That Contain Condition_Expr
  • Replace Search_Expr with Replace_Expr Except on Lines That Contain Condition_Expr
  • Replace Smart Quotes with Straight Quotes

Working with Delimited Files

  • Convert Comma Delimited File Where Some Values Are Quoted and Some Values Are Not to Tab Delimited
  • Convert Multiline Records to Table
  • Extract Individual Fields from Files
  • Find the Most Common Values in the Second Field of a File
  • Find All Lines in Tab Delimited File Not Containing Six Fields
  • Fix Delimited File That Contains Line Breaks in Fields
  • Remove Trailing and Leading Whitespace from Tab Delimited Data Fields
  • Reorder Fields in a Tab Delimited File

Working with JSON and XML    

  • Add an Attribute to an XML Document
  • Add an Element to an XML Document
  • Apply XSLT Stylesheet to XML Document
  • Convert JSON to Tab Delimited Format
  • Delete Elements, Attributes, or Values Based on XPath Expressions
  • Display Structure of XML File
  • Pretty Print JSON Document
  • Pretty Print XML Document

Glossary
Symbols That Perform Important Tasks
Useful Commands
Regular Expression Cheat Sheet
Index

Kyle Banerjee

Kyle Banerjee has wrangled data for diverse purposes in academic, government, and nonprofit environments since 1996. A firm believer that understanding people is the key to building services of the future from the systems and data of the past, his professional interests revolve around understanding workflows and identifying opportunities in data previously thought inconsistent or incomplete. He has published several books and numerous articles on a variety of topics related to applying technology in library settings.

"I highly recommend The Data Wrangler’s Handbook for anyone who now manipulates data or may need to do so in the future. In Banerjee’s words, 'If these tasks [that require data wrangling] sound intimidating, this book is for you. You will understand everything in this book even if you have no special technical knowledge or programming experience."
— Technicalities

"Written in a clear and accessible manner and filled with helpful examples ... It will be especially useful to persons with little technical experience but an interest in learning the basics of data wrangling. Highly recommended."
— Catholic Library World