Synthbot

Problems that data cleaning tools face

Jul 18th, 2021
63
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.00 KB | None | 0 0
  1. General situation:
  2. - The people willing to spend time cleaning data are usually not programmers.
  3. - Cleaning data often means collecting more data or building more datasets not directly related to the original task.
  4. - Most of the good data cleaning tools are made for content creators, not developers.
  5. - You don't know ahead of time what data you'll need to retain and check.
  6.  
  7. Problems that come up:
  8. - The labels need to be human-writable and compatible with whatever entry is possible in external software. That means you need modules that convert between different formats. This is a PITA for people to deal with. (Think: "Run this tool to convert to this format, use this external tool to clean X, run this tool to convert the outputs to this other format, use this other external tool to clean Y...")
  9. - Your tool is almost certainly not going to be able to collect the additional data you need for cleaning. You'll need to direct the user to collect that data, and you don't know what formats it's going to come in. For example, cleaning audio data for a show may require collecting subtitles or collecting alternate audio versions from other languages. Fixing this requires a feedback loop between development and the user, which is a problem since the user probably isn't a developer.
  10. - Data is error-prone in semantic ways (e.g., typo in the speaker name, mistakenly using homophones). Some of this can be fixed just by creating dictionaries to keep track of what's valid and what's not. Some of it requires heuristics or a grammar checker (e.g., using a lowercase "L" in place of an "i", which actually happened several times.) Some of it requires external tools (e.g., forced aligner to check if transcripts are complete), which require yet more data and use yet more formats. Some of these tools can't be modified easily, which means when the tools miss something, you either need to track it in some standard and flexible way, or you need to find some other tool that can accept feedback and not repeatedly throw the same errors.
  11.  
  12. And we got lucky with audio data because the external tools were all reasonably fast and programmable *enough*. We're doing the same thing for puppet animation data, and I can easily see us running into the same problems plus more.
  13. - There are very few options for viable tools, the tools are designed for highly specialized content creators, they're not designed to support batch processing, they use extremely complex proprietary formats, and they're complex enough that we can't just ship a binary and expect everyone to be able to run it.
  14. - With audio, you can select data by selecting two timepoints. That doesn't work for puppet animation, where you have layers and hierarchies of overlapping symbols, and where it's not usefully possible to see all symbols at once (since layers can occlude other layers and some errors are only visible in some frames).
  15. - It is NOT possible to fully clean animation data in their original format. That's because some animations require hacks to describe in the original format.
Advertisement
Add Comment
Please, Sign In to add comment