Cleaning up a text file with a variety of CR and LF line breaks

Question

I'm trying to import into mySQL a text file containing memos. I don't know how they managed to do this, but while the memo field is consistently terminated by CR LF, parts of the text itself contains a mixture of CR, LF, and CR LF line breaks as well.

Naturally this breaks my ability to import it, as there is no clear indication of what constitutes a line break. Roughly half the data is lost during import, and 25% of what made the cut ends up truncated.

Is there any feasible way of sorting this mess out? It was originally exported from Access.

Thanks!

Why would some combination of CRLF and LF characters naturally break your ability to import the data? Line breaks should not matter much. How are you importing? — Ярослав Рахматуллин, Aug 27 '13 at 12:52
LOAD DATA LOCAL INFILE 'source_data.csv' INTO TABLE my_table FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\r\n'
The last line was a recent addition in an effort to try to solve this import problem. The original problem as well was that a memo might get imported up until a line break, I assume mySQL thinks it's the start of the next line so it dumps the rest of the text because it's not valid CSV and then goes to the next line, which sometimes doesn't work just the same. — Ivan, Aug 27 '13 at 13:05
Delete your last comment and update your question with (A) description of the client/tool/environment you are using to import (phpmyadmin? something else?), (B) short sample of problematic input text, (C) table definitions for the relevant columns / records. Good luck. — Ярослав Рахматуллин, Aug 27 '13 at 13:11

score 2 · Accepted Answer · edited Mar 08 '17 at 16:58

2

If the good CR+LF comes outside of the single quotes that surround the Memo field, the following lookbehind should work to match the CR+LF that you want to delete:

(?<!')\r\n

Regular expression visualization

Edit live on Debuggex

Original Answer:

Can you run a regular expression on the text file?

\r\n?|\n

Replace every match with your desired newline code, such as \r\n.

edited Mar 08 '17 at 16:58

Community

1

answered Aug 27 '13 at 14:06

Dane

1,897
14
19

Thanks, but that would just break it further. I can live with getting rid of CR and LFs individually; it's the random CR+LFs that ruin everything. – Ivan Aug 27 '13 at 14:11
Ohhhh. So you've got like, "Blah blah\r\nblah blah"\r\n ? And you want to keep the final one, but not the one that's in the middle of the text? What comes immediately after the real \r\n? Is it just Column A with some random data? A sample would be great, but I think I have an idea for another regex... – Dane Aug 27 '13 at 14:28
Actually, yeah, we need some sample. Isn't the memo field enclosed by single quotes? Can't you just keep the \r\n that's preceded by a single quote and remove the others? This lookahead would seem to match all the bad ones: (?<!')\r\n – Dane Aug 27 '13 at 14:34
Thanks for your efforts. I think you may be on to something with that last regex. I understand their importance but I don't understand a damn thing about writing them so I appreciate your help! Testing it out now; early results looked promising. – Ivan Aug 27 '13 at 15:39

M K · Answer 2 · 2013-08-27T13:15:13.527

0

Depending on what you'd like to retain, either CR-LF or LF, you could use a unix2dos/dos2unix style utility that can convert and cleanup your import file from CR-LF to LF and vice versa so that it's uniform throughout.

If you're on a Unix/Linux platform, you should have dos2unix and unix2dos commands available.
If you're on Windows and do not have Cygwin (or equivalent), you could try tofrodos.

edited Aug 27 '13 at 13:15

answered Aug 27 '13 at 12:56

M K

2,792

score 0 · Answer 3 · answered Aug 27 '13 at 13:21

If there are CR LF's inside the data itself, and they are not enclosed with something sort of qualifier (for example ""), then obviously those lines will be broken into multiple lines. Best bet is to clean up the data in Access and export again. If the CR LFs are qualified, then you should be able to manipulate the file (I'd probably use AWK) and replace the unqualified CR LF with \n.

In order to get more precise, I'd need to know a bit more about the data.

Cleaning up a text file with a variety of CR and LF line breaks

3 Answers3