I have a massive file of customer account information that is currently sorted like this, into one column. I am looking to split each row, using the : as the separator. But in doing so, for each row, when separated, I wanted to make into a new column, placing the data after each : into the respective column. My ultimate goal is to make this into CSV form to import somewhere for data analytics and/or to build a database.
firstName:John
middleName:null
lastName:Doe
companyName:John Doe Corp
suffix:null
primaryEmail:johndoe@johndoe.com
primaryPhone:555.555.5555
secondaryEmail:johndoe@johndoe.com
secondaryPhone:null
Also, this is not the total amount of rows per customer. Each customer is 55 rows.
ACCEPTED]
Using perl, which is present on any desktop or server Linux distro:
perl -lne '
BEGIN{$,=","}
($k,$v)=split":",$_,2;
next unless defined $v;
for($k,$v){s/"/""/g,$_=qq{"$_"}if/[$,"]/}
$k=$t{$k}//=$t++;
if(exists$f[$k]){print@f;@f=()}
$f[$k]=$v;
END{print@f;print STDERR sort{$t{$a}<=>$t{$b}}keys%t}
' your_file
This should convert the file to standard CSV, except that the header (the first line with the field names) will be printed the stderr, after having processed the whole file. You can save it somewhere with ... >body 2>hdr and then cat hdr body > final_file.csv.
This does not attach any special significance to empty lines, etc: a record is considered as comprised of a cluster of fields which have different names, no matter what order they're in.
Fields which contain either , or " will be put inside "...", and any inner " will be escaped by doubling it as "" (using the CSV convention).
You can adjust the field separator by changing $,="," to eg. $,="|" (or $,="\t" for Tabs). You can get rid of the quoting & escaping by removing the for($k,$v){ ... } line.
This could be done in awk (NOT in sed or tr, though), only that it will be a bit more complicated, as awk has no way to print entire arrays at once (you have to loop through them), nor the ability to split a string in a limited number of fields (you'd have to use a substr trick for that).
For the sake of completeness, an awk-based solution.
The awk-script (let's call it convert_csv.awk):
#!/bin/awk -f
BEGIN{FS=":";OFS=","}
# Process all non-empty lines
NF>0{
# Check if the "key" part of the line was not yet encountered, both globally
# and for the currently processes record.
# If not encountered globally yet, add to list of headers (=columns).
new_hdr=1; new_key=1;
for (i=1; i<=n_hdrs; i++) {if (hdr[i]==$1) new_hdr=0;}
if (new_hdr) hdr[++n_hdrs]=$1;
for (key in val) {if (key==$1) new_key=0;}
# Once no globally new keys are found, consider the "list of headers" as
# complete and print it as CSV header line.
if (!new_hdr && !hdr_printed)
{
for (i=1;i<=n_hdrs;i++) printf("%s%s", hdr[i], i==n_hdrs?ORS:OFS);
hdr_printed=1;
}
# If the current key was already found in the currently processed record,
# we assume that a new record was started, and print the data collected
# so far before collecting data on the next record.
if (!new_key)
{
for (i=1;i<=n_hdrs;i++) printf("%s%s", val[hdr[i]], i==n_hdrs?ORS:OFS);
delete val;
}
# Associate the "value" part of the line with the "key", perform transformations
# as necessary. Since both the 'gsub()' function used for escaping '"' to '""'
# and the 'index()' function used to localize ',' return non-zero if an occurence
# was found, the sum of both return values being > 0 indicates that the field
# must be quoted.
quote=gsub("\"","\"\"",$2)+index($2,",");
if (quote) $2="\""$2"\"";
val[$1]=$2;
}
# Print the last record. If it was the only record, print the header line, too (this
# is the case if 'new_hdr' is still 'true' at end-of-file).
END {
if (new_hdr)
{
for (i=1;i<=n_hdrs;i++) printf("%s%s", hdr[i], i==n_hdrs?ORS:OFS);
}
for (i=1;i<=n_hdrs;i++) printf("%s%s", val[hdr[i]], i==n_hdrs?ORS:OFS);
}
The function is explained in the comments, but basically it looks for sets of unique keys and considers a record complete once a key "already encountered" is found on a line; it then prints the record and clears the temporary buffer for collecting the next record. It also applies the transformations as indicated by @mosvy to adhere to the CSV standard for special characters in the fields.
Call this as
awk -f convert_csv.awk input.txt
xtabtocsv(orcsvlite) transformation should be possible - steeldriverfirstName? - pLumo