Parsing file in linux

Парсинг CSV-файла средствами bash и awk

Доброго времени суток, Хаброчитатель!

Возникла у меня необходимость перевести интерфейс одной системы. Перевод для каждой формы лежит в отдельном XML-файле, а файлы группами разбросаны по папкам, что очень неудобно. Решено было создать единый словарь, чтобы в Excel’е работать с переводом всех форм. Данная задача в свою очередь разбивается на 2 подзадачи: извлечь информацию из всех XML-файлов в один CSV-файл, после перевода из CSV-файла создать XML-файлы с прежней структурой. В качестве инструментов были выбраны bash и awk. Первую подзадачу описывать смысла нет, так как она достаточно тривиальная. А вот как распарсить CSV-файл?

В Интернете можно найти множество информации на эту тему. Большинство примеров с легкостью справляются только с простыми вариантами. Но я не нашел ничего подходящего, например, для такого:

./web/analyst/xml/list.template.xml;test;»t «»test»»; est»
./web/analyst/xml/list.template.xml;%1 _s found. Displaying %2 through %3;Найдено объектов: %1. Отображено с %2 по %3

В Excel’е эти строки выглядит так:

Файл Тег Перевод
./web/analyst/xml/list.template.xml test t «test»; est
./web/analyst/xml/list.template.xml %1 _s found. Displaying %2 through %3 Найдено объектов: %1. Отображено с %2 по %3

Взяв за основу пример с OpenNET, я решил его изменить. Вот текст awk-программы:

А вот фрагмент bash-скрипта ( XML_PATH – переменная с путем, по которому располагаются папки с XML-файлами):

Источник

How to Parse or View XML Code in Linux Command Line

XML is an abbreviation for Extensible Markup Language. Since XML is both a markup language and a file format, its usage is paramount in the storage, transmission, and reconstruction of arbitrary data. XML-defined set of rules makes it possible to encode documents in machine-readable and human-readable formats.

There is a downside to XML being attributed as a human-readable language. It is challenging to read and write due to its unfriendly format. For instance, you will find it difficult to visually comprehend a single long line of XML code when it lacks element indentations.

For instance, consider the view of the file XML code under a Linux terminal.

View XML Content in Linux

The screen capture above details a valid XML file. However, due to its irregular format, the human eye finds it difficult to read and understand it.
Throughout this tutorial, we will be referencing this file as our input file before pretty-printing it on our Linux command line shell environments.

Out of the many approaches to formatting, printing, and outputs an XML file on the Linux terminal, we will look into two ideal solutions:

1. Parsing XML Files Using xmllint Command

The xmllint command is part of the xmllib2 package with a primary role of checking XML files validity, evaluating XPath expressions, and parsing XML files.

The —format option in the xmllint command helps reformat and re-indent a targeted XML file as per the following syntax:

Let us use the xmllint command to reformat our sample mailing.xml file.

Parsing XML Files in Linux

The command execution above has added an XML declaration ( View XML Content in Linux

The man page of the xmlstarlet toolkit provides more formatting options for your XML file.

With these discussed approaches to pretty-printing XML files in Linux, you should have no problem expanding your knowledge on the usage of these commands after visiting their associated man pages.

Источник

How To Parse CSV Files In Bash Scripts In Linux

Work With CSV Files In Bash Scripts

Comma-separated values aka CSV is a semi-structured data that uses comma as the delimiter to separate the words. CSV file formats are very popular among data professionals since they have to deal with a lot of CSV files and process it to create insights. In this article, we will be focusing on how to parse CSV files in Bash shell scripts in Linux.

In most parts of this article, I will be using awk and sed tools for csv parsing instead of combining different commands like grep , cut , tr , etc.

The awk utility reduces the complexity of piping multiple commands or writing a loop with logic to grab the data. Instead, you can write a one-liner code in awk to do the job.

1. Preparing CSV File For Processing

Your CSV file may be generated from a database, an API, or you might have run some commands and converted the output to delimit in CSV format. In any of the cases, you have to first analyze the dataset before running your logic on top of it.

As a best practice, you should cleanse your dataset before using it. Why should we cleanse the dataset? There may be situations where there will be empty cell values or no proper formatting in headers, extra columns that are not required for processing, and many more.

I am using the below CSV data, which I grabbed from Kaggle for demonstration purposes.

1.1. Replace Empty Cells

In some cases, the CSV file will not have any values in particular cells. Take a look at the below screenshot where there are some empty cells between the columns.

I would always replace it with «NA» or «No Value», so there will be no empty cells. You can use the following awk snippet to replace any empty cell with your desired value. In this case, I am replacing the empty cells with «No value».

The way this snippet works is I am setting the field separator and output field separator to comma ( FS=»,»;OFS=»,» ). Using for loop , iterate through each cell in a line, and if a cell is found empty ( $i == «» ) then replace it with «No value» ( $i=»No value» ). You have to redirect the changes to a new file.

Suggested Read:

1.2. Capitalize The Header

CSV files may or may not have headers. But if there is a header I would always capitalize it for better readability. You can do it easily using awk or sed . I will show you both the ways.

Here, we are checking if the line is first-line using( NR==1 ) and using the toupper() function to capitalize it. The same snippet can be written as a one-liner.

Using awk , you have to again redirect the changes to a new file. Instead, you can use ‘ sed ‘ to modify the changes directly into the file. Here \U converts the case to uppercase. If you want to do lowercase conversion, use \L .

1.3. Remove Trailing Comma

Your CSV file may have a comma at the end. To clean the trailing commas, you can follow the below method.

I have purposely added a trailing comma from lines 7 to 11 in my data file.

To remove all the trailing commas, run the following sed command:

Now we are done with the cleaning part. There may be a few more steps required for you but that depends on how your CSV file is structured and what needs to be cleaned.

2. Pretty Print CSV File In Terminal

If you are trying to display the CSV files in the terminal, then there are a few options where you can print the file in tabular format which will give you better readability.

2.1. Column Command

The first approach is to use the column command. Column command accepts a separator which is set to comma and a delimiter to split the column which is set to tab in the below command. You can also set your own custom delimiters.

2.2. CSV Look Command

Csvlook is a utility that comes with the csvkit package. There is no need to set a delimiter as we did with the column command.

2.3. Python Pretty Table

If you have the python prettytable module installed, then you can run the following one-liner and redirect the CSV file to generate the table.

You can also create an alias for the one-liner and pass the file name as an argument.

3. Grabbing Data From CSV File

3.1. Print Row & Column Count

To get the number of columns in the CSV file, run the following command. Here the variable NF represents the number of fields split by a comma as the delimiter.

To get the number of rows, run the following command. Here the variable NR represents the current record (i.e) each line is considered as one record.

To skip the first line (header) and calculate the number of lines, run the following command.

3.2. Print Entire CSV File

This is pretty simple. You can use cat or awk to print the entire CSV file.

3.3. Print Only Header From CSV File

Printing the header alone will give you a nice overview of what type of data your CSV file holds. You can use the head or awk command to grab the header alone.

3.4. Exclude Header Line

To exclude the header line and print all other lines use the awk command. The awk variable NR > 1 will make the first line to be skipped.

Sed can also be used to exclude the first line and print all other lines. The 1d flag will delete the first line and print all other lines to stdout (Terminal).

3.5. Print Particular Columns

We can use the column position to print the entire column. There are two approaches to achieve this. The first approach will be to use awk and the second approach will be to use loops. Awk will be much simpler to grab the column.

Awk by default splits the line based on the delimiter and stores the values in $1 , $2 , $3 , etc. The default delimiter for awk is white space.

Take a look at the below snippet where the field separator( FS=»,» ) and output field separator( OFS=»,» ) is set to comma. The print statement will print the first column, second column, and sixth column.

You can write the above snippet in one-liner too.

Now the second approach would be to use loops.

Let me explain what exactly happens when you run the above snippet.

  • We are setting the Internal field separator IFS to comma.
  • Using the read command we are creating an array named «fields» and redirecting the input file to the while loop .
  • For each iteration, it will read line by line and store the line as array elements in «fields» so you can use the array index position to grab the particular column alone.

Note: Index value starts from 0..N

3.6. Print Row That Matches The Condition

If you wish to print the rows that match a certain condition, then you can do it easily using awk . Let’s go over a few scenarios.

To print all the rows that match a value in a column, run the following command. Here I am trying to print all rows that match the value «India» in column 6.

To print all rows that do not match a certain value, run the following command. Instead of an equality operator, we are using not equal operator.

You can also do a condition check on more than one column using logical AND, logical OR operator. Let’s say I want to check all the rows that have the country as «India» and the batting hand as «Right_hand».

Here, $4 points to the 4th column and $6 points to the 6th column. The symbol && is used as a logical AND operator to evaluate two conditions.

If you wish to include the header along with the result from the conditional check, use the following command. First I am printing the first line using NR==1 , then using the logical AND operator running the conditional check to print the results.

If you wish to print or redirect the output, then run the entire command inside a subshell by enclosing it with brackets.

A note about Csvkit

So far whatever we have seen in this article is simple and straightforward. But when your CSV file has a complex structure, then it becomes tedious to parse using the above approach. There is a utility called CSVKIT, which is an excellent utility to work with CSV files in bash.

The problem with the csvkit utility is it is installed by default in your distribution and you might have to manually install it. In your corporate environment, this may not be possible since there may be some restrictions to installing external packages. But this utility is worth the mention and we will create a separate detailed article for it.

Conclusion

In this guide, we have seen how to work with CSV files using awk, sed. You can also use other utilities like cut, grep, tr, etc to get the desired result but awk and sed will make your life simpler and reduce the complexity of writing a lot of codes. If you have any feedback do mention it in the comment section and we will be happy to hear it from you.

Источник

How to Parse the Tab-Delimited File Using `awk`

`tab` is used as a separator In the tab-delimited file. This type of text file is created to store various types of text data in a structured format. Different types of command exist in Linux to parse this type of file. `awk` command is one of the ways to parse the tab-delimited file in different ways. The uses of the `awk` command to read the tab-delimited file has shown in this tutorial.

Create a tab-delimited file:

Create a text file named users.txt with the following content to test the commands of this tutorial. This file contains the user’s name, email, username, and password.

users.txt

Name Email Username Password

Md. Robin [email protected] robin89 563425

Nila Hasan [email protected] nila78 245667

Mirza Abbas [email protected] mirza23 534788

Aornob Hasan [email protected] arnob45 778473

Nuhas Ahsan [email protected] nuhas34 563452

Example-1: Print the second column of a tab-delimited file using the -F option

The following `sed` command will print the second column of a tab-delimited text file. Here, the ‘-F’ option is used to define the field separator of the file.

The following output will appear after running the commands. The second column of the file contains the user’s email addresses, which are displaying as output.

Example-2: Print the first column of a tab-delimited file using the FS variable

The following `sed` command will print the first column of a tab-delimited text file. Here, FS ( Field Separator) variable is used to define the field separator of the file.

$ awk ‘< print $1 >‘ FS = ‘\t’ users.txt

The following output will appear after running the commands. The first column of the file contains the user’s names, which are displaying as output.

Example-3: Print the third column of a tab-delimited file with formatting

The following `sed` command will print the third column of the tab-delimited text file with formatting by using the FS variable and printf. Here, the FS variable is used to define the field separator of the file.

The following output will appear after running the commands. The third column of the file contains the username that has been printed here.

Example-4: Print the third and fourth columns of the tab-delimited file by using OFS

OFS (Output Field Separator) is used to add a field separator in the output. The following `awk` command will divide the content of the file based on tab(\t) separator and print the 3rd and 4th columns using the tab(\t) as a separator.

$ awk -F » \t » ‘OFS=»\t» («output.txt»)>’ users.txt

The following output will appear after running the above commands. The 3rd and 4th columns contain the username and password, which have been printed here.

Example-5: Substitute the particular content of the tab-delimited file

sub() function is used in `awk to command for substitution. The following `awk` command will search the number 45 and substitute with the number 90 if the searching number exists in the file. After the substitution, the content of the file will be stored in the output.txt file.

$ awk -F » \t » ‘‘ users.txt > output.txt

The following output will appear after running the above commands. The output.txt file shows the modified content after applying the substitution. Here, the content of the 5th line has modified, and ‘arnob45’ is changed to ‘arnob90’.

Example-6: Add string at the beginning of each line of a tab-delimited file

In the following, the `awk` command, the ‘-F’ option is used to divide the content of the file based on the tab(\t). OFS has used to add a comma(,) as a field separator in the output. sub() function is used to add the string ‘—→’ at the beginning of each line of the output.

The following output will appear after running the above commands. Each field value is separated by comma(,) and a string is added at the beginning of each line.

Example-7: Substitute the value of a tab-delimited file by using the gsub() function

gsub() function is used in the `awk` command for global substitution. All string values of the file will replace where the searching pattern matches. The main difference between the sub() and gsub() functions is that sub() function stops the substitution task after finding the first match, and the gsub() function searches the pattern at the end of the file for substitution. The following `awk` command will search the word ‘nila’ and ‘Mira’ globally in the file and substitute all occurrences by the text, ‘Invalid Name’, where the searching word matches.

The following output will appear after running the above commands. The word ‘nila’ exists two times in the 3rd line of the file that has been replaced by the word ‘Invalid Name’ in the output.

Example-8: Print the formatted content from a tab-delimited file

The following `awk` command will print the first and the second columns of the file with formatting by using printf. The output will show the user’s name by enclosing the email address in brackets.

The following output will appear after running the above commands.

Conclusion

Any tab-delimited file can be easily parsed and printed with another delimiter by using the `awk` command. The ways of parsing tab-delimited files and printing in different formats have shown in this tutorial by using multiple examples. The uses of sub() and gsub() functions in the `awk` command for substituting the content of the tab-delimited file are also explained in this tutorial. I hope this tutorial will help the readers to parse the tab-delimited file easily after practicing the examples of this tutorial properly.

About the author

Fahmida Yesmin

I am a trainer of web programming courses. I like to write article or tutorial on various IT topics. I have a YouTube channel where many types of tutorials based on Ubuntu, Windows, Word, Excel, WordPress, Magento, Laravel etc. are published: Tutorials4u Help.

Источник

Читайте также:  How to extend partition linux
Поделиться с друзьями
КомпСовет
Adblock
detector