Import Tabular Data
- Tabular data can be in a single spreadsheet or a delimited text file
readtable
function:myTable = readtable('tablename.txt');
- access the variable using dot notation:
P=myTable.Pressure;
- when there is additional header in text file, provide additional inputs to indicate the number of header lines in the file:
myTable = readtable('tablename.txt','HeaderLines',5)
- use the option
CommentStyle
to ignore some lines with specific symbols:myTable = readtable('tablename.txt','CommentStyle','##');
Representing Discrete Categories:
- By default, the values of finite variables are imported as a cell array. But it may consume the memory
- therefore using
categorical
variables to store those data. x = categorical(x)
can specify all the possible values as additional inputs to
categorical
: the second input indicates the unique category values in the original array, and the third input indicates the names that correspond to these categories:1
2
3
4
5v = [ 10 5 0 0 ];
levels = { 'beg' 'mid' 'last' };
categorical(v,[0 5 10],levels)
ans =
last mid beg begto remain the inherent ordering, use
Ordinal
:1
2
3
4
5v = [2 4 1 1];
levels = {'tiny','small','big','huge'};
c = categorical(v,[1 2 3 4],levels,'Ordinal',true)
ans =
small huge tiny tinyCategorical arrays allow the use of
==, >, <
for comparison:y=='small'
Preprocessing Data
- Calculations Involving NaNs:
- mean without NaN:
y=mean(y,'omitnan')
- median without NaN:
y=median(y,'omitnan')
- test an array for numeric equality, as well as determining if the NaNs align, use
isequaln
:test = isequaln(x,y)
- mean without NaN:
- Locating Missing Data:
- replace the NaN value with zero:
x(isnan(x)) = 0
- delete the NaN value:
x(isnan(x)) = [];
- use
ismissing
on a table to identify location of any kind of missing values:missingDataLocations = ismissing(tableName);
- Use
any
function to determine which rows have any true values:trueRows = any(grid,2)
,2
indicates that the function should search for nonzero elements along the 2nd (column) dimension.
- replace the NaN value with zero:
- Categories and Set Operations:
categories
function returns the unique categories within a categorical array:cats = categories(variableName);
setdiff
function performs the set difference between the first and the second input:d = setdiff(a,b)
, returned variable d has values in a that are not present in b.- merge different categories in a categorical array:
a = mergecats(a,{'small' 'medium' 'large'},'size')
- Discretizing Continuous Data:
discretize
function to categorize values into discrete bins:binNum = discretize(x,0:0.2:1)
.- Note that any NaNs or values outside the range of the bins are unclassified. Add
-Inf
orInf
to the vector of bin edges if want to include bins for values outside of the edges. - To discretize data into categories, use the
Categorical
option:cats = {'on','off'}; binNum = discretize(x,0:0.5:1,'Categorical',cats);
Graphics Formatting Function
- A plot in MATLAB is a collection of graphics objects. You can change the properties of the graphics object by providing additional inputs to the function that created the graphics object.
- Plot line properties:
plot(x,y,'*','MarkerSize',8,'MarkerFaceColor',[0.5 0.5 1])
- Scatter plot:
- scale the size of the markers by supplying it as the third optional input which must be either a scalar or the same length as the input values:
scatter([4 5 6 7],[9 11 13 15],[25 50 75 100])
- specify marker style:
plot(xData,yData,15,'kd')
to plot with black diamond markers - fill the marker:
scatter(xData,yData,15,'kd','filled')
- scale the size of the markers by supplying it as the third optional input which must be either a scalar or the same length as the input values:
- Functions for Customizing Appearance:
xlim
function will change the limits of the x-axis:xlim([1 10])
grid
command to control whether or not grid lines are displayed:grid('on')
orgrid('minor')
axis
command to change the style of the axes:axis('tight')
axis('square')
Importing Data from Multiple Files
- Create Datastores:
- A datastore is just a reference to a file or a set of files. Creating a datastore does not automatically import any data into MATLAB.
- use
datastore
function with the file or folder location as the input:ds = datastore('dirName/fileName.txt');
At this point, we have only created a reference to the data file. preview
function help see the first few lines of data in the file:preview(datastoreVariable)
- Since the datastore variable does not contain any data but only the information about the file, we can access this information through its properties:
ds.propertyName
. The properties can beVariableNames
,Files
,NumHeaderLines
,MissingValue
- Modify Datastore Properties
- ignore that begin with the character sequence ‘//‘ by modifying the
CommenStyle
property:dat.CommentStyle = '//'
- set the
ReadVariableNames
property to false if there isn’t a line containing the variable names:dat.ReadVariableNames = false
- Set the variable name:
dat.VariableNames ={'color','size','act','age','inflated'}
- ignore that begin with the character sequence ‘//‘ by modifying the
- Import Data into MATLAB
read
andreadall
: to read data using datastore:data = ds.read;
- the
read
function will read data up to the number of lines specified by theReadSize
property of the datastore (20000 by default). - If File1.txt has more than 20000 rows, only first 20000 are read. When
read
again, the rest part are readed. reset
function: reset the datastore to the beginning of the first data file.- After resetting, use the function
readall
to read all the data.
- Importing Datatypes Directly
TextscanFormats
property will return a cell array with the format used to read in each column of data:fmt = dat.TextscanFormats
- By default, the numeric columns are represented with
%f
and non-numeric columns are interpreted to be strings denoted by%q
. The format specifier for a categorical is%C
. Datatime is%D
- To modify the datatype of a variable while importing, use curly braces to index into a cell of TextscanFormats and set it to the appropriate value:
ds.TextscanFormats{1} = '%q'
- Skipping Columns of Data
- to import only a subset of columns, use
SelectedVariableNames
. Only the variables listed in theSelectedVariableNames
property are imported:ds.SelectedVariableNames = {'Name','Date'}
- to import only a subset of columns, use
Analyzing Groups within Data
- Find unique groups of data
findgroups
function can group the values in an array and get the group numbers for each value:v = {'tiger' 'lion' 'lion' 'tiger'}; grpNums = findgroups(v)
- return the group values from findgroups by requesting a second output:
[grpNum,grpVal] = findgroups(v)
, thengrpval='lion', 'tiger'
histcounts
can count the number of observations in each group:counts = histcounts(grpNum,'BinMethod','integers');
findgroups
allows for grouping with multiple inputs. In addition to the group number, it can also return the group values from each input variable:[grpNum,petVals,genderVals] = findgroups(pets,gender)
- Aggregating Grouped Data
- function
splitapply
to perform different operations on groups of data:splitapply(@min,data,grpNums)
- after that, you can plot the grouped results using
bar
chart:[gNum1,gName1] = findgroups(mnth); avgWS = splitapply(@mean,hurrs.Windspeed,gNum1); bar(avgWS);xticklabels(gName1)
monthNum2Name
can help convert month to names:xticklabels(monthNum2Name(gName1)); xtickangle(45)
- function
- Aggregating Grouped Data into a Prescribed Format
- You might want to see the correlations between the groups by aggregating groups and storing the results in a particular structure.
accumarray
can do that. - The first input to
accumarray
is the results fromfindgroups
, with columns representing the group numbers. The second input is the data to be aggregated. The third input is left blank and the fourth input is the function to be used for aggregation:avgP = accumarray([G1 G2],Price,[],@mean)
- You might want to see the correlations between the groups by aggregating groups and storing the results in a particular structure.
Customizing Graphics Objects
- Accessing Graphics Objects
- To modify the properties of a graphics object, the first step is to obtaining a variable (sometimes called a handle) that refers to the particular graphics object.
- Obtain the graphics object variable by assigning output from the graphics functions:
f = figure
- By assigning output from the
plot
command, you can obtain a line object variable:p=plot(x,y)
- To get the graphics object variables for a plot that is already created, use the functions
gcf
,gca
, andgco
to obtain the current figure, axes and selected object (“get current figure/axes/object”):fig = gcf;
Querying and Modifying Properties:
- use dot notation with the property name to return object property value, e.g.:
ax = gca; fw = ax.FontWeight
- use the dot notation to assign a value to an object property:
ax.FontWeight = 'bold'
,ax.XTick = [1,4,8,12]
- modify the data values of an existing plot:
p.XData = linspace(0,1,12)
- rememeber that if you want to change the property of axes, then use gca.
1
2
3
4
5
6
7# create a figure containing 2 axes and 2 line plots
fig = figure;
ax1 = axes;
l1 = plot(t,y1)
axis tight;
ax2 = axes('Position',[.6 .6 .25 .25])
l2 = plot(ax2,t,y2);
- use dot notation with the property name to return object property value, e.g.:
The Graphics Object Hierarchy
- All graphics objects are part of a hierarchy that starts with the root, the main display containing the MATLAB environment. You can make use of the graphics object hierarchy to obtain a specific graphics object after a plot is created.
- besides using
gca
to get axes object, we can also get the axes graphics object using theChildren
property of the figure:ax = fig.Children
- The scatter and line plots are the children of the axes:
p = ax.Children
X Y axis are the children of axes:
1
2
3
4
5xLab = ax.XLabel;
xLab.FontName = 'Garamond'; % only change the font of x axis
xAx = ax.XAxis;
xAx.TickDirection = 'out';
xAx.FontName = 'Courier';After ploting a bar chart with two or more bars, if you want to change the property of only one of the bars:
1
2
3
4
5
6ax = gca;
b = ax.Children;
b(1).FaceColor = [1 0 0];
xAx = ax.XAxis;
xAx.FontWeight = 'bold';
The following is a summary of this section:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31ds = datastore('fuelEconomy2.txt')
ds.ReadSize = 362;
data = ds.read;
[gNum, gNames] = findgroups(data.NumCyl )
avgMPG = splitapply(@mean, data.CombinedMPG, gNum)
b = bar(avgMPG);
xlabel('Number of cylinders')
title('Average MPG')
% Customize the chart
f = gcf;
a = gca;
f.Color = [0.81 0.87 0.9];
a.Color = [0.81 0.87 0.9];
a.Box = 'off';
a.YAxisLocation = 'right';
a.YGrid = 'on';
a.GridColor = [1 1 1];
a.GridAlpha = 1;
a.XTickLabel = gVal;
a.YLim = [0 40];
ax = a.XAxis;
ax.TickDirection = 'out';
b.FaceColor = [0,0.31,0.42];
b.BarWidth = 0.5;
Images and 3-D Surface Plots
- Making grid
meshgrid
function converts vectors of points into matrices that can represent a grid of points in the x-y plane:[X,Y] = meshgrid(x,y)
- Interpolating Scattered Data
- Interpolating irregularly located data to a regular grid requires two steps: 1) Using the scattered data to create an interpolating function and 2) Evaluating the interpolant at desired locations. Use
griddata
griddata
function: the first three inputs represent the original data and the next two inputs contain the locations at which you would like to get the interpolated data:zInterp = griddata(xOrig,yOrig,zOrig, xNew,yNew);
- Interpolating irregularly located data to a regular grid requires two steps: 1) Using the scattered data to create an interpolating function and 2) Evaluating the interpolant at desired locations. Use
- Visualizing Surfaces:
surf(X,Y,Z)
- The color of the lines between the patches is determined by the
EdgeColor
property:s.EdgeColor = 'interp'
- Colormaps and Indexed Colors:
- The colors in a surface are determined by indexing into a color lookup table associated with the parent figure window, called a
colormap
. - Each point on a surface has a color data value. These values are stored in the
CData
property of the surface. - The color data value is mapped to a range of values. The range is set by the axes using the
CLim
property of the axes:c = ax.CLim
- The default colormap of a figure is called
parula
. You can modify this using the colormap function:colormap(jet)
- The colors in a surface are determined by indexing into a color lookup table associated with the parent figure window, called a
- Creating Indexed-Color Images
- Using
pcolor
: thepcolor
function has the same syntax assurf
. In fact, it actually creates a flat surface withZData
all set to 0 andCData
set to Z. - The direction of the y-axis can be changed using the axis command:
axis xy
oraxis ij
- Using
Import Unstructured Data
- Low-Level File I/O
fopen
function: to open a file. This does not open the file in an editor but instead opens a connection between MATLAB and the file for reading and writing the data:fi = fopen('economy.txt');
- The returned value is a unique file identifier used to reference the open file.
- Once the file is opened, you can use the file identifier to read data from the file.
fgetl
function takes the file identifier as input in order to read in the first line of code.:fgetl(fi) ans = date, unrate, gdp, feddebt
- Subsequent file reads will read in subsequent lines. The file position indicator moves to the beginning of the next line after each read. So when using
fgetl
the second time, it will return the second line of the file. - use
frewind
command to move the file position indicator back to the beginning of the file:frewind(fi)
- To close a file that has been opened by
fopen
, use thefclose
function:fclose(fi)
- if a file was opened, but an identifier was not stored, use
fclose('all')
to close all opened files.
- Importing a Block of Formatted Data:
- You can read in data from a file with arbitrary formatting using the
textscan
function textscan
function has two required inputs: a file identifier and a format specification string, e.g.data = textscan(fid,'%D%q%f%f%f')
, this is to convert the first unit of data to a date,%D
, the second to a string,%q
, and the next three units to double precision numbers,%f
. This pattern is repeated indefinitely, so the sixth unit is a date, the seventh is a string and so on.- use cell indexing to extract the data from a column;
secondColumn = data{2}
%q
specifier can read both numbers and strings- Data is read from the file sequentially in blocks delimited by whitespace (by default) or a specific delimiting character (if provided):
data = textscan(fid,'%D%f%q%C','Delimiter','\t')
- the data will stop being read as soon as a match was not found, even though there were additional matches later in the file. Generally, textscan matches the format string pattern as many times as possible until a match is not found.
- When a text file contains header lines, you can still import data by instructing
textscan
to skip a number of lines before attempting to read data:data = textscan(fid,'%D%f','HeaderLines',5);
- To read only a specific number of lines from a file, specify the number using a third input to textscan. The following code will read five rows:
data = textscan(fid,'%D%f%f',5);
Remember there will also be an indicator in textscan, so when it’s used the second time, it will not start from the beginning but the stop point from last time been called. - Reading Sections with Headers:
- You can read in data from a file with arbitrary formatting using the
- Parsing Data in Text:
- use the
strfind
function to determine the index values where certain phrases or characters appear:iv = strfind('cat,dog,goat',',') iv = 4 8
strsplit
function can split up a line of text into individual strings in a cell array. This function will split on whitespace unless a delimiter is provided as an optional input value:C = strsplit('cat dog goat') C = 'cat' 'dog' 'goat'
strcmp
function can find a word within a list of strings in a cell array. This function will compare values and return a logical array:strcmp(C,'goat') ans = 0 0 1
deblank
function can remove trailing white spaces.data = deblank(data)
- use the
- Processing Data in Blocks:
- You may need to programatically adjust your format specification string that you use to read in data with textscan.
repmat
function can help out:formatSpecString = repmat('%q',1,5); formatSpecString = '%q%q%q%q%q'
fstr = ['%D' repmat('%f',1,3)] fstr = '%D%f%f%f'
feof
function can test if a file has reached the end. This can be used in awhile
loop to read in data until the end of the file:while ~feof(fid) ... end
- You may need to programatically adjust your format specification string that you use to read in data with textscan.
Review of data types
- Extract Data from a Table:
- use {} to extract the data as numeric format
- use () to extract the data as table format
- Merge data:
key
variables are the variables that are common to all sources and uniquely identify each observation, or row:T12 = join(T1,T2)
innerjoin
: just select the observations that have key variables common to both tables:C = innerjoin(A,B);
outerjoin
: include every single observation (row) from both tables:C = outerjoin(A,B)
- Set the
Mergekeys
property totrue
will return a table where the key values of A and B are merged into one variable in the C:C = outerjoin(A,B,'MergeKeys',true);
- Represent Dates and Times:
- To convert a cell array of strings of dates to a datetime array, use the
datetime
function:dates = datetime(dates);
hour
can get the hour time of the date
- To convert a cell array of strings of dates to a datetime array, use the