RecordCollection(CollectionWithIDs)

class metaknowledge.RecordCollection(inCollection=None, name='', extension='', cached=False, quietStart=False)

A container for a large number of indivual records.

RecordCollection provides ways of creating Records from an isi file, string, list of records or directory containing isi files.

When being created if there are issues the Record collection will be declared bad, bad wil be set to False, it will then mostly return None or False. The attribute error contains the exception that occurred.

They also possess an attribute name also accessed with __repr__(), this is used to auto generate the names of files and can be set at creation, note though that any operations that modify the RecordCollection’s contents will update the name to include what occurred.

Customizations

The Records are containing within a set and as such many of the set operations are defined, pop, union, in … also records are hashed with their WOS string so no duplication can occur. The comparison operators <, <=, >, >= are based strictly on the number of Records within the collection, while equality looks for an exact match on the Records

__Init__

inCollection is the object containing the information about the Records to be constructed it can be an isi file, string, list of records or directory containing isi files

Parameters

inCollection : optional [str] or None

the name of the source of WOS records. It can be skipped to produce an empty collection.

If a file is provided. First it is checked to see if it is a WOS file (the header is checked). Then records are read from it one by one until the ‘EF’ string is found indicating the end of the file.

If a directory is provided. First each file in the directory is checked for the correct header and all those that do are then read like indivual files. The records are then collected into a single set in the RecordCollection.

name : optional [str]

The name of the RecordCollection, defaults to empty string. If left empty the name of the Record collection is set to the name of the file or directory used to create the collection. If provided the name id set to name

extension : optional [str]

The extension to search for when reading a directory for files. extension is the suffix searched for when a directory is read for files, by default it is empty so all files are read.

cached : optional [bool]

Default False, if True and the inCollection is a directory (a string giving the path to a directory) then the initialized RecordCollection will be saved in the directory as a Python pickle with the suffix '.mkDirCache'. Then if the RecordCollection is initialized a second time it will be recovered from the file, which is much faster than reprising every file in the directory.

metaknowledge saves the names of the parsed files as well as their last modification times and will check these when recreating the RecordCollection, so modifying existing files or adding new ones will result in the entire directory being reanalyzed and a new cache file being created. The extension given to __init__() is taken into account as well and each suffix is given its own cache.

Note The pickle allows for arbitrary python code execution so only use caches that you trust.

__init__(inCollection=None, name='', extension='', cached=False, quietStart=False)

Basically a collections.abc.MutableSet wrapper for a set with a bunch of extra record keeping attached.

citeFilter(keyString='', field='all', reverse=False, caseSensitive=False)

Filters Records by some string, keyString, in their citations and returns all Records with at least one citation possessing keyString in the field given by field.

dropNonJournals(ptVal='J', dropBad=True, invert=False)

Drops the non journal type Records from the collection, this is done by checking ptVal against the PT tag

findProbableCopyright()

Finds the (likely) copyright string from all abstracts in the RecordCollection

forBurst(tag, outputFile=None, dropList=None, lower=True, removeNumbers=True, removeNonWords=True, removeWhitespace=True, stemmer=None)

Creates a pandas friendly dictionary with 2 columns one 'year' and the other 'word'. Each row is a word that occurred in the field given by tag in a Record and the year of the record. Unfortunately getting the month or day with any type of accuracy has proved to be impossible so year is the only option.

forNLP(outputFile=None, extraColumns=None, dropList=None, lower=True, removeNumbers=True, removeNonWords=True, removeWhitespace=True, removeCopyright=False, stemmer=None)

Creates a pandas friendly dictionary with each row a Record in the RecordCollection and the columns fields natural language processing uses (id, title, publication year, keywords and the abstract). The abstract is by default is processed to remove non-word, non-space characters and the case is lowered.

genderStats(asFractions=False)

Creates a dict ({'Male' : maleCount, 'Female' : femaleCount, 'Unknown' : unknownCount}) with the numbers of male, female and unknown names in the collection.

getCitations(field=None, values=None, pandasFriendly=True, counts=True)

Creates a pandas ready dict with each row a different citation the contained Records and columns containing the original string, year, journal, author’s name and the number of times it occured.

There are also options to filter the output citations with field and values

localCiteStats(pandasFriendly=False, keyType='citation')

Returns a dict with all the citations in the CR field as keys and the number of times they occur as the values

localCitesOf(rec)

Takes in a Record, WOS string, citation string or Citation and returns a RecordCollection of all records that cite it.

makeDict(onlyTheseTags=None, longNames=False, raw=False, numAuthors=True, genderCounts=True)

Returns a dict with each key a tag and the values being lists of the values for each of the Records in the collection, None is given when there is no value and they are in the same order across each tag.

When used with pandas: pandas.DataFrame(RC.makeDict()) returns a data frame with each column a tag and each row a Record.

networkBibCoupling(weighted=True, fullInfo=False, addCR=False)

Creates a bibliographic coupling network based on citations for the RecordCollection.

networkCitation(dropAnon=False, nodeType='full', nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, directed=True, keyWords=None, detailedCore=True, detailedCoreAttributes=False, coreOnly=False, expandedCore=False, recordToCite=True, addCR=False, _quiet=False)

Creates a citation network for the RecordCollection.

networkCoAuthor(detailedInfo=False, weighted=True, dropNonJournals=False, count=True, useShortNames=False, citeProfile=False)

Creates a coauthorship network for the RecordCollection.

networkCoCitation(dropAnon=True, nodeType='full', nodeInfo=True, fullInfo=False, weighted=True, dropNonJournals=False, count=True, keyWords=None, detailedCore=True, detailedCoreAttributes=False, coreOnly=False, expandedCore=False, addCR=False)

Creates a co-citation network for the RecordCollection.

rpys(minYear=None, maxYear=None, dropYears=None, rankEmptyYears=False)

This implements Referenced Publication Years Spectroscopy a techinique for finding import years in citation data. The authors of the original papers have a website with more information, found here.

This function computes the spectra of the RecordCollection and returns a dictionary mapping strings to lists of ints. Each list is ordered and the values of each with the same index form a row and each list a column. The strings are the names of the columns. This is intended to be read directly by pandas DataFrames.

The columns returned are:

  1. 'year', the years of the counted citations, missing years are inserted with a count of 0, unless they are outside the bounds of the highest year or the lowest year and the default value is used. e.g. if the highest year is 2016, 2017 will not be inserted unless maxYear has been set to 2017 or higher
  2. 'count', the number of times the year was cited
  3. 'abs-deviation', deviation from the 5-year median. Calculated by taking the absolute deviation of the count from the median of it and the next 2 years and the preceding 2 years
  4. 'rank', the rank of the year, the highest ranked year being the one with the highest deviation, the second highest being the second highest deviation and so on. All years with 0 count are given the rank 0 by default
writeBib(fname=None, maxStringLength=1000, wosMode=False, reducedOutput=False, niceIDs=True)

Writes a bibTex entry to fname for each Record in the collection.

If the Record is of a journal article (PT J) the bibtext type is set to 'article', otherwise it is set to 'misc'. The ID of the entry is the WOS number and all the Record’s fields are given as entries with their long names.

Note This is not meant to be used directly with LaTeX none of the special characters have been escaped and there are a large number of unnecessary fields provided. niceID and maxLength have been provided to make conversions easier only.

Note Record entries that are lists have their values separated with the string ' and ', as this is the way bibTex understands

writeCSV(fname=None, splitByTag=None, onlyTheseTags=None, numAuthors=True, genderCounts=True, longNames=False, firstTags=None, csvDelimiter=', ', csvQuote='"', listDelimiter='|')

Writes all the Records from the collection into a csv file with each row a record and each column a tag.

writeFile(fname=None)

Writes the RecordCollection to a file, the written file’s format is identical to those download from WOS. The order of Records written is random.

yearSplit(startYear, endYear, dropMissingYears=True)

Creates a RecordCollection of Records from the years between startYear and endYear inclusive.