Practical AI Transparency —Revealing Datafication and Algorithmic Identities

How does one do research on algorithms and their outputs when confronted with the inherent algorithmic opacity and black box-ness as well as with the limitations of API-based research and the data access gaps imposed by platforms’ gate-keeping practices? This article outlines the methodological steps we undertook to manoeuvre around the above-mentioned obstacles. It is a “byproduct” of our investigation into datafication and the way how algorithmic identities are being produced for personalisation, ad delivery and recommendation. Following Paßmann and Boersma’s (2017) suggestion for pursuing “practical transparency” and focusing on particular actors, we experiment with different avenues of research. We develop and employ an approach of letting the platforms speak and making the platforms speak . In doing so, we also use non-traditional research tools, such as transparency and regulatory tools, and repurpose them as objects of/for study . Empirically testing the applicability of this integrated approach, we elaborate on the possibilities it offers for the study of algorithmic systems, while being aware and cognizant of its limitations and shortcomings.


INTRODUCTION
Today, there is almost no area in everyday life that has not been mediated or impacted by Artificial Intelligence (AI).From recommender systems for news, apps, routes, products, to job applications, financial services, health care, education, criminal justice, etc., individuals have been increasingly, to lesser or greater degree, subjected to the automated decision making (ADM) by some kind of algorithmic and AI systems.More and more decisions impacting individuals are based on what we can call 'algorithmic identity' (Cheney-Lippold, 2011; but also Jarrett, 2014;Reigeluth, 2014) guided by extensive profiles about people, uncovering their affinities and interests and predicting their behaviour.With the ubiquity of these ADM and AI systems, it becomes an issue of urgency to be able to investigate them, reveal their workings, and explain their outputs and impact.
We could say that this article is a "byproduct" of our attempt to investigate how algorithmic identities are being produced by few sampled actors (Facebook, Google, Quantcast and Oracle) participating in the process of algorithmic identity building for personalisation, ad delivery and recommendation.For us that meant a critical investigation of the inner workings of algorithmic systems, of the datafication practices that enable the algorithmic identity creation, in particular the actors participating in these processes, the types of data that are used, the sources of data andimportantly -their relation to the inference-making processes that are building blocks for the algorithmic identities.But an analytical inquiry like this confronted us with the question of how to investigate these issues?How does one do research on algorithms and their outputs when confronted with the inherent algorithmic opacity and black box-ness as well as with the limitations of API-based research and the data access gaps imposed by platforms' gate-keeping practices?How does one overcome and manoeuvre the limitations when dealing with data provided or extracted from these platforms while being aware of and critical of the 'methodological bias' (Marres and Gerlitz, 2015), the un-rawness of data (Gitelman, 2013) and the level of mediation (van Es et al., 2013)?This had led to our research focus of investigating the process of how an individual algorithmic is being created by few different platforms.To achieve this, we experiment with novel avenues for methodologically investigating this process and the underlying datafication processes and practices.In doing so, we also contribute towards answering the question of how to uncover and explain datafication and algorithmic identity.
This paper documents and discusses our attempts and experimentations in studying and researching algorithms while doing digital social research (Lindgren, 2019).We rely on data collection methods that manoeuvre around the API restrictions and make use of non-traditional data sources, like transparency and regulatory tools.We do this by using a mixed method design for which we developed and adopted two approaches: letting the platforms speak and making the platforms speak.This led to an investigation on two levels, interface and software, while employing two corresponding overarching methods, technography and digital methods (Rogers, 2017).Experimenting with this methodological setup, our experience and results show that, while there are limitations, this approach enables an in-depth critical inquiry and generates valuable new insights into the processes of algorithmic construction of identity, data extraction and inferential analytics.
What follows is an outline of the techniques we employed around the limitations we encountered when dealing with platforms' algorithmic systems for the purpose of research.We see this approach as just one possible path for doing research on algorithms, AI and platforms.As such, it does not aim to be taken as a generalizable, "apply-to-all" approach, but it aims foremost to inspire, test and experiment, and explore the possibilities and limits of different approaches and tools.We will elaborate on the rationale behind the approach, the methods chosen, the particular tools and research protocols applied, as well as the specific steps taken.First, we discuss some recent developments in digital social research, the restrictions imposed by platforms and the very nature of algorithms, and elaborate how that impacts the ability to do digital social research.This is followed by outlining our methodological design and rationale and detailing the particular steps and tools we used.A substantial part is dedicated to the elaboration of our results.We conclude with a discussion on the advantages and limitations of the particular methodological choices and tools.

PRACTICAL AI TRANSPARENCY
The networked infrastructure of the internet, with its technological capacity to track user movements across different web sites, apps and servers, has given rise to an industry of web analytics firms that are actively amassing information on individuals and fine-tuning computer algorithms to make sense of that data.Via the process of datafication -'the transformation of the social actions of their users to quantified data' (Mayer-Schönberger and Cukier, 2013, p. 78) -and the collection of data via tracking technologies, combined with the analytics capabilities of algorithms and companies, the aim of many of these companies it to create what Cheney-Lippold (2011) calls 'algorithmic identity' -"an identity formation that works through mathematical algorithms to infer categories of identity on otherwise anonymous beings" (p.165).Datafication can be understood both as action and as aim.As an action, it means "transformation of social action into online quantified data, thus allowing for real-time tracking and predictive analysis" (van Dijck, 2014, p.198).As an aim, it relates to the pursuit to collect, monitor, analyse, understand and use people's behaviour for behaviour prediction, affinity profiling, but also for 'unstated preset purposes ' (van Dijck, 2014, p.205).Raley (2013) calls the latter 'data speculation', i.e. a value yet to be added to the data and 'informational patterns still to come ' (p. 123).This is closely tied, first, with the belief "in the objective quantification and potential tracking of all kinds of human behaviour and sociality through online media technologies" (ibid., p.198) to which van Dijck refers to as 'dataism', and second, with the 'collect everything' approach (van Dijck, 2014;Sadowski, 2019;Andrejevic & Gates, 2014).The creation of an algorithmic identity is possible because of the process of datafication, and datafication and dataism are the building blocks for behaviour prediction and affinity profiling, among other things, for targeted advertising and personalisation.However, this algorithmic identity is a construct, "it is not the personal identity of the embodied individual but rather the actuarial or categorical profile of the collective which is of foremost concern' to new, unenclosed surveillance networks" (Hier, 2003: 402 in Cheney-Lippold, 2011, p. 177).So how do we investigate the process of algorithmic identity and the underlying processes of datafication?
In his recent article Axel Bruns (2019) (rightfully) states that the APIcalypse has arrived and it seriously impacts our ability as social science researchers to critically study society via the digital.This results in a restriction as regards who has access to the platforms' data via their Application Programming Interfaces (APIs), which makes access to data either impossible or possible only for the chosen few and under strict conditions.Hence the APIcalypse limits the possibilities to inspect and investigate phenomena happening "in the digital".The importance of this gatekeeping is even greater if we consider that the online is never a separate realm, as decisions made about individuals based on the digital traces they leave behind can impact their offline lives as well.Being severely restricted and limited in investigating the digital and the algorithmic, seriously impacts the ability of researchers and scientists to investigate and criticise these systems, hold to account their proprietors and remedy their outputs.
To borrow the definition by Venturini and Rogers (2019), API based research is an approach to computational social sciences and digital sociology based on the extraction of records from the datasets made available by online platforms through their application programming interfaces (or APIs).This type of research has allowed the collection of detailed information on large populations, thereby effectively challenging the distinction between qualitative and quantitative methods.(p. 1).
As such, this approach enabled the studying of a variety of phenomena pertaining to the interplay and mutual influence of both technology and society, and mediated numerous findings, previously not possible at such a large scale.However, this is not the only approach undertaken or proposed by researchers and scholars for studying the digital, or the best one.As Venturini (2018) notes, 'when all you have is a Twitter feed, everything looks like a hashtag ' (p. 4210).This refers to the limitations imposed by the affordances of the platforms when we see them as objects of research.We use these statements as an entry point to discuss some of the approaches developed for studying the relationship between the digital, the algorithmic and the societal, their limitations and shortcomings.In the paragraphs that follow we briefly outline some of them, and outline our own developed methodological approach, as being a response to both the APIcalypse and the dominant discourse of API-dependability for research.
The approach of auditing algorithms was proposed by Sandvig et al. (2014) and entails different techniques to uncover the inner workings of algorithmic agents.Depending on the infrastructure and affordance of the system, the objective (input, output or system) and available resources, these techniques range from relying on APIs, use of software and hardware infrastructures to users' input, either to investigate the code or the outputs of the system.Weltevrede (2016) talks about the adoption of a device-driven approach, as a way to focus on the 'the specific strategies or intents embedded in algorithms' (p.106) and to repurpose the 'analytical affordances of the algorithmic systems/devices' (ibid.).Because algorithms are techno-epistemological devices, the analytical inquiry is dependent on the system's affordances, so on what the system allows and limits to be seen.As such it requires a combination of different types of methodological and conceptual resources to study device-captured data points.This approach shares similarities with the reverse-engineering one, as a diagnostic approach that allows for an observation of the relationship between the inputs and outputs, and a way to obtain 'missing knowledge' (Bucher, 2012, p. 79) and grasp a model of how the particular algorithmic system works.As a strategy to see to what the algorithm pays attention to, is a "process of articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works."(Diakopoulos, 2014, p. 13).
All these approaches are characterised by a move away from the quest to open the black box towards investigating algorithms in action, at work, in practice.It is a quest towards 'unknowing algorithms' (Bucher, 2018;Annany and Crawford, 2016), studying them as 'part of specific situations' (Bucher, 2018, p. 49) and uncovering the actor-network assemblages/configurations (Annany and Crawford, 2016).By observing the effects of the system, researchers are able to overcome the obstacles of the "black box", and to assess the 'operational principles of systems' (Bucher, 2012, p. 77) and its actual working.Additionally, investigating algorithms as an assemblage(s), to borrow Annany & Crawford's (2016) suggestion, is to look at them as a system and across a/the system.
However, this doesn't solve (all of) the difficulties of socially investigating algorithms.Algorithms are predominantly patent protected and proprietary software, with their inherent opacity stemming from the underlying machine-learning process at work.It is never a single algorithm, but always an algorithmic system of interconnected and interrelated algorithms (Gillespie, 2014;Bucher, 2018).In addition, these systems are in a 'perpetual beta' state (Weltevrede, 2016) with constant and continuous A/B testing, fine tuning and upgrades, making the study of algorithmic systems almost a study of a 'historical object' (Bucher, 2012).All this coupled with the research affordances (Rogers, 2013) of -and the restricted access to -the platforms' APIs and code crucially limits and impacts digital social research and complicates the task of developing and applying the appropriate methodological apparatus and tools.
Faced with these inherent characteristics, the calls for transparency of algorithmic systems, initially aiming towards total transparency, have shifted their focus significantly.Paßmann and Boersma (2017), differentiate between two notions of transparency.Formalised transparency, which would like to see more inside the content of the black box and 'obtain more positive knowledge' (p.140) and practical transparency, which does not try to open the black box, but to 'develop skills without raising the issue of openability' (ibid.).These skills should help researchers deal with the (parts of the) algorithms that we still don't have knowledge about, and probably we won't be able to have.Thus, the aim is actually to ask and investigate how to 'behave towards what remains black after all.' (ibid., p. 140).In order to find ways to work around these unknowns, the authors suggest other sources, external to the algorithms that will help turn 'unknown unknowns in to known unknowns' (p.145), such as ethnographic data or other sources that are some kind of everyday knowledge.Our research follows the principle of practical transparency.

METHODOLOGICAL PATHWAYS THROUGH THE (ALGORITHMIC) SYSTEM
In his book Design research and the new learning, Buchanan (2001) states: By definition, a system is the totality of all that is contained, has been contained, and may yet be contained within it.We can never see or experience this totality.We can only experience our personal pathway through a system.(p.12).
This corresponds with the methodological and sampling approach that we adopt in our empirical research: zooming in on a few platforms but looking at the wider system/assemblages of actors participating in the creation of an algorithmic identity of a single individual.Focusing on a research subject of one, we also expand our research to the other social and technological actors partaking in the process.In this following section we elaborate on the methodological setup while discussing the specific aspects we took into consideration and the limitations and opportunities we were faced with.
Methodological design.Our methodological approach is the result of a two-way process.First, we built our research on an assessment of the analytical affordances of the platforms in our study and of the mechanisms and tools known and available to the researchers.We tested and experimented with a variety of digital methods and tools, ranging from API-access ones to scraping ones.Through a process of going back and forth, we finalised the list of tools based on their applicability to the research questions and their particular affordances, while constantly being aware of their level of mediation (van Es et al., 2018).
Next, we experimented with the method of an interface walkthroughwhere we mimicked and rehearsed ordinary use (researchers as users perspective) (Dieter and Tkacz, 2020).In that way we investigated what could be collected and used as data for research through what was available via the interface of the platforms.However, if we were to experiment with "out of the (black) box" approaches and tools, we had to think both more critically and creatively.In doing so, we "took advantage" of the newly established regulatory and transparency mechanisms and repurposed them as objects/tools for study.The platforms we queried have developed and made available (confined) gateways to transparency and explainability, as an attempt to provide more information on data collection and personalisation practices.We decided to experiment with these transparency and accountability tools and see if we could repurpose them as objects for study.Additionally, we were curious to investigate how the General Data Protection Regulation's (GDPR) Article 15, its' corresponding recitals and in particular the Data Subject Access Request mechanism (European Commission, 2016) could be used for academic research.
Approach.Faced with the above-mentioned challenges, one of the strategies created, tested and employed was to work with what is available and be creative in finding ways to do research relying on the affordances of platforms themselves and repurposing transparency and regulatory tools as objects for/of study. 1 We define these approaches as letting the platforms speak and making the platforms speak, focusing on achieving practical transparency (Paßmann and Boersma, 2017) through investigating algorithms in action and studying them by observing their outputs and effects.
Sampling.The insights collected and discussed in this paper are the result of the data originating from one research subject (n=1).It is collected via different means over a period of six months 2 .Choosing the personalised, one-research-subject-only approach, allows for the observation of real useralgorithmic agents interactions, where "pre-existing profiles, browsing histories, technology fingerprints, and other organically developed profile information are used."(Bodo et al., 2018, p. 143).This real-world observation is advantageous in comparison with the use of sock puppet audits or dummy users, as it overcomes the shortcomings of 'non-adequate approximations of real-life users' (ibid.),allowing for investigation of the effects of algorithmic agents on individual users (ibid., p. 144).As such, the detailed (data) account of a personalised experience offers an overview of 'the whole spectrum of online and offline, personalised and nonpersonalised information flows.' (ibid., p. 145).Additionally, the insights offered by small data bear the quality of more context-aware research, granularity and depth of the data and the findings by combining various methods, complementing data and triangulating the findings (Crawford, 2013).As the method and type of data should follow the research question (Van Es et al., 2017), small data gathered using digital data analysis enables for a qualitative and contextualised investigation (Kitchin, 2014).
Focusing on algorithms in actions, around a user (a real individual, with a browsing history and data scattered around the digital space and different online and offline databases), that exhibits real-life behaviour and for whom information "in the wild" already exists, enables not just for nonlab experimentation, but also for fully taking advantage of additional nontraditional research tools, such as Data Subject Access Requests.We are aware that one of the difficulties with the auto-technographic approach is its highly individualised and personalised approach "as the observation of the interface is confined to the 'me-centric view of the researcher's own Interests, Interactions with Advertisers (Facebook), Advertisers that have uploaded contact details (Facebook), Why am I seeing this Ad (Facebook), assigned interests by Google and reasons for assigning them.Data was collected in the period of November 27, 2018 to June 6, 2019, with different recording periods for different insights, following the browsing behaviour of one research subject.A research web browser was set prior to the start of the data collection process.
2 Data was collected in the period from November 27, 2018 to June 6, 2019, from Brussels, Belgium.The data collection period however differs between the different tools used and the related observation.This is elaborated in more detail in the sections related to each particular tool.
account" (Weltevrede, 2016, p. 107).However, this approach both enables to manoeuvre around the "black-boxed" systems and to follow the advice by boyd and Crawford (2012) that 'the size of data should fit the research question being asked' (p.670).

RESEARCHING ALGORITHMS IN ACTION
Letting the platform speak approach relies on what the platforms themselves allow to be seen and to be visible at an interface level, without the assistance or help of additional data collection tools, relying on the affordances of the presentation layer of the platforms and their front-end.Literary it means looking at what information platforms willingly provide and reveal via the user interface.This approach also helps uncover the platform's politics of visibility, i.e. what the algorithmic system itself decides to make visible and the insights it permits willingly.In addition to the focus on the interface, this approach entails use of external available sources that describe and reveal the workings of the system (Bucher, 2012a, p. 74): technical documentation, specifications, patents, media talks, but also help sections for users and advertisers.However, what we did in a novel way, and where we add to the repository of methods for research is the usage of the transparency tools enabled by platforms (such as Ad Settings, Data Explanations and Ad Explanations), the privacy policies and the Data subject Access requests, enabled by the GDPR.In that sense, we employed a 'multisite technography' (Bucher, 2012, p. 73): as algorithmic systems are always assemblages and always in interaction with other actors and systems, be it technical or human, all these "sites" can be used as sources of data and insights for digital social research.Data collection-methods wise, technography, as defined by Taina Bucher (2012) was adopted, as "a descriptive-interpretative approach to the understanding of software, rooted in a critical reading of the mechanisms and operational logic of technology."(p.71).This is employed via observation, where the daily changes of the information provided by the platforms are observed and recorded.This approach was chosen because it allows for a granular, detailed dossier of the interaction and communication between the user and the system, it enables for insights into the actors they are in communication with, into what is 'the interplay between a diverse set of actors (both human and nonhuman)' (Bucher, 2012, p. 69).This is especially important for the investigation of the actor-network around the data collected and sources used for affinity profiling and algorithmic identity-building, their position within the network and in 'particular sociotechnical events' (Latour, 2005: 128 in Bucher, 2012, p. 72).
The making the platform speak approach, on the other hand, looks for insights not relying on what the software makes visible willingly, but by forcing the software to reveal itself and its inner workings.It relies on the use of automated scraping and crawling tools and tools relying on platforms' Application Programming Interfaces (APIs).In that sense it also makes visible the politics of knowledge of the platform, i.e. what the platform allows to be known, if one has the knowledge and tools to seek knowledge.While this can be more insightful, it is still limited.This approach aims to make the system reveal itself, in order to gain more in-depth knowledge or insights by looking not just how it produces outputs, but also to uncover things not visible at an interface level and to the human eye.In that regard, this is an analysis done at a software level.This approach implies that the algorithmic devices and systems will be forced to speak, meaning, the "analytical gaze" goes beneath the surface and what is visible and tries to uncover some inner workings of these systems.
We specifically set up a research browser through which the platforms and other actors would be able to gather as much possible information on the behaviour, actions, patterns of behaviour of the research subject and thus provide personalised search results, ads and recommendations.This enabled us to -as objectively as possibleinvestigate the datafication practices and the creation of algorithmic identity, while being aware of the multitude of factors affecting data collection and algorithmic outputs in the form of personalisation.In addition we were able to further investigate the assigned algorithmic identity via the outputs provided both by the used search engine (Google) and browser (Chrome) and the platforms visited during the period of the data collection phase. 3Steps were also undertaken to allow for as much data collection and data sharing between Facebook and third-parties as possible, by setting up the preferences, permission and settings options 4 .
3 We set up the research browser by installing a "clean browser", deleting all the previous cookies, browsing history and preferences, and setting the preferences to allow for a maximum data collection by the platform and associated third parties: cookies were enabled, keeping record of web and app activity and location was enabled (location history, device information -info about contacts, calendars, apps, and other device data to improve users' experience across Google services, voice and audio activity, YouTube search History, YouTube watch history), as well as "Chrome browsing history and activity from websites and apps that use Google services" (that includes: activity from sites and apps that partner with Google to show ads; Chrome history (if Chrome Sync is turned on; app activity, including data that apps share with Google; Android usage & diagnostics, like battery level, how often you use your device and apps, and system errors).Ad settings were adjusted too, enabling ad personalisation, giving Google permission to show ads based on user's activity on Google services (such as Search or YouTube) and websites and apps that partner with Google to show ads.Whenever a consent by websites was asked in regard to data collection (in accordance with the GDPR), consent was given. 4The steps we took to set up and allow Facebook to maximise the data collection for the research subject were the following: changing the privacy settings and enabling data

INVESTIGATING DATAFICATION AND ALGORITHMIC IDENTITIES
We start our analysis by investigating datafication practices and the network of actors around the research subject.This is an important starting point, as the creation of an algorithmic identity relies on behavioural data collected via tracking elements present on both the web 5 and in apps.This step, additionally, guides the further analysis of the process of algorithmic identity creation: what data is seen as a worthy signal and what behaviour is taken as important/proxy for affinity profiling -'grouping people according to their assumed interests rather than their personal traits' (Wachter, 2019, p. 33), based on proxies (friends, likes, groups, IP address and similar).Importantly, we are interested to see if only 'raw' data is taken as basis for inferences or there are other (hidden) mechanisms and 'cooked' data (Gitelman, 2013).The structuring of the results follows the same path: we first elaborate on our approach and findings regarding datafication and then focus on methods to investigate and assess algorithmic identity.

Investigating datafication
In order to investigate the formation of an algorithmic identity, our first step was to investigate the datafication practices surrounding a user.This provided us with insights into two interrelated aspects: the sources taken as input for the prediction outputs -the 'qualities, preferences, characteristics, intentions, needs and wants of users' (Lehtiniemi, 2016, p.4), affinities and interests -and the network of companies that collect (behavioural) data about the user (traces of user actions and interactions), as well as their dominance and variety.For this we used diverse sources of insight, collecting data on different levels (interface and software) and using a mixed method approach.We did this according to the following consecutive phases: first, using automated tools to record tracking behaviour and data collection, after which, we used privacy policies as source of information regarding data collection practices of platforms and companies.Lastly, we used transparency and regulatory tools as objects for studying datafication practices.
collection and data-sharing between Facebook family of companies and services; allowed "Ads based on data from partners"; "Ads based on your activity on Facebook Company Products that you see elsewhere"; allowed Facebook Audience Network; with the setup of the Research browser to enable third-party tracking, Facebook was granted access to the full browsing, off-Facebook, behaviour of the research subject.
5 In this research we focus only on tracking datafication actors via web platforms.

Figure 1. Overview of the tools used for investigating datafication and insights gathered
Firstly, by using digital methods and tools, we collected information on the third-party trackers using the browser extension TrackingObserver6 and the automated web scanner Privacy Score7 .This was done at a software level.
Both tools offer different insights in correspondence with their aim, affordances and information structure.As a result, they are suitable for different aspects and levels of analysis.Because of the ability to track every browsing behaviour around a particular user, TrackingObserver enables investigation of the network of third-party trackers and companies around a particular user and their unique browsing behaviour.From the data collected during a six months period8 , we obtained valuable insights into the network relations and data exchange practices of a multitude of actors.
The latter was later used as a source for further investigation.We triangulated the data obtained via the initial data collection with data available from other sources (WhoTracksMe 9 and Better.fyi 10 ), providing us with several valuable insights: it enabled us to reveal the companies behind the trackers and analyse their presence, to detect the type of trackers and their particular purpose.The analysis showed the dominance of a few companies in the network, representing the majority of trackers on the visited websites (Figure 2).However, we also observed a long tail of many different actors (a large number of trackers with low websites frequency) that captured data about the user's behaviour, supporting similar findings by Binns et al. (2018b).Categorising the detected trackers based on a taxonomy, we discovered a presence of a vast and well-developed network of ad networks, counting for more than half (57.23%) of the detected unique trackers.These findings are important for several reasons.The detected long tail is worrying as it indicates that a great number of companies get some and partial data from the research subject and users in general.This is even more of a cause for concern as the profiling-oriented businesses, being faced with lack of informational awareness and with 'information gaps' (Crain, 2018, p. 91), need to infer data and predict behaviours using analytics and modelling to fill that gap.If these sources are 'data poor', the inferences and algorithmic identities (poorly) built on them will be inevitably inaccurate, affecting further the automated decision-making processes.As detected with the TrackerObserver, if we look at the presence of company trackers in the sampled websites, here we also encounter a well-developed network, dominated by Google and distantly followed by Amazon, Oracle, Facebook, Conde Nast and Quantcast (Figure 3).The analysis further shows that most of the trackers set by third parties are via cookies (73,41%) and for the purpose of advertising (83,23%) (Figure 4).Cross-referencing data collected via Privacy Score with data from Better.fyi and WhoTracksMe enabled us to detect the purpose for tracking and the tracking type detected (Figure 4).This additionally confirms that most of the surveillance done online is for the purpose of accumulating data for online behavioural advertising, referring to personalised and targeted advertising based on prediction of interests and affinities profiling.The insights collected from these two tools guided the subsequent research steps.It was expected that Google and Facebook would be the most prominent companies.However, observing the not-insignificant presence of data brokers such as Oracle and Quantcast motivated the further investigation about the data these companies hold about the research subject and the algorithmic identity they assigned.Data brokers are important actors since they are businesses whose revenue model revolves around aggregating information about individuals from a variety of public and private sources […] who sell access to the collected data to third parties, including advertisers, marketers, and political campaigns.(Venkatadri et al., 2018, p.1).
We investigated their role and the data they have by looking at what is detectable regarding datafication practices by different actors at an interface level.To do so, we experimented with data from less-traditional sources: the privacy policies of the most dominant tracking companies we detected in the previous step, the transparency tools made available by the actors themselves and the regulatory tools -the Data Access Request mechanisms enabled by Article 15 of the GDPR.
We started with the privacy policies as investigation tools.We sampled the following platforms -Google and Facebook -and two data brokers -Oracle and Quantcast -detected previously.To get better initial structured overview, we used the machine learning tool, PriBot13 , in order to collect data on (1) what kind of data is being collected about the users and (2) the reasons for data collection.Although privacy policies can be information-rich sources, we decided to narrow our analysis to these two aspects only, as they are the most relevant for our research question.

Figure 5. Overview of the type of data and specification of data types in the sampled privacy policies (information source: PriBot)
By analysing and comparing the information we obtained from the PriBot tool, a list of all possible data types that could be collected by these actors, the above table was created (Figure 5), listing all the data and their definitions, that were/could be captured for the research subject and further (re)used, and which are of particular interest to the sampled companies.This reveals a dominant 'collect all' approach, where the (legal) principle of data minimization is not respected and a lot of data that is not necessary for establishing a connection or providing a service is captured.
We additionally analysed and cross-referenced the findings for each of the actors (Figure 6), thus being able to discover the relations between the types of data collected by each of the sampled policies, the stated reason for collection and the actors that collect each type.The analysis shows that the most under-defined category -"Other data" is the most frequently captured data, although it was not explicitly stated in any of the policies what kind of data that is, leaving many open doors for misuse and abuse.Looking at the column with particular actors, it is noticeable that apart from Google (not unexpected), Oracle is actually the actor that closely follows Google for potential capture of a number of various data types.Figure 6 shows how "messy" data collection is, and how different types of data can be used for various purposes.We can detect, for example, that Personalisation & Customisation is a reason for data collection for all sampled companies, and the following types of data are used for that purpose: user profile, IP address and device IDs, location, contact information, computer information, generic personal information, cookies and tracking elements, user online activities and other data.For Marketing purposes, companies use financial data, contact information, generic personal information, cookies and tracking elements, user online activities and other data.

Figure 6. Diagram of data collected and stated reasons for collection across companies (information source: PriBot)
What we call transparency tools are designed by the platforms with a specific purpose in mind: to increase transparency and accountability towards users and regulators (Facebook Newsroom, 2020;Google Blog, 2018).However, here we are repurposing them as objects for study in order to investigate datafication practices and sources.For this particular case, we looked into Google and Facebook's data explanation and ads explanation mechanisms.
Google's Ad preference page14 , for example, shows the inferred interests about each particular user, briefly elaborating on the logic and process behind it.This allows us to investigate where the (behavioural) data originates from.Having this information, we can see how data is captured and transferred and thus get insights into the datafication and data sharing network.Following and recording the data a few times a week over a period of two and a half months15 , during which we collected 183 distinct interests assigned to the research subject, our research showed that Google estimates the interests based on using and/or combining data from: 1. activity on Google services/products; 2. activity on Google combined with activity on other websites and apps; 3. activity on non-Google (outside Google) services and 4. Visiting an advertiser's website/app. 16This also gives insights into the structuring of information and the degree of (non)disclosure by the platform itself, impacting the degree and scope of possible research insights.However, as these systems are highly volatile, at the time of writing this article and checking explanations again, it was noticed that Google added one more insight source -"similarity to other users".As an example, for the categorisation "Homeownership Status" Google categorises the research subject as "Renter" based on "Google estimates this demographic because your signed in activity on Google services (such as Search or YouTube) is similar to people who've told Google they're in this category".Additionally, three months before, the research subject was categorised as "Homeowner", based on the same sources (see Figure 7).Facebook offers more transparency mechanisms, of which we used the data explanations 17 and ad explanations 18 .We used these two tools to collect information on the sources of data, the type of data (whether or not personal data) and the actors in the datafication network, as well as -equally important -the mechanisms and sharing practices between the actors in the network.
The insights provided show that Facebook datafies users both on-and off-platform, of which the latter one is the prevalent one.Using additional sources of insights about the workings of the platform's tracking system, such as guidelines offered to advertisers by Facebook itself, shows that this is data originating from the websites integrating the Facebook Pixel tracking technology, and is handed to the platform by clients (websites/app) that integrate it.Clients uploading a contact list to Facebook is another source of data feeding the platform.These two sources (Pixel and List) contain personal data and they constitute 68.25% of the off-platform data ending up at Facebook.The only data originating from on-platform behaviour is the data gathered by tracking the ads shown on Facebook's Newsfeed that were clicked.Recording the data available via the "advertisers who use contact list added to Facebook" tab, shows that very high percent (75%) of companies 17 Data explanations provide the user with a list of attributes Facebook has inferred about them, how they were inferred and what information is used to target them with advertisements (see Andreou et al., 2018 for more detailed explanation of the mechanisms).The data explanations are accessible via an Ad Preferences Page (https://www.facebook.com/ads/settings) and they provide information structured in the following way: Your interests, Advertisers and Businesses, Your information and Ad Settings. 18Ad explanations provide the user with an information/explanation why a particular ad was served.They are accessible via the "Why am I seeing this?" button above every ad served on the user's Newsfeed.
listed collected personal data from other sources, not the user itself, without the user's explicit consent or information about the source provided.This potentially points to the well-developed network of actors in the (personal) data sharing network.
Repurposing the Ad explanation tool by Facebook, particularly the "Why am I seeing this ad" option, we were able to collect information on the data sources used for personalised ad targeting. 19We did this on both levels (interface and software), using both observation for recording the data from the interface, and the automated tool AdAnalyst, to collect data at a software level.Following an analysis of the explanations provided, we were able to uncover the relations between the sources of data used, the types of data used, the analytical processes at play and the particular reasons for personalised ad targeting, shown in the figure below (Figure 8).For example, if the targeting is based on a particular interest, behavioural data will be used to make that inference.This data could be originating either from Facebook (by tracking the activity of the user), and advertisers and/or data brokers, using inferential and prediction analytics.The latter analytics methods are used to infer user preferences, attributes and opinions and predict behaviour (Wachter and Mittelstadt, 2018, p.4). Reading, structuring and coding the information collected and recorded, provided us with additional insights: apart from insights into the processes behind the ad-targeting analytics and the inputs/outputs relations, it also revealed that the sources of data could originate both on-and off-platform, they can be volunteered (by the user), obtained (via partners, data brokers and advertisers) or captured by Facebook.Different types of data are taken as signals for affinities/interests.This ranges from location and age, to languages spoken, activities and social neighbourhood, or tracking the social network of/for relations between individuals/users and taking this as a data signal for further affinity profiling and commodification for ads targeting.This 'data inference process' (Andreou et al., 2018, p.3) is important because it allows the advertising platform to infer users' preferences and attributes, later used for affinity profiling and building algorithmic identity, further used as a basis for commodification (targeted advertising).

Figure 8. Alluvial diagram of sources of data and inferences for Facebook
The last strategy we used for uncovering and investigating the data sources, actors and mechanisms for inferential analytics, prediction and building algorithmic identity was repurposing the Data Subject Access Rights mechanisms as an object for study.Article 15 of the GDPR, in force since May 2018, enables data subjects to request and obtain access to any personal data being held and processed by a data controller.Executed in correct manner, it should give information on the purposes of the processing, the categories of personal data concerned, the recipients or categories of recipients to whom the personal data have been or will be disclosed, and if automated decision-making (including profiling) is present.The latter entails providing meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject (European Commission, 2016).Repurposed for academic research, Data Subject Access Requests (DSAR) would give information on the sources of data (categories of personal data concerned), the network of actors with access to the data and the algorithmic identity/assigned affinities by the controller.
Six DSARs were filed, of which only one response (by Oracle) was entirely suitable for analysis. 20The data obtained from Quantcast, although 20 Requests were sent to Bumble, Oracle, Criteo, Quantcast, Facebook and Acxiom.Only Oracle provided data that can be used for the purposes of the research.The file obtained by Quantcast was "unreadable" in terms that it contained only a few unique rows, duplicated tens of thousands of times (96,659 data entries in total).Criteo was asking for additional identification checks, and because of time constraints it was decided not to follow through.
incomplete, enabled for some crucial observations.The first observation pertains to the well-established and wide network of data sharing and the exchange system between data brokers.Oracle relies on six other data brokers to collect data and infer affinities and interests (these data brokers are Eyeota, OnAudience, Lotame, Bombora, AuDigent, Affinity Answers).This complicates the quest of tracking where data originates and where its' final destination is, making it difficult to later contest or rectify the data in question.The second observation concerns the risk of inaccurate inferences: if one data broker makes inaccurate inference, this information is further shared across the ecosystem.Closely inspecting the data provided by Oracle, it could be observed that some of the inaccurate data Oracle holds originates from Eyota, that obtained them from Bombora.The reliance on other partners and data brokers is also indicated in the data obtained from Quantcast, in their "Audience Grid" data file, which points to a largely adopted practice.This might have serious consequences for the data subject resulting in not just their erroneous profiling, but also (potentially) in access to services and opportunities.
The "unsuccessful" DSARs also demonstrate that the access to personal data held by online platforms is more often than not a complex and uncertain process.Because of the different interpretations of the DSAR procedure and the GDPR in general by companies, there are apparently substantial differences about what data is considered personal and thus eligible to be provided by the data controllers. 21Sometimes the data controllers have long and extensive procedures (like Criteo) or they try to bypass meaningful information by directing users towards other available data (Facebook).Even when successful, the data obtained might not be readable (as in the Quantcast case), the file might be incomplete, and the logic behind the presented and provides data and information might not be available or accessible for the user.

Investigating algorithmic identity
Next we investigated the workings of the algorithmic systems of a web platform (Google), social media (Facebook) and one data broker (Oracle).We took the inferences as proxies, or represents, for investigating the assigned Facebook provided data, but with no additional meaningful information, and the data corresponds with the one provided on their platform via the "Download your information" tool.Acxiom provided an answer stating that no data is collected from individuals residing in Belgium. 21In an attempt to obtain data from the dating app Bumble, the platform representatives stated that they can only provide a registration date, IP addresses and profile photos (source: personal correspondence).algorithmic identity.We decided for sampling these two platforms and the data broker based on the results from the datafication phase of the research, where most trackers were originating from these three actors (and as such have most data on the research subject), and on their affordances for research.

Figure 9. Overview of the tools used for algorithmic identity and insights gathered
The three different data controllers are investigated in order to assess the assigned algorithmic identity and test the possibility for research using the inferred interests as proxies.As Figure 9 shows, we employed a variety of methods and tools, at a different level, to analyse various aspects of the inferential analytics at play and their outputs.
Google's Ad settings tool22 was observed in frequent intervals for two and a half months and it was used to record the assigned interests.Based on the 184 observed interests, and triangulated with a list of categories (Brave, 2019) indicating the particular category an interest belongs to, enhanced with a close reading of the categories, we were able to get an overview of the most dominant categories the research subject was categorised in (Figure 10).The daily recording of the interactions and assigned interests show that these are often immediate outputs of simple browsing behaviour, but also that they are unstable and disappearing -thus no historical database of inferred interests is available (for research or personal insights).Some of the interests disappear on a daily basis and some remain longer periods of time, or during the entire period of data collection.This is significant from a point of view of reliability of collected data: researchers must be aware of the instability of the data and the potential inability of collecting what is available.This underlines the dependence on and significance of the information structuring and information visibility, which can be seen as politics both of visibility and knowledge, controlled by the platforms themselves.Andreou et al. (2018) point to the same characteristic of Facebook's transparency tools, referring to it as snapshot/temporal completeness.

Figure 10. Frequency of categories of interests as assigned to the research subject by Google
Reading the assigned interests as text, we were able to construct an overview of the assigned algorithmic identity by Google (Figure 11).The use of the auto-technographic approach, as well as the fact that we are relying on and working with data from a real individual, enables us to test the assumptions made by the algorithms and assess its truthfulness.In our case, the assigned algorithmically constructed identity is in a sharp discrepancy with the research subject's sense of real identity and does not represent their actual life conditions (financial, familial, or employment).Similar are the findings from the data collected from Oracle, with the important difference here that online data brokers often lack information on basic demographic data and thus have to infer it via browsing behaviour in order to fill the 'information gap' (Crain, 2018), unlike platforms like Google and Facebook that rely on both volunteered data (by users) and have more access to daily behaviour of users.However, we must be aware as researchers that an important aspect of reading and interpreting the data is concealed by the platforms: there is a lack of information on how these attributes are assigned, and what is the inferential analytics process.This potentially affects the comprehensiveness of the data collected by the researcher and consequently -the analysis itself.

Figure 11. A close-reading illustration of an algorithmic identity as assigned by Google Figure 12. A close-reading illustration of an algorithmic identity as assigned by Oracle
When it comes to the possibility to investigate algorithmic identity as assigned by Facebook, by using the very affordances of the platform itself, we were able to draw an overview of the assigned general affinity towards certain categories, via few available "points".We used as data source the data explanations (revealing the reason for assigning the interests 23 ) and the ad explanation feature, both at an interface level (via observation and recording of data) and at software level (AdAnalyst tool).
The data collected via the data explanation feature, gives not only insights into the dominant assigned interests by category (Figure 13), but also points to the very specific categorisation practices Facebook uses for profiling and targeting.Closely reading the list of interests, it becomes visible that Facebook is constructing very narrow categories (e.g.headphones; old style and new style dates; Conversion (gridiron football); Right-to-work law; particular movies/songs etc.) that might enable a very specific targeting, and also, that many of them simply do not make sense (e.g.non-resident Indian and person of Indian origin; hydrogen) or can be regarded as potentially sensitive information (Gay Pride, LGBT community).

Figure 13. Dominance of interests assigned by Facebook, per category
Although the matching process of a user being served a particular ad is complex due to the fact that the outcome doesn't depend only on the advertising platform and its matching algorithm, but also on the very eventspecific factors 24 , the explanatory tool "Why am I seeing this ad?", when 23 At the time of the data collection and analysis of the Facebook data (11/27/2018 -04/16/2019), Facebook was providing three very generic explanations why an interest/affinity was inferred, but no further information: "you clicked an ad related with the interest" (64.37%), "you liked a page related with the interest" (31.80%) or "installed an app" (0.5% of the entries).Additionally, it added "liked their page or post" (3.30%)data recorded in May 2019 showed that Facebook made changes to their "assigned interests" explanation, adding one more "reason" to the previous three. 24Such as the competing advertisers at the particular moment when an ad is about to be served, their specific requests/objectives set by the advertisers and the characteristics of the available users on the platform, in a particular moment of time (Andreou et al., 2018, p. 3).
repurposed as an object for study, can provide significant information regarding the particular behaviour, activities and interests of the research subject used for automated behavioural targeting.Combining the insights collected manually via the interface with the data collected automatically via AdAnalyst at a software level, provided significant insights.The first finding is related with the type of data that algorithmic systems consider as an important particular aspect of the research subject's algorithmic identity to be later taken into account for personalised behavioural ad targeting. 25 The second one relates to the affordances of the different research methods and tools, and the different insights, depth and scope of insight that they enable.AdAnalyst offers different insights as it has access to more parameters at a software level, not accessible via the interface.Such is the distinction between the general ad explanation served to the research subject (as a user) and what is indicated as a reason the particular user to be targeted.Additionally, insights can be obtained about the targeting parameters set by the advertisers.As Figure 14 shows, what the research subject has been targeted based on (e.g.bicycle as interest), might be just one of the campaign targets set by the advertiser.These can sometimes be different, and in that sense AdAnalyst provides more in-depth insights than available if looking only at an interface level.

Figure 14. Screenshot of AdAnalyst's interface
The screenshot above is interesting for analysis because, via the section "The advertisers targeted other users with", it provides valuable insights into the parameters Facebook uses for targeting.We can observe that apart from the well-known Lookalike audience, Personally Identifiable Information (PII) 26 , Social Neighbourhood and similar, it also targets users 25 For example, the research subject has liked a page, has or was at a particular location, belongs within a particular age group, etc. 26 Personally Identifiable Information (PII) is considered any data that can be used to identify a specific individual, such as name, email, phone number, IP address, location address, online identifier, biometric records and similar.For more detailed definition, see GDPR Art. 4 (1).
based on data from data brokers, based on behaviours (e.g.expats in France), operating system and version (based on where Facebook was accessed from) and biographical data (Master's degree).
Another avenue to investigate and assess an assigned algorithmic identity is to repurpose the particular ads served to the user, more specifically the textual part of each of the ads.As the purpose of the ads is to nudge users to take particular action, ads are served targeting specific interests of particular users, with the aim to steer actions or behaviour.In that sense, ads could uncover the assigned affinities and, at an aggregate level, the algorithmic identity.Thus, a semantic analysis of 1,553 served ads, collected both manually (interface level) and using AdAnalyst (software level), was done.Only unique ads were taken into consideration.The tool CorText (Munk, 2019) was used to detect the semantic clusters forming from the corpus of served ads.The frequency of the semantic co-occurrence can be read as a signal of attributes the user is more targetable for, or most prone to take actions for.It can also be seen as enabling an insight into how a particular user is seen by the algorithms, given that the most dominant reasons for targeting are being part of a lookalike audiences and because of specific user interests.As Marres and Gerlitz (2015) observe, social media platforms 'do not present us with raw data, but rather with specially formatted information' (p.22).The formatting of this data, both at an interface and software (API) level, then inevitably influences the methodological implications for research.By "standardising" the presentation of data and the way it is made visible, the platforms are guiding the researchers through what is available to be seen and investigated.The perspective, methods and insights are limited by the affordances of each and every platform, their algorithmic and API system(s).Keeping this in mind is important for discussing the scope and depth of available information when employing the set of research methods and tools in this empirical research.Marres and Gerlitz (2015) call this 'methodological bias' (p.22) and rightfully ask the question if "is it really the researcher that here 'decides' to use this method, or is this decision rather informed by the object of study with its associated tools and metrics?" (ibid.).If not limited in the right sense of the word, then we, as researchers, are nudged, steered towards the particular configuration of analytic practices via the platforms', APIs' and software's own 'sampling techniques, options for analysis and modes of visualization' (p.31).
Another potentially problematic aspect of relying both on APIs and data and being denied access to them and adopting a method for data collection based on observation, is the constant change of what platforms make available.This highlights the constant revision and change of their politics of visibility and politics of knowledge implemented via the changes at an interface and software level.Barrett and Kreiss (2019) call this platform transience -a concept they use to describe the sudden changes platforms make in their policies, procedures and/or affordances, which impacts the ability for critical research, as it makes them continuously changeable and ephemeral in significant ways.Right after the end of the data collection phase of this research, Facebook changed its data and ad explanation structures, now offering more information at the disposal of users (Facebook Newsroom, 2019).This is not just problematic in the sense that it makes data collected at different time-periods potentially incomparable, but it also makes the study of algorithmic systems almost a study of 'historical objects' (Bucher, 2012).We as researchers will be always bounded by what platforms decide to make available, either via the interface or the API.With platforms closing their APIs and giving data access only to the "chosen" few (for example, Facebook's Social Science One27 ), move described by Bruns (2019) as 'corporate data philanthropy', the data access gap will be only widening.Hence our ability to study technology, society, and the intersection of the two, will narrow down and become potentially very limited.
Considering this, and considering the increasing limitations of how research can be done and what can be obtained as valuable knowledge, as a result of immanent methodological bias, API restrictions and impenetrability of black boxes, we are faced with the question of how successful and valid the research we conducted was.At an interface level, the methodological design imposed some limitations in a particular manner.This is once again related with how the platforms organise their information: how Facebook and Google's ad settings are organised, how much they reveal, how the data obtained via Data Subject Access Request is organised, how readable it is and finally, what is made available through these "interfaces" and what is concealed, left out or not provided.Another related aspect concerns the nature of observation as a research method and of the auto-technographic approach.As outlined by Weltevrede (2016), this is always a me-centric view, highly individualised and personalised (p.107), as is our experience and content provided on these platforms.Additionally, we need to be aware of the complexities arising when one would like to translate this very same methodological design and setting on a sample comprising more than one research subject.That would require undertaking additional and modified steps, setting up the research environment and testing the possibilities to obtain valuable and valid data, considering all the complexities of browsing histories, browsing habits and patterns, that particular research subjects could exhibit.
These exact same limitations can be seen as an advantage, as they enable 'real user-algorithmic agent interactions' (Bodo et al., 2018, p. 143).Being able to observe these enriches the quality of the insights, but more importantly, it allows to see the wider 'socio-technological assemblage' (ibid.)and the networks between different actors.And while it might not provide a picture of the totality of the system, it does provide a valuable, although partial, reconstruction of the complexity of these algorithmic assemblages.
By using the affordances of the different methods, at a different level of visibility (interface and software) for analytical inquiry, and combining these findings, new and more in-depth insights were made possible.This is reinforced with the action of repurposing objects of/for study -such as the data explanations, ad explanations, data subjects access request and similar -as a strategy to overcome the limitations, uncover and make visible what was previously not revealable.While having to adjust to the affordances and thus limitations of methods and tools, this research and methodological strategy offered ways to be innovative, to -by learning what is possiblelook for new avenues, new perspectives, new sources of data and thus insights for digital social research.In that regard, the methodological design of this research is successful, as it provides access to new insights and enables for a more in-depth inquiry into the processes of algorithmic construction of identity, data extraction and inferential analytics, and the ecosystem of actors and networks around these surveillance practices.At a software level, automated tools enabled for a more in-depth knowledge and helped better investigate aspects hidden from the interface and the eye.However, the approach has its limitations, emanating from the nature of platforms' APIs, which are also limited in scope and applicability by their very affordances.They have their own "politics of visibility", limiting what can be seen and uncovered.At an interface level, the daily, detailed observation and recording of the workings and outputs of the system enable for more granular insights and observations of the subtle changes in and by algorithmic systems.
With our research we tried to manoeuvre around the restrictions for research imposed by APIs and black boxes and find ways to investigate opaque algorithmic systems.Following Paßmann and Boersma's (2017) suggestion for pursuing practical transparency, complemented by what they call formalized transparency, we made use of sources external to the algorithms, their APIs and black boxes as a way to detect and make known the unknowns.While APIs are important research entry point, they are not the only one.We experimented with different approaches to circumvent the limitations for research imposed by platforms' gatekeeping practices.In doing so, we got close to what can be called 'digital fieldwork' (Venturini and Rogers, 2019): exploring, experimenting with, testing and employing various new approaches, sources, ways to collect data and capture the interactions between the algorithms and users, mediated via interfaces and APIs.With that, we proposed (just) one of the possible avenues for overcoming data access gaps and algorithmic opacity in doing digital social research.While the question of if and how platforms should provide access to data for researchers is not a focus of this paper, it remains an important one.We are on the opinion that while it is necessary, thorough digital social research should use and rely on other methods, techniques and data access points in combination with API data.We see this as the only approach that will provide comprehensive view of the socio-technological assemblages, their outputs and impact.

FUNDING STATEMENT AND ACKNOWLEDGMENTS
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was in part supported by Fonds voor Wetenschappelijk Onderzoek -Vlaanderen (FWO) [grant number G054919N].Ana Pop Stefanija was an Erasmus

Figure 2 .
Figure 2. The most prevalent trackers per company in the research dataset as captured by TrackingObserver, data triangulated with WhoTracksMe and Better.fyi 11

Figure 3 .
Figure 3. Frequency of third-party trackers per website.Data source: Privacy Score

Figure 4 .
Figure 4. Categories of trackers per website.Data source: Privacy Score, data triangulated with WhoTracksMe and Better.fyi

Figure 7 .
Figure 7. Screenshots from the research subject's Google's Ad Settings page.The one of the left dates from April 27, 2020; the one on the right is from July 21, 2020

Figure 15 .
Figure 15.Network mapping of semantic clusters from served ads on Facebook, using CorText