Saturday, April 20, 2024
HomeTechnologyIf you use AI Copilot on GitHub, you may be sued!

If you use AI Copilot on GitHub, you may be sued!

Some People Abandoned GitHub For That

GitHub recently announced a shiny product. It’s an artificial intelligence (AI) called Copilot. This AI is machine-learning-powered software that allows you to write your own code and has very nice programming generation capabilities. But it’s also true that some people are pulling out of GitHub or worrying about lawsuits .

This AI behaves similarly to other OpenAI -powered code generation tools. When the user writes a comment about what they want the AI ​​to write, the AI ​​realizes the source code for that content. What makes Copilot unique is that it takes the initiative from the user, such as constantly suggesting auto-completion suggestions.

In addition to GitHub Copilot, the code generation AI that utilizes language AI developed by OpenAI has a source generation function implemented in Microsoft Power Apps, an application development platform developed by Microsoft . Below is a table comparing the two.
Comparison of Microsoft Power Apps built-in source generation function and GitHub Copilot

Microsoft Power Apps code generation feature GitHub Copilot
Base language AI GPT-3 OpenAI Codex (GPT-3 fine-tuned for source generation)
Supported programming languages Microsoft Power FX Highly compatible with Python , JavaScript, TypeScript, Ruby and Go. Supports dozens of other programming languages
how to use Used as a feature of Microsoft Power Apps Use as an extension for Visual Studio Code

By the way, GitHub has been under the Microsoft umbrella since 2018, so the above two apps are effectively under the control of Microsoft .

Copilot sounds great, doesn’t it? Readers who know me know that I am often excited about artificial intelligence. He even published a book explaining technologies like Copilot . But there are a lot of issues surrounding machine learning , and GitHub has faced such a dilemma since day one of Copilot’s announcement. The drama around machine learning applications usually starts with data, and the Copilot fuss follows that rule. In Copilot’s case specifically, the question is how GitHub collected the data to build its algorithms.

(*Translation Note 2) In general, machine learning apps can encourage prejudices and discrimination that are latent in the training data . Regarding this issue, please refer to the AINOW article “ What is the current state of “discrimination by AI”?” Case Studies, Causes, and Initiatives around the World ”. However, the issue with GitHub Copilot is copyright, as discussed below.

Unfortunately, users have no way of knowing whether the algorithm created a particular piece of code by Copilot itself or stole it from a licensed code repository.

Like other machine learning algorithms, Copilot learns how to do things (how to write code) by giving it data about what works (code). According to GitHub, the AI ​​was trained using billions of lines of code extracted from GitHub repositories . So when Copilot writes code for its users, it’s working with billions of lines of data.

Unfortunately, users have no way of knowing whether the algorithm created a particular piece of code by Copilot itself or stole it from a licensed code repository.

When I say I stole, I mean it quite literally.

A software engineer posted on Twitter an image of the code that was generated after asking Copilot (in comments) to write an “about me” page. Funny enough, this code is taken directly from a real person’s page.

(*Translation Note 3) The tweets explained above are presumed to be:

Regarding the possibility of generating personal data, GitHub Copilot’s FAQ answers as follows.
Because GitHub Copilot trains using publicly available code, its training set included publicly available personal data. Internal testing has shown that GitHub Copilot proposals contain verbatim personal data from the training set very rarely. In some cases, the model suggests what appears to be personal data (email addresses, phone numbers, access keys, etc.), but actually consists of information synthesized from patterns in the training data. In the technical preview, we implemented a basic filter that blocks emails when viewed in standard form, but with enough effort it’s possible to have the model suggest this kind of content.
As described above, the possibility of generating real personal data is not zero. The act of extracting personal data from large-scale training data is sometimes referred to as a “ training data extraction attack .’  ] has been published.

Here’s another seriously hilarious source code sample from Copilot. A user-uploaded GIF shows the AI ​​writing a function taken straight from the repository for the video game Quake III Arena. It even includes original comments.

(*Translation Note 4) “Quake III Arena” is an FPS released in 1999 by id Software, which developed the FPS “DOOM”. It was developed as a game specialized in multiplayer.
(*Translation Note 5) The tweets explained above are presumed to be:

These are the fundamental problems of Copilot. You can’t tell which code is Copilot’s own idea and which is copied verbatim from another source.

The GitHub Copilot FAQ answers the question, “Does GitHub Copilot repeat code from the training set?”
GitHub Copilot is a code synthesizer, not a search engine. Most of the code suggested by GitHub Copilot is uniquely generated and has not been seen before. We found that about 0.1% of the time, the suggestions might contain a verbatim snippet quoted from the training set. This result is a detailed study of the behavior of the model. Many cases of copying code like this happen when you don’t provide enough context (especially when editing an empty file) or when there is a general, perhaps universal, solution to the problem. increase. We are in the process of building an origin tracker to detect rare instances of code that iterate our training set and make good decisions in real-time on GitHub Copilot suggestions.
Once the origin tracker mentioned in the answer is implemented, it will be possible to identify the copied code.

Another Twitter user noted that the software is a vehicle for laundering open source code into commercial products.

(*Translation Note 7) The translation of the above tweet is as follows.
GitHub Copilot, as GitHub itself admits, is trained on piles of GPL code, but I’m not sure if that’s a form of laundering open source code into commercial work. . “GitHub Copilot doesn’t usually reproduce exact chunks of code” isn’t very satisfying.
In addition, the sentence quoted in the tweet is a response about the code copy of GitHub Copilot translated in Note 6.

According to GitHub’s Terms of Service, users of GitHub’s platform “(omitted) to the extent necessary to provide services to us (the operator of GitHub) and our legal successors, You grant the right to store, archive, analyze, display, and make incidental copies of your content.” However, the use of source code as learning data for Copilot may not be what users had in mind when they signed up for the service.

Personally, I believe in a future where we can write code faster with assistants powered by machine learning. But GitHub Copilot isn’t that future

In the Copilot FAQ, GitHub claims that “code you create with the help of GitHub Copilot belongs to you . ” However, on the popular programmer site Hacker News, there are people who claim that Copilot is infringing copyright . The AI ​​should only be able to use code that the original owner has licensed for commercial use, but it’s clearly using any code regardless of license.

(*Translation Note 8) In the GitHub Copilot FAQ, the following answer is given regarding the copyright of the code generated by the AI.
GitHub Copilot is a tool like a compiler and a pen. Any suggestions that GitHub Copilot generates, and any code you create with its help, belong to and are your responsibility. As with any code you write yourself, we encourage you to carefully test, review, and scrutinize your code.
Also, in response to the question “Do I need to give credit for GitHub Copilot that helped me write the code?”
No, the code you create with the help of GitHub Copilot belongs to you. All friendly robots like the occasional thank you note, but are under no obligation to give GitHub Copilot credit. As with using the compiler, the output of using GitHub Copilot belongs to you.
From the above responses, it can be seen that GitHub’s official position is that GitHub Copilot is only a tool and does not qualify for copyright .

In another thread on Hacker News, concerns were expressed that using the tool could lead to unknowingly using copyrighted code and being sued. One user called Copilot a “legal time bomb,” while another added a personal anecdote. “I’m in charge of product security for a large company, and they’re already moving in the direction of banning Copilot […]”

Some people abandon GitHub because of this situation .

Personally, I believe in a future where we can write code faster with assistants powered by machine learning. But GitHub Copilot isn’t that future. In the case of this AI, there are just too many concerns around data collection and use.

I expect to see more and more similar services in the future, but if they aren’t ethically and wisely crafted, they won’t really succeed.

GitHub isn’t the only company building AI that generates code. What impact will these new algorithms have on developers? Will AI replace programmers? Check out the article below.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments